In [None]:
import pandas as pd
import datetime

1. Introduction
2. Data Description:
    1. Twitter Data
        1. Columns we will be using
        2. Limitations of the data
    2. Covid Global data
        1. Columns we will be using
        2. Limitations of the data
    3. Ethical Consideration
3. Data Preperation, Cleaning and Manipulation:
4. Exploratory Data Analysis (EDA):
    1. Tweeter
    2. COCID-19
5. Sentiment Analysis:
6. Correlation Analysis:

## 1. Introduction

As the pandemic has impacted the globe for 3 years and continued to reach repetitive peaks in different regions, epidemics and corresponding preventive measures have become the center of discussion and concerns to modern society. While previously it was much difficult to collect adequate reactions to vaccines across different communities at one place, twitter allows us to access expressed sentiments from a variety of communities.


With that comes various opinions about the vaccine, and many choose to be vocal about their ideas on social media. We wondered whether looking at social media would help us understand what different communities across the world thought of the Vaccine, and whether such opinions are teetered by their socio-political climates and geographical locations.

To do this, we turned to Twitter and scraped tweets containing hastags related to the vaccine. We then took these datapoints and quantified their approval ratings by calcuatinng sentiment score for each text. We then expressed this information in the form of graphs and maps. Currently, we have 5000 datapoints, and over 2000 datapoints with geographical coordinates.

We hope that by reviewing these charts and maps, we can better understand the concerns different communitites have over the vaccine, and what may be contributing factors to this.

The proposed project is to analyze tweets about the Pfizer-BioNTech vaccine in order to understand how sentiment varies by country, over time, and by demographic factors. The project also aims to study the correlation between sentiment and the number of confirmed cases, deaths, and active cases, and to compare the sentiment of tweets about the Pfizer-BioNTech vaccine to the sentiment of tweets about other COVID-19 vaccines. The goal is to gain insights on how people perceive and discuss the Pfizer-BioNTech vaccine on social media in different countries, how it changes over time and how it is influenced by various demographic factors. This information can be useful for researchers, healthcare professionals, and policymakers, to understand public opinion and to develop strategies to improve vaccine uptake.

## 2. Data Description

### 2.1 Twitter Data

Our primary dataset is The Pfizer and BioNTech Vaccine Tweets Dataset posted on Kaggle and created by Gabriel Preda, who is a data scientist in Romania. 

Key features: Select 'user_location' 'text', 'hashtags' to extract useful information we want: locations for geo analysis, and their original tweets about the vaccines for text processing and sentiment analysis
Estimated size: 4.54 MB
Location: https://www.kaggle.com/gpreda/pfizer-vaccine-tweets
Format: CSV file
Access Method: through download or Kaggle API
COLLECTION METHODOLOGY:
Use tweepy to collect tweets about Pfizer & BioNTech (using #PfizerBioNTech hashtag)

### 2.2 COVID-19 Global

Our secondary dataset is the Covid-19 Global Dataset. The creator of the data set is an artificial intelligence engineer from Lebanon. 

Key features: useful columns to include are ‘total_confirmed’, ‘total_deaths’, ‘active_cases’, ‘, country’ to reflect the up-to-date numbers of daily confirmed, death and active cases for 218 countries
Estimated size:20.38 kB (but with multiple versions)
Location: https://www.kaggle.com/josephassaker/covid19-global-dataset?select=worldometer_coronavirus_summary_data.csv
Format: CSV file
Access Method: through download or Kaggle API


### 2.3 Ethical Concerns and bias

User privacy protection,
removed unique identifiers.

Limitations of the data

## 3. Data Preperation, Cleaning and Manipulation

In [None]:
# load the datasets:
tweet_df = pd.read_csv("data/vaccination_tweets.csv")
covid_df = pd.read_csv("data/worldometer_coronavirus_daily_data.csv")

In [None]:
tweet_df.sample(5)

In [None]:
# check for shape and missing values of the tweeter dataset
print(tweet_df.shape)
print(tweet_df.isna().sum())

In [None]:
covid_df.sample(5)

In [None]:
# check for shape and missing values of the covid-19 stats dataset
print(covid_df.shape)
print(covid_df.isna().sum())

In [None]:
# subseting the tweeter data
tweet_df = tweet_df[[ 'user_name','user_location', 'user_created', 'date', 'text', 'hashtags', 'retweets', 'favorites']]
# encode user_name using to integers according to ethical concerns 7202 unique usernames detected
tweet_df['user_name'] = tweet_df['user_name'].factorize()[0]
# change the format
tweet_df['date'] = pd.to_datetime(tweet_df['date'], errors = 'coerce').dt.date
tweet_df['user_created'] = pd.to_datetime(tweet_df['user_created'], errors = 'coerce').dt.date

In [None]:
covid_df['date'] = pd.to_datetime(covid_df['date'], errors = 'coerce').dt.date

### 3.1 Adding Counrty and City for joining.

In [None]:
# first, we try to get the city if possible
#!pip install geotext
from geotext import GeoText
# we will need the another tool to interact with Geotext
from collections import OrderedDict
def get_city(loc_txt):
    try:
        return GeoText(loc_txt).cities[0]
    except:
        pass
    return None

def get_counrty(loc_txt):
    try:
        return list(GeoText(loc_txt).country_mentions.keys())[0]
    except:
        pass
    
    return None
        
tweet_df['city'] = tweet_df['user_location'].apply(get_city)
tweet_df['country'] = tweet_df['user_location'].apply(get_counrty)

In [None]:
tweet_df.sample(5)

In [None]:
# second we get the country
# !pip install pycountry
import pycountry

def fill_country(loc_txt, country):
    if country!=None:
        return country
    candicates = []
    candicates = [country.name for country in pycountry.countries if country.name in str(loc_txt)]
    if candicates:
        return candicates[0]
    candicates = [country.alpha_2 for country in pycountry.countries if country.alpha_2 in str(loc_txt)]
    if candicates:
        return candicates[0]
    candicates = [country.alpha_3 for country in pycountry.countries if country.alpha_3 in str(loc_txt)]
    if candicates:
        return candicates[0]
    return None

tweet_df['country'] = tweet_df[['user_location','country']].apply(lambda x:fill_country(x['user_location'],x['country']),axis=1)

In [None]:
# %%time
# this may take about 5 mins
# !pip install country_converter
# convert country names to standard format
import country_converter as coco
tweet_df['country'] = tweet_df['country'].apply(lambda x: str(x))
tweet_df['country'] = coco.convert(names=tweet_df['country'].to_list(), to='name_short')
tweet_df['country'] = tweet_df['country'].apply(lambda x: None if x=="not found" else x)
# there is nothing we can do about the error 

In [None]:
tweet_df.sample(5)

In [None]:
tweet_df.isna().sum()

### 3.2 Text Cleaning and sentiment evaluation
Besides the spacial relationship, we are also interested in the content that users posted. To measure users' approval of the COVID-19 vaccine, we will approach with the sentiment score of the texts.

The nltk library we will be using returns measures of postivity, negativity, neutrality, and a compound sentiment score of the text. The higher the compound sentiment score, the greater the approval.

We will need to import the NLTK library and download some dictionaries to run certain methods.

In [None]:
import re

# make all text lowercase
tweet_df['clean_text'] = tweet_df.text.apply(lambda x: x.lower())

#Remove twitter handlers
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x:re.sub('@[^\s]+','',x))

#remove hashtags
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x:re.sub(r'\B#\S+','',x))

# Remove URLS
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x:re.sub(r"http\S+", "", x))

# Remove all the special characters
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x:' '.join(re.findall(r'\w+', x)))

#remove all single characters
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x:re.sub(r'\s+[a-zA-Z]\s+', '', x))

# Substituting multiple spaces with single space
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

# removing short words
tweet_df['clean_text'] = tweet_df['clean_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

In [None]:
import nltk
# downlaod some resouces
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('vader_lexicon')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

# load the sentiment function
sia = SentimentIntensityAnalyzer()
# load the stemmer function 
porter = nltk.PorterStemmer()

In [None]:
# romve stop-wards
stop_words = stopwords.words('english')

# tokenization
tokenized_tweet = tweet_df['clean_text'].apply(lambda x: x.split())

# remove stop-words
tokenized_tweet = tokenized_tweet.apply(lambda x: [item for item in x if item not in stop_words])

# similar to stop words, we create a list of words we don't want
unwanted_words = ['covid','vaccine']
tokenized_tweet = tokenized_tweet.apply(lambda x: [w for w in x if w not in unwanted_words])
tokenized_tweet = tokenized_tweet.apply(lambda x: [w for w in x if not(w.find('http')!=-1)])
tokenized_tweet = tokenized_tweet.apply(lambda x: [w for w in x if not(w.find('vac')!=-1)])
tokenized_tweet = tokenized_tweet.apply(lambda x: [w for w in x if not(w.find('covid')!=-1)])

# de-tokenization
detokenized_tweet = []
for i in range(len(tweet_df)):
    t = ' '.join(tokenized_tweet[i])
    detokenized_tweet.append(t)
tweet_df['tweet_words'] = tokenized_tweet 
tweet_df['clean_text'] = detokenized_tweet

In [None]:
tweet_df[['text','tweet_words','clean_text']].sample(5)

In [None]:
# Now we will calculate the sentiment score for each tweet.

tweet_df['compound_sentiment'] = tweet_df['clean_text'].apply(lambda x:sia.polarity_scores(x)['compound'])
tweet_df['neg_sentiment'] = tweet_df['clean_text'].apply(lambda x:sia.polarity_scores(x)['neg'])
tweet_df['pos_sentiment'] = tweet_df['clean_text'].apply(lambda x:sia.polarity_scores(x)['pos'])
tweet_df['neu_sentiment'] = tweet_df['clean_text'].apply(lambda x:sia.polarity_scores(x)['neu'])

In [None]:
tweet_df.sample(5)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
plt.title('Distriubtion Of Sentiments Across Our Tweets',fontsize=19,fontweight='bold')
sns.kdeplot(tweet_df['neg_sentiment'],bw=0.1, label = 'neg_sentiment')
sns.kdeplot(tweet_df['pos_sentiment'],bw=0.1, label = 'pos_sentiment')
sns.kdeplot(tweet_df['neu_sentiment'],bw=0.1, label = 'neu_sentiment')
sns.kdeplot(tweet_df['compound_sentiment'],bw=0.1, label = 'compound_sentiment')
plt.legend(loc='upper right')
plt.show()

In [None]:
merged_df = tweet_df.merge(covid_df, on=['country', 'date'], how='left')
merged_df['date'] = pd.to_datetime(merged_df['date'], errors = 'coerce')
merged_df['user_created'] = pd.to_datetime(merged_df['user_created'], errors = 'coerce')
display(merged_df.sample(5))
merged_df.isna().sum()

## 4. Exploratory Data Analysis (EDA)


#### 4.1 User Demographics
Created Time
    Followers
    Frequency by City and Country
#### 4.2 Sentiment Score Over time
#### 4.3 Sentiment Score VS Location


### 4.1 User Demographics

In [None]:
plt.figure(figsize=(16,8))
sns.displot(merged_df, x="user_created", kde=True, color='blue',height=6, aspect=2,binwidth=30)
plt.xlabel('Date', fontsize=12)
plt.ylabel('# User Created', fontsize=12)
plt.title('User Created Over Time', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(16,8))
sns.displot(merged_df, x="user_created", kde=True, color='blue',height=6, aspect=2, binwidth=30)
plt.xlabel('Date', fontsize=12)
plt.xlim([datetime.date(2019, 1, 1), datetime.date(2022, 1, 1)])
plt.ylabel('# User Created', fontsize=12)
plt.title('User Created Over Time', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Lets see the user demographics
user_country = merged_df[['user_name', 'country']].value_counts().reset_index()
top_10_countries =  user_country['country'].value_counts(sort=False).nlargest(10)

sns.countplot(y=user_country['country'], order=top_10_countries.index, orient='h')
plt.xlabel('# of Occurrences', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.title("User's Countries", fontsize=15, fontweight='bold')
# plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Lets see the user demographics
top_10_countries =  merged_df['country'].value_counts(sort=False).nlargest(10)
sns.countplot(y=merged_df['country'], order=top_10_countries.index, orient='h')
plt.xlabel('Count', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.title('Distribution of Countries', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
top_10_cities =  merged_df['city'].value_counts(sort=False).nlargest(10)
sns.countplot(y=merged_df['city'], order=top_10_cities.index, orient='h')
plt.xlabel('Count', fontsize=12)
plt.ylabel('city', fontsize=12)
plt.title('Distribution of city', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
top_10_cities =  merged_df[merged_df.country=='United States']['city'].value_counts(sort=False).nlargest(10)
sns.countplot(y=merged_df[merged_df.country=='United States']['city'], order=top_10_cities.index, orient='h')
plt.xlabel('Count', fontsize=12)
plt.ylabel('city', fontsize=12)
plt.title('Distribution of U.S. city', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Lets see the user demographics
tweet_country = merged_df[['date', 'country']].value_counts().reset_index(name='tweets')
top_5_countries =  user_country['country'].value_counts(sort=False).nlargest(5)
tweet_country = tweet_country[tweet_country['country'].isin(top_5_countries.index)]

# Make the joint plot
plt.figure(figsize=(16, 20))

sns.jointplot(x='date', y='tweets', data=tweet_country, 
              hue='country', height=15, 
              xlim = (datetime.date(2020, 11, 1), datetime.date(2022, 1, 1)), ylim=(0,40))

plt.xlabel('Date', fontsize=12)
plt.ylabel('Counts', fontsize=12)
plt.title('Vaccine Tweets Discussion Over Time By Country', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# two time point where things are interesting: 2021-2 and 2021-09
# also why missing data and its impact on the curve

In [None]:
tweet_country = merged_df[['date', 'country']].value_counts().reset_index(name='tweets')
top_5_countries =  user_country['country'].value_counts(sort=False).nlargest(5)
tweet_country = tweet_country[tweet_country['country'].isin(top_5_countries.index)]

In [None]:
# popular tweets

### 4.2 Sentiment Score Over time

In [None]:
from pandas.tseries.offsets import MonthEnd
df = merged_df.copy()
df['date'] = pd.to_datetime(df['date'], errors = 'coerce')
# df["month_end_date"] = df['date'].dt.date + MonthEnd(0)

In [None]:
# Passing the entire dataset in long-form mode will aggregate over repeated values (each year) to show the mean and 95% confidence interval:
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='compound_sentiment', data=df, color='blue')
plt.xlabel('Date', fontsize=12)
plt.ylabel('compound_sentiment', fontsize=12)
plt.title('compound_sentiment Over Time', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# df = df.groupby(['date']).mean().reset_index()
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='neg_sentiment', data=df, color='blue')
plt.xlabel('Date', fontsize=12)
plt.ylabel('neg_sentiment', fontsize=12)
plt.title('neg_sentiment Over Time', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# df = df.groupby(['date']).mean().reset_index()
plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='pos_sentiment', data=df, color='blue')
plt.xlabel('Date', fontsize=12)
plt.ylabel('pos_sentiment', fontsize=12)
plt.title('pos_sentiment Over Time', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# for compound snetiment score, the trend started to gain more volatility after 2021-07
# we can also observe this in the negaive sentiment score chart, the peak in 2021-10 is due to less amount of data and thus we consider it outliers
# the positive sentiment score seems more violent after 2021-9, probablit due to less data collected

### 4.3 Sentiment Score vs. Location

In [None]:
df = merged_df.copy()
df = df.groupby(["date","country"]).mean().reset_index()
top_10_countries =  merged_df['country'].value_counts(sort=False).nlargest(5)
df = df[(df['country'].isin(top_10_countries.index))&(df['pos_sentiment']>0)]

# maybe group by week is better

plt.figure(figsize=(12,6))
sns.lineplot(x='date', y='pos_sentiment', data=df, hue='country')
plt.xlabel('Date', fontsize=12)
plt.ylabel('pos_sentiment', fontsize=12)
plt.title('pos_sentiment per country over time', fontsize=15, fontweight='bold')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
import plotly.express as px

df = merged_df.copy()
df = df[['date','country', 'compound_sentiment','neg_sentiment','pos_sentiment', 'neu_sentiment']].dropna()
df = df.groupby(['country','date']).mean().reset_index()

# df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')

start_date = df['date'].min()
end_date = df['date'].max()


fill_df = []
for name, group in df.groupby('country'):
    group.set_index('date', inplace=True)
    reindexed = group.reindex(pd.date_range(start=start_date, end=end_date, freq='D'))
    reindexed['country'].fillna(name, inplace=True)
    reindexed.fillna(0, inplace=True)
    
    fill_df.append(reindexed.reset_index())

df = pd.concat(fill_df).rename(columns={'index':'date'})
df['date'] = df['date'].astype(str)
pos_df = df.copy()


top_10_countries =  merged_df['country'].value_counts(sort=False).nlargest(10)
df = df[(df['country'].isin(top_10_countries.index))].fillna(0)

# Change in pos_sentiment over time for different countries
fig = px.bar(df,
             y = "country",
             x = 'pos_sentiment',
             animation_frame= 'date',
             range_x = [0,1],
             color='country'
)
fig.show()


In [None]:
# !pip install plotly-express
import plotly.express as px

df = covid_df.copy()
top_10_countries =  merged_df['country'].value_counts(sort=False).nlargest(10)
df = df[(df['country'].isin(top_10_countries.index))].fillna(0)
# maybe group by week is better


fig = px.bar(df,
             y = "country",
             x = 'cumulative_total_cases',
             animation_frame= 'date',
             range_x = [0,50000000],
             color='country'
)
fig.show()


### 4.4 COVID-19 Global Cases vs. Pos Sentiment over the time

In [None]:
# !pip install plotly-express Viridis_r
df = covid_df.copy()
fig = px.choropleth(df,
                    locations="country",
                    color="cumulative_total_cases",
                    hover_name="country",
                    animation_frame="date",
                    locationmode='country names',
                    color_continuous_scale='Viridis_r',
                    range_color=(1000, 2000000),
                    height=600
                    )
fig.show()

In [None]:
# Change in pos_sentiment over time for different countries
df = pos_df.copy()
fig = px.choropleth(df,
                    locations="country",
                    color="pos_sentiment",
                    hover_name="country",
                    animation_frame="date",
                    locationmode='country names',
                    color_continuous_scale='Viridis_r',
                    range_color=(0,1),
                    height=600
                    )
fig.show()

## 5. Word Cloud Visual

In [None]:
# !pip install wordcloud
from wordcloud import WordCloud

df = merged_df.copy()
text = ''
for i in df['tweet_words']:
    text += ' '.join(i)

wordcloud = WordCloud(width=800, height=800, background_color='white').generate(text)

plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

## 6. Correlation Analysis

In [None]:
df = merged_df.copy()
df = df[['retweets','favorites', 'compound_sentiment','neg_sentiment',  'pos_sentiment',  'neu_sentiment',  'cumulative_total_cases',
        'daily_new_cases',  'active_cases',  'cumulative_total_deaths','daily_new_deaths']].fillna(0)
df_corr = df.corr()
fig, ax = plt.subplots(figsize=(16, 16),facecolor='w')

sns.heatmap(df.corr(),annot=True, vmax=1, square=True, cmap="viridis", fmt='.2g',annot_kws={"fontsize":12})
plt.title('Correlation Heat Map')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

In [None]:
cols = ['pos_sentiment', 'cumulative_total_cases', 'active_cases', 'daily_new_cases','daily_new_deaths']
sns.pairplot(df[cols])