# **Covid-19 Vaccines - Sentiment Analysis & Time Series**
Notebook for the second project for the Machine Learning Complements course (CAC).

## **Introduction**


## Imports

The following libraries will be used in this project:

In [None]:
import os
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
import utils as ut
import warnings
import seaborn as sns
import contractions
import nltk
import plotly.express as px
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.classify import NaiveBayesClassifier
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
warnings.simplefilter(action='ignore')
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse


## Load Data

In [None]:
df_tweets = pd.read_csv('tweets.csv')

## Initial Observations

The dataset contains a single file: `tweets.csv`.

In this section we will take a look at the first few rows of each file to get a better understanding of the data, and do some initial data exploration.

In [None]:
ut.initial_obs(df_tweets)

## Data Understanding

## Data Pre-Processing
We can see that many attributes are not really relevant for the kind of work we will be doing. Therefore, we'll just selec tthe most relevant ones.

In [None]:
df_tweets = df_tweets[['id','user_location','date','text','hashtags','user_followers','source']]
pd.set_option('display.max_colwidth', None)

df_tweets['text'].head(5)
df_tweets['orig_text'] = df_tweets['text']

### Removing Spaces within the text
When removing spaces within the text, ensure seamless integration of words for enhanced readability and processing efficiency.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(ut.trim_text)
df_tweets['text'].head(5)

### Contractions Mapping
Contractions mapping simplifies language by expanding contractions like "can't" to "cannot" for consistent analysis and interpretation.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(contractions.fix)
df_tweets['text'].head(5)

### Cleaning HTML
Cleaning HTML tags from text data streamlines content for NLP tasks, preventing interference from markup elements.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(lambda x:re.sub(r"http\S+", "", x))
df_tweets['text'].head(5)

### Emojis & Emotion Handling
Emojis and emotion handling enrich text analysis by capturing nuances of sentiment and expression for deeper understanding. We thought about removing them initially, however their presence may be crucial to identify sentiments within the text.

In [None]:
pattern = re.compile(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251\U0001F004\U0001F0CF\U0001F170-\U0001F251\U0001F600-\U0001F64F\U0001F680-\U0001F6FF]+', flags=re.UNICODE)

# Find examples in df_tweets['text'] that have emojis
emojis_examples = df_tweets[df_tweets['text'].str.contains(pattern, na=False)][0:5]

for index in emojis_examples.index:
    print(df_tweets.loc[index, 'text'])

df_tweets['text'] = df_tweets['text'].apply(ut.convert_emojis_to_text)

print('\n')

for index in emojis_examples.index:
    emoji_text = df_tweets.loc[index, 'text']
    print(emoji_text)

### Handling Twitter Handles (@) & Hashtags
Handling Twitter handles (@) and hashtags facilitates contextual analysis and topic extraction in social media text. We removed the twitter handle, as they mostly are used to identify persons therefore they are not very important in this matter. On the other hand, hashtags may indicate sentiments or other important informations like topics. e.g #sad, #happy or #astrozeneca

In [None]:
df_tweets['text'] = df_tweets['text'].apply(ut.remove_twitter_handles_hashtags)

for index in emojis_examples.index:
    emoji_text = df_tweets.loc[index, 'text']
    print(emoji_text)

### Convert text to lower-case
Converting all the characters to lower case so that words in different forms can be interpreted as the same. The problem with this is that in social media people may use upper-case to express sentiments, e.g SAD, HAPPY.

Here we also remove special characters, keeping only characters.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(ut.remove_special_characters)

df_tweets['text'] = df_tweets['text'].apply(lambda x:re.sub(r'\s+[a-zA-Z]\s+', ' ', x))
df_tweets['text'] = df_tweets['text'].apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

df_tweets['text'] = df_tweets['text'].str.lower()

### Tokenization
Tokenization breaks down text into individual units, such as words or phrases, enabling granular analysis and feature extraction. We also remove stopwords, meaning words that often appear within the text and don't add any meaning to it.

In [None]:
df_tweets['tokenized_text'] = df_tweets['text'].apply(lambda x: word_tokenize(x))
df_tweets['tokenized_text'] = df_tweets['tokenized_text'].apply(ut.remove_stopwords)
df_tweets['token_text'] = df_tweets['tokenized_text'].apply(lambda text: " ".join(text))


df_tweets['tokenized_text'].head(5)

### Stemming
Stemming typically chops off prefixes and/or suffixes of words to derive the root form. It's a simpler and faster process compared to lemmatization. However, stemming doesn't always result in valid words. For instance, "running" might be stemmed to "runn," which isn't a valid word.

In [None]:
stemmer = PorterStemmer()
df_tweets['stemmed_text'] = df_tweets['tokenized_text'].apply(lambda x: [stemmer.stem(word) for word in x])

df_tweets['stemmed_text'].head(5)

### Lemmatization
Lemmatization, on the other hand, involves resolving words to their dictionary form, known as the lemma. It uses lexical knowledge bases to ensure that the root form returned is a valid word. For example, "am," "are," and "is" would all be lemmatized to "be." Lemmatization is generally more accurate than stemming but can be slower due to its linguistic complexity.

In [None]:
lemmatizer = WordNetLemmatizer()
df_tweets['lemmatized_text'] = df_tweets['tokenized_text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

df_tweets['lemmatized_text'].head(5)
df_tweets['clean_text'] = df_tweets['lemmatized_text'].apply(lambda text: " ".join(text))
df_tweets.drop_duplicates(subset='clean_text', inplace=True)


## Sentiment Analysis - Using VADER & TextBlob

In [None]:
sid = SentimentIntensityAnalyzer()

def analyze_sentiment(text):
    vader_scores = sid.polarity_scores(text)['compound']
    if vader_scores >= 0.05:
        sentiment = 'Positive'
    elif vader_scores <= -0.05:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    return sentiment, vader_scores

df_tweets['sentiment'], df_tweets['vader_score'] = zip(*df_tweets['clean_text'].apply(analyze_sentiment))

#df_tweets['sentiment'] = df_tweets['sentiment'].replace({'Positive': 1, 'Neutral': 0, 'Negative': -1})

In [None]:
ut.plot_sentiments(df_tweets)

### WordClouds

#### Positive Sentiment - WordCloud

In [None]:
positive_tweets = df_tweets[df_tweets['sentiment'] == "Positive"]
negative_tweets = df_tweets[df_tweets['sentiment'] == "Negative"]
neutral_tweets = df_tweets[df_tweets['sentiment'] == "Neutral"]

ut.generate_word_cloud(positive_tweets['token_text'], 'Positive Sentiment Word Cloud')
positive_tweets['clean_text'].head(5)

In [None]:
ut.common_words(positive_tweets, 50)

#### Neutral Sentiment - WordCloud

In [None]:
ut.generate_word_cloud(neutral_tweets['token_text'], 'Neutral Sentiment Word Cloud')
neutral_tweets['clean_text'].head(5)

In [None]:
ut.common_words(neutral_tweets, 50)

#### Negative Sentiment - WordCloud

In [None]:
ut.generate_word_cloud(negative_tweets['token_text'], 'Negative Sentiment Word Cloud')
negative_tweets['clean_text'].head(5)

In [None]:
ut.common_words(negative_tweets, 50)

### N-Gram Analysis by sentiment

#### Uni-Gram

In [None]:
ut.plot_n_grams(df_tweets, 1)

#### Bi-Gram

In [None]:
ut.plot_n_grams(df_tweets, 2)

#### Tri-Gram

In [None]:
ut.plot_n_grams(df_tweets, 3)

### Plotting Average Word Amount by Sentiment

In [None]:
ut.plot_avg_word_length_distribution_multi(positive_tweets, neutral_tweets, negative_tweets)



## Geo-Spatial Sentiment Analysis

## Time-Series Analysis

In [None]:
# Function to filter the data to a single date and print tweets from users with the most followers
def date_filter(df, date):
    return df[df['date'].astype(str)==date].sort_values('user_followers', ascending=False)[['date' ,'orig_text']]

def date_printer(df, dates, num=10): 
    for date in dates:
        display(date_filter(df, date).head(num))
        
# Get tweets for vaccine Pfizer,Pfizer; Sinopharm;Sinovac;Moderna;AstraZeneca;Covaxin;Sputnik V.
df_tweets_pz = df_tweets[df_tweets['clean_text'].str.contains('pfizer', case=False, na=False)]
df_tweets_sinopharm = df_tweets[df_tweets['clean_text'].str.contains('sinopharm', case=False, na=False)]
df_tweets_sinovac = df_tweets[df_tweets['clean_text'].str.contains('sinovac', case=False, na=False)]
df_tweets_moderna = df_tweets[df_tweets['clean_text'].str.contains('moderna', case=False, na=False)]
df_tweets_astrazeneca = df_tweets[df_tweets['clean_text'].str.contains('astrazeneca', case=False, na=False)]
df_tweets_covaxin = df_tweets[df_tweets['clean_text'].str.contains('covaxin', case=False, na=False)]
df_tweets_sputnik = df_tweets[df_tweets['clean_text'].str.contains('sputnik', case=False, na=False)]

print('Number of tweets for Pfizer:', len(df_tweets_pz))
print('Number of tweets for Sinopharm:', len(df_tweets_sinopharm))
print('Number of tweets for Sinovac:', len(df_tweets_sinovac))
print('Number of tweets for Moderna:', len(df_tweets_moderna))
print('Number of tweets for AstraZeneca:', len(df_tweets_astrazeneca))
print('Number of tweets for Covaxin:', len(df_tweets_covaxin))
print('Number of tweets for Sputnik V:', len(df_tweets_sputnik))

### Sentiment Over Time

In [None]:
import datetime

# Assuming df_tweets is already defined and contains a 'date' column
df_tweets['date_'] = pd.to_datetime(df_tweets['date']).dt.date

# Count tweets per day
tweets_per_day = df_tweets.groupby('date_').size().reset_index(name='Tweets Per Day')

# Create the plot using Plotly Express
fig = px.line(tweets_per_day, x='date_', y='Tweets Per Day')

# Add a horizontal line representing the mean of tweet counts
mean_tweet_count = tweets_per_day['Tweets Per Day'].mean()
fig.add_shape(type="line",
    x0=tweets_per_day['date_'].min(), y0=mean_tweet_count, x1=tweets_per_day['date_'].max(), y1=mean_tweet_count,
    line=dict(
        color="Green",
        width=3,
        dash="dashdot",
    ),
    name='Mean',
)

# Update the traces to include markers
fig.update_traces(mode="markers+lines")

# Add annotations
annotations = [
    dict(
        x=datetime.datetime(2021, 3, 1), 
        y=tweets_per_day.loc[tweets_per_day['date_'] == datetime.date(2021, 3, 1), 'Tweets Per Day'].values[0],
        text='March 1',
        showarrow=True,
        arrowhead=3,
        bordercolor="#c7c7c7"
    ),
    dict(
        x=datetime.datetime(2021, 4, 21), 
        y=tweets_per_day.loc[tweets_per_day['date_'] == datetime.date(2021, 4, 21), 'Tweets Per Day'].values[0],
        text='April 21',
        showarrow=True,
        arrowhead=3,
        yshift=5,
        bordercolor="#c7c7c7"
    ),
    dict(
        x=datetime.datetime(2021, 6, 30), 
        y=tweets_per_day.loc[tweets_per_day['date_'] == datetime.date(2021, 6, 30), 'Tweets Per Day'].values[0],
        text='June 30',
        showarrow=True,
        arrowhead=3,
        yshift=5,
        ay=-30,
        bordercolor="#c7c7c7"
    ),
    dict(
        x=datetime.datetime(2021, 8, 11), 
        y=tweets_per_day.loc[tweets_per_day['date_'] == datetime.date(2021, 8, 11), 'Tweets Per Day'].values[0],
        text='August 11',
        showarrow=True,
        arrowhead=3,
        yshift=5,
        ay=-30,
        bordercolor="#c7c7c7"
    ),dict(
        x=datetime.datetime(2021, 10, 12), 
        y=tweets_per_day.loc[tweets_per_day['date_'] == datetime.date(2021, 10, 12), 'Tweets Per Day'].values[0],
        text='October 12',
        showarrow=True,
        arrowhead=3,
        yshift=5,
        ay=-30,
        bordercolor="#c7c7c7"
    ),dict(
        x=datetime.datetime(2021, 11, 3), 
        y=tweets_per_day.loc[tweets_per_day['date_'] == datetime.date(2021, 11, 3), 'Tweets Per Day'].values[0],
        text='November 3',
        showarrow=True,
        arrowhead=3,
        yshift=5,
        ay=-30,
        bordercolor="#c7c7c7"
    )
]

for annotation in annotations:
    fig.add_annotation(annotation)

# Update layout
fig.update_layout(
    title='<b>Daily Tweets<b>',
    hovermode='x unified',
    width=1000
)

# Show the plot
fig.show()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation


dates_to_check = ['2021-03-01','2021-04-21','2021-06-30','2021-08-11','2021-10-12','2021-11-03']

for date in dates_to_check:
    print("DATE: ", date)
    tweets_data = ut.get_tweets_by_date(df_tweets, date)
    ut.topic_modelling(tweets_data)
    print('---------------------------\n\n')

#for i in range(1,100):
    #print("index: ", i, tweets_on_specific_date.iloc[i])


In [None]:
df_filtered = df_tweets[df_tweets['sentiment'] != 'Neutral']

# Group by 'user_location', calculate the average sentiment, and select the top 10 locations
top_10_locations = df_filtered.groupby('user_location')['sentiment'].value_counts().unstack().fillna(0).sum(axis=1).nlargest(10).index

# Filter DataFrame to include only the top 10 locations
df_top_10 = df_filtered[df_filtered['user_location'].isin(top_10_locations)]

# Group by 'user_location' and calculate the average sentiment
avg_sentiment = df_top_10.groupby('user_location')['sentiment'].value_counts(normalize=True).unstack().fillna(0)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_sentiment.index, y=avg_sentiment['Positive'], color='green', label='Positive')
sns.barplot(x=avg_sentiment.index, y=avg_sentiment['Negative'], color='red', label='Negative')
plt.ylabel('Average Sentiment')
plt.xlabel('User Location')
plt.title('Average Sentiment per Top 10 User Locations')
plt.legend()
plt.xticks(rotation=45)
plt.show()


In [None]:
#tweets_data = ut.get_tweets_by_date(df_tweets, '2021-11-03')
#for i in range(1,2000):
#   print("index: ", i, tweets_data.iloc[i])



In [None]:
df_tweets['date'] = pd.to_datetime(df_tweets['date'])
df_tweets['datedt'] = pd.to_datetime(df_tweets['date'])
df_tweets['date'] = df_tweets['date'].dt.date
df_tweets['year'] = df_tweets['datedt'].dt.year
df_tweets['month'] = df_tweets['datedt'].dt.month
df_tweets['day'] = df_tweets['datedt'].dt.day
df_tweets['dayofweek'] = df_tweets['datedt'].dt.dayofweek
df_tweets['hour'] = df_tweets['datedt'].dt.hour
df_tweets['minute'] = df_tweets['datedt'].dt.minute
df_tweets['dayofyear'] = df_tweets['datedt'].dt.dayofyear
df_tweets['date_only'] = df_tweets['datedt'].dt.date



# Get counts of number of tweets by sentiment for each date
timeline = df_tweets.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index().dropna()

# Plot results

fig = px.line(timeline, x='date', y='tweets', color='sentiment', category_orders={'sentiment': ['neutral', 'negative', 'positive']},color_discrete_sequence=[ '#EF553B','#636EFA', '#00CC96'], title='Number of Tweets by Sentiment Over Time')
fig.show()

There are some spikes in the data, which may be due to some events that happened in the world. Let's investigate them further.

In [None]:
spike = df_tweets[df_tweets['date'].astype(str)=='2021-03-01']
spike['user_location'].value_counts().head(10)

In [None]:
spike = spike.sort_values('user_location', ascending=False)
spike['orig_text'].head(10)

It looks like the Prime Minister of India took the first dose of the Covid 19 vaccine on March 1st, 2021. This event caused a spike in the number of tweets and we can see that the sentiment is mostly positive.

#### Covaxin

In [None]:
def filtered_timeline(df, vax, title):
    df = df.dropna()
    title_str = 'Timeline showing sentiment of tweets about the '+title+' vaccine'
    vac_tweets = df_tweets[df_tweets['clean_text'].str.contains(vax, case=False, na=False)]
    
    timeline = vac_tweets.groupby(['date', 'sentiment']).agg(**{'tweets': ('id', 'count')}).reset_index()
    fig = px.line(timeline, x='date', y='tweets', color='sentiment', color_discrete_map={'Positive': '#00CC96', 'Negative': '#EF553B', 'Neutral': '#636EFA'},title=title_str)

    fig.show()
    return vac_tweets

covaxin = filtered_timeline(df_tweets, 'covaxin', title='Covaxin')

In [None]:
covaxin_spike = covaxin[covaxin['date'].astype(str)=='2021-11-03']
print('Number of tweets on 2021-11-03:', len(covaxin_spike))
covaxin_spike['user_location'].value_counts().head(10)

In [None]:
covaxin_spike = covaxin[covaxin['user_location']=='India']

date_printer(covaxin_spike, ['2021-11-03'], num=5)

After a brief investigation we found out that Covaxin was approved for emergency use by WHO (World Health Organization) on May 3rd, 2021. This led to a positive spike in the tweets.

#### Sinopharm

In [None]:
sinopharm = filtered_timeline(df_tweets, 'sinopharm', title='Sinopharm')


In [None]:
sinopharm_spike = sinopharm[sinopharm['date'].astype(str)=='2021-08-13']
print('Number of tweets on July 13th:', len(sinopharm_spike))


print(sinopharm_spike['orig_text'].head(10))

# Count how many duplicate tweets there are
print('Number of duplicate tweets:', len(sinopharm_spike[sinopharm_spike.duplicated(subset='clean_text')]))

For the Sinopharm vaccine there was one major event that happened. An 89 year old man died after taking the vaccine. This event lead to a huge number of negative tweets.
Additionally, we observed a notable prevalence of duplicate data within the dataset, with 821 similar tweets recorded on that day.

#### Sinovac

In [None]:
sinovac = filtered_timeline(df_tweets, 'sinovac', title='Sinovac')

In [None]:
sinovac_spike = sinovac[sinovac['date'].astype(str)=='2021-06-02']

print('Number of tweets on June 2nd:', len(sinovac_spike))

date_printer(sinovac_spike, ['2021-06-02'], num=10)

Similar to the Covaxin vaccine, the Sinovac vaccine was also approved for emergency use by WHO on June 6th, 2021. This event led to a mostly positive sentiment in the tweets.

#### Moderna

In [None]:
moderna = filtered_timeline(df_tweets, 'moderna', title='Moderna')

There were numerous spikes related to the Moderna vaccine. Let's focus on the positive spike that happened on June 29th.

In [None]:
date_printer(moderna, ['2021-06-29'], num=10)

On June 29th, 2021, India granted approval for the importation and utilization of the Moderna vaccine. This development sparked a notable increase in positive tweets.

#### AztraZeneca

In [None]:
astrazeneca = filtered_timeline(df_tweets, 'astrazeneca', title='AstraZeneca')

In [None]:
date_printer(astrazeneca, ['2021-03-16'], num=10)

#### Sputnik V

In [None]:
sputnik = filtered_timeline(df_tweets, 'sputnik', title='Sputnik V')


dates = ['2021-04-12', '2021-05-14']

date_printer(sputnik, dates, num=5)

#### Pfizer

In [None]:
pfizer = filtered_timeline(df_tweets, 'pfizer', title='Pfizer')

In [None]:
date_printer(pfizer, ['2021-08-23'], num=10)

The biggest peak of positive engagement for the Pfizer vaccine seems to relate to it's approval by the US FDA, the first vaccine to do so.

In [None]:
## WORK IN PROGRESS
locations = {
    'India': {'lat': 23.309469, 'long': 78.532748},
    'United States': {'lat': 36.434542, 'long': -103.931671},
    'China': {'lat': 35.377854, 'long': 103.165949},
}
country_counts = {country: 0 for country in locations.keys()}
for index, tweet in df_tweets.iterrows():
    user_location = str(tweet['user_location'])  # Convert to string
    # If the user location contains a country name
    for country in locations.keys():
        if country in user_location:
            country_counts[country] += 1  # Increment count for the country
            break

data = []
for country, count in country_counts.items():
    data.append({'Location': country, 'Latitude': locations[country]['lat'], 'Longitude': locations[country]['long'], 'Number of Tweets': count})

df_locations = pd.DataFrame(data)

print(df_locations)
fig = px.scatter_mapbox(df_locations, lat="Latitude", lon="Longitude", hover_name="Number of Tweets", size = 'Number of Tweets',color="Number of Tweets", 
                    color_continuous_scale=px.colors.sequential.Plasma, size_max=15, zoom=1,
                   mapbox_style="carto-positron")
fig.show()

In [None]:
def bar_plot(df, title, x, y, color, color_discrete_sequence, angle, size, rge=0, scale=''):
    # Plot the mean sentiment scores for each vaccine

    if(scale!=''):
        fig = px.bar(df, x, y, title=title, labels={x: x, y: y}, color=color, color_continuous_scale=scale)
    else:
        if(rge==0):
            fig = px.bar(df, x, y, title=title, labels={x: x, y: y}, color=color, color_discrete_sequence=color_discrete_sequence)
        else:
            fig = px.bar(df, x, y, title=title, labels={x: x, y: y}, color=color, color_discrete_sequence=color_discrete_sequence, range_y=[-rge, rge])

    # Rotate x-axis labels for better readability
    fig.update_layout(xaxis_tickangle=angle)
    # bigger figure
    fig.update_layout(width=size[0], height=size[1])
        
    # Show plot
    fig.show()

In [None]:
# Count occurrences of each location
location_counts = df_tweets['user_location'].value_counts().reset_index()
location_counts.columns = ['user_location', 'count']

# Sort locations by count in descending order
location_counts = location_counts.sort_values(by='count', ascending=False)

# Select top 10 locations
top_10_locations = location_counts.head(10)

colors = px.colors.qualitative.Set1[:10]  # Using a qualitative color palette for variety

bar_plot(top_10_locations, 'Top 10 User Locations', 'user_location', 'count', 'user_location', colors, -45, (900, 600))




As can be seen, most of the tweets in the dataset have origin in India.

In [None]:
plt.figure(figsize=(7,5))
df_tweets['source'].value_counts().nlargest(5).plot(kind='bar')
plt.xticks(rotation=45)
plt.title('Top 5 Tweet Sources')

plt.ylabel('Number of Tweets')
plt.xlabel('Tweet Source')
plt.show()


Tweets are similar distributed among the biggest 3 platforms: Android, Web and Iphone.

In [None]:
def plot_time_variation(df, x='date_only', y='count', hue = None, size=1, title="", is_log=False):
    f, ax = plt.subplots(1,1, figsize=(3*size,1*size))
   
    g = sns.lineplot(x=x, y=y, hue=hue, data=df)
    #plt.xticks(rotation=90)
    if hue:
        plt.title(f'{y} grouped by {hue} | {title}')
    else:
        plt.title(f'{title}')
    if(is_log):
        ax.set(yscale="log")
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.show()

tweets_agg_df = df_tweets.groupby(["date"])["text"].count().reset_index()
tweets_agg_df.columns = ["date_only", "count"]

plot_time_variation(tweets_agg_df, title="Number of Tweets per Day", x='date_only', y='count', size=4)

In [None]:
# Count occurrences of each country
country_counts = df_tweets['user_location'].value_counts().reset_index()
country_counts.columns = ['user_location', 'tweet_count']

# Select top 10 countries with the most appearances in tweets
top_10_countries = country_counts.nlargest(10, 'tweet_count')

mean_sentiment_by_country = df_tweets.groupby('user_location')['vader_score'].mean().reset_index()
mean_sentiment_by_country.columns = ['user_location', 'mean_sentiment']
#select only countries in top 10
mean_sentiment_by_country = mean_sentiment_by_country[mean_sentiment_by_country['user_location'].isin(top_10_countries['user_location'])]
colors = px.colors.qualitative.Set1[:10]
bar_plot(mean_sentiment_by_country, 'Mean Sentiment of Top 10 Countries with Most Appearances in Tweets', 'user_location', 'mean_sentiment', 'user_location', colors, -45, (900,600), .5)





Sentiment Analisys reveals tweets from most proeminent geographic locations in the dataset had a rather positive sentiment towards COVID (most certainly not to COVID itself, but optimistic news), while in Toronto and in the rest of the world it was the opposite.

In [None]:
def calculate_mean_sentiment(df):
    return df['vader_score'].mean()

# Calculate the mean sentiment score for each vaccine
mean_sentiments = {
    'Pfizer': calculate_mean_sentiment(df_tweets_pz),
    'Sinopharm': calculate_mean_sentiment(df_tweets_sinopharm),
    'Sinovac': calculate_mean_sentiment(df_tweets_sinovac),
    'Moderna': calculate_mean_sentiment(df_tweets_moderna),
    'AstraZeneca': calculate_mean_sentiment(df_tweets_astrazeneca),
    'Covaxin': calculate_mean_sentiment(df_tweets_covaxin),
    'Sputnik V': calculate_mean_sentiment(df_tweets_sputnik)
}

# Convert mean sentiments to a DataFrame
mean_sentiments_df = pd.DataFrame(list(mean_sentiments.items()), columns=['Vaccine', 'Mean Sentiment']).sort_values('Mean Sentiment', ascending=False)
colors = px.colors.qualitative.Set1[:7]

# Plot the mean sentiment scores for each vaccine
bar_plot(mean_sentiments_df, 'Mean Sentiment Scores for Each Vaccine', 'Vaccine', 'Mean Sentiment', 'Vaccine', colors, -45, (900, 600), .15)


As can be seen, all vacines had rather positive views from the public.

### Patterns in a Time Series

#### Trend

A trend in a time series refers to the movement or direction of data over a period o time.

#### Seasonality

A seasonality refers to the presence of predictable patterns that happen over a specific period due to seasonal factors.

To understand if the data has either a trend or a seasonality we will take a look at the number of tweets over time.

In [None]:
plot_time_variation(tweets_agg_df, title="Number of Tweets per Day", x='date_only', y='count', size=4)

After analyzing the COVID-19 vaccine tweets dataset, it appears that there are no significant trends or seasonality.
The time series decomposition reveals a relatively flat trend component, indicating no long-term increase or decrease in tweet volumes.

Additionally, the seasonal component does not show any consistent, repeating patterns over time.


Therefore, the data seems to be mostly influenced by random fluctuations rather than systematic trends or seasonal effects, which makes sense since the number of tweets are driven by random, unpredictable events such as news reports, vaccine rollouts, public health announcements, and social media campaigns.


### Decomposition of a Time Series 

In [None]:

from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

# Multiplicative Decomposition 
result_mul = seasonal_decompose(tweets_agg_df['count'], model='multiplicative', extrapolate_trend='freq', period=150)

# Additive Decomposition
result_add = seasonal_decompose(tweets_agg_df['count'], model='additive', extrapolate_trend='freq', period=150)

# Plot
plt.rcParams.update({'figure.figsize': (16,12)})
result_mul.plot().suptitle('Multiplicative Decomposition', fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

result_add.plot().suptitle('Additive Decomposition', fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()


### Stationary or Non-Stationarity Time Series

Now, we will try to understand whether our time series is stationary or not. A stationary time series is one whose statistical properties such as mean, variance, and autocorrelation are all constant over time.

#### Augmented Dickey Fuller test (ADF Test)

In [None]:
from statsmodels.tsa.stattools import adfuller, kpss

result = adfuller(tweets_agg_df['count'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))



We can conclude  based on the results that the time series is not stationary.

In this test, we assume that the null hypothesis is the time series possesses a unit root and is non-stationary.

* Since the ADF statistic (-1.887130) is not less than the critical values at the 1%, 5%, or even 10% significance levels, you cannot reject the null hypothesis that the time series has a unit root.

* Additionally, the p-value (0.338140) is significantly higher than 0.05, which further indicates that you cannot reject the null hypothesis.

#### Kwiatkowski-Phillips-Schmidt-Shin – KPSS test

In [None]:
result = kpss(tweets_agg_df['count'])

print('\nKPSS Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[3].items():
    print('\t%s: %.3f' % (key, value))
    

The KPSS Test also supports the conclusion that the time series is not stationary.

* The KPSS statistic (0.621666) is greater than the critical values at the 10%, 5%, and 2.5% significance levels.

* The p-value is less than 0.05, which indicates that you can reject the null hypothesis that the time series is stationary.

A major difference between KPSS and ADF tests is the capability of the KPSS test to check for stationarity in the ‘presence of a deterministic trend’.

### Test for Seasonality

In [None]:
from pandas.plotting import autocorrelation_plot

# Draw Plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(tweets_agg_df['count'].tolist())

### Autocorrelation and Partial Autocorrelation

* Autocorrelation is simply the correlation of a series with its own lags. If a series is significantly autocorrelated, that means, the previous values of the series (lags) may be helpful in predicting the current value.

* Partial Autocorrelation also conveys similar information but it conveys the pure correlation of a series and its lag, excluding the correlation contributions from the intermediate lags.

In [None]:
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Draw Plot
fig, axes = plt.subplots(1,2,figsize=(16,3), dpi= 100)
plot_acf(tweets_agg_df['count'].tolist(), lags=50, ax=axes[0])
plot_pacf(tweets_agg_df['count'].tolist(), lags=50, ax=axes[1])


The presence of significant spikes in both plots suggests that there is autocorrelation at lags 1 and 2.

Let's consider an ARIMA model with p (number of lag observations included in the model) equal to 2 and q (size of the moving average window) equal to 2.

In [None]:
#Number of tweets per day of the week
df_dayofweek = df_tweets.groupby('dayofweek').size().reset_index(name='count')
df_dayofweek['dayofweek'] = df_dayofweek['dayofweek'].map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})

df_dayofweek['dayofweek'] = pd.Categorical(df_dayofweek['dayofweek'], categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], ordered=True)

colors = px.colors.qualitative.Set1
bar_plot(df_dayofweek, 'Number of Tweets per Day of the Week', 'dayofweek', 'count', 'dayofweek', colors, 0, (900, 600))

Tweets regarding COVID were more proeminent during the week days than during the weekend. It may be related to the scheduled release of weekly reports.

In [None]:
#Number of tweets per hour
df_hour = df_tweets.groupby('hour').size().reset_index(name='count')
df_hour = df_hour.sort_values('count', ascending=False)

px.colors.qualitative.Vivid
colors = px.colors.qualitative.Set2[:10]
scale = "aggrnyl"
bar_plot(df_hour, 'Number of Tweets per Hour', 'hour', 'count', 'count', colors, 0, (900, 600),0,scale)



As expected, most tweets happened during day hours, with a peak from 13h to 15h.

In [None]:
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

tweet_counts_hourly = df_tweets.groupby(['year', 'month', 'day', 'hour']).size().reset_index(name='tweet_count')

# Convert date columns to datetime format
tweet_counts_hourly['date'] = pd.to_datetime(tweet_counts_hourly[['year', 'month', 'day', 'hour']])

# Drop redundant columns
tweet_counts_hourly.drop(columns=['year', 'month', 'day', 'hour'], inplace=True)

# Set 'date' column as index
tweet_counts_hourly.set_index('date', inplace=True)
tweet_counts_hourly = tweet_counts_hourly.asfreq('H')
# Plot the time series
plt.figure(figsize=(10, 6))
plt.plot(tweet_counts_hourly.index, tweet_counts_hourly['tweet_count'])
plt.title('Hourly Tweet Counts')
plt.xlabel('Date')
plt.ylabel('Tweet Count')
plt.show()





### Forecasting

In [None]:

import matplotlib.dates as mdates


sample = tweet_counts_hourly[tweet_counts_hourly.index.month == 8]
sample = sample[sample.index.year == 2021]
# days from 1 to 4
sample = sample[sample.index.day <= 20]

# Split the data into train and test
train_size = int(len(sample) * 0.8)
train, test = sample[0:train_size], sample[train_size:len(sample)]

# Fit the ARIMA model on the training dataset
model_train = ARIMA(train, order=(2,1,2))
model_train_fit = model_train.fit()

# Forecast on the test dataset
test_forecast = model_train_fit.get_forecast(steps=len(test))
test_forecast_series = pd.Series(test_forecast.predicted_mean, index=test.index)

test = test.dropna()
test_forecast_series = test_forecast_series.dropna()

test_forecast_series = test_forecast_series.loc[test.index]

mse = mean_squared_error(test['tweet_count'], test_forecast_series)

# Calculate the mean squared error
mse = mean_squared_error(test['tweet_count'], test_forecast_series)
rmse = mse**0.5


# Create a plot to compare the forecast with the actual test data
plt.figure(figsize=(10,5))
plt.plot(train, label='Training Data', color='blue')

plt.plot(test, label='Actual Test Data', color='green')
plt.plot(test_forecast_series, label='Forecast', color='red')

plt.title('ARIMA Model Forecast')
plt.xlabel('Date')
plt.ylabel('Tweet Count')

# Format the x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=1))

plt.legend()
plt.grid(True)
plt.xticks(rotation=45)  # Rotate date labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of date labels
plt.show()

print(f'Root Mean Squared Error: {rmse:.2f}')



High RMSE shows low seasonality in the dataset and how hard it would be to predict tweet count regarding a random event like the COVID epidemic.

In [None]:
# Aggregate tweet counts by day
tweet_counts_daily = df_tweets.groupby(['year', 'month', 'day']).size().reset_index(name='tweet_count')

# Convert date columns to datetime format
tweet_counts_daily['date'] = pd.to_datetime(tweet_counts_daily[['year', 'month', 'day']])


# Drop redundant columns
tweet_counts_daily.drop(columns=['year', 'month', 'day'], inplace=True)

# Set 'date' column as index

tweet_counts_daily.set_index('date', inplace=True)
tweet_counts_daily = tweet_counts_hourly.asfreq('D')

# Plot the time series
plt.figure(figsize=(10, 6))
plt.plot(tweet_counts_daily.index, tweet_counts_daily['tweet_count'])
plt.title('Daily Tweet Counts')
plt.xlabel('Date')
plt.ylabel('Tweet Count')
plt.show()

High RMSE shows low seasonality in the dataset and how hard it would be to predict tweet count regarding a random event like the COVID epidemic.

In [None]:
data_sample = tweet_counts_daily[tweet_counts_daily.index.month >= 2]
data_sample = data_sample[data_sample.index.month <= 4]



#### Choosing a model
Exponential smoothings methods are appropriate for non-stationary data (ie data with a trend and seasonal data).

ARIMA models should be used on stationary data only. One should therefore remove the trend of the data (via deflating or logging), and then look at the differenced series.

#### Smoothing methods
Smoothing methods work as weighted averages. Forecasts are weighted averages of past observations. The weights can be uniform (this is a moving average), or following an exponential decay — this means giving more weight to recent observations and less weight to old observations. More advanced methods include other parts in the forecast, like seasonal components and trend components.

We will use the component form for our mathematical equations. y will denote our time series, p our forecast, l the level, s the seasonal component and b the trend component

#### Simple Exponential Smoothing

When to use?
* Few data points, Irregular data, No seasonality or trend.



In [None]:
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, Holt, ExponentialSmoothing


# Set the value of Alpha and define m (Time Period)
m = 12
alpha = 1/(2*m)

data_sample['alpha.2'] = SimpleExpSmoothing(data_sample['tweet_count']).fit(smoothing_level=.2,optimized=False).fittedvalues
data_sample['alpha.5'] = SimpleExpSmoothing(data_sample['tweet_count']).fit(smoothing_level=.5,optimized=False).fittedvalues



# Plot
plt.figure(figsize=(10,5))
plt.plot(data_sample['tweet_count'], label='Actual')
plt.plot(data_sample['alpha.2'], label='alpha= 0.2')
plt.plot(data_sample['alpha.5'], label='alpha= 0.5')
plt.title('Simple Exponential Smoothing')

# Format the x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))


plt.grid(True)

plt.tight_layout()  # Adjust layout to prevent clipping of date labels
plt.legend()
plt.show()



#### Holt's Exponential Smoothing

When to use?
* Trend in data, No seasonality.

In [None]:
data_sample['holt_.05'] = Holt(data_sample['tweet_count']).fit(smoothing_level=.3, smoothing_slope=.05,optimized=False).fittedvalues
data_sample['holt_.2'] = Holt(data_sample['tweet_count']).fit(smoothing_level=.3, smoothing_slope=.2,optimized=False).fittedvalues

# Plot
plt.figure(figsize=(10,5))
plt.plot(data_sample['tweet_count'], label='Actual')
plt.plot(data_sample['holt_.05'], label='alpha= 0.3 & beta= 0.05')
plt.plot(data_sample['holt_.2'], label='alpha= 0.3 & beta= 0.2')
plt.title('Holt\'s Exponential Smoothing')

# Format the x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))


plt.grid(True)

plt.tight_layout()  # Adjust layout to prevent clipping of date labels
plt.legend()
plt.show()



#### Holt’s Damped Trend

The problem with Holt’s Linear trend method is that the trend is constant in the future, increasing or decreasing indefinitely. For long forecast horizons, this can be problematic. The damped trend method is therefore a method which add a dampening parameter so that the trend converges to a constant value in the future (it flattens the trend). The parameter 𝑏 is replaced by 𝜙𝑏

When to use?
* Data has a trend. Use the multiplicative version, unless the data has been logged before. In this case, use the additive version

In [None]:
data_sample['HWES3_ADD'] = ExponentialSmoothing(data_sample['tweet_count'],trend='add',seasonal='add',seasonal_periods=12).fit().fittedvalues


# Plot
plt.figure(figsize=(10,5))
plt.plot(data_sample['tweet_count'], label='Actual')

plt.plot(data_sample['HWES3_ADD'], label='Triple Exponential Smoothing', color='green')
plt.title('Triple Exponential Smoothing')
plt.xlabel('Date')
plt.ylabel('Tweet Count')

# Format the x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))

plt.grid(True)

plt.tight_layout()  # Adjust layout to prevent clipping of date labels

plt.legend()
plt.show()


Despite not achieving great success using exponential smoothing for forecasting, triple exponential smoothing proved to be substatially more successful in comparision to other strategies, being able to predict the overall tendency of the tweet count.


Splitting the data into training and testing sets.

In [None]:
# Split into train and test sets
train_size = int(len(data_sample) * 0.7)
train, test = data_sample[0:train_size], data_sample[train_size:len(data_sample)]

In [None]:


# Fit the Holt-Winters model on the training dataset
fitted_model = ExponentialSmoothing(train['tweet_count'],trend='add',seasonal='add',seasonal_periods=12).fit()

# Forecast on the test dataset
test_predictions = fitted_model.forecast(len(test))

# Calculate the mean squared error


# Create a plot to compare the forecast with the actual test data
plt.figure(figsize=(10,5))
plt.plot(train['tweet_count'], label='Training Data', color='blue')
plt.plot(test['tweet_count'], label='Actual Test Data', color='green')
plt.plot(test_predictions, label='Forecast', color='red')
plt.title('Holt-Winters Model Forecast')
plt.xlabel('Date')
plt.ylabel('Tweet Count')
# Format the x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))


plt.grid(True)

plt.tight_layout()  # Adjust layout to prevent clipping of date labels
plt.legend()
plt.show()


#### ARIMA

ARIMA models (which include ARMA, AR and MA models) are a general class of models to forecast stationary time series.

An ARIMA model is often noted ARIMA(p, d, q) where p represents the order of the AR part, d the order of differencing (“I” part), and q the order of the MA term.

1. Choosing the differencing order

The first step of fitting an ARIMA model is to determine the differencing order to stationarize the series. To do that, we look at the ACF and PACF plots, and keep in mind these two rules:

#### Plotting predictions

In [None]:

import matplotlib.dates as mdates

sample = tweet_counts_hourly[tweet_counts_hourly.index.month == 8]
sample = sample[sample.index.year == 2021]
# days from 1 to 4
sample = sample[sample.index.day <= 20]

# Split the data into train and test
train_size = int(len(sample) * 0.8)
train, test = sample[0:train_size], sample[train_size:len(sample)]

# Fit the ARIMA model on the training dataset
model_train = ARIMA(train, order=(2,1,2))
model_train_fit = model_train.fit()

# Forecast on the test dataset
test_forecast = model_train_fit.get_forecast(steps=len(test))
test_forecast_series = pd.Series(test_forecast.predicted_mean, index=test.index)

test = test.dropna()
test_forecast_series = test_forecast_series.dropna()

test_forecast_series = test_forecast_series.loc[test.index]

mse = mean_squared_error(test['tweet_count'], test_forecast_series)

# Calculate the mean squared error
mse = mean_squared_error(test['tweet_count'], test_forecast_series)
rmse = mse**0.5


# Create a plot to compare the forecast with the actual test data
plt.figure(figsize=(10,5))
plt.plot(train, label='Training Data', color='blue')

plt.plot(test, label='Actual Test Data', color='green')
plt.plot(test_forecast_series, label='Forecast', color='red')

plt.title('ARIMA Model Forecast')
plt.xlabel('Date')
plt.ylabel('Tweet Count')

# Format the x-axis dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=1))

plt.legend()
plt.grid(True)
plt.xticks(rotation=45)  # Rotate date labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of date labels
plt.show()

print(f'Root Mean Squared Error: {rmse:.2f}')