# **Covid-19 Vaccines - Sentiment Analysis & Time Series**
Notebook for the second project for the Machine Learning Complements course (CAC).

## **Introduction**


## Imports

The following libraries will be used in this project:

In [None]:
import os
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
import utils as ut
import warnings
import seaborn as sns
import contractions
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.classify import NaiveBayesClassifier
nltk.download('vader_lexicon')
warnings.simplefilter(action='ignore')


## Load Data

In [None]:
df_tweets = pd.read_csv('tweets.csv')

## Initial Observations

The dataset contains a single file: `tweets.csv`.

In this section we will take a look at the first few rows of each file to get a better understanding of the data, and do some initial data exploration.

In [None]:
ut.initial_obs(df_tweets)

## Data Understanding

## Data Pre-Processing
We can see that many attributes are not really relevant for the kind of work we will be doing. Therefore, we'll just selec tthe most relevant ones.

In [None]:
df_tweets = df_tweets[['id','user_location','date','text','hashtags']]
pd.set_option('display.max_colwidth', None)

df_tweets['text'].head(5)

### Removing Spaces within the text
When removing spaces within the text, ensure seamless integration of words for enhanced readability and processing efficiency.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(ut.trim_text)
df_tweets['text'].head(5)

### Contractions Mapping
Contractions mapping simplifies language by expanding contractions like "can't" to "cannot" for consistent analysis and interpretation.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(contractions.fix)
df_tweets['text'].head(5)

### Cleaning HTML
Cleaning HTML tags from text data streamlines content for NLP tasks, preventing interference from markup elements.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(lambda x:re.sub(r"http\S+", "", x))
df_tweets['text'].head(5)

### Emojis & Emotion Handling
Emojis and emotion handling enrich text analysis by capturing nuances of sentiment and expression for deeper understanding. We thought about removing them initially, however their presence may be crucial to identify sentiments within the text.

In [None]:
pattern = re.compile(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251\U0001F004\U0001F0CF\U0001F170-\U0001F251\U0001F600-\U0001F64F\U0001F680-\U0001F6FF]+', flags=re.UNICODE)

# Find examples in df_tweets['text'] that have emojis
emojis_examples = df_tweets[df_tweets['text'].str.contains(pattern, na=False)][0:5]

for index in emojis_examples.index:
    print(df_tweets.loc[index, 'text'])

df_tweets['text'] = df_tweets['text'].apply(ut.convert_emojis_to_text)

print('\n')

for index in emojis_examples.index:
    emoji_text = df_tweets.loc[index, 'text']
    print(emoji_text)

### Handling Twitter Handles (@) & Hashtags
Handling Twitter handles (@) and hashtags facilitates contextual analysis and topic extraction in social media text. We removed the twitter handle, as they mostly are used to identify persons therefore they are not very important in this matter. On the other hand, hashtags may indicate sentiments or other important informations like topics. e.g #sad, #happy or #astrozeneca

In [None]:
df_tweets['text'] = df_tweets['text'].apply(ut.remove_twitter_handles_hashtags)

for index in emojis_examples.index:
    emoji_text = df_tweets.loc[index, 'text']
    print(emoji_text)

### Convert text to lower-case
Converting all the characters to lower case so that words in different forms can be interpreted as the same. The problem with this is that in social media people may use upper-case to express sentiments, e.g SAD, HAPPY.

Here we also remove special characters, keeping only characters.

In [None]:
df_tweets['text'] = df_tweets['text'].apply(ut.remove_special_characters)

df_tweets['text'] = df_tweets['text'].apply(lambda x:re.sub(r'\s+[a-zA-Z]\s+', '', x))
df_tweets['text'] = df_tweets['text'].apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

df_tweets['text'] = df_tweets['text'].str.lower()

### Tokenization
Tokenization breaks down text into individual units, such as words or phrases, enabling granular analysis and feature extraction. We also remove stopwords, meaning words that often appear within the text and don't add any meaning to it.

In [None]:
df_tweets['tokenized_text'] = df_tweets['text'].apply(lambda x: word_tokenize(x))
df_tweets['tokenized_text'] = df_tweets['tokenized_text'].apply(ut.remove_stopwords)
df_tweets['token_text'] = df_tweets['tokenized_text'].apply(lambda text: " ".join(text))


df_tweets['tokenized_text'].head(5)

### Stemming
Stemming typically chops off prefixes and/or suffixes of words to derive the root form. It's a simpler and faster process compared to lemmatization. However, stemming doesn't always result in valid words. For instance, "running" might be stemmed to "runn," which isn't a valid word.

In [None]:
stemmer = PorterStemmer()
df_tweets['stemmed_text'] = df_tweets['tokenized_text'].apply(lambda x: [stemmer.stem(word) for word in x])

df_tweets['stemmed_text'].head(5)

### Lemmatization
Lemmatization, on the other hand, involves resolving words to their dictionary form, known as the lemma. It uses lexical knowledge bases to ensure that the root form returned is a valid word. For example, "am," "are," and "is" would all be lemmatized to "be." Lemmatization is generally more accurate than stemming but can be slower due to its linguistic complexity.

In [None]:
lemmatizer = WordNetLemmatizer()
df_tweets['lemmatized_text'] = df_tweets['tokenized_text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

df_tweets['lemmatized_text'].head(5)
df_tweets['clean_text'] = df_tweets['lemmatized_text'].apply(lambda text: " ".join(text))


## Sentiment Analysis - Using VADER

In [None]:
sid = SentimentIntensityAnalyzer()

def analyze_sentiment(text):
    vader_scores = sid.polarity_scores(text)['compound']
    if vader_scores >= 0.05:
        sentiment = 'Positive'
    elif vader_scores <= -0.05:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    return sentiment, vader_scores

df_tweets['sentiment'], df_tweets['vader_score'] = zip(*df_tweets['text'].apply(analyze_sentiment))
#df_tweets['sentiment'] = df_tweets['sentiment'].replace({'Positive': 1, 'Neutral': 0, 'Negative': -1})

In [None]:
ut.plot_sentiments(df_tweets)

### WordClouds

#### Positive Sentiment - WordCloud

In [None]:
positive_tweets = df_tweets[df_tweets['sentiment'] == "Positive"]
negative_tweets = df_tweets[df_tweets['sentiment'] == "Negative"]
neutral_tweets = df_tweets[df_tweets['sentiment'] == "Neutral"]

ut.generate_word_cloud(positive_tweets['token_text'], 'Positive Sentiment Word Cloud')
positive_tweets['clean_text'].head(5)

In [None]:
ut.common_words(positive_tweets, 50)

#### Neutral Sentiment - WordCloud

In [None]:
ut.generate_word_cloud(neutral_tweets['token_text'], 'Neutral Sentiment Word Cloud')
neutral_tweets['clean_text'].head(5)

In [None]:
ut.common_words(neutral_tweets, 50)

#### Negative Sentiment - WordCloud

In [None]:
ut.generate_word_cloud(negative_tweets['token_text'], 'Negative Sentiment Word Cloud')
negative_tweets['clean_text'].head(5)

In [None]:
ut.common_words(negative_tweets, 50)

### N-Gram Analysis by sentiment

#### Uni-Gram

In [None]:
ut.plot_n_grams(df_tweets, 1)

#### Bi-Gram

In [None]:
ut.plot_n_grams(df_tweets, 2)

#### Tri-Gram

In [None]:
ut.plot_n_grams(df_tweets, 3)

### Plotting Average Word Amount by Sentiment

In [None]:
ut.plot_avg_word_length_distribution_multi(positive_tweets, neutral_tweets, negative_tweets)



## Sentiment Analysis - Typical ML Approach
As we can see our dataset is not labelled, therefore we can't separate it into train/test and just train a model. What we will do is train a model using a labelled tweet dataset (the theme of both dataset would be similar, so we can use it for training) and then test on our dataset.

In [None]:
"""
training_df = pd.read_csv('train_tweets.csv')
training_df = training_df[training_df['new_sentiment'].notna()]
training_df = training_df[['old_text','new_sentiment']]


training_df = ut.pre_process_pipeline(training_df,'old_text')
training_df.rename(columns={'new_sentiment': 'sentiment'}, inplace=True)
training_df['sentiment'] = training_df['sentiment'].replace({'positive': 1, 'neutral': 0, 'negative': -1})

ut.initial_obs(training_df)"""

### Generating Training-Features

In [None]:
"""
train_features, test_features = ut.generate_features(training_df,df_tweets,'tfidf')

#TF-IDF Results:
pred_nb_tfidf, pred_svm_tfidf = ut.predict_labels(train_features,training_df['sentiment'].values,test_features)

df_tweets['predicted_tfidf_NB'] = pred_nb_tfidf
df_tweets['predicted_tfidf_SVM'] = pred_svm_tfidf


df_eval = df_tweets.copy()
df_eval = df_eval[['clean_text','text','sentiment','predicted_tfidf_NB','predicted_tfidf_SVM']]

import seaborn as sns

sentiment_counts = df_eval['predicted_tfidf_NB'].value_counts()
plt.figure(figsize=(8, 6))
sns.barplot(x=sentiment_counts.index, y=sentiment_counts.values, palette="viridis")
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()"""

## Geo-Spatial Sentiment Analysis