# Sentiment Analysis of Covid-19 related Tweets

In this demo, we will try to apply different models to perform Sentiment Analysis on tweets related to the Covid-19 pandemic.

In [1]:
# Load libraries
import pandas as pd
pd.options.mode.chained_assignment = None  # Surppress warnings
import re
from nltk.stem.porter import PorterStemmer
from nltk import download
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### Exploratory Data Analysis

The dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/sentiment-analysis-of-covid-19-related-tweets).

In [2]:
train_df = pd.read_csv('./data/training.csv')  # Labeled
test_df = pd.read_csv('./data/validation.csv')  # Unlabeled 

In [3]:
print("We have {} labeled entries.".format(len(train_df)))
print("We have {} unlabled entries.".format(len(test_df)))

We have 5000 labeled entries.
We have 2500 unlabled entries.


In [4]:
train_df.head(10)

Unnamed: 0,ID,Tweet,Labels
0,1,NO JOKE I WILL HOP ON A PLANE RN! (Well after ...,0 10
1,2,BanMediaHouse whose is responsible for spreadi...,6
2,3,Im waiting for someone to say to me that all t...,3 4
3,4,He is a liar. Proven day night. Time again. Li...,6
4,5,"NEW: U.S. CoronaVirus death toll reaches 4,000...",8
5,6,Coronavirus impact Govt extends I-T deadlines ...,5 8
6,7,"42,000 people might have died in China from Co...",6 7 8
7,8,Dear Chinese! Kindly cook your bat thoroughly ...,5 10
8,9,This is how the govt of kenya is checking the ...,3 6 9
9,10,My mental health hasn't suffered at all under ...,10


In [5]:
print("We have {} labels in total.".format(len(set(train_df.Labels.str.split().sum()))))

We have 11 labels in total.


> According to the documentation on Kaggle, the 11 labels are: Optimistic (0), Thankful (1), Empathetic (2), Pessimistic (3), Anxious (4), Sad (5), Annoyed (6), Denial (7), Surprise (8), Official report (9) and Joking (10).

In order to be able to perform binary classification later on, we will extract entries with labels `Optimistic` and `Pessimistic`.

In [6]:
train_df_bin = train_df[train_df.Labels.str.contains('(?<!1)0|3')]
train_df_bin.head()

Unnamed: 0,ID,Tweet,Labels
0,1,NO JOKE I WILL HOP ON A PLANE RN! (Well after ...,0 10
2,3,Im waiting for someone to say to me that all t...,3 4
8,9,This is how the govt of kenya is checking the ...,3 6 9
12,13,Has anyone elses FB ads been killing it since ...,0 5 10
16,17,Three teams are fighting Covid19 day night: 1:...,0 1


In [7]:
train_df_bin['LabelsBinary'] = train_df_bin['Labels'].apply(lambda x: '1' if (x.find('3') == -1) else '0')

In [8]:
train_df_bin.head()

Unnamed: 0,ID,Tweet,Labels,LabelsBinary
0,1,NO JOKE I WILL HOP ON A PLANE RN! (Well after ...,0 10,1
2,3,Im waiting for someone to say to me that all t...,3 4,0
8,9,This is how the govt of kenya is checking the ...,3 6 9,0
12,13,Has anyone elses FB ads been killing it since ...,0 5 10,1
16,17,Three teams are fighting Covid19 day night: 1:...,0 1,1


### Dara Preparation

We normaly do not use raw text as an input for ML models.  
The first step is to remove punctuation and special characters. At the same time, we replace capital letters by lowercase.

In [9]:
def clean_text(t):
    t = re.sub('[\W]', ' ', t.lower())
    return t

In [10]:
test_text = 'COVID is a Hoax! What are you saying ???'
test_text_clean = clean_text(test_text)
test_text_clean

'covid is a hoax  what are you saying    '

The next step is **Tokenization**: split each string of text into a list of smaller parts. We also perform **Normalization** by using the Porter Stemming algorithm.

In [11]:
def tokenize_text(t):
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in t.split()]

In [12]:
tokenized_text = tokenize_text(test_text_clean)
tokenized_text

['covid', 'is', 'a', 'hoax', 'what', 'are', 'you', 'say']

Another step would be to remove stop words which are common words that do not much information.

In [13]:
download('stopwords', quiet=True);

In [14]:
stop_words = stopwords.words('english')

In [15]:
[word for word in tokenized_text if word not in stop_words]

['covid', 'hoax', 'say']

Let's apply the preprocessing steps to the tweets.

In [16]:
def apply_preprocess(df, old_column, new_column, function):
    df[new_column] = df[old_column].apply(function)
    return df.head(3)

In [17]:
apply_preprocess(train_df, 'Tweet', 'TweetCleaned', clean_text)

Unnamed: 0,ID,Tweet,Labels,TweetCleaned
0,1,NO JOKE I WILL HOP ON A PLANE RN! (Well after ...,0 10,no joke i will hop on a plane rn well after ...
1,2,BanMediaHouse whose is responsible for spreadi...,6,banmediahouse whose is responsible for spreadi...
2,3,Im waiting for someone to say to me that all t...,3 4,im waiting for someone to say to me that all t...


In [18]:
apply_preprocess(train_df, 'TweetCleaned', 'TweetTokenized', tokenize_text)

Unnamed: 0,ID,Tweet,Labels,TweetCleaned,TweetTokenized
0,1,NO JOKE I WILL HOP ON A PLANE RN! (Well after ...,0 10,no joke i will hop on a plane rn well after ...,"[no, joke, i, will, hop, on, a, plane, rn, wel..."
1,2,BanMediaHouse whose is responsible for spreadi...,6,banmediahouse whose is responsible for spreadi...,"[banmediahous, whose, is, respons, for, spread..."
2,3,Im waiting for someone to say to me that all t...,3 4,im waiting for someone to say to me that all t...,"[im, wait, for, someon, to, say, to, me, that,..."


As most ML algorithms need numerical values, we use **TFIDF** (term frequency–inverse document frequency).

In [19]:
tfidf = TfidfVectorizer(use_idf=True,
                        norm='l2',
                        smooth_idf=True,
                        preprocessor = clean_text,
                        tokenizer = tokenize_text)

### Logistic Regression

For this part, we use the subset with binary labels.

In [20]:
print("We have {} entries with binary labels: Optimisitc(1) / Pessimistic(0).".format(len(train_df_bin)))

We have 1705 entries with binary labels: Optimisitc(1) / Pessimistic(0).


In [21]:
X = tfidf.fit_transform(train_df_bin.Tweet)
y = train_df_bin.LabelsBinary.values

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=2019)

In [23]:
print("{} tweets for training".format(X_train.shape[0]))
print("{} tweets for testing".format(X_test.shape[0]))

1364 tweets for training
341 tweets for testing


In [24]:
model = LogisticRegressionCV(cv=5, scoring = 'accuracy', random_state=2020, max_iter=500)

In [25]:
model.fit(X_train, y_train);

In [26]:
print("Accuracy on training dataset: {}".format(model.score(X_train, y_train)))

Accuracy on training dataset: 0.9325513196480938


In [27]:
print("Accuracy on test dataset: {}".format(model.score(X_test, y_test)))

Accuracy on test dataset: 0.6920821114369502
