# COVID-19 Vaccines Sentiment Analysis
------------------

For this notebook, we will be doing a sentiment analysis for COVID-19 vaccines using data from COVID-19 All Vaccines Tweets, collected using tweepy Python package to access Twitter API. For each of the vaccine I use relevant search term (most frequently used in Twitter to refer to the respective vaccine).

Before we start, we will be importing the necessary libraries for our analysis. 

In [None]:
pip install twython

In [1]:
import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt 
import re
import string

import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
nltk.download('stopwords')
nltk.download('vader_lexicon')


from collections import Counter

from matplotlib import pyplot as plt
from matplotlib import ticker
import seaborn as sns
import plotly.express as px

sns.set(style="darkgrid")

## Importing the Dataset

Now we will be importing the dataset from the COVID-19 All Vaccines Tweets. 

In [2]:
path = '../input/all-covid19-vaccines-tweets/vaccination_all_tweets.csv'
df = pd.read_csv(path)
df.head()

Now that we have imported the dataset, we will check the shape of our dataset, to view the number of rows and column.

In [6]:
df.shape

We can see that our dataset have 189,054 rows and 16 columns. But since we would not need all the columns, we will now select the important ones for our analysis, and create a new dataframe. 

In [7]:
data = ['user_name', 'date', 'text']
df = df[data]
df.head()

Now that we have a new dataset with the important data for our analysis, we need to check the data types of the dataframe. 

In [8]:
df.info()

We can see that all three of our columns have the same data types. But for the date column, we can see that it is specific to the second of the tweet. Since we wouldn't need such an accurate data for our analysis, we will only take the day, month, and year of the tweet. 

In [9]:
df.user_name = df.user_name.astype('category')
df.user_name = df.user_name.cat.codes

df.date = pd.to_datetime(df.date).dt.date
df.head()

Now that we have finished importing our dataset, we can continue to process our data for analysis. 

## Processing the Data

For processing our data, we will need to select the text column of our dataset. 

In [10]:
texts = df['text']
texts.head()

The first step of our processing would be removing the url from all the tweets, since we don't need them. after that, we will be converting all of the text into lower cases for easier analysis. Lastly, we will also remove all punctuations from the texts. 

In [11]:
remove_url = lambda x: re.sub(r'https\S+', '', str(x))
texts_lr = texts.apply(remove_url)
texts_lr.head()

In [12]:
to_lower = lambda x : x.lower()
texts_lc = texts_lr.apply(to_lower)
texts_lc.head()

In [13]:
rmv_pcs = lambda x : x.translate(str.maketrans('', '', string.punctuation))
texts_pcs = texts_lc.apply(rmv_pcs)
texts_pcs

Now that we have remove all the unnecessary characters from our text, we will now remove the stopwords from the text. A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. This will reduce the noise in our analysis. 

In [14]:
update_words = ['covid','#coronavirus', '#coronavirusoutbreak', '#coronavirusPandemic', '#covid19', '#covid_19', '#epitwitter', '#ihavecorona', 'amp', 'coronavirus', 'covid19']
stop_words = set(stopwords.words('english'))
stop_words.update(update_words)

remove_words = lambda x : ' '.join([word for word in x.split() if word not in stop_words])
texts_rs = texts_pcs.apply(remove_words)
texts_rs.head()

## Text Analysis

Before we analyze the sentiments of the tweets, we will be doing an analysis on the text itself. First, we will be listing all of the words on each of the tweets, and also visualizing it. The purpose is to see the most common words from all of the tweets. 

In [15]:
word_list = [word for line in texts_rs for word in line.split()]
word_list[:10]

In [17]:
word_counts = Counter(word_list).most_common(50)
words_df = pd.DataFrame(word_counts)
words_df.columns = ['word', 'frequency']

px.bar(words_df, x='word', y='frequency', title='Most Common Words')

## Join Table

Since we are done with the processing of the text data, we can now put the cleaned text into our main dataframe.

In [18]:
df.text = texts_rs
df.head()

In [19]:
df.info()

Now we see that the date here is still in string type. For our analysis, we need to convert it into datetime data type. Also, to limit  our analysis, we will only be taking tweets from march 1st, 2021. 

In [20]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

filtered_df = df.loc[(df['date'] >= '2021-03-01')]
filtered_df

## Sentiment Analysis 

Now that we have finished preparing the data for our analysis, we can continue to with our sentiment analysis. Sentiment Analysis can be organized into neutral, positive, or negative sentiment. To find out, we will be using SentimentIntensityAnalyzer, which will rate whether the tweet containt positive, negative, or neutral sentiment. 

In [21]:
sid = SentimentIntensityAnalyzer()
ps = lambda x : sid.polarity_scores(x)
sentiment_scores = filtered_df.text.apply(ps)
sentiment_scores

In [22]:
sentiment_df = pd.DataFrame(data = list(sentiment_scores))
sentiment_df.head()

We can see that there is neg for negative sentiment, neu for neutral sentiment, pos for positive sentiment, and compound as the average rate of the sentiment. We will focused on the compound. 

For negative sentiment, the compound score will be closer to -1, and the opposite goes for the positive sentiment, which will be closer to 1. Neutral sentiment will be a 0. 

For our analysis, we will create another column called label, where we will be labelling the scores based on the compound polarity value. 

In [23]:
labelize = lambda x : 'neutral' if x==0 else('positive' if x>0 else 'negative')
sentiment_df['label'] = sentiment_df.compound.apply(labelize)
sentiment_df.head()

Now that we have the label for each tweet, we will join the label column into our main dataframe. Once we have joined the two tables, we will be counting the number of positive, negative, and neutral tweets from our dataframe and visualize it. 

In [24]:
data = filtered_df.join(sentiment_df.label)
data.head()

In [25]:
counts_df = data.label.value_counts().reset_index()
counts_df

In [26]:
sns.barplot(data=counts_df, x='index', y='label')

We can see that mostly, the tweets about the vaccines is neutral, and with more positive than negative. But the visualization that we see is from the total tweets from march to september 2021. 

For closer analyzation, we will see the number of positive, negative, and neutral tweets dialy from march 2021. 

In [27]:
data_agg = data[['user_name', 'date', 'label']].groupby(['date', 'label']).count().reset_index()
data_agg.columns = ['date', 'label', 'counts']
data_agg.head()

In [28]:
px.line(data_agg, x='date', y='counts', color='label', title='COVID-19 Vaccines Sentiment Analysis')

From the visualization, we can see that the sentiment of the tweets about COVID-19 Vaccines is mostly neutral. Although there are negative tweets about the vaccines, the positive tweets about the vaccines outweight the negative tweets. 