# Vaccine Discourse Analysis
## **Table of Contents**
1. [Set Up](#section-1)
2. [Data Preprocessing](#section-2)   
    2.1 Processing the initial dataset   
    2.2 Cleaning the tweets
3. 


### **Set Up** <a class='anchor' id='section-1'></a>    

Import Necessary Packages

In [19]:
import numpy as np
import pandas as pd
import re       

Read the CSV file

In [20]:
df = pd.read_csv('TweetsAboutCovid-19.csv')
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,created_at,date,time,timezone,place,tweet,language,replies_count,retweets_count,likes_count,hashtags,cashtags,retweet,video,thumbnail
0,1.38588e+18,2021-04-24 08:43:17 UTC,4/24/2021,8:43:17,0,,🇨🇺: ✍️ Covid-19 en Cuba: 1241 nuevos casos pos...,es,0.0,0.0,0.0,"['reportando', 'cuba']",[],False,0.0,
1,1.38588e+18,2021-04-24 08:43:17 UTC,4/24/2021,8:43:17,0,,The latest The Zika Advice Paper! https://t.c...,en,0.0,0.0,0.0,"['covid19', 'amr']",[],False,0.0,
2,1.38588e+18,2021-04-24 08:43:16 UTC,4/24/2021,8:43:16,0,,Tum karo toh mantra ..woh kare toh tantra ..ai...,tl,0.0,0.0,0.0,"['covidindia', 'covidvaccine', 'covidresources...",[],False,1.0,https://pbs.twimg.com/media/EzufwnkUYA0TExx.jpg
3,1.38588e+18,2021-04-24 08:43:16 UTC,4/24/2021,8:43:16,0,,https://t.co/4rdhSH3IYl Prime Minister @Nare...,en,0.0,0.0,0.0,['covid_19'],[],False,1.0,https://pbs.twimg.com/media/EzueSRxVgAEiTie.jpg
4,1.38588e+18,2021-04-24 08:43:16 UTC,4/24/2021,8:43:16,0,,@bc_pt64 @KackCake @sherlockine1 @SternchenJvB...,de,0.0,0.0,0.0,[],[],False,0.0,


Understanding the initial dataset by looking at the column names, the types of the columns, and how many null entries are in each column.

In [21]:
df.dtypes

id                float64
created_at         object
date               object
time               object
timezone            int64
place              object
tweet              object
language           object
replies_count     float64
retweets_count    float64
likes_count       float64
hashtags           object
cashtags           object
retweet            object
video             float64
thumbnail          object
dtype: object

In [22]:
print('shape of dataframe:', df.shape)
df.isnull().sum()

shape of dataframe: (803645, 16)


id                     0
created_at             0
date                   0
time                   0
timezone               0
place             802765
tweet                  0
language               1
replies_count          1
retweets_count         1
likes_count            1
hashtags               1
cashtags               1
retweet                1
video                  1
thumbnail         570664
dtype: int64

### **Data Preprocessing** <a class='anchor' id='section-2'></a>

#### Processing the initial dataset   
***Drop columns that we will not be using in this project***    

We will remove the place and thumbnail columns due to a large number of missing entries. They are also not necessary for the objectives of this project. 

Based on the objectives of this project, we will only need the ids, date, time, tweet, language, replies_count, retweets_count, likes_count, and hashtags columns from the dataset. We will remove the other columns. 

In [23]:
df = df.drop(["created_at", "timezone", "place", 'cashtags', 'retweet', 'video', 'thumbnail'], axis=1)
df.head()

Unnamed: 0,id,date,time,tweet,language,replies_count,retweets_count,likes_count,hashtags
0,1.38588e+18,4/24/2021,8:43:17,🇨🇺: ✍️ Covid-19 en Cuba: 1241 nuevos casos pos...,es,0.0,0.0,0.0,"['reportando', 'cuba']"
1,1.38588e+18,4/24/2021,8:43:17,The latest The Zika Advice Paper! https://t.c...,en,0.0,0.0,0.0,"['covid19', 'amr']"
2,1.38588e+18,4/24/2021,8:43:16,Tum karo toh mantra ..woh kare toh tantra ..ai...,tl,0.0,0.0,0.0,"['covidindia', 'covidvaccine', 'covidresources..."
3,1.38588e+18,4/24/2021,8:43:16,https://t.co/4rdhSH3IYl Prime Minister @Nare...,en,0.0,0.0,0.0,['covid_19']
4,1.38588e+18,4/24/2021,8:43:16,@bc_pt64 @KackCake @sherlockine1 @SternchenJvB...,de,0.0,0.0,0.0,[]


***Filtering for only English tweets***   
 
We will filter and only use the tweets that are in English for easier understanding. We remove the language column after filtering.

In [24]:
language_counts = df['language'].value_counts()
print(language_counts)

en     412115
es     131519
in      40841
pt      33253
hi      28140
        ...  
am         10
ps          7
ka          5
dv          2
ckb         2
Name: language, Length: 64, dtype: int64


In [25]:
df = df[df['language'] == 'en']
new_language_counts = df['language'].value_counts()
print(new_language_counts)

df = df.drop(['language'], axis=1)

en    412115
Name: language, dtype: int64


***Check for null and duplicate entries***   

Lastly, we will check for null or duplicate entries and remove those. 

In [26]:
df = df.dropna()
df.isnull().sum()

id                0
date              0
time              0
tweet             0
replies_count     0
retweets_count    0
likes_count       0
hashtags          0
dtype: int64

In [27]:
duplicates = df[df.duplicated()]
print("Number of duplications:", len(duplicates))
print("Initial shape of dataframe:", df.shape)
df = df.drop_duplicates()
print("Final shape of dataframe:", df.shape)

Number of duplications: 907
Initial shape of dataframe: (412115, 8)
Final shape of dataframe: (411208, 8)


#### Cleaning the tweets
Before delving into data analysis, it's pivotal to clean the tweets. This is to ensure that we extract genuine insights when applying natural language processing (NLP) techniques.   

The preprocessing function below cleans the dataset by removing:  
- Hyperlinks  
- `@mentions`  
- Special characters  
- Emojis  
- Extra spaces  

In [34]:
def pre_processing(text):
    """    
    Parameters:
    - text (str): The tweet text to be cleaned.
    
    Returns:
    - str: The cleaned tweet text.
    """
    
    # Remove hyperlinks from the tweet
    text = re.sub(r'https?:\/\/\S+', '', text)
    
    # Remove @mentions, which are specific to Twitter posts
    text = re.sub(r'@[A-Za-z0-9]+', '', text)
    
    # Remove newline escape sequences (like \n) from the tweet
    # to ensure the tweet text doesn't contain any line breaks
    text = re.sub(r'\n','', text) 

    # Keep only letters, numbers, and hashtags. 
    # All other characters are replaced with a space.
    text = re.sub(r"[^A-Za-z0-9#]+", ' ', text)
    
    # Remove any extra spaces between words and any trailing or leading spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# convert TWEET column from object to string
df['tweet'] = df['tweet'].astype('string')

# create a column for the cleaned tweets
df['tweet_cleaned'] = df['tweet'].apply(pre_processing)
df['tweet_cleaned'] = df['tweet_cleaned'].astype('string')

Remove the rows containing empty strings (if any) in `text_cleaned`, since these tweets only contain links or mentions or special characters. 

In [35]:
print('shape of df before', df.shape)
df = df[df['tweet_cleaned'] != '']
print('shape of df after',df.shape)

shape of df before (411208, 9)
shape of df after (411208, 9)


Next, we remove **stop words** from the tweets. Tweets often contain a high frequency of common words, known as stop words, that do not contribute significantly to the meaning of the text.    

By eliminating these stopwords, such as articles, prepositions, and conjunctions, the remaining content becomes more focused on essential keywords and meaningful context. This reduction in noise not only simplifies the data but also improves the performance of downstream NLP tasks.   

Additionally, removing stopwords aids in reducing the dimensionality of the data, making the computational processes more resource-efficient and expediting the training of machine learning models. 

We will make use of the list of English stop words from Python `NLTK` (Natural Language Toolkit).

In [31]:
# Importing the stopwords list from the Natural Language Toolkit (NLTK).
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tanka\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [39]:
stopwords = stopwords.words("english")

def remove_stopwords(text):
    """
    Parameters:
    - text (str): The input text from which stopwords need to be removed.
    
    Returns:
    - str: The text with stopwords removed.
    """
    processed_string = text.lower().split()
    filtered_text = []
    
    for word in processed_string:
        if word not in stopwords:
            filtered_text.append(word)        
    
    filtered_text = " ".join(filtered_text)
    
    return filtered_text

# create a column for no stop words from tweet_cleaned
df['tweet_no_stop'] = df['tweet_cleaned'].apply(remove_stopwords)

In [40]:
df.head()

Unnamed: 0,id,date,time,tweet,replies_count,retweets_count,likes_count,hashtags,tweet_cleaned,tweet_no_stop
1,1.38588e+18,4/24/2021,8:43:17,The latest The Zika Advice Paper! https://t.c...,0.0,0.0,0.0,"['covid19', 'amr']",The latest The Zika Advice Paper Thanks to #co...,latest zika advice paper thanks #covid19 #amr
3,1.38588e+18,4/24/2021,8:43:16,https://t.co/4rdhSH3IYl Prime Minister @Nare...,0.0,0.0,0.0,['covid_19'],Prime Minister on Saturday said that like last...,prime minister saturday said like last year pr...
5,1.38588e+18,4/24/2021,8:43:16,Covid-19: India is going through very terrible...,0.0,0.0,0.0,"['presssangharsh', 'dailynews', 'news', 'india...",Covid 19 India is going through very terrible ...,covid 19 india going terrible situation says d...
6,1.38588e+18,4/24/2021,8:43:15,@CPBlr @KamalPantIPS speaks to me on the rules...,0.0,0.0,0.0,['covid19'],speaks to me on the rules that people will hav...,speaks rules people follow volunteers bengalur...
7,1.38588e+18,4/24/2021,8:43:14,@Physio_voice @BiswabhusanHC @ysjagan @Audimul...,0.0,0.0,0.0,[],voice Namaste sir we are the NTRUHS Physiother...,voice namaste sir ntruhs physiotherapy student...


### **Data Visualization** <a class='anchor' id='section-2'></a>
