# Twitter Sentiment Analysis.

## Business Problem.

We want to analyze a COVID-19 Twitter dataset to understand how positive and negative trends spread after news announcements.

Key Questions:

How do positive and negative sentiments spread among users after a news announcement related to COVID-19?

Purpose:

Help media outlets measure the impact of their announcements on public sentiment.

Goals:

Track sentiment trends over time.


## Dataset location and download instructions.

[Covid-19 Twitter Dataset](https://www.kaggle.com/datasets/arunavakrchakraborty/covid19-twitter-dataset/data)


* Place datasets into ```Data``` folder if running the notebook locally.
* Upload the files into ```/content``` root file folder of Colab environment.

## Installing the required modules.

We'll start with installing the requirements [available here](https://github.com/leksea/capstone-twitter-sentiment-analysis/blob/main/requirements.txt).

In [8]:
!wget https://raw.githubusercontent.com/leksea/capstone-twitter-sentiment-analysis/main/requirements.txt
!pip install -r 'requirements.txt'

--2024-12-27 01:55:37--  https://raw.githubusercontent.com/leksea/capstone-twitter-sentiment-analysis/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103 [text/plain]
Saving to: ‘requirements.txt.1’


2024-12-27 01:55:37 (4.14 MB/s) - ‘requirements.txt.1’ saved [103/103]



### Importing modules.

In [9]:
# built-in modules f
import seaborn as sns
import os
import string
import re
import glob
from datetime import datetime
# data manupulation, analysis
import numpy as np
import pandas as pd

# general data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# world maps
from folium import plugins
from folium.plugins import HeatMap
import branca.colormap as cm
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter
# world cloud
from wordcloud import WordCloud

# Natural Language Processing (NLP)
import nltk
from emot.emo_unicode import UNICODE_EMOJI
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
%matplotlib inline
# stop words for tokenizer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Loading The Covid-19 Twitter Datasets.

In [10]:
# Supplemental function to determine data directory
# Input: none
# Output: Data directory, depending on runtime environment.

def determine_data_dir():
    """
    Determines the data directory based on the execution environment:
    - Local: Uses 'Data' directory in the current working directory.
    - Cloud (e.g., Google Colab): Uses '/content' as the data directory.

    Returns:
        str: Path to the appropriate data directory.
    """
    if 'COLAB_GPU' in os.environ:  # Check if running in Google Colab
        data_dir = "/content"
        print(f"Running in Google Colab. Using data directory: {data_dir}")
    else:
        data_dir = os.path.join(os.getcwd(), "Data")
        print(f"Running locally. Using data directory: {data_dir}")

        # Ensure the 'Data' directory exists locally
        if not os.path.isdir(data_dir):
            print(f"The directory '{data_dir}' does not exist. Please create it and place the data files there.")
            raise FileNotFoundError(f"'{data_dir}' directory is required for local execution.")

    return data_dir

In [11]:
# Loading the files
# Determine the data directory
data_dir = determine_data_dir()

# Step 1: Locate all CSV files in the determined directory
files_pattern = os.path.join(data_dir, "*.csv")
files = glob.glob(files_pattern)

# Step 2: Check if files are found
if not files:
    print(f"No CSV files found in directory: {data_dir}")
else:
     # Step 3: Load and inspect each file
    dfs = []  # Store valid DataFrames
    for file in files:
        try:
            # Load the DataFrame
            df = pd.read_csv(file)
            rows, cols = df.shape
            print(f"File: {file} | Rows: {rows}, Columns: {cols}")

            # Optional: Skip empty files or files with no columns
            if rows == 0 or cols == 0:
                print(f"Skipping empty or invalid file: {file}")
                continue

            # Append to list if valid
            dfs.append(df)

        except Exception as e:
            print(f"Error loading file {file}: {e}")

    # Step 4: Concatenate all valid DataFrames
    if dfs:
        data = pd.concat(dfs, ignore_index=True)
        print(f"Data loaded successfully with {data.shape[0]} rows and {data.shape[1]} columns.")
        print(data.head())
    else:
        print("No valid DataFrames to concatenate.")

Running in Google Colab. Using data directory: /content
File: /content/Covid-19 Twitter Dataset (Apr-Jun 2020).csv | Rows: 143903, Columns: 17
File: /content/Covid-19 Twitter Dataset (Apr-Jun 2021).csv | Rows: 147475, Columns: 17
File: /content/Covid-19 Twitter Dataset (Aug-Sep 2020).csv | Rows: 120509, Columns: 17
Data loaded successfully with 411887 rows and 17 columns.
             id  created_at  \
0  1.250000e+18  2020-04-19   
1  1.250000e+18  2020-04-19   
2  1.250000e+18  2020-04-19   
3  1.250000e+18  2020-04-19   
4  1.250000e+18  2020-04-19   

                                              source  \
0  <a href="http://twitter.com/download/android" ...   
1  <a href="http://twitter.com/download/android" ...   
2  <a href="http://twitter.com/download/iphone" r...   
3  <a href="http://twitter.com/download/iphone" r...   
4  <a href="http://twitter.com/download/android" ...   

                                       original_text lang  favorite_count  \
0  RT @GlblCtzn: .@priya

## Exploratory Data Analysis.
### Data Understanding and Cleaning.

We'll start with getting general information about the dataset and identify the columns of interest.

In [12]:
# get general info about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 411887 entries, 0 to 411886
Data columns (total 17 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               411883 non-null  float64
 1   created_at       411885 non-null  object 
 2   source           411587 non-null  object 
 3   original_text    411885 non-null  object 
 4   lang             411884 non-null  object 
 5   favorite_count   411884 non-null  float64
 6   retweet_count    411884 non-null  float64
 7   original_author  411884 non-null  object 
 8   hashtags         97775 non-null   object 
 9   user_mentions    295207 non-null  object 
 10  place            293775 non-null  object 
 11  clean_tweet      409915 non-null  object 
 12  compound         411887 non-null  float64
 13  neg              411887 non-null  float64
 14  neu              411887 non-null  float64
 15  pos              411887 non-null  float64
 16  sentiment        411887 non-null  obje

In [13]:
# info about the numeric columns
data.describe()

Unnamed: 0,id,favorite_count,retweet_count,compound,neg,neu,pos
count,411883.0,411884.0,411884.0,411887.0,411887.0,411887.0,411887.0
mean,1.324197e+18,0.216726,1585.174163,0.008415,0.09092,0.807021,0.102052
std,5.902218e+16,6.33225,9423.896052,0.370853,0.152717,0.200474,0.15708
min,1.25e+18,0.0,0.0,-0.9925,0.0,0.0,0.0
25%,1.26e+18,0.0,1.0,-0.1027,0.0,0.667,0.0
50%,1.31e+18,0.0,15.0,0.0,0.0,0.819,0.0
75%,1.395011e+18,0.0,243.0,0.2263,0.18,1.0,0.2
max,1.40914e+18,2923.0,416923.0,0.9805,1.0,1.0,1.0


In [29]:
# select subset of tweets we'll be working with:
cols_to_keep = ['id', 'created_at', 'original_text', \
                'lang', 'favorite_count', 'retweet_count', 'original_author', \
                'hashtags', 'user_mentions', 'place']
tweets_df = data[cols_to_keep].copy()

In [30]:
# Supplemental function will display unique values for all categorical columns in a dataframe.
def display_categorical_vals(df):
    # select categorical columns
    categorical_columns = df.select_dtypes(include=['object', 'category']).columns

    # print categorical columns and their unique values
    for col in categorical_columns:
        unique_values = df[col].unique()
        print(f"Column '{col}' has unique values: {unique_values}")

In [26]:
display_categorical_vals(tweets_df)

Column 'created_at' has unique values: ['2020-04-19' '2020-04-22' '2020-04-23' '2020-04-24' '2020-04-25'
 '2020-04-26' '2020-04-27' '2020-04-28' nan '2020-04-29' '2020-04-30'
 '2020-05-01' '2020-05-02' '2020-05-03' '2020-05-04' '2020-05-05'
 '2020-05-06' '2020-05-07' '2020-05-08' '2020-05-09' '2020-05-10'
 '2020-05-11' '2020-05-12' '2020-05-13' '2020-05-14' '2020-05-15'
 '2020-05-16' '2020-05-17' '2020-05-18' '2020-05-19' '2020-05-20'
 '2020-05-22' '2020-05-23' '2020-05-24' '2020-05-25' '2020-05-26'
 '2020-05-27' '2020-05-28' '2020-05-29' '2020-05-30' '2020-05-31'
 '2020-06-01' '2020-06-03' '2020-06-04' '2020-06-05' '2020-06-06'
 '2020-06-07' '2020-06-08' '2020-06-09' '2020-06-10' '2020-06-11'
 '2020-06-12' '2020-06-13' '2020-06-14' '2020-06-15' '2020-06-17'
 '2020-06-19' '2020-06-20' '2021-04-26' '2021-04-27' '2021-04-28'
 '2021-04-29' '2021-04-30' '2021-05-01' '2021-05-03' '2021-05-04'
 '2021-05-05' '2021-05-07' '2021-05-08' '2021-05-09' '2021-05-10'
 '2021-05-11' '2021-05-13' '2021-

### Data Type Conversion.

In [35]:
#rename date column for clarity and convert to date
tweet_df = data.rename(columns={'created_at': 'date'})
tweet_df['date'] = pd.to_datetime(tweet_df['date'])

Follow by exploratory data analysis:
* What were daily tweet patters?
* What were the top 20 days with most tweets?  

In [None]:
# exploratory analysis: plot number of tweets per day
# group by date and count tweets
tweets_per_day = tweet_df.groupby(tweet_df['date'].dt.date)['id'].count()

plt.figure(figsize=(12, 6))
plt.bar(tweets_per_day.index, tweets_per_day.values)
plt.xlabel("Date")
plt.ylabel("Number of Tweets")
plt.title("Number of Tweets Per Day")
plt.grid(True)
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

In [37]:
# exploratory analysis: list the top 20 days with the most tweets
# rename columns for clarity
tweets_per_day.columns = ['date', 'tweet_count']

# sort by tweet count and get the top 20
top_20_days = tweets_per_day.sort_values(ascending=False).head(20)

# display the result
print(top_20_days)

date
2021-06-03    7229
2020-05-04    7165
2021-05-25    7114
2020-09-27    7040
2020-09-30    7028
2020-04-25    6178
2020-10-01    6166
2020-05-09    5857
2020-05-23    5600
2020-04-24    5323
2021-06-01    5175
2021-06-09    5158
2020-05-22    5102
2021-05-08    5032
2020-05-05    4949
2021-05-05    4886
2020-05-25    4804
2021-05-26    4750
2021-05-07    4586
2020-05-06    4575
Name: id, dtype: int64


### Evaluasting Missing Values.
Before coming up with the strategy for each column, we'll check the contents of categorical data and the distributiuon of NaNs.

* It would make sence that fields like ```hashtags``` and ```user_mentions``` would have missing values and we'll leave it as it is.
* We'll check the ```lang``` and ```place``` columns.


In [38]:
# number of NaNs in lang
sum(tweet_df.lang.isna())

3

In [41]:
# look at the tweet text
tweet_df[tweet_df.lang.isna()]

Unnamed: 0,id,date,source,original_text,lang,favorite_count,retweet_count,original_author,hashtags,user_mentions,place,clean_tweet,compound,neg,neu,pos,sentiment
23125,,NaT,,,,,,,,,,,0.0,0.0,1.0,0.0,neu
236779,1.40049e+18,2021-06-03,"<a href=""http://twitter.com/download/android"" ...",@santoshmt7666 @globaltimesnews The COVID-19 d...,,,,,,,,covid19 death toll india could time higher,-0.5994,0.394,0.606,0.0,neg
368535,1.31e+18,NaT,,,,,,,,,,,0.0,0.0,1.0,0.0,neu


The only 3 rows where language is missing are missing the original text, so we'll discard them.

In [45]:
# drop rows where 'lang' is NaN
tweet_df = tweet_df.dropna(subset=['lang'])

# verify the changes
print(f"Number of NaNs in 'lang' after dropping: {sum(tweet_df.lang.isna())}")

# drop the lang column from the df
tweet_df.drop(columns=['lang'], inplace=True)

Number of NaNs in 'lang' after dropping: 0


In [47]:
#rename place column to location for clarity
tweet_df = tweet_df.rename(columns={'place': 'location'})

In [48]:
# count of NaNs
sum(tweet_df.location.isna())

118112

In [50]:
# what are the unique locations
print(set(tweet_df.location))

