# Twitter Sentiment Analysis.

## Business Problem.

We want to analyze a COVID-19 Twitter dataset to understand how positive and negative trends spread after news announcements. Additionally, we want to use a bot detection algorithm to determine what percentage of each sentiment is made up of bots and how this affects the general sentiment of the public (non-bots).

Key Questions:

How do positive and negative sentiments spread among users after a news announcement related to COVID-19?
What proportion of tweets in each sentiment category (positive/negative/neutral) come from bots?
How do bots influence the general sentiment of non-bot users?
Purpose:

Help media outlets measure the impact of their announcements on public sentiment.
Assist public health agencies in identifying misinformation or bot-driven content to improve communication strategies.
Support social media platforms in detecting and limiting bot activity that could distort public opinion.
Goals:

Track sentiment trends over time.
Quantify bot participation in each sentiment category.
Measure the influence of bots on genuine public sentiment.

## Datasets location and download instructions.

1. [Covid-19 Twitter Dataset](https://www.kaggle.com/datasets/arunavakrchakraborty/covid19-twitter-dataset/data)

2. [Twitter Bots Accounts.](https://www.kaggle.com/datasets/davidmartngutirrez/twitter-bots-accounts)

* Place datasets into ```Data``` folder if running the notebook locally.
* Upload the files into ```/content``` root file folder of Colab environment.

## Installing the required modules.

We'll start with installing the requirements [available here](https://github.com/leksea/capstone-twitter-sentiment-analysis/blob/main/requirements.txt).

In [2]:
!wget https://raw.githubusercontent.com/leksea/capstone-twitter-sentiment-analysis/main/requirements.txt
!pip install -r 'requirements.txt'

--2024-12-26 02:08:36--  https://raw.githubusercontent.com/leksea/capstone-twitter-sentiment-analysis/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98 [text/plain]
Saving to: ‘requirements.txt’


2024-12-26 02:08:37 (5.10 MB/s) - ‘requirements.txt’ saved [98/98]

Collecting cartopy (from -r requirements.txt (line 7))
  Downloading Cartopy-0.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Downloading Cartopy-0.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m70.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: cartopy
Successfully installed cartopy-0.24.1


### Importing modules.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
import string
import re
import glob
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import branca.colormap as cm
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from cartopy.mpl.ticker import LongitudeFormatter, LatitudeFormatter
import requests
import folium
from folium import plugins
from folium.plugins import HeatMap
import branca.colormap
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud
from tqdm import tqdm, notebook
%matplotlib inline
# stop words for tokenizer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Loading The Covid-19 Twitter Datasets.

In [5]:
# Supplemental function to determine data directory
# Input: none
# Output: Data directory, depending on runtime environment.

def determine_data_dir():
    """
    Determines the data directory based on the execution environment:
    - Local: Uses 'Data' directory in the current working directory.
    - Cloud (e.g., Google Colab): Uses '/content' as the data directory.

    Returns:
        str: Path to the appropriate data directory.
    """
    if 'COLAB_GPU' in os.environ:  # Check if running in Google Colab
        data_dir = "/content"
        print(f"Running in Google Colab. Using data directory: {data_dir}")
    else:
        data_dir = os.path.join(os.getcwd(), "Data")
        print(f"Running locally. Using data directory: {data_dir}")

        # Ensure the 'Data' directory exists locally
        if not os.path.isdir(data_dir):
            print(f"The directory '{data_dir}' does not exist. Please create it and place the data files there.")
            raise FileNotFoundError(f"'{data_dir}' directory is required for local execution.")

    return data_dir

Running in Google Colab. Using data directory: /content
Found 3 files: ['/content/Covid-19 Twitter Dataset (Apr-Jun 2020).csv', '/content/Covid-19 Twitter Dataset (Apr-Jun 2021).csv', '/content/Covid-19 Twitter Dataset (Aug-Sep 2020).csv']
Data loaded successfully with 411887 rows and 17 columns.
             id  created_at  \
0  1.250000e+18  2020-04-19   
1  1.250000e+18  2020-04-19   
2  1.250000e+18  2020-04-19   
3  1.250000e+18  2020-04-19   
4  1.250000e+18  2020-04-19   

                                              source  \
0  <a href="http://twitter.com/download/android" ...   
1  <a href="http://twitter.com/download/android" ...   
2  <a href="http://twitter.com/download/iphone" r...   
3  <a href="http://twitter.com/download/iphone" r...   
4  <a href="http://twitter.com/download/android" ...   

                                       original_text lang  favorite_count  \
0  RT @GlblCtzn: .@priyankachopra is calling on l...   en             0.0   
1  RT @OGSG_Official: OG

In [6]:
# Loading the files
# Determine the data directory
data_dir = determine_data_dir()

# Step 1: Locate all CSV files in the determined directory
files_pattern = os.path.join(data_dir, "*.csv")
files = glob.glob(files_pattern)

# Step 2: Check if files are found
if not files:
    print(f"No CSV files found in directory: {data_dir}")
else:
    print(f"Found {len(files)} files: {files}")

    # Step 3: Load and concatenate the files into a single DataFrame
    data = pd.concat(map(pd.read_csv, files), ignore_index=True)
    print(f"Data loaded successfully with {data.shape[0]} rows and {data.shape[1]} columns.")
    print(data.head())

Running in Google Colab. Using data directory: /content
Found 3 files: ['/content/Covid-19 Twitter Dataset (Apr-Jun 2020).csv', '/content/Covid-19 Twitter Dataset (Apr-Jun 2021).csv', '/content/Covid-19 Twitter Dataset (Aug-Sep 2020).csv']
Data loaded successfully with 411887 rows and 17 columns.
             id  created_at  \
0  1.250000e+18  2020-04-19   
1  1.250000e+18  2020-04-19   
2  1.250000e+18  2020-04-19   
3  1.250000e+18  2020-04-19   
4  1.250000e+18  2020-04-19   

                                              source  \
0  <a href="http://twitter.com/download/android" ...   
1  <a href="http://twitter.com/download/android" ...   
2  <a href="http://twitter.com/download/iphone" r...   
3  <a href="http://twitter.com/download/iphone" r...   
4  <a href="http://twitter.com/download/android" ...   

                                       original_text lang  favorite_count  \
0  RT @GlblCtzn: .@priyankachopra is calling on l...   en             0.0   
1  RT @OGSG_Official: OG