# ![ga_logo](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)Project 5: Sentiment Analysis of COVID-19-related Tweets  


#### Eu Jin Lee [GitHub](https://github.com/missingNA) [LinkedIn](https://www.linkedin.com/in/eeujinlee/)  

#### Gwen Rathgeber [GitHub](https://git.generalassemb.ly/gwenrathgeber) [LinkedIn](https://www.linkedin.com/in/gwenrathgeber/)

## Problem Statement 

Throughout the course of the COVID-19 pandemic, social media has been filled with a constant stream of valuable information that could shed light on the state of the world. Twitter is a great text-based social medium for applications of Natural Language Processing (NLP) techniques. In this project, our goal is to visually represent the mood of the United States over the course of the COVID-19 pandemic. 

The nation's response to the pandemic has been highly regional in nature. Some states had larger earlier or delayed spikes in case rates, for example. States varied on their timeline of enacting and enforcing stay-at-home policies. We hope to visualize the response of the nation along those significant state boundaries. 
 
We also make initial steps toward creating a web-based tool or app for tracking and monitoring the mood of the country in real time.

## Executive Summary 


The workflow of the projects is as follows:

- Importing packages, obtaining COVID-19 geo-tagged tweet ids 
- Hydrating tweets 
- Sentiment analysis of tweets
- Evaluation
- Visualization Deliverable
- Conclusions and Recommendations

For the purposes of this project, we pulled tweet ids from the Institute of Electrical and Electronics Engineers' (IEEE) Data Repository. The data mining process was an interesting but long process. First, obtaining a dataset of COVID-19 related tweet ids totalling to ~168,493 (at the time). Next, we hydrated those tweet ids using the `twarc` Twitter API wrapper to obtain the full `.json` data of those tweets. The tweets were then filtered for only those from the United States (primary focus) with state information. Our final tweets count was ~65,000. 

A sentiment analysis was then conducted using TextBlob, determining the polarity of the tweets on a positive-to-negative spectrum. TextBlob is a popular Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc. 

The sentiment analysis data is then mapped in Tableau using the geo-tagged information from these tweets to build a visual timeline of the country's overall sentiment on COVID-19. This opens up the ability for us to view the public's response to events over the course of the pandemic.

### Table of Contents 

- [Problem Statement](#Problem-Statement)
- [Executive Summary](#Executive-Summary)
- [Data Dictionary](#Data-Dictionary)
- [Package Import](#Package-Import)
- [Scraping COVID-19 Geo Tagged Tweet URLs](#Scraping-COVID-19-Geo-Tagged-Tweet-URLs)
- [Hydrating Tweets using TWARC API](#Hydrating-Tweets-using-TWARC-API)
- [Analyzing Twitter Data](#Analyzing-Twitter-Data)
- [Visualization](#Visualization)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)
- [References](#References)

## Data Dictionary

| Features       | Type    | Description                                        |
|----------------|---------|----------------------------------------------------|
| full_text      | object  | full text of the tweet                             |
| id             | float64 | tweet id                                           |
| date           | object  | date tweet was posted                              |
| city           | object  | city from which tweet was posted                   |
| country_code   | object  | country code from which related tweet was posted   |
| country        | object  | country corresponding to country code              |
| coordinates    | object  | latitude and longitudinal values of tweet location |
| state          | object  | state from which tweet was posted                  |
| classification | object  | overall tweet sentiment (positive or negative)     |
| p_pos          | float64 | percent chance of positive sentiment               |
| p_neg          | float64 | percent chance of negative sentiment               |

---

## Package Imports

In [1]:
# Standard Packages
import pandas as pd
import numpy as np

# Modeling Packages
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

# Data Obtaining and Cleaning Packages
import re
import csv
import requests
from bs4 import BeautifulSoup
import time

## Scraping COVID-19 Geo Tagged Tweet URLs 

We sourced our tweets from [this collection](https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset). It turned out that, once logged in, the page contains a list of direct links to the `.csv` files of tweet IDs for each day, so we downloaded the specific `<div>` containing this list and used `BeautifulSoup` and `requests` to download the files. Finally, we concatenate the full list and converted the `.csv` to a `.txt` out of python for easy access from `twarc`.

In [7]:
soup = BeautifulSoup(open("../data/data_links.html"), "html.parser")

In [8]:
links = []

for link in soup.find_all('a'):
    links.append(link.get('href'))

In [42]:
all_csvs = []
for i, link in list(enumerate(links)):
    file = requests.get(link)
    title = re.findall('(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)', link)
    title = ''.join(list(title[0]))
    decoded_content = file.content.decode('utf-8')

    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    all_csvs.append(pd.DataFrame(my_list))
    pd.DataFrame(my_list).to_csv('../data/tweet_ids/' + title, index=False)

In [45]:
all_ids = pd.concat(all_csvs)

In [46]:
all_ids.to_csv('../data/tweet_ids/all_ids.csv', index=False)

### Hydrating Tweets using TWARC API

If you are following this process, ensure you have `pip install twarc`, then run the shell command `twarc hydrate ids.txt > tweets.jsonl` in the folder with the `ids.txt` file to download all the tweets in json form. This process took about 20 minutes for us.

## Analyzing Twitter Data 

In [None]:
# Adapted from: https://medium.com/shiyan-boxer/2020-us-presidential-election-twitter-sentiment-analysis-and-visualization-89e58a652af5
# Big thanks to Shiyan Boxer

class TweetAnalyzer():
    """
    Functionality for analyzing and categorizing content from tweets.
    """

    def clean_tweet(self, tweet):
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

    def analyze_sentiment(self, tweet):
        return TextBlob(self.clean_tweet(tweet), analyzer=NaiveBayesAnalyzer())

    def tweets_to_data_frame(self, tweets):
        df = pd.DataFrame(data=[tweet['full_text']
                                for tweet in tweets], columns=['full_text'])
        df['id'] = np.array([tweet['id'] for tweet in tweets])
        df['date'] = np.array([tweet['created_at'] for tweet in tweets])
        df['city'] = [tweet['place']['full_name'] for tweet in tweets]
        df['country_code'] = [tweet['place']['country_code']
                              for tweet in tweets]
        df['country'] = [tweet['place']['country'] for tweet in tweets]
        df['coordinates'] = [tweet['coordinates']['coordinates']
                             for tweet in tweets]

        return df

In [None]:
# Load .jsonl of all tweets

all_tweets = []
with open('../data/tweets.jsonl', 'r') as json_file:
    json_list = list(json_file)

for json_str in json_list:
    try:
        result = json.loads(json_str)
        all_tweets.append(result)
    except:
        pass

In [None]:
# Remove tweets which do not have proper geo fields
all_tweets = [tweet for tweet in all_tweets if type(tweet['place']) == dict]

In [None]:
# Use TweetAnalyzer class to convert our .jsonl to a pandas dataframe
ta = TweetAnalyzer()
df = ta.tweets_to_data_frame(all_tweets)

# Keep only US data
df = df[df['country_code'] == 'US']

# Keep only tweets with convenient state labels
df['state'] = [city[-2:] for city in df['city']]

valid_states = ['OH', 'CA', 'MA', 'FL', 'IL', 'MD', 'NC', 'NY', 'AZ',
                'LA', 'TX', 'UT', 'GA', 'NV', 'MI', 'NJ', 'IN', 'ME', 'KS', 'VA',
                'MN', 'TN', 'PA', 'SC', 'WI', 'NM', 'OR', 'MO', 'WA', 'DC',
                'AL', 'CT', 'ID', 'KY', 'MS', 'CO', 'OK', 'HI', 'AR', 'VT', 'RI',
                'NH', 'MT', 'DE', 'NE',  'SD', 'IA', 'ND', 'WV',  'AK',
                'WY']

df = df[df['state'].isin(valid_states)]

# Reset index
df = df.reset_index()

We then needed to split our data into two halves to share the analysis time:

In [None]:
df.loc[0:32316, :].to_csv(
    '../data/tweets/cleaned_tweets_first_half.csv', index=False)
df.loc[32317:, :].to_csv(
    '../data/tweets/cleaned_tweets_second_half.csv', index=False)

In [None]:
df = pd.read_csv('../data/tweets/cleaned_tweets_second_half.csv')

s = time.time()
for i, tweet in enumerate(df['full_text']):
    analysis = ta.analyze_sentiment(tweet).sentiment
    df.loc[i, 'classification'] = analysis[0]
    df.loc[i, 'p_pos'] = analysis[1]
    df.loc[i, 'p_neg'] = analysis[2]
    if i % 100 == 0:
        print(
            f'{i} of {df.shape[0]}, time elapsed: {(time.time() - s) / 60} minutes')
        df.to_csv('../data/tweets/analyzed_tweets_second_half.csv', index=False)

df.to_csv('../data/tweets/analyzed_tweets_second_half.csv', index=False)

Total runtime for one half of the data was `32200 of 32278, time elapsed: 2317.5798520445824 minutes
`
Total runtime for the other half was `32300 of 32317, time elapsed: 1938.4947881817818 minutes
`

Re-merging files after we processed them on separate computers:

In [None]:
df_1 = pd.read_csv('../data/tweets/analyzed_tweets_first_half.csv')
df_2 = pd.read_csv('../data/tweets/analyzed_tweets_second_half.csv')
df_3 = pd.concat([df_1, df_2])
df_3.to_csv('../data/tweets/analyzed_tweets.csv', index=False)

Finally, create a slightly different version that uses `\n` as seperator and `\t` as text delimiter, which Tableau found easier to interpret:

In [None]:
df_3.to_csv('../data/tweets/analyzed_tweets_for_tableau.csv',
            index=False, sep='\n', quotechar='\t')

# Visualization

We moved to Tableau to create the visualization, a timelapse of the four previous days rolling average sentiment by state. [You can view or download it yourself on Tableau Public.](https://public.tableau.com/views/COVIDTwitterSentimentVisualization/Sheet1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link)

![](../assets/coronavirus_sentiment_timelapse_w_legend.gif)

# Conclusions and Recommendations 

In this project, we analyzed the sentiments of COVID-19-related tweets in several ways. The overall trend shows that the public was initially optimistic, got much less cheerful in mid-April, then trended somewhat more positive throughout future months. We were also able to see the overall sentiment by state. You can especially track the major decrease in positive sentiment in NY and surrounding states after the pandemic reached its peak case load. Interestingly, there is not much of a visible decrease in sentiment over the course of the larger and more wide-spread second wave in our data.

The fight against COVID-19 not only needs the guidance from the government but also a positive attitude from the public. Our analysis provides a potential approach to examine the public’s mood and allow institutions to respond to it in a timely manner.

Our analysis has shown some relationships between geographic data and the general sentiments of the state during the pandemic. Moving forward, introducing more granular data such as the growth of confirmed cases or adding additional dimensionality to our sentiment analysis would help provide a more comprehensive picture. 

Our workflow was limited by the level of API access key granted by Twitter, preventing us from gathering much historical data on our own. We needed to find a pre-existing dataset, and would be well-served in the future to integrate additional COVID-related tweet datasets with what we were able to obtain.

In addition, time constraints prevented us from testing a variety of pre-trained sentiment analysis models. The TextBlob Naive Bayes model was trained on IMDB movie review, making it somewhat less than ideal for our data, but most other well-known pre-trained models have a similar issue. Future work could use a more advanced sentiment analysis model, either by accessing a paid API or by hand-labeling tweets to create a custom-fitted model for this application.

Finally, we would have liked to include a wider variety of visualizations of this data, included but not limited to annotations of major events, time series, and visualizations of more features such as retweets and likes.

# References 

- COVID-19 Geo Tagged Tweets Dataset: https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset
- Package for Hydrating Tweets: https://github.com/DocNow/twarc
- TweetAnalyzer class adapted from: https://medium.com/shiyan-boxer/2020-us-presidential-election-twitter-sentiment-analysis-and-visualization-89e58a652af5
- Everything You Need to Know About Sentiment Analysis: https://monkeylearn.com/sentiment-analysis/
- Making a request to download csv: https://stackoverflow.com/questions/35371043/use-python-requests-to-download-csv
