# Problem Statement 

With the recent series of unfortunate events that have happened, the COVID-19 pandemic being a major one; social media has been a constant stream of valuable information that could shed light on the state of the country. Twitter is a great text based social medium in which Natural Language Processing (NLP) techniques can be implemented on. Hence in this project, the goal is to conduct a sentiment analysis to determine the polarity of tweets related to COVID-19 and mapped them geographically across the United States. 

The nation's response to the pandemic has been largely regional and state-based in nature. Some states have enacted strictly enforced stay-at-home policies, while others have provided less stringent guidelines paired with instructions on a federal level. From this analysis, we would be able to observe the response of the nation according to respective state and federal guidelines. Comparing the sentiment analysis of tweets across the United States with respect to both the local policies on social distancing and the occurrences of the pandemic in those areas.
 

In this project, we make initial steps toward designing and implementing a web-tool or an app for tracking and monitoring geo-tagged tweets during the pandemic, in close to real time.

While traditional methods for alerting on such events rely on official information derived from official sources (e.g. CDC), our focus is attempting to utilize social media activity to identify these events and alert when an event first occurs. The question we look at primarily here is, given a sea of text content from social media platforms, how is the general public's response to pandemic guidelines and policies and how quickly does the public respond from the time a policy has been implemented? And what sort of implementation would be valuable?

*Topic Extraction?

# Executive Summary 


The workflow of the projects is as follows:

- Importing packages, obtaining COVID-19 geo-tagged tweet ids 
- Hydrating tweets 
- Exploratory data analysis of tweets collected 
- Sentiment analysis of tweets
- Evaluation
- Visualizations 
- Conclusions and Recommendations

For the purposes of this project, we pulled tweets from the Institute of Electrical and Electronics Engineers' (IEEE) Data Repository. The data mining process was an interesting but long process. First, obtaining a dataset of COVID-19 related tweet ids totalling to ~168,493 (at the time). Next, hydrating those tweet ids to obtain the Twitter information of those tweets. The tweets were then filtered for only those from the United States (primary focus) with state information. Our final tweets count was ~65,000. 

A sentiment analysis was then conducted using TextBlob determining the polarity of the tweets ranging from overwhelming positive on one end to overwhelming negative on the other. TextBlob is a popular Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation etc. 

The sentiment analysis data is then mapped using the geo-tagged information from these tweets to build a visual timeline of the country's overall sentiment on COVID-19. This opens up the ability for us to view the public's response on the events that unfold throughout the current pandemic. It also allows us to dive into deeper issues regarding response times and help us make better decisions in health policymaking especially in the event of the pandemic. 

# Contents 

- [Data Dictionary](#Data-Dictionary)
- [Package Import](#Package-Import)
- [Scraping COVID-19 Geo Tagged Tweet URLs](#Scraping-COVID-19-Geo-Tagged-Tweet-URLs)
- [Hydrating Tweets using TWARC API](#Hydrating-Tweets-using-TWARC-API)
- [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-(EDA))
- [Modeling](#Modeling)
- [Model Selection](#Model-Selection)
- [Model Evaluation](#Model-Evaluation)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)
- [Reference](#Reference)

# Data Dictionary

# Package Import

In [None]:
# Standard Packages
import re
from textblob import TextBlob
from sklearn.pipeline import Pipeline
from textblob.sentiments import NaiveBayesAnalyzer
import csv
import requests
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
%matplotlib inline

# Modeling Packages

# Twitter
# pip install textblob

# Data Obtaining and Cleaning Packages


pd.set_option('display.max_columns', 100)

# Scraping COVID-19 Geo Tagged Tweet URLs 



We sourced our tweets from [this collection](https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset). It turned out that, once logged in, the page contains a list of direct links to the `.csv` files of tweet IDs for each day, so we downloaded the specific `<div>` containing this list and used `BeautifulSoup` and `requests` to download the files. Finally, we concatenate the full list and converted the `.csv` to a `.txt` out of python for easy access from `twarc`.

If you are following this process, ensure you have `pip install twarc`, and you can then run the shell command `twarc hydrate ids.txt > tweets.jsonl` in the folder with the `ids.txt` file to download all the tweets in json form. This process took about 20 minutes for us.

In [7]:
soup = BeautifulSoup(open("../data/data_links.html"), "html.parser")

In [8]:
links = []

for link in soup.find_all('a'):
    links.append(link.get('href'))

In [42]:
all_csvs = []
for i, link in list(enumerate(links)):
    file = requests.get(link)
    title = re.findall('(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)', link)
    title = ''.join(list(title[0]))
    decoded_content = file.content.decode('utf-8')

    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    all_csvs.append(pd.DataFrame(my_list))
    pd.DataFrame(my_list).to_csv('../data/tweet_ids/' + title, index=False)

In [45]:
all_ids = pd.concat(all_csvs)

In [46]:
all_ids.to_csv('../data/tweet_ids/all_ids.csv', index=False)

# Exploratory Data Analysis (EDA)

## Unsupervised Sentiment Analysis 

### Analyzing Twitter data 

In [None]:
# Adapted from: https://medium.com/shiyan-boxer/2020-us-presidential-election-twitter-sentiment-analysis-and-visualization-89e58a652af5
# Big thanks to Shiyan Boxer


class TweetAnalyzer():
    """
    Functionality for analyzing and categorizing content from tweets.
    """

    def clean_tweet(self, tweet):
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

    def analyze_sentiment(self, tweet):
        return TextBlob(self.clean_tweet(tweet), analyzer=NaiveBayesAnalyzer())

    def tweets_to_data_frame(self, tweets):
        df = pd.DataFrame(data=[tweet['full_text']
                                for tweet in tweets], columns=['full_text'])
        df['id'] = np.array([tweet['id'] for tweet in tweets])
        df['date'] = np.array([tweet['created_at'] for tweet in tweets])
        df['city'] = [tweet['place']['full_name'] for tweet in tweets]
        df['country_code'] = [tweet['place']['country_code']
                              for tweet in tweets]
        df['country'] = [tweet['place']['country'] for tweet in tweets]
        df['coordinates'] = [tweet['coordinates']['coordinates']
                             for tweet in tweets]

        return df

In [None]:
# Load .jsonl of all tweets

all_tweets = []
with open('../data/tweets.jsonl', 'r') as json_file:
    json_list = list(json_file)

for json_str in json_list:
    try:
        result = json.loads(json_str)
        all_tweets.append(result)
    except:
        pass

In [None]:
# Remove tweets which do not have proper geo fields
all_tweets = [tweet for tweet in all_tweets if type(tweet['place']) == dict]

In [None]:
# Use TweetAnalyzer class to convert our .jsonl to a pandas dataframe
ta = TweetAnalyzer()
df = ta.tweets_to_data_frame(all_tweets)

# Keep only US data
df = df[df['country_code'] == 'US']

# Keep only tweets with convenient state labels
df['state'] = [city[-2:] for city in df['city']]

valid_states = ['OH', 'CA', 'MA', 'FL', 'IL', 'MD', 'NC', 'NY', 'AZ',
                'LA', 'TX', 'UT', 'GA', 'NV', 'MI', 'NJ', 'IN', 'ME', 'KS', 'VA',
                'MN', 'TN', 'PA', 'SC', 'WI', 'NM', 'OR', 'MO', 'WA', 'DC',
                'AL', 'CT', 'ID', 'KY', 'MS', 'CO', 'OK', 'HI', 'AR', 'VT', 'RI',
                'NH', 'MT', 'DE', 'NE',  'SD', 'IA', 'ND', 'WV',  'AK',
                'WY']

df = df[df['state'].isin(valid_states)]

# Reset index
df = df.reset_index()

We then needed to split our data into two halves to share the analysis time:

In [None]:
df.loc[0:32316, :].to_csv(
    '../data/tweets/cleaned_tweets_first_half.csv', index=False)
df.loc[32317:, :].to_csv(
    '../data/tweets/cleaned_tweets_second_half.csv', index=False)

In [None]:
df = pd.read_csv('../data/tweets/cleaned_tweets_second_half.csv')

s = time.time()
for i, tweet in enumerate(df['full_text']):
    analysis = ta.analyze_sentiment(tweet).sentiment
    df.loc[i, 'classification'] = analysis[0]
    df.loc[i, 'p_pos'] = analysis[1]
    df.loc[i, 'p_neg'] = analysis[2]
    if i % 100 == 0:
        print(
            f'{i} of {df.shape[0]}, time elapsed: {(time.time() - s) / 60} minutes')
        df.to_csv('../data/tweets/analyzed_tweets_second_half.csv', index=False)

df.to_csv('../data/tweets/analyzed_tweets_second_half.csv', index=False)

Total runtime for one half of the data was `32200 of 32278, time elapsed: 2317.5798520445824 minutes
`
Total runtime for the other half was `32300 of 32317, time elapsed: 1938.4947881817818 minutes
`

Re-merging files after we processed them on separate computers:

In [None]:
df_1 = pd.read_csv('../data/tweets/analyzed_tweets_first_half.csv')
df_2 = pd.read_csv('../data/tweets/analyzed_tweets_second_half.csv')
df_3 = pd.concat([df_1, df_2])
df_3.to_csv('../data/tweets/analyzed_tweets.csv', index=False)

Finally, create a slightly different version that uses `\n` as seperator and `\t` as text delimiter, which Tableau found easier to interpret:

In [None]:
df_3.to_csv('../data/tweets/analyzed_tweets_for_tableau.csv',
            index=False, sep='\n', quotechar='\t')

## Visualizing Twitter data

In [None]:
# Get average length over all tweets:
print(np.mean(df['len']))

# Get the number of likes for the most liked tweet:
print(np.max(df['likes']))

# Get the number of retweets for the most retweeted tweet:
print(np.max(df['retweets']))

print(df.head(10))

# Time Series
# time_likes = pd.Series(data=df['len'].values, index=df['date'])
# time_likes.plot(figsize=(16, 4), color='r')
# plt.show()

# time_favs = pd.Series(data=df['likes'].values, index=df['date'])
# time_favs.plot(figsize=(16, 4), color='r')
# plt.show()

# time_retweets = pd.Series(data=df['retweets'].values, index=df['date'])
# time_retweets.plot(figsize=(16, 4), color='r')
# plt.show()

# Layered Time Series:
time_likes = pd.Series(data=df['likes'].values, index=df['date'])
time_likes.plot(figsize=(16, 4), label="likes", legend=True)

time_retweets = pd.Series(data=df['retweets'].values, index=df['date'])
time_retweets.plot(figsize=(16, 4), label="retweets", legend=True)
plt.show()

# Conclusions and Recommendations 

In this project, we analyzed the sentiments of COVID-19-related tweets in several ways. The overall trend shows that the public has been more optimistic over time. Digging into the dual-dimensional sentiment analysis we conducted, we found that the sentiment “Positive” went down initially and towards the end, and “Negative” went up through the height of the pandemic. We were also able to see the respective sentiment consensus by state. Our results reflect the general reaction of the 'first wave' of the pandemic and the political climate during this time. 

The fight against COVID-19 not only needs the guidance from the government but also a positive attitude from the public. Our analysis provides a potential approach to reveal the public’s sentiment status and help institutions respond timely to it.

Our analysis has shown some relationships between geographic data 
and the general sentiments of the state during the pandemic. Moving forward, introducing more granular data such confirmed cases' growth and adding additional dimensionality to our sentiment analysis would help provide a more comprehensive picture. Allowing us to generate more insights into hardest-hit areas, demographic of those affected; enabling institutions and government to take affirmative action based on this valuable information. 

# References 

- COVID-19 Geo Tagged Tweets Dataset: https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset
- Package for Hydrating Tweets: https://github.com/DocNow/twarc
- Unsupervised Sentiment Analysis (K Means Clustering): https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483
- Recommended Python libraries for Sentiment Analysis: https://www.iflexion.com/blog/sentiment-analysis-python
- Everything You Need to Know About Sentiment Analysis: https://monkeylearn.com/sentiment-analysis/
- Twitter Sentiment Analysis with Python and NLTK: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
- Is it possible to do sentiment analysis of unlabelled text using word2vec model?: https://stackoverflow.com/questions/61185290/is-it-possible-to-do-sentiment-analysis-of-unlabelled-text-using-word2vec-model
- Making a request to download csv: https://stackoverflow.com/questions/35371043/use-python-requests-to-download-csv
