# Problem Statement 


Problem Statement: The COVID-19 response has been largely regional and state-based in nature. Some states have enacted strictly enforced stay-at-home policies, while others have provided guidelines. It would be worthwhile to compare the sentiment analysis of tweets across the United States and compare them to both the local policies on social distancing and the occurrences of the pandemic in those areas.

Suggestions for Deliverables:

- A short write up describing the project, results, and next steps or proposal to scale
- Open source code for identifying social media posts from specific regions and conducting a sentiment analysis or topic extraction on that data

Descriptions of input data:

- Twitter tweets 
- Government data on social distancing policies
- Health related data on COVID-19 occurrences in that region

# Executive Summary 


# Contents 

- [Data Dictionary](#Data-Dictionary)
- [Package Import](#Package-Import)
- [Scraping COVID-19 Geo Tagged Tweet URLs](#Scraping-COVID-19-Geo-Tagged-Tweet-URLs)
- [Hydrating Tweets using TWARC API](#Hydrating-Tweets-using-TWARC-API)
- [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-(EDA))
- [Modeling](#Modeling)
- [Model Selection](#Model-Selection)
- [Model Evaluation](#Model-Evaluation)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)
- [Reference](#Reference)

# Data Dictionary

# Package Import

In [None]:
#Standard Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Modeling Packages
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline 

#Twitter 
# pip install textblob 
from textblob import TextBlob 
import re 



pd.set_option('display.max_columns', 100)

# Scraping COVID-19 Geo Tagged Tweet URLs 



The twitter scraping process can be found in the get_tweet_ids.ipynb Jupyter notebook. 

# Hydrating Tweets using TWARC API

The hydrating tweet urls to obtain the tweets process can be found in the hydrate_tweets.ipynb Jupyter notebook.

# Exploratory Data Analysis (EDA)

In [None]:
all_tweets = []
with open('/content/tweets.jsonl', 'r') as json_file:
    json_list = list(json_file)

for json_str in json_list:
    try:
      result = json.loads(json_str)
      all_tweets.append(result)
    except:
      pass
    #print("result: {}".format(result))
    #print(isinstance(result, dict))

In [None]:
len(all_tweets)

0

In [None]:
len([tweet for tweet in all_tweets if type(tweet['place']) == dict])

0

In [None]:
all_tweets = [tweet for tweet in all_tweets if type(tweet['place']) == dict]




## Analyzing Twitter data 

In [None]:
# https://medium.com/shiyan-boxer/2020-us-presidential-election-twitter-sentiment-analysis-and-visualization-89e58a652af5

class TweetAnalyzer():
    """
    Functionality for analyzing and categorizing content from tweets.
    """

    def clean_tweet(self, tweet):
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

    def analyze_sentiment(self, tweet):
        return TextBlob(self.clean_tweet(tweet))
        
    def tweets_to_data_frame(self, tweets):
        df = pd.DataFrame(data=[tweet['full_text'] for tweet in tweets], columns=['full_text'])
        df['id'] = np.array([tweet['id'] for tweet in tweets])
        df['date'] = np.array([tweet['created_at'] for tweet in tweets])
        df['city'] = [tweet['place']['full_name'] for tweet in tweets]
        df['country_code'] = [tweet['place']['country_code'] for tweet in tweets]
        df['country'] = [tweet['place']['country'] for tweet in tweets]
        df['coordinates'] = [tweet['coordinates']['coordinates'] for tweet in tweets]

        return df

 
# if __name__ == '__main__':

#     twitter_client = TwitterClient()
#     tweet_analyzer = TweetAnalyzer()

#     api = twitter_client.get_twitter_client_api()

#     tweets = api.user_timeline(screen_name="realDonaldTrump", count=20)

#     # Demonstrations of possible EDA information 
#     #print(dir(tweets[0]))
#     #print(tweets[0].retweet_count)

#     df = tweet_analyzer.tweets_to_data_frame(tweets)
#     print(df.head(10))

#""" Sentiment Analysis """
    #df['sentiment'] = np.array([tweet_analyzer.analyze_sentiment(tweet) for tweet in df['tweets']])

In [None]:
all_tweets

[]

In [None]:
test_tweet = all_tweets[0]


test_tweet['place']['full_name']

IndexError: ignored

In [None]:
ta = TweetAnalyzer()
df = ta.tweets_to_data_frame(all_tweets)

TypeError: ignored

## Visualizing Twitter data

In [None]:
# Get average length over all tweets:
print(np.mean(df['len']))

# Get the number of likes for the most liked tweet:
print(np.max(df['likes']))

# Get the number of retweets for the most retweeted tweet:
print(np.max(df['retweets']))

print(df.head(10))

# Time Series
# time_likes = pd.Series(data=df['len'].values, index=df['date'])
# time_likes.plot(figsize=(16, 4), color='r')
# plt.show()

# time_favs = pd.Series(data=df['likes'].values, index=df['date'])
# time_favs.plot(figsize=(16, 4), color='r')
# plt.show()

# time_retweets = pd.Series(data=df['retweets'].values, index=df['date'])
# time_retweets.plot(figsize=(16, 4), color='r')
# plt.show()

# Layered Time Series:
time_likes = pd.Series(data=df['likes'].values, index=df['date'])
time_likes.plot(figsize=(16, 4), label="likes", legend=True)

time_retweets = pd.Series(data=df['retweets'].values, index=df['date'])
time_retweets.plot(figsize=(16, 4), label="retweets", legend=True)
plt.show()

# Unsupervised Sentiment Analysis 

# Modeling 

## Model Preparation 

### Instantiating feature and target variables

## Model Selection 

## Model Evaluation 

# Conclusions and Recommendations 

# References 

- COVID-19 Geo Tagged Tweets Dataset: https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset
- Package for Hydrating Tweets: https://github.com/DocNow/twarc
- Unsupervised Sentiment Analysis (K Means Clustering): https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483
- Recommended Python libraries for Sentiment Analysis: https://www.iflexion.com/blog/sentiment-analysis-python
- Everything You Need to Know About Sentiment Analysis: https://monkeylearn.com/sentiment-analysis/
- Twitter Sentiment Analysis with Python and NLTK: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
- Is it possible to do sentiment analysis of unlabelled text using word2vec model?: https://stackoverflow.com/questions/61185290/is-it-possible-to-do-sentiment-analysis-of-unlabelled-text-using-word2vec-model