In [1]:
import pandas as pd
import numpy as np

## Objective

For the purpose of this analysis, I will attempt to measure the sentiment of tweets to learn whether tweets impact the number of Covid-19 cases and deaths in the United States. This study is important as the reopening of our society, from going to get an ice cream cone to being able to earn a living, hinges on our ability to lower the rate of infection in our country. With so many individuals receiving their news and information through social media, being able to predict how COVID cases will either increase or decrease based on tweets can inform public policy. Should we be able to predict the future number of COVID cases based on the text of tweets; public officials, business leaders and concerned citizens can alter their tweeting practices to promote improved COVID outcomes.

To create the dataset, I utilized the TWINT library to collect all tweets from January 1,2020 until July 10th. I then made various subsets of the tweets. For example, to measure the impact of tweets by public leaders viewed as polar opposites regarding their response to the pandemic, I collected tweets by President Trump and the Governor of New York, Andrew Cuomo. Another subset of tweets that I labeled as baseline consists of tweets by the New York Times and Washington Post - two of America's leading journalism outlets.

The purpose of creating these subsets is that the baseline tweets can be considered to be those that communicate mainly fact. While they might have op-ed columnists, we can assume that most tweets from the news reporting divisions will provide factual updates on the Covid response. By considering the two polar opposites, Trump and Cuomo, we can measure Covid outcomes, in terms of cases, after the tweets have been consumed by the public. Finally, the main Covid collection will allow us to see whether more individuals subscribed to the Trump/Cuomo tweets and how Covid cases changed, for the positive or negative, in their region.

## Obtaining Data

For the notebooks that contain the queries for the tweets gathered on TWINT, please refer to the Covid Data Queries notebook in the repo. The JSON files for these queries were used to create DataFrames.

In [2]:
#All Covid tweets
All_Covid_tweets = pd.read_json('tweets/Covid_tweets3.json',lines=True)

#All Trump tweets
Trump_Covid_tweets = pd.read_json('tweets/Trump_Covid_tweets3.json', lines=True)

#All Cuomo tweets
Cuomo_Covid_tweets = pd.read_json('tweets/Cuomo_Covid_tweets3.json',lines=True)

#Baseline Tweets
NYTimes_tweets = pd.read_json('tweets/Nytimes_Covid_tweets3.json',lines=True)
#print( len(NYTimes_tweets))
WashingtonPost_tweets = pd.read_json('tweets/Washpost_tweets3.json',lines=True)
#print( len(Washpost_tweets3.json))

#combining NYTimes and Washington Post to get Baseline Tweets
Baseline_tweets = pd.concat([NYTimes_tweets,WashingtonPost_tweets],axis=0)

#Reformatting Date columns for later merge
All_Covid_tweets['Date'] = All_Covid_tweets['date']
Trump_Covid_tweets['Date'] = Trump_Covid_tweets['date']
Cuomo_Covid_tweets['Date'] = Cuomo_Covid_tweets['date']
Baseline_tweets['Date'] = Baseline_tweets['date']

Data for Covid Cases and Deaths was collected from The COVID Tracking Project.

In [3]:
# Covid data set

covid_cases = pd.read_csv('covid data/time_series_covid_19_confirmed.csv')

#Getting US data - confirmed cases
covid_cases = covid_cases[covid_cases['Country/Region'] == 'US']
#covid_cases = covid_cases.transpose()

# Covid death data set

covid_deaths = pd.read_csv('covid data/time_series_covid_19_deaths.csv')


#Getting US data - confirmed cases

#covid_deaths = covid_deaths.transpose()
covid_deaths = covid_deaths[covid_deaths['Country/Region'] == 'US']


In [4]:
#Covid cases and deaths (still need to rename columns, from left to right = cases then deaths)
covid_data = pd.concat([covid_cases,covid_deaths],axis=0)
covid_data = covid_data.transpose()

In [5]:
covid_data = covid_data.drop(['Province/State','Country/Region','Lat','Long'])

In [6]:
covid_data.head()

Unnamed: 0,225,225.1
1/22/20,1,0
1/23/20,1,0
1/24/20,2,0
1/25/20,2,0
1/26/20,5,0


### Adding Case/Death Data on Day of the Tweet

In [7]:
#Edited column names in Excel for Merge
covid_data_formatted = pd.read_excel('covid data/covid_data_date.xlsx')
covid_data_formatted.head()

Unnamed: 0,Date,Cases,Deaths
0,1/22/20,1,0
1,1/23/20,1,0
2,1/24/20,2,0
3,1/25/20,2,0
4,1/26/20,5,0


In [8]:
#Converting all Date columns to datetime for Merge
covid_data_formatted['Date'] = pd.to_datetime(covid_data_formatted['Date'])
All_Covid_tweets['Date'] = pd.to_datetime(All_Covid_tweets['Date'])
Trump_Covid_tweets['Date'] = pd.to_datetime(Trump_Covid_tweets['Date'])
Cuomo_Covid_tweets['Date'] = pd.to_datetime(Cuomo_Covid_tweets['Date'])
Baseline_tweets['Date'] = pd.to_datetime(Baseline_tweets['Date'])

In [9]:
#All Tweet Data with corresponding case/death information
All_Covid_tweets_case_data = pd.merge(All_Covid_tweets,covid_data_formatted,on='Date')
#Trump Tweet Data with corresponding case/death information
Trump_Covid_tweets_case_data = pd.merge(Trump_Covid_tweets,covid_data_formatted,on='Date')
#Cuomo Tweet Data with corresponding case/death information
Cuomo_Covid_tweets_case_data = pd.merge(Cuomo_Covid_tweets,covid_data_formatted, on='Date')
#Baseline Tweet Data with corresponding case/death information
Baseline_tweets_case_data = pd.merge(Baseline_tweets,covid_data_formatted,on='Date')

### Adding case/death data for two weeks after original tweet

In [10]:
#Getting date two weeks from now for Covid case/death reaction to Tweets
from datetime import datetime,timedelta

N = 14
days_N_from_now = All_Covid_tweets['Date'] + timedelta(days=N)

All_Covid_tweets_case_data['14 days'] = (All_Covid_tweets_case_data['Date'] + timedelta(days=N))
Trump_Covid_tweets_case_data['14 days'] = (Trump_Covid_tweets_case_data['Date'] + timedelta(days=N))
Cuomo_Covid_tweets_case_data['14 days'] = (Cuomo_Covid_tweets_case_data['Date'] +timedelta(days=N))
Baseline_tweets_case_data['14 days'] = (Baseline_tweets_case_data['Date'] + timedelta(days=N))

In [11]:
covid_data_two_week = pd.read_excel('covid data/covid_data_14days.xlsx')
covid_data_two_week.head()

Unnamed: 0,14 days,Cases,Deaths
0,1/22/20,1,0
1,1/23/20,1,0
2,1/24/20,2,0
3,1/25/20,2,0
4,1/26/20,5,0


In [12]:
#Converting all Date columns to datetime for Merge
covid_data_two_week['14 days'] = pd.to_datetime(covid_data_two_week['14 days'])
All_Covid_tweets_case_data['14 days'] = pd.to_datetime(All_Covid_tweets_case_data['14 days'])
Trump_Covid_tweets_case_data['14 days'] = pd.to_datetime(Trump_Covid_tweets_case_data['14 days'])
Cuomo_Covid_tweets_case_data['14 days'] = pd.to_datetime(Cuomo_Covid_tweets_case_data['14 days'])
Baseline_tweets_case_data['14 days'] = pd.to_datetime(Baseline_tweets_case_data['14 days'])

In [13]:
#All Tweet Data with corresponding case/death information
All_Covid_tweets_case_data = pd.merge(All_Covid_tweets_case_data,covid_data_two_week,on='14 days')
#Trump Tweet Data with corresponding case/death information
Trump_Covid_tweets_case_data = pd.merge(Trump_Covid_tweets_case_data,covid_data_two_week,on='14 days')
#Cuomo Tweet Data with corresponding case/death information
Cuomo_Covid_tweets_case_data = pd.merge(Cuomo_Covid_tweets_case_data,covid_data_two_week, on='14 days')
#Baseline Tweet Data with corresponding case/death information
Baseline_tweets_case_data = pd.merge(Baseline_tweets_case_data,covid_data_two_week,on='14 days')

### Adding Case/Death Data for four weeks after original tweet

In [14]:
covid_data_four_week = pd.read_excel('covid data/covid_data_28days.xlsx')
covid_data_four_week.head()

Unnamed: 0,28 days,Cases,Deaths
0,1/22/20,1,0
1,1/23/20,1,0
2,1/24/20,2,0
3,1/25/20,2,0
4,1/26/20,5,0


### Getting Dates and COVID Data for two weeks after Tweet

In [15]:
#Getting date two weeks from now for Covid case/death reaction to Tweets
from datetime import datetime,timedelta

N = 28
days_N_from_now = All_Covid_tweets['Date'] + timedelta(days=N)

All_Covid_tweets_case_data['28 days'] = (All_Covid_tweets_case_data['Date'] + timedelta(days=N))
Trump_Covid_tweets_case_data['28 days'] = (Trump_Covid_tweets_case_data['Date'] + timedelta(days=N))
Cuomo_Covid_tweets_case_data['28 days'] = (Cuomo_Covid_tweets_case_data['Date'] +timedelta(days=N))
Baseline_tweets_case_data['28 days'] = (Baseline_tweets_case_data['Date'] + timedelta(days=N))

In [16]:
#Converting all Date columns to datetime for Merge
covid_data_four_week['28 days'] = pd.to_datetime(covid_data_four_week['28 days'])
All_Covid_tweets_case_data['28 days'] = pd.to_datetime(All_Covid_tweets_case_data['28 days'])
Trump_Covid_tweets_case_data['28 days'] = pd.to_datetime(Trump_Covid_tweets_case_data['28 days'])
Cuomo_Covid_tweets_case_data['28 days'] = pd.to_datetime(Cuomo_Covid_tweets_case_data['28 days'])
Baseline_tweets_case_data['28 days'] = pd.to_datetime(Baseline_tweets_case_data['28 days'])

In [17]:
#All Tweet Data with corresponding case/death information
All_Covid_tweets_case_data = pd.merge(All_Covid_tweets_case_data,covid_data_four_week,on='28 days')
#Trump Tweet Data with corresponding case/death information
Trump_Covid_tweets_case_data = pd.merge(Trump_Covid_tweets_case_data,covid_data_four_week,on='28 days')
#Cuomo Tweet Data with corresponding case/death information
Cuomo_Covid_tweets_case_data = pd.merge(Cuomo_Covid_tweets_case_data,covid_data_four_week, on='28 days')
#Baseline Tweet Data with corresponding case/death information
Baseline_tweets_case_data = pd.merge(Baseline_tweets_case_data,covid_data_four_week,on='28 days')

In [18]:
Baseline_tweets_case_data.head()

Unnamed: 0,cashtags,conversation_id,created_at,date,geo,hashtags,id,likes_count,link,mentions,...,video,Date,Cases_x,Deaths_x,14 days,Cases_y,Deaths_y,28 days,Cases,Deaths
0,[],1270707306578264064,2020-06-10 13:20:05,2020-06-10,,[],1270707306578264065,209,https://twitter.com/nytimes/status/12707073065...,[],...,0,2020-06-10,2000702,113631,2020-06-24,2382426,122604,2020-07-08,3054699,132300
1,[],1270636833555308544,2020-06-10 08:40:03,2020-06-10,,[],1270636833555308544,857,https://twitter.com/nytimes/status/12706368335...,[],...,0,2020-06-10,2000702,113631,2020-06-24,2382426,122604,2020-07-08,3054699,132300
2,[],1270815442173603840,2020-06-10 20:29:46,2020-06-10,,[],1270815442173603841,159,https://twitter.com/washingtonpost/status/1270...,[],...,0,2020-06-10,2000702,113631,2020-06-24,2382426,122604,2020-07-08,3054699,132300
3,[],1270541216308957184,2020-06-10 02:20:06,2020-06-09,,[],1270541216308957184,382,https://twitter.com/nytimes/status/12705412163...,[nytmag],...,0,2020-06-09,1979908,112714,2020-06-23,2347491,121847,2020-07-07,2996098,131480
4,[],1270470755889840128,2020-06-09 21:40:07,2020-06-09,,[],1270470755889840134,404,https://twitter.com/nytimes/status/12704707558...,[nytmag],...,0,2020-06-09,1979908,112714,2020-06-23,2347491,121847,2020-07-07,2996098,131480


### Combined Tweet DataFrame

In [19]:
#Tweet dataframes combined

Master_Tweet_df = pd.concat([All_Covid_tweets_case_data,Trump_Covid_tweets_case_data,Cuomo_Covid_tweets_case_data,Baseline_tweets_case_data])
#Master_Tweet_df.to_csv('raw_tweet_data.csv')

### Organizing US State Data from NYTimes

The NYTimes manages a github repo that tracks state by state COVID Data. This data can be useful later in the analysis when we track how certain localities have fared dealing with the COVID pandemic. Tracking state COVID details will allow for examination of whether states that are classified as subscribing to the tenets of the Trump administrtion respond better/worse than states that might align more with the politics of NY governor Andrew Cuomo.

In [20]:
#reading in data from CSV
state_case_df = pd.read_csv('covid data/us-states.csv')

#Groupby to get states by state
state_data = state_case_df.groupby('state')
state_case_df.to_csv('data/state_data_raw.csv')

state_case_df = state_case_df.sort_values(['state','date'],ascending=[True,True])
#state_case_df = state_case_df.sort_values('date',ascending=True)
state_case_df.reset_index(drop=True,inplace=True)
#state_case_df = state_case_df.sort_values('date',ascending=True)
#state_case_df.head(50)

In [21]:
#dropping unnecessary columns

Master_Tweet_df = Master_Tweet_df.drop(['cashtags', 'conversation_id','geo', 'hashtags',
       'id','link', 'mentions', 'name', 'near', 'photos',
       'place', 'quote_url','reply_to', 'retweet',
       'retweet_date', 'retweet_id','source', 'time',
       'timezone', 'trans_dest', 'trans_src', 'translate','urls',
       'user_id', 'user_rt', 'user_rt_id',],axis=1)

### SCRUBBING OF TWEETS

In [22]:
#pip install textfeatures
import textfeatures as tf
Master_Tweet_df.columns

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jamaalsmith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Index(['created_at', 'date', 'likes_count', 'replies_count', 'retweets_count',
       'tweet', 'username', 'video', 'Date', 'Cases_x', 'Deaths_x', '14 days',
       'Cases_y', 'Deaths_y', '28 days', 'Cases', 'Deaths'],
      dtype='object')

In [23]:
#Master_Tweet_df=Master_Tweet_df.drop('date',axis=1)
Master_Tweet_df.head()

Unnamed: 0,created_at,date,likes_count,replies_count,retweets_count,tweet,username,video,Date,Cases_x,Deaths_x,14 days,Cases_y,Deaths_y,28 days,Cases,Deaths
0,2020-05-27 01:57:20,2020-05-26,0,0,0,"Two of the United States leading news sources,...",whitewindlandon,0,2020-05-26,1689162,99952,2020-06-09,1979908,112714,2020-06-23,2347491,121847
1,2020-05-27 01:57:20,2020-05-26,0,0,0,"Two of the United States leading news sources,...",whitewindlandon,0,2020-05-26,1689162,99952,2020-06-09,1979908,112714,2020-06-23,2347491,121847
2,2020-05-26 23:18:19,2020-05-26,0,0,0,…Unless U’re a physician or a nurse in a surgi...,rescon1,0,2020-05-26,1689162,99952,2020-06-09,1979908,112714,2020-06-23,2347491,121847
3,2020-05-26 23:00:14,2020-05-26,0,2,0,The reality is that Andy Beshear didn't create...,nealhead,0,2020-05-26,1689162,99952,2020-06-09,1979908,112714,2020-06-23,2347491,121847
4,2020-05-26 22:34:09,2020-05-26,1,0,1,"In large countries such as the United States, ...",6121el,0,2020-05-26,1689162,99952,2020-06-09,1979908,112714,2020-06-23,2347491,121847


In [24]:
tf.word_count(Master_Tweet_df,"tweet",'word_count')
tf.avg_word_length(Master_Tweet_df,'tweet','avg_word_length')
tf.stopwords_count(Master_Tweet_df,'tweet','stopwords_count')
tf.char_count(Master_Tweet_df,'tweet','char_count')
tf.stopwords(Master_Tweet_df,'tweet','stopwords')
tf.clean(Master_Tweet_df,'tweet','clean_text')

Unnamed: 0,created_at,date,likes_count,replies_count,retweets_count,tweet,username,video,Date,Cases_x,...,Deaths_y,28 days,Cases,Deaths,word_count,avg_word_length,stopwords_count,char_count,stopwords,clean_text
0,2020-05-27 01:57:20,2020-05-26,0,0,0,"Two of the United States leading news sources,...",whitewindlandon,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,26,10.821429,7,332,"[of, the, and, these, do, not, or]",united states leading news sources time today ...
1,2020-05-27 01:57:20,2020-05-26,0,0,0,"Two of the United States leading news sources,...",whitewindlandon,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,26,10.821429,7,332,"[of, the, and, these, do, not, or]",united states leading news sources time today ...
2,2020-05-26 23:18:19,2020-05-26,0,0,0,…Unless U’re a physician or a nurse in a surgi...,rescon1,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,36,6.216216,16,267,"[a, or, a, in, a, you, have, no, a, in, the, o...",unless physician nurse surgical room business ...
3,2020-05-26 23:00:14,2020-05-26,0,2,0,The reality is that Andy Beshear didn't create...,nealhead,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,47,4.872340,21,277,"[is, that, didn't, he, didn't, it, in, the, an...",reality andy beshear didnt create covid didnt ...
4,2020-05-26 22:34:09,2020-05-26,1,0,1,"In large countries such as the United States, ...",6121el,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,43,5.395349,14,275,"[such, as, the, and, the, of, the, is, to, or,...",large countries united states russia brazil in...
5,2020-05-26 22:22:16,2020-05-26,0,0,0,"Right now, it's a very Good look for a Preside...",roswell32,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,38,4.976190,16,250,"[it's, a, very, for, a, of, the, by, is, and, ...",right good look president united states leadin...
6,2020-05-26 21:57:10,2020-05-26,13,3,6,United States has officially surpassed the gri...,gary_lyman,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,44,5.295455,18,276,"[has, the, of, over, from, is, that, this, is,...",united states officially surpassed grim milest...
7,2020-05-26 21:13:57,2020-05-26,0,0,0,The United States has more confirmed COVID-19 ...,oldnavy1968,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,40,5.250000,13,249,"[has, more, than, the, have, more, than, the, ...",united states confirmed covid cases next count...
8,2020-05-26 20:45:24,2020-05-26,0,0,0,The average age of deceased and COVID-19 posit...,dablazinjr,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,37,4.800000,12,235,"[of, and, is, it, is, not, all, the, is, it, o...",average deceased covid positive patients years...
9,2020-05-26 18:15:33,2020-05-26,0,0,0,You are some of the stupidest people how do yo...,smartpe53402672,0,2020-05-26,1689162,...,112714,2020-06-23,2347491,121847,56,4.150943,25,275,"[are, some, of, the, how, do, you, for, is, be...",stupidest people paid stupidity choosing votin...


In [25]:
#Removing stopwords and punctuation

clean_tweet = Master_Tweet_df['clean_text']
#Tweet Tokenizer 
from nltk.tokenize import TweetTokenizer
ttknz = TweetTokenizer()

#creation of the corpus
#corpus = Master_Tweet_df['clean_tweets'].astype(str)
#corpus.dtypes

#tokenizing corpus
tok_corp = []
for sent in clean_tweet:
    toked = ttknz.tokenize(sent)
    tok_corp.append(toked)

In [26]:
len(tok_corp)

8514

In [27]:
#Sentiment Analysis 
from textblob import TextBlob
from textblob.sentiments import PatternAnalyzer, NaiveBayesAnalyzer
#from twitter_nlp_toolkit.tweet_sentiment_classifier import tweet_sentiment_classifier

#tweets = Master_Tweet_df['clean_tweets']

tweets = Master_Tweet_df['clean_text']

Sentiment = []
for tweet in tweets:
    #Classifier = tweet_sentiment_classifier.SentimentAnalyzer()
    #sentiment = Classifier.predict_proba(tweet)
    blob = TextBlob(tweet,analyzer=PatternAnalyzer())
    rating = blob.sentiment.polarity
    Sentiment.append(rating)

Master_Tweet_df['Sentiment'] = Sentiment
#Master_Tweet_df['Sentiment'] = Master_Tweet_df['Sentiment'].astype(int)
#Master_Tweet_df['Sentiment'].round(decimals = 4)

#Master_Tweet_df['Sentiment'].head()

In [28]:
#Saving as CSV for later uploads to different notebooks
Master_Tweet_df.to_csv('data/Master_Tweet_df.csv')
poll_data = pd.read_csv('data/poll_data_dates.csv')

In [29]:
#uploading reformatted date column - will revisit to do in Pandas, but errors kept occuring
Master_Tweet_df = pd.read_excel('data/Master_Tweet_df.xls')

In [30]:
#Converting Date columns to integer so merge will work
Master_Tweet_df['Date'] = pd.to_datetime(Master_Tweet_df['Date'])
Master_Tweet_df['Date'] = Master_Tweet_df['Date'].astype(int)
poll_data['Date'] = pd.to_datetime(poll_data['Date'])
poll_data['Date'] = poll_data['Date'].astype(int)

In [31]:
#Merging with Poll Data

#poll_data = pd.read_csv('data/poll_data_dates.csv')
#pd.to_datetime(poll_data['Date']) #converting to datetime object for merge purposes
#pd.to_datetime(Master_Tweet_df['Date']) #converting to datetime object for merge purposes

left = Master_Tweet_df.sort_values(by='Date')
right = poll_data.sort_values(by='Date')

Master_Tweet_df = pd.merge_asof(left,right,on='Date',allow_exact_matches=False)

In [32]:
#Master_Tweet_df = Master_Tweet_df.drop('Date')
#Master_Tweet_df = Master_Tweet_df.drop('Date',axis=1)
Master_Tweet_df.head()

Unnamed: 0.1,Unnamed: 0,created_at,date,likes_count,replies_count,retweets_count,tweet,username,video,Date,...,stopwords,Sentiment,Poll,Start Date,End Date,Sample,MoE,Biden (D),Trump (R),Spread
0,501,2020-02-11 17:20:08,2020-02-11,4374,464,2189,The World Health Organization has an official ...,nytimes,0,1581379200000000000,...,"['has', 'an', 'for', 'the', 'by', 'no', 'to', ...",0.0,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7
1,7317,2020-02-14 02:13:11,2020-02-13,0,0,0,Leading health experts warned on Feb. 12 that ...,tamiann02,0,1581552000000000000,...,"['on', 'that', 'the', 'of', 'of', 'the', 'or',...",0.375,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7
2,7318,2020-02-13 21:28:24,2020-02-13,0,0,0,Leading health experts warned on Feb. 12 that ...,1val1richy,0,1581552000000000000,...,"['on', 'that', 'the', 'of', 'of', 'the', 'or',...",0.375,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7
3,7319,2020-02-13 14:01:49,2020-02-13,0,0,0,"‚ÄúSo far, about 82 percent of the [covid-19 v...",kolsaw,0,1581552000000000000,...,"['about', 'of', 'the', 'all', 'in', 'the', 'ha...",0.048611,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7
4,7316,2020-02-15 23:04:26,2020-02-15,0,0,0,CDC director Dr. Robert Redfield says will bec...,_watch_observe_,0,1581724800000000000,...,"['will', 'the', 'or', 'it', 'is', 'to', 'how',...",0.216667,NPR/PBS/MaristNPR/PBS,2/13/20,2/16/20,1164 RV,3.7,50.0,44.0,Biden +6


In [33]:
Master_Tweet_df.columns

Index(['Unnamed: 0', 'created_at', 'date', 'likes_count', 'replies_count',
       'retweets_count', 'tweet', 'username', 'video', 'Date', 'Cases_x',
       'Deaths_x', '14 days', 'Cases_y', 'Deaths_y', '28 days', 'Cases',
       'Deaths', 'word_count', 'avg_word_length', 'stopwords_count',
       'clean_text', 'char_count', 'stopwords', 'Sentiment', 'Poll',
       'Start Date', 'End Date', 'Sample', 'MoE', 'Biden (D)', 'Trump (R)',
       'Spread'],
      dtype='object')

## Creation of Target Column

Victory Spread dataframe is Trump poll figure subtracted from Biden poll figure. If spread is positive, that indicates how much Biden is leading by. Should it be negative, that represents how much Trump is ahead by.

In [34]:
# Calculating Victory Spread
Master_Tweet_df['Victory Spread'] = Master_Tweet_df['Biden (D)'] - Master_Tweet_df['Trump (R)']

In [64]:
#Creating Target Column

def victor(x):
    if x >2 and x<4:
        return 'Likely Trump'
    if x<0 or x==2:
        return 'Definitely Trump'
    if x==4 and x>4:
        return 'Likely Biden'
    if x==6 or x>6:
        return 'Definitely Biden'

Master_Tweet_df['Target'] = Master_Tweet_df['Victory Spread'].apply(lambda x: victor(x))


In [65]:
Master_Tweet_df['MoE'].dtype

dtype('O')

In [67]:
Master_Tweet_df.head(40)

Unnamed: 0.1,Unnamed: 0,created_at,date,likes_count,replies_count,retweets_count,tweet,username,video,Date,...,Poll,Start Date,End Date,Sample,MoE,Biden (D),Trump (R),Spread,Victory Spread,Target
0,501,2020-02-11 17:20:08,2020-02-11,4374,464,2189,The World Health Organization has an official ...,nytimes,0,1581379200000000000,...,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7,7.0,Definitely Biden
1,7317,2020-02-14 02:13:11,2020-02-13,0,0,0,Leading health experts warned on Feb. 12 that ...,tamiann02,0,1581552000000000000,...,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7,7.0,Definitely Biden
2,7318,2020-02-13 21:28:24,2020-02-13,0,0,0,Leading health experts warned on Feb. 12 that ...,1val1richy,0,1581552000000000000,...,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7,7.0,Definitely Biden
3,7319,2020-02-13 14:01:49,2020-02-13,0,0,0,"‚ÄúSo far, about 82 percent of the [covid-19 v...",kolsaw,0,1581552000000000000,...,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Biden +7,7.0,Definitely Biden
4,7316,2020-02-15 23:04:26,2020-02-15,0,0,0,CDC director Dr. Robert Redfield says will bec...,_watch_observe_,0,1581724800000000000,...,NPR/PBS/MaristNPR/PBS,2/13/20,2/16/20,1164 RV,3.7,50.0,44.0,Biden +6,6.0,Definitely Biden
5,1978,2020-02-16 19:56:00,2020-02-16,5,0,0,Americans should question the lack of informat...,shima3t,0,1581811200000000000,...,ABC News/Wash PostABC/WP,2/14/20,2/17/20,913 RV,4,52.0,45.0,Biden +7,7.0,Definitely Biden
6,1979,2020-02-16 19:56:00,2020-02-16,5,0,0,Americans should question the lack of informat...,shima3t,0,1581811200000000000,...,ABC News/Wash PostABC/WP,2/14/20,2/17/20,913 RV,4,52.0,45.0,Biden +7,7.0,Definitely Biden
7,7315,2020-02-17 18:02:26,2020-02-17,2,1,0,Several new confirmed cases of COVID-19 Corona...,narniaexpert,0,1581897600000000000,...,ABC News/Wash PostABC/WP,2/14/20,2/17/20,913 RV,4,52.0,45.0,Biden +7,7.0,Definitely Biden
8,582,2020-02-20 01:37:02,2020-02-19,88,5,47,How epidemics like COVID-19 end (and how to en...,washingtonpost,0,1582070400000000000,...,EmersonEmerson,2/16/20,2/18/20,1250 RV,2.7,48.0,52.0,Trump +4,-4.0,Definitely Trump
9,572,2020-02-20 01:37:02,2020-02-19,89,5,47,How epidemics like COVID-19 end (and how to en...,washingtonpost,0,1582070400000000000,...,EmersonEmerson,2/16/20,2/18/20,1250 RV,2.7,48.0,52.0,Trump +4,-4.0,Definitely Trump


## Final DF Housekeeping

In [71]:
# couldn't find any value from these columns
Master_Tweet_df.drop(['Unnamed: 0','14 days', 'Cases_y', 'Deaths_y', '28 days', 'Cases',
       'Deaths','Date','Victory Spread','Spread'],axis=1)

Unnamed: 0,created_at,date,likes_count,replies_count,retweets_count,tweet,username,video,Cases_x,Deaths_x,...,stopwords,Sentiment,Poll,Start Date,End Date,Sample,MoE,Biden (D),Trump (R),Target
0,2020-02-11 17:20:08,2020-02-11,4374,464,2189,The World Health Organization has an official ...,nytimes,0,12,0,...,"['has', 'an', 'for', 'the', 'by', 'no', 'to', ...",0.000000,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Definitely Biden
1,2020-02-14 02:13:11,2020-02-13,0,0,0,Leading health experts warned on Feb. 12 that ...,tamiann02,0,13,0,...,"['on', 'that', 'the', 'of', 'of', 'the', 'or',...",0.375000,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Definitely Biden
2,2020-02-13 21:28:24,2020-02-13,0,0,0,Leading health experts warned on Feb. 12 that ...,1val1richy,0,13,0,...,"['on', 'that', 'the', 'of', 'of', 'the', 'or',...",0.375000,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Definitely Biden
3,2020-02-13 14:01:49,2020-02-13,0,0,0,"‚ÄúSo far, about 82 percent of the [covid-19 v...",kolsaw,0,13,0,...,"['about', 'of', 'the', 'all', 'in', 'the', 'ha...",0.048611,QuinnipiacQuinnipiac,2/5/20,2/9/20,1519 RV,2.5,50.0,43.0,Definitely Biden
4,2020-02-15 23:04:26,2020-02-15,0,0,0,CDC director Dr. Robert Redfield says will bec...,_watch_observe_,0,13,0,...,"['will', 'the', 'or', 'it', 'is', 'to', 'how',...",0.216667,NPR/PBS/MaristNPR/PBS,2/13/20,2/16/20,1164 RV,3.7,50.0,44.0,Definitely Biden
5,2020-02-16 19:56:00,2020-02-16,5,0,0,Americans should question the lack of informat...,shima3t,0,13,0,...,"['should', 'the', 'of', 'in', 'their', 'the', ...",0.000000,ABC News/Wash PostABC/WP,2/14/20,2/17/20,913 RV,4,52.0,45.0,Definitely Biden
6,2020-02-16 19:56:00,2020-02-16,5,0,0,Americans should question the lack of informat...,shima3t,0,13,0,...,"['should', 'the', 'of', 'in', 'their', 'the', ...",0.000000,ABC News/Wash PostABC/WP,2/14/20,2/17/20,913 RV,4,52.0,45.0,Definitely Biden
7,2020-02-17 18:02:26,2020-02-17,2,1,0,Several new confirmed cases of COVID-19 Corona...,narniaexpert,0,13,0,...,"['of', 'in', 'and', 'the', 'of', 'in', 'the', ...",0.178788,ABC News/Wash PostABC/WP,2/14/20,2/17/20,913 RV,4,52.0,45.0,Definitely Biden
8,2020-02-20 01:37:02,2020-02-19,88,5,47,How epidemics like COVID-19 end (and how to en...,washingtonpost,0,13,0,...,"['how', 'to', 'them']",0.000000,EmersonEmerson,2/16/20,2/18/20,1250 RV,2.7,48.0,52.0,Definitely Trump
9,2020-02-20 01:37:02,2020-02-19,89,5,47,How epidemics like COVID-19 end (and how to en...,washingtonpost,0,13,0,...,"['how', 'to', 'them']",0.000000,EmersonEmerson,2/16/20,2/18/20,1250 RV,2.7,48.0,52.0,Definitely Trump


## Saving to CSV

In [73]:
#saving df to csv for upload in other notebooks
Master_Tweet_df.to_csv('data/Master_Tweet_notebook.csv')

In [74]:
len(Master_Tweet_df['Target'])

8514

In [None]:
len