In [1]:
import glob
import pandas as pd

Let's get the list of CSV files ...

In [2]:
file_list = sorted(glob.glob('./data/TWITTER_CSV_EXPORTS/*.*'))

In [3]:
file_list

['./data/TWITTER_CSV_EXPORTS/AZ-01-Shedd - shedd_tweets.csv',
 './data/TWITTER_CSV_EXPORTS/AZ-02 - stauz_tweets.csv',
 './data/TWITTER_CSV_EXPORTS/AZ-02.2 - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/CA-25 - garcia_tweets.csv',
 './data/TWITTER_CSV_EXPORTS/CA-39-Kim - kim_tweets.csv',
 './data/TWITTER_CSV_EXPORTS/CA-45 - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/CA-48-Steel - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/CO-06-House - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/IA-01-Hinson - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/IA-02-Miller-Meeks - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/IA-03-Young - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/IL-06-Ives - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/IL-14-Oberweis - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/IL-17-Joy-King - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/KS-03-Adkins - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/ME-02-Bennett - Sheet1.csv',
 './data/TWITTER_CSV_EXPORTS/MI-11-Bentivoglio - Sheet1.csv',
 './data/TWITTER_CSV_EXPORT

Now we need to concatinate all of those. Here's a nifty piece of code that does that I got from [this medium post](https://medium.com/@kadek/elegantly-reading-multiple-csvs-into-pandas-e1a76843b688).

If you look closely, you can see that it's looping through every file (called 'f') in the file_list, reading the file as a csv, and concatinating the whole thing.

In [4]:
df = pd.concat([pd.read_csv(f) for f in file_list], ignore_index = True)

In [5]:
df.describe()

Unnamed: 0,date,handle,content,link,covid_1,covid_2,covid_3,covid_final,other_1,other_2,other_3,other_final,url,twitter handle,district,incumbent,incumbent party,opponent,opponent party,opponent twitter handle
count,3993,1124,3993,611,3500,3403,1409,3379,3234,3232,1400,3252,219,77,94,94,94,94,86,48
unique,3899,28,3987,611,6,5,5,5,2,2,4,4,219,4,10,4,9,10,1,1
top,"April 05, 2020 at 04:51PM",@hiral4congress,RT @WhiteHouse: President @realDonaldTrump jus...,http://twitter.com/PathForFreedom/status/12479...,False,False,False,0,False,False,False,0,http://twitter.com/OlaForCongress/status/12532...,@hinsonashley,TX-07,Lizzie Fletcher,D,Wesley Hunt,R,@WesleyHuntTX
freq,7,105,2,1,3175,3187,1309,3134,3046,3031,1330,2962,1,74,48,48,86,48,86,48


Ooooh. That's a lot of extra columns. That's okay. We just need three of them:

In [46]:
master = df[['content','covid_final','other_final']].copy()

In [47]:
master.describe()

Unnamed: 0,content,covid_final,other_final
count,3993,3379,3252
unique,3987,5,4
top,RT @WhiteHouse: President @realDonaldTrump jus...,0,0
freq,2,3134,2962


In [48]:
master.head()

Unnamed: 0,content,covid_final,other_final
0,Thank you to all those fighting COVID-19 on th...,0,0
1,The Arizona Department of Health Services has ...,0,0
2,The Arizona Department of Health Services has ...,0,0
3,RT @animalag: FFA and 4-H members are lending ...,0,0
4,"As of today, I am officially on the ballot for...",0,0


Nice. We have almost 4,000 tweets. Let's store all of the tweet texts (and just the tweet texts) in a corpus file.

In [49]:
corpus = master['content'].copy()

In [51]:
corpus.to_csv('data/corpus.csv', index=False, header=True)

In [52]:
master.describe()

Unnamed: 0,content,covid_final,other_final
count,3993,3379,3252
unique,3987,5,4
top,RT @WhiteHouse: President @realDonaldTrump jus...,0,0
freq,2,3134,2962


Something curious about that description ☝🏻. See the "unique" row? Says there are 5 and 4 unique values for `covid_final` and `other_final`. That's suspicious. Let's see what's in those columns:

In [53]:
master.covid_final.unique()

array([0, 1, nan, '0', '#DIV/0!', '1'], dtype=object)

In [54]:
master.other_final.unique()

array([0, nan, 1.0, '0', '#DIV/0!'], dtype=object)

Ah, yes, we need to clearn those up. We have invalid values and also numbers represented as strings (that's what the quotation marks indicate.

First, let's get rid of any row that has a `nan` which means "not a number" and really indicates missing data. These are boxes we didn't fill out.

In [55]:
## Drop rows with any blanks (nan)
master.dropna(inplace=True)

In [56]:
master.describe()

Unnamed: 0,content,covid_final,other_final
count,3216,3216,3216
unique,3211,5,4
top,Join me in asking Congress to implement nation...,0,0
freq,2,2976,2931


We lost about 800 rows there! That's okay. We're still good. Now let's get rid of rows where there's a `#DIV/0!` error there. That's because we tried to run our scoring formula against some blank data.

In [57]:
master = master[master.covid_final != '#DIV/0!']

In [58]:
master = master[master.other_final != '#DIV/0!']

In [59]:
master.describe()

Unnamed: 0,content,covid_final,other_final
count,3215,3215,3215
unique,3210,4,3
top,Join me in asking Congress to implement nation...,0,0
freq,2,2976,2931


In [60]:
master.covid_final.unique()

array([0, 1, '0', '1'], dtype=object)

Nice. So now we have both numbers and strings -- so the number 1 and the "word" '1'. Let's make a new column called "covid" that will be `True` if "covid_final" is in the group `[1, '1']`:

In [61]:
master['covid'] = master['covid_final'].isin([1 , '1'])

And the same for "other" and "other_final":

In [62]:
master['other'] = master['other_final'].isin([1, '1'])

In [63]:
master.describe()

Unnamed: 0,content,covid_final,other_final,covid,other
count,3215,3215,3215,3215,3215
unique,3210,4,3,2,2
top,Join me in asking Congress to implement nation...,0,0,False,False
freq,2,2976,2931,3034,2990


In [64]:
master

Unnamed: 0,content,covid_final,other_final,covid,other
0,Thank you to all those fighting COVID-19 on th...,0,0,False,False
1,The Arizona Department of Health Services has ...,0,0,False,False
2,The Arizona Department of Health Services has ...,0,0,False,False
3,RT @animalag: FFA and 4-H members are lending ...,0,0,False,False
4,"As of today, I am officially on the ballot for...",0,0,False,False
...,...,...,...,...,...
3997,We NEED supplies: Day 1 of our #coronavirus wo...,0,0,False,False
3998,"In #TX10, people are missing paychecks and won...",1,1,True,True
3999,"#TX10 communities: please, listen to the advic...",0,0,False,False
4000,CD23: LET’S DEBATE THE ISSUES!\n\nThank You G...,0,0,False,False


We can double-check to make sure all we have are `True`s and `False`s with the next line.

Note: `master.other` is just a cleaner way of writing `master['other']`. They're the same thing.

In [41]:
master.other.unique(), master.covid.unique()

(array([False,  True]), array([False,  True]))

Now we're finally to the point we were trying to get to Wednesday :-)

In pandas, if you have a _boolean_ -- that is, a variable that's True or False -- you can do a lot of fun things!

The English version of the next line is: "For every row, show True if `master.covid` is True OR `master.other` is True." The pandas symbol for "or" is `|`.

In [42]:
(master.covid | master.other)

0       False
1       False
2       False
3       False
4       False
        ...  
3997    False
3998     True
3999    False
4000    False
4001     True
Length: 3215, dtype: bool

Pretty slick, right? You can scroll up to see from the last 5 lines on the table above that it worked.

Let's add a column called `either` that contains those values.

In [43]:
master['either'] = (master.covid | master.other)

In [44]:
master

Unnamed: 0,content,covid_final,other_final,covid,other,either
0,Thank you to all those fighting COVID-19 on th...,0,0,False,False,False
1,The Arizona Department of Health Services has ...,0,0,False,False,False
2,The Arizona Department of Health Services has ...,0,0,False,False,False
3,RT @animalag: FFA and 4-H members are lending ...,0,0,False,False,False
4,"As of today, I am officially on the ballot for...",0,0,False,False,False
...,...,...,...,...,...,...
3997,We NEED supplies: Day 1 of our #coronavirus wo...,0,0,False,False,False
3998,"In #TX10, people are missing paychecks and won...",1,1,True,True,True
3999,"#TX10 communities: please, listen to the advic...",0,0,False,False,False
4000,CD23: LET’S DEBATE THE ISSUES!\n\nThank You G...,0,0,False,False,False


And we'll save that to a CSV called "scored.csv."

In [45]:
master.to_csv('data/scored.csv', index=False, header=True)

Now we have our corpus CSV and our scored CSV and we're ready for the machine learning part. 