# Overview

This notebook assumes you have already collected and scored MOC tweets. It creates a dataset for use in R to analyze the patterns of polarization over time. You will do some parsing on an AWS server and some locally before ultimately making a CSV file that you can open and analyze in R.

LH Note: on my computer, Git stuff and data live in different places, so you'll see notes about moving files or changing directories. I haven't figured out a good way to keep files in both places or to mirror or sync or whatever. So, for now, paths are hard-coded or there's a note about where to find a file.

# Get Data

2016 election data is on an AWS server under ```/data/purpletag```. 

To login: 

```ssh -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org```

The data is large (> 4GB), so best to run Juypter notebooks to parse on the server. Then CSV files can be used locally.

You can run a notebook on the server and use your local browser with these two commands:

* ```ssh -L 8080:localhost:8888 -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org```
* ```nohup jupyter notebook --no-browser > log.txt 2>&1 &```

Then access ```http://localhost:8080``` in your browser.

## On server: Parsing from ```scores``` files to CSV

This section assumes you have already run purpletag's ```collect``` and ```score``` functions and gotten the Twitter data that you want in JSON format and parsed that data into score files.

In [1]:
# Based on https://stackoverflow.com/questions/26415906/read-multiple-txt-files-into-pandas-dataframe-with-filename-as-column-header
import pandas as pd
import os
import glob
import yaml

def build_file_list(directory, extension):
    '''
    args: 
        directory - full path to where the files are
            ex: /data/purpletag/scores
        extension - tells us which files to include in the list
            ex: *.l.moc.scores # using 1-day purpletag MOC scores
    '''
    
    # Step 1: get a list of all score files in target directory
    fileList = []
    os.chdir( directory )

    # Step 2: Build up list of files:
    for files in glob.glob(extension): 
        fileName, fileExtension = os.path.splitext(files)
        fileList.append(files) #filename with extension
        
    return fileList

def build_df(fileList, outfile, score_type):
    '''
    args:
        fileList - list of files to include, usually output from build_file_list
        outfile - full path to where to put the df
            ex: /data/purpletag/mocs_by_date.pkl
    '''
    # Step 3: Build up DataFrame:
    # Based on https://stackoverflow.com/questions/35717706/python-how-to-turn-a-dictionary-of-dataframes-into-one-big-dataframe-with-colum
    d = {} # dictionary to hold multiple dfs

    for filename in fileList:
        df1 = pd.read_csv(filename, header=None, sep=' ', index_col=0)
        if score_type == 'moc': # moc score files
            d[filename[:-13]] = df1
        else: # tag score files
            d[filename[:-9]] = df1

    df = pd.concat(d, axis=1)
    df.columns = df.columns.droplevel(-1) 

    df.to_pickle(outfile)

In [2]:
# set some variables that get used a few times below
#git_path = "/home/ubuntu/purpletag" # where is the GitHub repo that holds this notebook
git_path = "/Users/libbyh/Documents/git/libbyh/purpletag/"
git_data_path = git_path + "2016-election-study/data-files/"
pt_path = "/data/purpletag" # where are the scores and other purpletag-generated data

In [None]:
fileList = build_file_list('/data/purpletag/scores', '*.1.moc.scores')
build_df(fileList, '/data/purpletag/mocs_by_date_test.pkl', 'moc')

Move the file from the AWS server to local if you want to work locally. For example, to move the file ```mocs_by_date.pkl``` from the server to my local repo, I use:

```scp -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org:/data/purpletag/mocs_by_date.pkl ~/Documents/git/casmlab/purpletag/files/```

## Locally: Prepping for stats

We now have a pickled dataframe of the form handleXdate. We need to keep data only from Labor Day to Election Day and get weekly averages.

In [3]:
import pandas as pd

df = pd.read_pickle(git_data_path + 'mocs_by_date.pkl')
df.head()

Unnamed: 0,2015-11-10,2015-11-11,2015-11-12,2015-11-13,2015-11-14,2015-11-15,2015-11-16,2015-11-17,2015-11-18,2015-11-19,...,2016-10-30,2016-10-31,2016-11-01,2016-11-02,2016-11-03,2016-11-04,2016-11-05,2016-11-06,2016-11-07,2016-11-08
austinscottga08,,2.40933,,,,,,,,,...,,,,,,,,,,
benniegthompson,,,,,,,,3.85585,,,...,,,,,,,,,,
bettymccollum04,,,,,,,,,,,...,,,-90.838,-61.3435,-40.482,,-33.4513,,,
billpascrell,,,-1.16875,-0.916501,-0.972477,,,,,,...,,,,-17.9542,,,,,,
boblatta,,,0.100723,,,1.70286,,1.56897,,-0.031978,...,,,,,,,,,,


In [4]:
def weekly_avg(df):
    '''
    Given a df from build_df, keep just the weeks we are interested in.
    '''
    week1_dates = ['2016-09-06','2016-09-07','2016-09-08','2016-09-09','2016-09-10','2016-09-11','2016-09-12']
    week2_dates = ['2016-09-13','2016-09-14','2016-09-15','2016-09-16','2016-09-17','2016-09-18','2016-09-19']
    week3_dates = ['2016-09-20','2016-09-21','2016-09-22','2016-09-23','2016-09-24','2016-09-25','2016-09-26']
    week4_dates = ['2016-09-27','2016-09-28','2016-09-29','2016-09-30','2016-10-01','2016-10-02','2016-10-03']
    week5_dates = ['2016-10-04','2016-10-05','2016-10-06','2016-10-07','2016-10-08','2016-10-09','2016-10-10']
    week6_dates = ['2016-10-11','2016-10-12','2016-10-13','2016-10-14','2016-10-15','2016-10-16','2016-10-17']
    week7_dates = ['2016-10-18','2016-10-19','2016-10-20','2016-10-21','2016-10-22','2016-10-23','2016-10-24']
    week8_dates = ['2016-10-25','2016-10-26','2016-10-27','2016-10-28','2016-10-29','2016-10-30','2016-10-31']
    week9_dates = ['2016-11-01','2016-11-02','2016-11-03','2016-11-04','2016-11-05','2016-11-06','2016-11-07']

    df['week1'] = df[week1_dates].mean(axis=1)
    df['week2'] = df[week2_dates].mean(axis=1)
    df['week3'] = df[week3_dates].mean(axis=1)
    df['week4'] = df[week4_dates].mean(axis=1)
    df['week5'] = df[week5_dates].mean(axis=1)
    df['week6'] = df[week6_dates].mean(axis=1)
    df['week7'] = df[week7_dates].mean(axis=1)
    df['week8'] = df[week8_dates].mean(axis=1)
    df['week9'] = df[week9_dates].mean(axis=1)
    
    return df

df = weekly_avg(df)
df.head()

Unnamed: 0,2015-11-10,2015-11-11,2015-11-12,2015-11-13,2015-11-14,2015-11-15,2015-11-16,2015-11-17,2015-11-18,2015-11-19,...,2016-11-08,week1,week2,week3,week4,week5,week6,week7,week8,week9
austinscottga08,,2.40933,,,,,,,,,...,,,4.932035,,1.06429,3.6757,6.0426,1.56161,,
benniegthompson,,,,,,,,3.85585,,,...,,,-1.21138,,,,-1.82301,,,
bettymccollum04,,,,,,,,,,,...,,-56.669016,-119.88476,-67.1729,-59.540265,-27.995798,-33.955595,-4.879673,-33.080633,-56.5287
billpascrell,,,-1.16875,-0.916501,-0.972477,,,,,,...,,-2.87143,-2.62381,-2.353348,-2.337897,,-1.02655,,-1.33333,-17.9542
boblatta,,,0.100723,,,1.70286,,1.56897,,-0.031978,...,,1.37997,9.673527,,,,0.974138,,,


In [5]:
weekly_df = df[['week1','week2','week3','week4','week5','week6','week7','week8','week9']]
weekly_df["handle"] = weekly_df.index.str.lower() # this line throws the SettingWithCopyWarning which I'm ignoring
weekly_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,week1,week2,week3,week4,week5,week6,week7,week8,week9,handle
austinscottga08,,4.932035,,1.06429,3.6757,6.0426,1.56161,,,austinscottga08
benniegthompson,,-1.21138,,,,-1.82301,,,,benniegthompson
bettymccollum04,-56.669016,-119.88476,-67.1729,-59.540265,-27.995798,-33.955595,-4.879673,-33.080633,-56.5287,bettymccollum04
billpascrell,-2.87143,-2.62381,-2.353348,-2.337897,,-1.02655,,-1.33333,-17.9542,billpascrell
boblatta,1.37997,9.673527,,,,0.974138,,,,boblatta


In [6]:
weekly_df.sort_values(by = 'week2', ascending = True).head()

Unnamed: 0,week1,week2,week3,week4,week5,week6,week7,week8,week9,handle
replawrence,-28.383236,-308.088,-18.402877,-32.248687,0.698665,-43.500575,,,-0.811634,replawrence
repdennyheck,,-300.145,-10.527903,-17.045025,,-18.14737,-16.686711,-40.5577,-62.14465,repdennyheck
repbobbyrush,,-270.786,-30.6797,-1.161955,,,,-8.6529,-1.89872,repbobbyrush
nitalowey,-1.29884,-215.463,-39.0497,-6.69917,,-11.493604,-1.08432,-3.043239,-18.510867,nitalowey
repcleaver,-6.302088,-194.11,-8.202654,-32.020135,-0.119293,0.375556,-18.248317,-3.68411,-9.614692,repcleaver


## Merge with legislator data
Now we merge our #polar-tag data with data about the legislators (e.g., party, state).

In [7]:
# get the data from Govtrack
with open(git_data_path + 'legislators-social-media-nov16.yaml', 'r') as f:
    df_social = pd.io.json.json_normalize(yaml.load(f))

with open(git_data_path + 'legislators-current.yaml', 'r') as f:
    df_current = pd.io.json.json_normalize(yaml.load(f))

# merge everything into one data frame with one row per MOC
df_meta = pd.merge(df_current, df_social, on="id.govtrack")
df_meta["handle"] = df_meta["social.twitter"].str.lower()

print("metadata length: " + str(len(df_meta)))

metadata length: 470


In [30]:
print("weekly_df length: " + str(len(weekly_df)))

df_merged = pd.merge(df_meta, weekly_df, on="handle", how='outer', indicator=True)
                     
print("df_merged length: " + str(len(df_merged)))

#cols_to_keep = ['id.govtrack','social.twitter','name.official_full','bio.gender','terms','week1','week2','week3','week4','week5','week6','week7','week8','week9']

df_merged = df_merged[['_merge','id.govtrack','handle','name.official_full','bio.gender','terms','week1','week2','week3','week4','week5','week6','week7','week8','week9']]
df_merged.to_csv(git_data_path + 'meta_weekly_merged.csv', encoding='utf-8')


weekly_df length: 511
df_merged length: 529


In [9]:
# see where the merge failed and probably fix those by hand
df_merge_failed = df_merged.loc[df_merged['_merge'] != "both"]
df_merge_failed.to_csv(git_data_path + 'merge_failed.csv', encoding='utf-8')

In [51]:
# Thank you, Roger Allen, https://gist.github.com/rogerallen/1583593
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

In [52]:
# get bioguide ids
bg_path = git_data_path + 'bioguide_ids.tsv'
bg = pd.DataFrame.from_csv(bg_path, sep='\t', header=0)
bg["Name"] = bg["Name"].str.strip()
bg['LastName'], bg['FirstName'] = bg['Name'].str.split(',', 1).str
bg['PartyState'] = bg['PartyState'].str.replace(")","")
bg['Party'], bg['State'] = bg['PartyState'].str.split(' - ', 1).str
bg['State_Abbrev'] = bg.State.map(us_state_abbrev)
bg = bg.rename(index=str, columns={"Member ID": "bioguide"})

bg.head()

Unnamed: 0_level_0,Name,PartyState,bioguide,LastName,FirstName,Party,State,State_Abbrev
Member,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Abdnor, James (Republican - South Dakota)","Abdnor, James",Republican - South Dakota,A000009,Abdnor,James,Republican,South Dakota,SD
"Abercrombie, Neil (Democratic - Hawaii)","Abercrombie, Neil",Democratic - Hawaii,A000014,Abercrombie,Neil,Democratic,Hawaii,HI
"Abourezk, James (Democratic - South Dakota)","Abourezk, James",Democratic - South Dakota,A000017,Abourezk,James,Democratic,South Dakota,SD
"Abraham, Ralph Lee (Republican - Louisiana)","Abraham, Ralph Lee",Republican - Louisiana,A000374,Abraham,Ralph Lee,Republican,Louisiana,LA
"Abraham, Spencer (Republican - Michigan)","Abraham, Spencer",Republican - Michigan,A000355,Abraham,Spencer,Republican,Michigan,MI


In [53]:
# get Ballotpedia House results data
bp_house = pd.DataFrame.from_csv(git_data_path + 'house_results.csv', sep=",", header=0)
bp_house.head()

Unnamed: 0_level_0,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party,Clinton 2016,Trump 2016,Obama 2012,Romney 2012,Obama 2008,McCain 2008
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
AK-AL,"Young, Don",R,1,1,1,155088,111019,R,D,37.6,52.8,41.2,55.3,38.1,59.7
AL-01,"Byrne, Bradley",R,1,1,0,208083,7810,R,write in,34.1,63.5,37.4,61.8,38.5,60.9
AL-02,"Roby, Martha",R,1,1,1,134886,112089,R,D,33.0,64.9,36.4,62.9,35.0,64.5
AL-03,"Rogers, Mike",R,1,1,1,192164,94549,R,D,32.3,65.3,36.8,62.3,36.6,62.6
AL-04,"Aderholt, Rob",R,1,1,0,235925,3519,R,write in,17.4,80.4,24.0,74.8,25.5,73.3


In [60]:
# get Ballotpedia Senate results data
bp_sen = pd.DataFrame.from_csv(git_data_path + 'senate_results.csv', sep=",", header=0)
bp_sen['State'] = bp_sen.index
bp_sen.head()

Unnamed: 0_level_0,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party,State
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AL,Shelby,R,1,1,1,1335104,748709,R,D,AL
AK,Murkowski,R,1,1,1,138149,36200,R,D,AK
AZ,McCain,R,1,1,1,1359267,1031245,R,D,AZ
AR,Boozman,R,1,1,1,661984,400602,R,D,AR
CA,Boxer,D,0,0,0,7542753,4701417,D,D,CA


In [66]:
# match bioguide id data to name data from Ballotpedia for the House
with_election_results = pd.merge(bg, bp_house, left_on="Name", right_on="inc_name")
with_election_results.to_csv(git_data_path + 'bioguide_house_results.csv')
with_election_results = pd.merge(bg, bp_sen, left_on=["LastName","State_Abbrev"], right_on=["inc_name","State"])
with_election_results.to_csv(git_data_path + 'bioguide_sen_results.csv')

# This automated approach to merging gets about half the House and most of the Senate results. Doing the rest by hand.

In [33]:
df1 = pd.concat([df_merged.drop(['terms'], axis=1), df_merged['terms'].apply(pd.Series)], axis=1)
df2 = pd.concat([df1.drop([0], axis=1), df1[0].apply(pd.Series)], axis=1)

keep_df = df2[['id.govtrack','handle','name.official_full','bio.gender','type','party','week1','week2','week3','week4','week5','week6','week7','week8','week9']]
keep_df

Unnamed: 0,id.govtrack,handle,name.official_full,bio.gender,type,party,week1,week2,week3,week4,week5,week6,week7,week8,week9
0,400050,sensherrodbrown,Sherrod Brown,M,rep,Democrat,-0.873145,-1.141480,-1.825096,-21.350100,-0.467001,0.422609,-16.612679,0.359653,-0.367748
1,300018,senatorcantwell,Maria Cantwell,F,rep,Democrat,-42.572318,-8.845897,-3.729938,-21.854320,-10.628532,-18.030278,-17.871626,-9.655197,-17.863474
2,400064,senatorcardin,Benjamin L. Cardin,M,rep,Democrat,-8.692816,-4.934183,-5.241676,-15.389205,-1.365323,-8.171898,-9.316775,-5.438023,-9.632692
3,300019,senatorcarper,Thomas R. Carper,M,rep,Democrat,-38.936383,0.153670,-4.047725,-10.227893,-11.687868,0.652402,-2.603721,-1.830320,-2.459966
4,412246,senbobcasey,"Robert P. Casey, Jr.",M,sen,Democrat,-19.997507,-4.658190,-1.272941,-21.918950,-1.944227,-3.823387,-12.673860,-7.019990,-36.656500
5,412248,senbobcorker,Bob Corker,M,sen,Republican,2.096852,5.911753,2.764636,4.465404,,1.819670,,1.748260,
6,300043,senfeinstein,Dianne Feinstein,F,sen,Democrat,-22.061820,-39.389700,-7.216368,-16.176455,-6.700642,-5.140945,-19.095330,,-19.184040
7,300052,senorrinhatch,Orrin G. Hatch,M,sen,Republican,5.009444,11.089169,11.336792,11.149503,6.292435,5.263352,2.093247,19.990525,2.214622
8,412242,,Amy Klobuchar,F,sen,Democrat,,,,,,,,,
9,412438,,Justin Amash,M,rep,Republican,,,,,,,,,


In [34]:
# melt it so each row is a person x week
df_long = pd.melt(keep_df, id_vars=['id.govtrack','handle','name.official_full','bio.gender','party','type'],
                value_vars=['week1','week2','week3','week4','week5','week6','week7','week8','week9'],
                var_name='week', value_name='avg_score')
df_long['week'] = df_long['week'].str[-1:]

df_long.rename(columns = {'type':'chamber', 'name.official_full': 'name', 'bio.gender': 'gender'}, inplace = True)

df_long.head()

Unnamed: 0,id.govtrack,handle,name,gender,party,chamber,week,avg_score
0,400050,sensherrodbrown,Sherrod Brown,M,Democrat,rep,1,-0.873145
1,300018,senatorcantwell,Maria Cantwell,F,Democrat,rep,1,-42.572318
2,400064,senatorcardin,Benjamin L. Cardin,M,Democrat,rep,1,-8.692816
3,300019,senatorcarper,Thomas R. Carper,M,Democrat,rep,1,-38.936383
4,412246,senbobcasey,"Robert P. Casey, Jr.",M,Democrat,sen,1,-19.997507


In [None]:
# make sure we have just two parties
df_long.party.unique()

In [None]:
# get an absolute value of the polar score
df_long['abs'] = df_long['avg_score'].abs()

In [None]:
df_long.to_csv('data-files/weekly_averages_long.csv')

# Now move to R for analysis

Run ```~/Documents/git/casmlab/purpletag/2016_election.R```

That R script sends its output to ```2016_election_results.txt```

In [None]:
results = open('data-files/2016_election_results.txt', 'r')
print(results.read())

Based on the outlier-excluded linear mixed-effects models, it makes sense to remove RepThompson. The pattern stays the same even with RepThompson in the set though: negative effect for republican and week, positive effect for their interaction. ```lmm5``` is the model-of-best-fit. 

## Changing the way we score hashtags

What if we score tags for the 63-day period and then score MOCS?

Run the following (on the server) to get new scores:

* purpletag parse -t 63 -d 200
* purpletag score
* purpletag score --counts --score-mocs

That first command took a week because the code starts with today and works backwards 200 days, one day at a time. Each day takes over an hour. See Issue #18 about options for changing this behavior.

With the new tag measures, can start the process over. Start at "On Server: Parsing scores to CSV" with a new file name.

# Getting Tag Data for Paper

We need to know more about the tags people were using to make sense of the regression results. So, let's get some tag data.'

In [None]:
score_files = build_file_list('/data/purpletag/scores', '*.1.scores')
build_df(score_files, '/data/purpletag/scores_by_date.pkl', 'tag')

In [None]:
import pandas as pd

df = pd.read_pickle('/data/purpletag/scores_by_date.pkl')
df.head()

In [None]:
df = weekly_avg(df)
df.head()

In [None]:
df_tags_weeks = df[['week1','week2','week3','week4','week5','week6','week7','week8','week9']]
df_tags_weeks = df_tags_weeks.dropna(how='all')
df_tags_weeks.head()
len(df_tags_weeks.index) # number of hashtags in our df

What was happening in week 2 that made Democrats so polarized that week?

In [None]:
week2 = df_tags_weeks.sort_values(by = 'week2', ascending = True)
week2.head(10)

In [None]:
week2 = df_tags_weeks.sort_values(by = 'week2', ascending = False)
week2.head(10)

In [None]:
week9_dems = df_tags_weeks.sort_values(by = 'week9', ascending = True)
week9_dems.head(10)

In [None]:
week9_reps = df_tags_weeks.sort_values(by = 'week9', ascending = False)
week9_reps.head(10)

# Go to the JSON for examples

In [None]:
import json
from datetime import datetime

# print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

missing = ['repboustany', 'repcorrinebrown', 'repduckworth', 'repgwengraham', 'repjoeheck', 
           'repjoepitts', 'repkirkpatrick', 'repmattsalmon', 'repmickmulvaney', 'repmikepompeo', 
           'repmiketurner', 'repmurphyfl', 'reprobbishop', 'repsamfarr', 'reptoddyoung', 'senatorsessions', 
           'senatortester', 'sentoomey', 'tiberipress', 'yvetteclarke']

# search JSON for high score tweeters
with open('/data/purpletag/jsons/1478545228.json','r') as f:
    for line in f:
        data = json.loads(line)
#         if (data['user']['screen_name'].lower() == 'replawrence' 
#             or data['user']['screen_name'].lower() == 'repdennyheck'): # most polarized Dems during Week 2
        if (data['user']['screen_name'].lower in missing):
            print(data['user']['screen_name'])
            print(data['created_at'])
            print(data['text'])
                        
# print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')) # takes about 2 min to search the whole thing

In [None]:
matches = list()

# search JSON for high score tags
with open('/data/purpletag/jsons/1478545228.json','r') as f:
    for line in f:
        data = json.loads(line)
        tags = data['entities']['hashtags']
        for tag in tags:
            if tag['text'].lower() == 'gunvote':
                matches.append(data['id'])
                print(data['user']['screen_name'])
                print(data['created_at'])
                print(data['text'])

print(len(matches))

In [None]:
# check for missing data from alternate_route.ipynb
# search JSON for high score tags
with open('/data/purpletag/jsons/1478545228.json','r') as f:
    for line in f:
        data = json.loads(line)
        tags = data['entities']['hashtags']
        for tag in tags:
            if tag['text'].lower() == 'gunvote':
                matches.append(data['id'])
                print(data['user']['screen_name'])
                print(data['created_at'])
                print(data['text'])

print(len(matches))