# Overview

This notebook assumes you have already collected and scored MOC tweets. It creates a dataset for use in R to analyze the patterns of polarization over time. You will do some parsing on an AWS server and some locally before ultimately making a CSV file that you can open and analyze in R.

LH Note: on my computer, Git stuff and data live in different places, so you'll see notes about moving files or changing directories. I haven't figured out a good way to keep files in both places or to mirror or sync or whatever. So, for now, paths are hard-coded or there's a note about where to find a file.

## Setup and Helpers

In [1]:
# Based on https://stackoverflow.com/questions/26415906/read-multiple-txt-files-into-pandas-dataframe-with-filename-as-column-header
import pandas as pd
import os
import glob
import yaml
import numpy as np

def build_file_list(directory, extension):
    '''
    args: 
        directory - full path to where the files are
            ex: /data/purpletag/scores
        extension - tells us which files to include in the list
            ex: *.l.moc.scores # using 1-day purpletag MOC scores
    '''
    
    # Step 1: get a list of all score files in target directory
    fileList = []
    os.chdir( directory )

    # Step 2: Build up list of files:
    for files in glob.glob(extension): 
        fileName, fileExtension = os.path.splitext(files)
        fileList.append(files) #filename with extension
        
    return fileList

def build_df(fileList, outfile, score_type):
    '''
    args:
        fileList - list of files to include, usually output from build_file_list
        outfile - full path to where to put the df
            ex: /data/purpletag/mocs_by_date.pkl
    '''
    # Step 3: Build up DataFrame:
    # Based on https://stackoverflow.com/questions/35717706/python-how-to-turn-a-dictionary-of-dataframes-into-one-big-dataframe-with-colum
    d = {} # dictionary to hold multiple dfs

    for filename in fileList:
        df1 = pd.read_csv(filename, header=None, sep=' ', index_col=0)
        if score_type == 'moc': # moc score files
            d[filename[:-13]] = df1
        else: # tag score files
            d[filename[:-9]] = df1

    df = pd.concat(d, axis=1)
    df.columns = df.columns.droplevel(-1) 

    df.to_pickle(outfile)

In [2]:
# set some variables that get used a few times below
#git_path = "/home/ubuntu/purpletag" # where is the GitHub repo that holds this notebook
git_path = "/Users/libbyh/Documents/git/libbyh/purpletag/"
git_data_path = git_path + "2016-election-study/data-files/"
pt_path = "/data/purpletag" # where are the scores and other purpletag-generated data

In [None]:
fileList = build_file_list('/data/purpletag/scores', '*.1.moc.scores')
build_df(fileList, '/data/purpletag/mocs_by_date_test.pkl', 'moc')

In [3]:
# Thank you, Roger Allen, https://gist.github.com/rogerallen/1583593
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

# Get Data

2016 election data is on an AWS server under ```/data/purpletag```. 

To login: 

```ssh -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org```

The data is large (> 4GB), so best to run Juypter notebooks to parse on the server. Then CSV files can be used locally.

You can run a notebook on the server and use your local browser with these two commands:

* ```ssh -L 8080:localhost:8888 -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org```
* ```nohup jupyter notebook --no-browser > log.txt 2>&1 &```

Then access ```http://localhost:8080``` in your browser.

## On server: Parsing from ```scores``` files to CSV

This section assumes you have already run purpletag's ```collect``` and ```score``` functions and gotten the Twitter data that you want in JSON format and parsed that data into score files.

Move the file from the AWS server to local if you want to work locally. For example, to move the file ```mocs_by_date.pkl``` from the server to my local repo, I use:

```scp -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org:/data/purpletag/mocs_by_date.pkl ~/Documents/git/casmlab/purpletag/files/```

## Locally: Prepping for stats

We now have a pickled dataframe of the form handleXdate. We need to keep data only from Labor Day to Election Day and get weekly averages.

In [4]:
import pandas as pd

df = pd.read_pickle(git_data_path + 'mocs_by_date.pkl')
df.head()

Unnamed: 0,2015-11-10,2015-11-11,2015-11-12,2015-11-13,2015-11-14,2015-11-15,2015-11-16,2015-11-17,2015-11-18,2015-11-19,...,2016-10-30,2016-10-31,2016-11-01,2016-11-02,2016-11-03,2016-11-04,2016-11-05,2016-11-06,2016-11-07,2016-11-08
austinscottga08,,2.40933,,,,,,,,,...,,,,,,,,,,
benniegthompson,,,,,,,,3.85585,,,...,,,,,,,,,,
bettymccollum04,,,,,,,,,,,...,,,-90.838,-61.3435,-40.482,,-33.4513,,,
billpascrell,,,-1.16875,-0.916501,-0.972477,,,,,,...,,,,-17.9542,,,,,,
boblatta,,,0.100723,,,1.70286,,1.56897,,-0.031978,...,,,,,,,,,,


In [5]:
def weekly_avg(df):
    '''
    Given a df from build_df, keep just the weeks we are interested in.
    '''
    week1_dates = ['2016-09-06','2016-09-07','2016-09-08','2016-09-09','2016-09-10','2016-09-11','2016-09-12']
    week2_dates = ['2016-09-13','2016-09-14','2016-09-15','2016-09-16','2016-09-17','2016-09-18','2016-09-19']
    week3_dates = ['2016-09-20','2016-09-21','2016-09-22','2016-09-23','2016-09-24','2016-09-25','2016-09-26']
    week4_dates = ['2016-09-27','2016-09-28','2016-09-29','2016-09-30','2016-10-01','2016-10-02','2016-10-03']
    week5_dates = ['2016-10-04','2016-10-05','2016-10-06','2016-10-07','2016-10-08','2016-10-09','2016-10-10']
    week6_dates = ['2016-10-11','2016-10-12','2016-10-13','2016-10-14','2016-10-15','2016-10-16','2016-10-17']
    week7_dates = ['2016-10-18','2016-10-19','2016-10-20','2016-10-21','2016-10-22','2016-10-23','2016-10-24']
    week8_dates = ['2016-10-25','2016-10-26','2016-10-27','2016-10-28','2016-10-29','2016-10-30','2016-10-31']
    week9_dates = ['2016-11-01','2016-11-02','2016-11-03','2016-11-04','2016-11-05','2016-11-06','2016-11-07']

    df['week1'] = df[week1_dates].mean(axis=1)
    df['week2'] = df[week2_dates].mean(axis=1)
    df['week3'] = df[week3_dates].mean(axis=1)
    df['week4'] = df[week4_dates].mean(axis=1)
    df['week5'] = df[week5_dates].mean(axis=1)
    df['week6'] = df[week6_dates].mean(axis=1)
    df['week7'] = df[week7_dates].mean(axis=1)
    df['week8'] = df[week8_dates].mean(axis=1)
    df['week9'] = df[week9_dates].mean(axis=1)
    
    return df

df = weekly_avg(df)
df.head()

Unnamed: 0,2015-11-10,2015-11-11,2015-11-12,2015-11-13,2015-11-14,2015-11-15,2015-11-16,2015-11-17,2015-11-18,2015-11-19,...,2016-11-08,week1,week2,week3,week4,week5,week6,week7,week8,week9
austinscottga08,,2.40933,,,,,,,,,...,,,4.932035,,1.06429,3.6757,6.0426,1.56161,,
benniegthompson,,,,,,,,3.85585,,,...,,,-1.21138,,,,-1.82301,,,
bettymccollum04,,,,,,,,,,,...,,-56.669016,-119.88476,-67.1729,-59.540265,-27.995798,-33.955595,-4.879673,-33.080633,-56.5287
billpascrell,,,-1.16875,-0.916501,-0.972477,,,,,,...,,-2.87143,-2.62381,-2.353348,-2.337897,,-1.02655,,-1.33333,-17.9542
boblatta,,,0.100723,,,1.70286,,1.56897,,-0.031978,...,,1.37997,9.673527,,,,0.974138,,,


In [6]:
weekly_df = df[['week1','week2','week3','week4','week5','week6','week7','week8','week9']]
weekly_df["handle"] = weekly_df.index.str.lower() # this line throws the SettingWithCopyWarning which I'm ignoring
weekly_df.to_csv(git_data_path + 'week_by_handle.csv', encoding='utf-8')
weekly_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,week1,week2,week3,week4,week5,week6,week7,week8,week9,handle
austinscottga08,,4.932035,,1.064290,3.675700,6.042600,1.561610,,,austinscottga08
benniegthompson,,-1.211380,,,,-1.823010,,,,benniegthompson
bettymccollum04,-56.669016,-119.884760,-67.172900,-59.540265,-27.995798,-33.955595,-4.879673,-33.080633,-56.528700,bettymccollum04
billpascrell,-2.871430,-2.623810,-2.353348,-2.337897,,-1.026550,,-1.333330,-17.954200,billpascrell
boblatta,1.379970,9.673527,,,,0.974138,,,,boblatta
bradsherman,,-1.941180,,-5.381950,,,,,-0.748092,bradsherman
call_me_dutch,-1.312583,-4.765343,-3.129013,-1.573270,-2.967365,-2.715476,-20.466664,-1.576630,-6.732820,call_me_dutch
candicemiller,1.678195,,0.029657,-0.969076,,1.086960,,,,candicemiller
cathymcmorris,7.926215,3.141282,8.873570,13.830632,1.000000,7.823813,3.747878,11.442992,10.554400,cathymcmorris
cbrangel,-39.434357,-61.711175,-41.164824,-37.333904,-19.490345,-32.642642,-8.514975,-4.536950,-18.458800,cbrangel


In [7]:
weekly_df.sort_values(by = 'week2', ascending = True).head()

Unnamed: 0,week1,week2,week3,week4,week5,week6,week7,week8,week9,handle
replawrence,-28.383236,-308.088,-18.402877,-32.248687,0.698665,-43.500575,,,-0.811634,replawrence
repdennyheck,,-300.145,-10.527903,-17.045025,,-18.14737,-16.686711,-40.5577,-62.14465,repdennyheck
repbobbyrush,,-270.786,-30.6797,-1.161955,,,,-8.6529,-1.89872,repbobbyrush
nitalowey,-1.29884,-215.463,-39.0497,-6.69917,,-11.493604,-1.08432,-3.043239,-18.510867,nitalowey
repcleaver,-6.302088,-194.11,-8.202654,-32.020135,-0.119293,0.375556,-18.248317,-3.68411,-9.614692,repcleaver


## Merge with legislator data
### Legislator Meta Data
Now we merge our #polar-tag data with data about the legislators (e.g., party, state).

In [8]:
# get term info 
# https://stackoverflow.com/questions/35968189/retrieving-data-from-a-yaml-file-based-on-a-python-list

# Connect to data from Govtrack Nov 15, 2016 commit
# https://github.com/unitedstates/congress-legislators/tree/1473ea983d5538c25f5d315626445ab038d8141b
with open(git_data_path + 'legislators-social-media-nov16.yaml', 'r') as f:
    df_social = pd.io.json.json_normalize(yaml.load(f))

with open(git_data_path + 'legislators-current-nov16.yaml', 'r') as f:
    df_current = pd.io.json.json_normalize(yaml.load(f), 'terms', [['id','bioguide'],['name','last']])

df_current_term = df_current.loc[df_current['end'] > '2016-01-01'] 
df_current_term = df_current_term[['district','state','party','type','id.bioguide','name.last']]

# merge everything into one data frame with one row per MOC
df_meta = pd.merge(df_current_term, df_social, on="id.bioguide")

df_meta["handle"] = df_meta["social.twitter"].str.lower()
df_meta['district'] = np.where(df_meta['type'] == 'sen', 'None', df_meta['district'])

df_meta = df_meta[['id.bioguide','handle','state','district','type','party','name.last']]

df_meta.to_csv(git_data_path + 'legislator_meta.csv', encoding='utf-8')

df_meta

Unnamed: 0,id.bioguide,handle,state,district,type,party,name.last
0,B000944,sensherrodbrown,OH,,sen,Democrat,Brown
1,C000127,senatorcantwell,WA,,sen,Democrat,Cantwell
2,C000141,senatorcardin,MD,,sen,Democrat,Cardin
3,C000174,senatorcarper,DE,,sen,Democrat,Carper
4,C001070,senbobcasey,PA,,sen,Democrat,Casey
5,C001071,senbobcorker,TN,,sen,Republican,Corker
6,F000062,senfeinstein,CA,,sen,Democrat,Feinstein
7,H000338,senorrinhatch,UT,,sen,Republican,Hatch
8,K000367,,MN,,sen,Democrat,Klobuchar
9,M001170,mccaskilloffice,MO,,sen,Democrat,McCaskill


### 2016 Election Results
Now add the results from the election

In [9]:
# get Ballotpedia House results data
bp_house = pd.read_csv(git_data_path + 'house_results.csv', sep=",", header=0, thousands=',')
bp_house['state_district'] = bp_house.index
bp_house['state'], bp_house['district'] = bp_house['district'].str.split('-',1).str
bp_house['district'] = bp_house['district'].str.replace('AL','0')
bp_house['district'] = bp_house['district'].astype(float).astype(str)
bp_house = bp_house[['state','district','inc_name','inc_party','inc_ran','inc_won','had_challenger','winner_votes','runnerup_votes','winner_party','runnerup_party']]
bp_house.head()

Unnamed: 0,state,district,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party
0,AK,0.0,"Young, Don",R,1,1,1,155088.0,111019.0,R,D
1,AL,1.0,"Byrne, Bradley",R,1,1,0,208083.0,7810.0,R,write in
2,AL,2.0,"Roby, Martha",R,1,1,1,134886.0,112089.0,R,D
3,AL,3.0,"Rogers, Mike",R,1,1,1,192164.0,94549.0,R,D
4,AL,4.0,"Aderholt, Rob",R,1,1,0,235925.0,3519.0,R,write in


In [10]:
# get Ballotpedia Senate results data
bp_sen = pd.read_csv(git_data_path + 'senate_results.csv', sep=",", header=0, thousands=',')
bp_sen['district'] = 'None'
bp_sen = bp_sen[['state','district','inc_name','inc_party','inc_ran','inc_won','had_challenger','winner_votes','runnerup_votes','winner_party','runnerup_party']]
bp_sen.head()

Unnamed: 0,state,district,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party
0,AL,,Shelby,R,1,1,1,1335104,748709,R,D
1,AK,,Murkowski,R,1,1,1,138149,36200,R,D
2,AZ,,McCain,R,1,1,1,1359267,1031245,R,D
3,AR,,Boozman,R,1,1,1,661984,400602,R,D
4,CA,,Boxer,D,0,0,0,7542753,4701417,D,D


In [11]:
# match bioguide id data to name data from Ballotpedia for the House
house_election_and_meta = pd.merge(df_meta, bp_house, on=['state','district'], how='right')
house_election_and_meta

Unnamed: 0,id.bioguide,handle,state,district,type,party,name.last,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party
0,A000055,robert_aderholt,AL,4.0,rep,Republican,Aderholt,"Aderholt, Rob",R,1,1,0,235925.0,3519.0,R,write in
1,A000367,,MI,3.0,rep,Republican,Amash,"Amash, Justin",R,1,1,1,203545.0,128400.0,R,D
2,B001269,reploubarletta,PA,11.0,rep,Republican,Barletta,"Barletta, Lou",R,1,1,1,199421.0,113800.0,R,D
3,B000213,repjoebarton,TX,6.0,rep,Republican,Barton,"Barton, Joe",R,1,1,1,159444.0,106667.0,R,D
4,B001270,repkarenbass,CA,37.0,rep,Democrat,Bass,"Bass, Karen",D,1,1,1,192490.0,44782.0,D,D
5,B000287,repbecerra,CA,34.0,rep,Democrat,Becerra,"Becerra, Xavier",D,1,1,1,122842.0,36314.0,D,D
6,B001271,congressmandan,MI,1.0,rep,Republican,Benishek,"Bergman, Jack",R,0,0,0,197777.0,144334.0,R,D
7,B001257,repgusbilirakis,FL,12.0,rep,Republican,Bilirakis,"Bilirakis, Gus",R,1,1,1,253559.0,116110.0,R,D
8,B001250,reprobbishop,UT,1.0,rep,Republican,Bishop,"Bishop, Rob",R,1,1,1,182925.0,73380.0,R,D
9,B000490,sanfordbishop,GA,2.0,rep,Democrat,Bishop,"Bishop, Sanford",D,1,1,1,148543.0,94056.0,D,R


In [12]:
# match bioguide id data to name data from Ballotpedia for the Senate
senate_election_and_meta = pd.merge(df_meta, bp_sen, left_on=['state','name.last','district'], right_on=['state','inc_name','district'], how='right')
senate_election_and_meta

Unnamed: 0,id.bioguide,handle,state,district,type,party,name.last,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party
0,A000368,kellyayotte,NH,,sen,Republican,Ayotte,Ayotte,R,1,0,1,354649,353632,D,R
1,B001267,senbennetco,CO,,sen,Democrat,Bennet,Bennet,D,1,1,1,1370710,1215318,D,R
2,B001277,senblumenthal,CT,,sen,Democrat,Blumenthal,Blumenthal,D,1,1,1,1008714,552621,D,R
3,B000575,royblunt,MO,,sen,Republican,Blunt,Blunt,R,1,1,1,1378458,1300200,R,D
4,B001236,johnboozman,AR,,sen,Republican,Boozman,Boozman,R,1,1,1,661984,400602,R,D
5,B000711,senatorboxer,CA,,sen,Democrat,Boxer,Boxer,D,0,0,0,7542753,4701417,D,D
6,B001135,senatorburr,NC,,sen,Republican,Burr,Burr,R,1,1,1,2395376,2128165,R,D
7,C000880,mikecrapo,ID,,sen,Republican,Crapo,Crapo,R,1,1,1,449017,188249,R,D
8,G000386,chuckgrassley,IA,,sen,Republican,Grassley,Grassley,R,1,1,1,926007,549460,R,D
9,H001061,senjohnhoeven,ND,,sen,Republican,Hoeven,Hoeven,R,1,1,1,268788,58116,R,D


In [17]:
# combine house and senate results
election_results = pd.concat([house_election_and_meta, senate_election_and_meta])

# race closeness
election_results['margin'] = (election_results['winner_votes'] - election_results['runnerup_votes'])/(election_results['winner_votes'] + election_results['runnerup_votes'])
election_results.to_csv(git_data_path + 'meta_and_election_results.csv', encoding='utf-8')

election_results

Unnamed: 0,id.bioguide,handle,state,district,type,party,name.last,inc_name,inc_party,inc_ran,inc_won,had_challenger,winner_votes,runnerup_votes,winner_party,runnerup_party,margin
0,A000055,robert_aderholt,AL,4.0,rep,Republican,Aderholt,"Aderholt, Rob",R,1,1,0,235925.0,3519.0,R,write in,0.970607
1,A000367,,MI,3.0,rep,Republican,Amash,"Amash, Justin",R,1,1,1,203545.0,128400.0,R,D,0.226378
2,B001269,reploubarletta,PA,11.0,rep,Republican,Barletta,"Barletta, Lou",R,1,1,1,199421.0,113800.0,R,D,0.273357
3,B000213,repjoebarton,TX,6.0,rep,Republican,Barton,"Barton, Joe",R,1,1,1,159444.0,106667.0,R,D,0.198327
4,B001270,repkarenbass,CA,37.0,rep,Democrat,Bass,"Bass, Karen",D,1,1,1,192490.0,44782.0,D,D,0.622526
5,B000287,repbecerra,CA,34.0,rep,Democrat,Becerra,"Becerra, Xavier",D,1,1,1,122842.0,36314.0,D,D,0.543668
6,B001271,congressmandan,MI,1.0,rep,Republican,Benishek,"Bergman, Jack",R,0,0,0,197777.0,144334.0,R,D,0.156215
7,B001257,repgusbilirakis,FL,12.0,rep,Republican,Bilirakis,"Bilirakis, Gus",R,1,1,1,253559.0,116110.0,R,D,0.371816
8,B001250,reprobbishop,UT,1.0,rep,Republican,Bishop,"Bishop, Rob",R,1,1,1,182925.0,73380.0,R,D,0.427401
9,B000490,sanfordbishop,GA,2.0,rep,Democrat,Bishop,"Bishop, Sanford",D,1,1,1,148543.0,94056.0,D,R,0.224597


## Merge Meta + Election Data with Scores data

In [18]:
# only people who ran for re-election
df_wide = pd.merge(election_results, weekly_df, on='handle', how='left')
df_wide.dropna(subset=['handle'], how='any', inplace=True)
df_wide.to_csv(git_data_path + 'merged_wide.csv', encoding='utf-8')

df_wide

Unnamed: 0,id.bioguide,handle,state,district,type,party,name.last,inc_name,inc_party,inc_ran,...,margin,week1,week2,week3,week4,week5,week6,week7,week8,week9
0,A000055,robert_aderholt,AL,4.0,rep,Republican,Aderholt,"Aderholt, Rob",R,1,...,0.970607,1.508142,,2.456140,19.826600,,10.846300,,0.851852,
2,B001269,reploubarletta,PA,11.0,rep,Republican,Barletta,"Barletta, Lou",R,1,...,0.273357,,-0.114591,3.561290,,0.182323,,,0.869231,
3,B000213,repjoebarton,TX,6.0,rep,Republican,Barton,"Barton, Joe",R,1,...,0.198327,1.933650,,2.550000,,,2.173910,2.116500,16.993080,21.865500
4,B001270,repkarenbass,CA,37.0,rep,Democrat,Bass,"Bass, Karen",D,1,...,0.622526,-14.938060,-92.063393,-18.224034,-19.090097,-5.101102,-11.509697,-20.173614,-11.338472,-26.374088
5,B000287,repbecerra,CA,34.0,rep,Democrat,Becerra,"Becerra, Xavier",D,1,...,0.543668,-30.206601,-40.407968,-17.238693,-5.432492,-4.291443,-3.787882,-2.167685,-3.903488,-12.661166
6,B001271,congressmandan,MI,1.0,rep,Republican,Benishek,"Bergman, Jack",R,0,...,0.156215,,,,,,,,,
7,B001257,repgusbilirakis,FL,12.0,rep,Republican,Bilirakis,"Bilirakis, Gus",R,1,...,0.371816,1.312443,,,,,,1.897960,10.711600,
8,B001250,reprobbishop,UT,1.0,rep,Republican,Bishop,"Bishop, Rob",R,1,...,0.427401,,,,,,,,,
9,B000490,sanfordbishop,GA,2.0,rep,Democrat,Bishop,"Bishop, Sanford",D,1,...,0.224597,-1.034570,0.231907,-1.026090,-49.990600,0.699301,,-0.944954,-4.695650,
10,B001273,repdianeblack,TN,6.0,rep,Republican,Black,"Black, Diane",R,1,...,0.530748,2.974408,3.727599,7.829558,7.146729,2.420896,9.991508,5.292473,9.057695,3.280900


In [19]:
# melt it so each row is a person x week
df_wide = pd.DataFrame.from_csv(git_data_path + 'merged_wide.csv', sep=",", header=0)
df_long = pd.melt(df_wide, id_vars=['id.bioguide','handle','name.last','type','party','district',
               'inc_ran','inc_won','had_challenger','inc_party','winner_party','runnerup_party',
                'runnerup_votes','winner_votes','margin'],
                value_vars=['week1','week2','week3','week4','week5','week6','week7','week8','week9'],
                var_name='week', value_name='avg_score')
df_long['week'] = df_long['week'].str[-1:]
df_long['abs'] = abs(df_long['avg_score'])

df_long.rename(columns = {'type':'chamber'}, inplace = True)

# make sure we have just two parties
df_long.dropna(subset=['party','abs'], how='any', inplace=True)

df_long.to_csv(git_data_path + 'merged_long.csv', encoding='utf-8')
df_long.head()

Unnamed: 0,id.bioguide,handle,name.last,chamber,party,district,inc_ran,inc_won,had_challenger,inc_party,winner_party,runnerup_party,runnerup_votes,winner_votes,margin,week,avg_score,abs
0,A000055,robert_aderholt,Aderholt,rep,Republican,4.0,1,1,0,R,R,write in,3519.0,235925.0,0.970607,1,1.508142,1.508142
2,B000213,repjoebarton,Barton,rep,Republican,6.0,1,1,1,R,R,D,106667.0,159444.0,0.198327,1,1.93365,1.93365
3,B001270,repkarenbass,Bass,rep,Democrat,37.0,1,1,1,D,D,D,44782.0,192490.0,0.622526,1,-14.93806,14.93806
4,B000287,repbecerra,Becerra,rep,Democrat,34.0,1,1,1,D,D,D,36314.0,122842.0,0.543668,1,-30.206601,30.206601
6,B001257,repgusbilirakis,Bilirakis,rep,Republican,12.0,1,1,1,R,R,D,116110.0,253559.0,0.371816,1,1.312443,1.312443


# Now move to R for analysis

Run ```~/Documents/git/casmlab/purpletag/2016_election.R```

That R script sends its output to ```2016_election_results.txt```

In [None]:
results = open('data-files/2016_election_results.txt', 'r')
print(results.read())

Based on the outlier-excluded linear mixed-effects models, it makes sense to remove RepThompson. The pattern stays the same even with RepThompson in the set though: negative effect for republican and week, positive effect for their interaction. ```lmm5``` is the model-of-best-fit. 

## Changing the way we score hashtags

What if we score tags for the 63-day period and then score MOCS?

Run the following (on the server) to get new scores:

* purpletag parse -t 63 -d 200
* purpletag score
* purpletag score --counts --score-mocs

That first command took a week because the code starts with today and works backwards 200 days, one day at a time. Each day takes over an hour. See Issue #18 about options for changing this behavior.

With the new tag measures, can start the process over. Start at "On Server: Parsing scores to CSV" with a new file name.

# Getting Tag Data for Paper

We need to know more about the tags people were using to make sense of the regression results. So, let's get some tag data.'

In [None]:
score_files = build_file_list('/data/purpletag/scores', '*.1.scores')
build_df(score_files, '/data/purpletag/scores_by_date.pkl', 'tag')

In [None]:
import pandas as pd

df = pd.read_pickle('/data/purpletag/scores_by_date.pkl')
df.head()

In [None]:
df = weekly_avg(df)
df.head()

In [None]:
df_tags_weeks = df[['week1','week2','week3','week4','week5','week6','week7','week8','week9']]
df_tags_weeks = df_tags_weeks.dropna(how='all')
df_tags_weeks.head()
len(df_tags_weeks.index) # number of hashtags in our df

What was happening in week 2 that made Democrats so polarized that week?

In [None]:
week2 = df_tags_weeks.sort_values(by = 'week2', ascending = True)
week2.head(10)

In [None]:
week2 = df_tags_weeks.sort_values(by = 'week2', ascending = False)
week2.head(10)

In [None]:
week9_dems = df_tags_weeks.sort_values(by = 'week9', ascending = True)
week9_dems.head(10)

In [None]:
week9_reps = df_tags_weeks.sort_values(by = 'week9', ascending = False)
week9_reps.head(10)

# Go to the JSON for examples

In [None]:
import json
from datetime import datetime

# print(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

missing = ['repboustany', 'repcorrinebrown', 'repduckworth', 'repgwengraham', 'repjoeheck', 
           'repjoepitts', 'repkirkpatrick', 'repmattsalmon', 'repmickmulvaney', 'repmikepompeo', 
           'repmiketurner', 'repmurphyfl', 'reprobbishop', 'repsamfarr', 'reptoddyoung', 'senatorsessions', 
           'senatortester', 'sentoomey', 'tiberipress', 'yvetteclarke']

# search JSON for high score tweeters
with open('/data/purpletag/jsons/1478545228.json','r') as f:
    for line in f:
        data = json.loads(line)
#         if (data['user']['screen_name'].lower() == 'replawrence' 
#             or data['user']['screen_name'].lower() == 'repdennyheck'): # most polarized Dems during Week 2
        if (data['user']['screen_name'].lower in missing):
            print(data['user']['screen_name'])
            print(data['created_at'])
            print(data['text'])
                        
# print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')) # takes about 2 min to search the whole thing

In [None]:
matches = list()

# search JSON for high score tags
with open('/data/purpletag/jsons/1478545228.json','r') as f:
    for line in f:
        data = json.loads(line)
        tags = data['entities']['hashtags']
        for tag in tags:
            if tag['text'].lower() == 'gunvote':
                matches.append(data['id'])
                print(data['user']['screen_name'])
                print(data['created_at'])
                print(data['text'])

print(len(matches))

In [None]:
# check for missing data from alternate_route.ipynb
# search JSON for high score tags
with open('/data/purpletag/jsons/1478545228.json','r') as f:
    for line in f:
        data = json.loads(line)
        tags = data['entities']['hashtags']
        for tag in tags:
            if tag['text'].lower() == 'gunvote':
                matches.append(data['id'])
                print(data['user']['screen_name'])
                print(data['created_at'])
                print(data['text'])

print(len(matches))