# Overview

This notebook assumes you have already collected and scored MOC tweets. It creates a dataset for use in R to analyze the patterns of polarization over time. You will do some parsing on an AWS server and some locally before ultimately making a CSV file that you can open and analyze in R.

LH Note: on my computer, Git stuff and data live in different places, so you'll see notes about moving files or changing directories. I haven't figured out a good way to keep files in both places or to mirror or sync or whatever. So, for now, paths are hard-coded or there's a note about where to find a file.

# Get Data

2016 election data is on an AWS server under ```/data/purpletag```. 

To login: 

```ssh -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org```

The data is large (> 4GB), so best to run Juypter notebooks to parse on the server. Then CSV files can be used locally.

You can run a notebook on the server and use your local browser with these two commands:

* ```ssh -L 8080:localhost:8888 -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org```
* ```nohup jupyter notebook --no-browser > log.txt 2>&1 &```

Then access ```http://localhost:8080``` in your browser.

## On server: Parsing from ```scores``` files to CSV

This section assumes you have already run purpletag's ```collect``` and ```score``` functions and gotten the Twitter data that you want in JSON format and parsed that data into score files.

In [None]:
# Based on https://stackoverflow.com/questions/26415906/read-multiple-txt-files-into-pandas-dataframe-with-filename-as-column-header
import pandas as pd
import os
import glob

# Step 1: get a list of all score files in target directory
my_dir = "/data/purpletag/scores/"
filelist = []
filesList = []
os.chdir( my_dir )

# Step 2: Build up list of files:
for files in glob.glob("*.1.moc.scores"): # using 1-day purpletag MOC scores
    fileName, fileExtension = os.path.splitext(files)
    filesList.append(files) #filename with extension

In [None]:
# Step 3: Build up DataFrame:
# Based on https://stackoverflow.com/questions/35717706/python-how-to-turn-a-dictionary-of-dataframes-into-one-big-dataframe-with-colum
d = {} # dictionary to hold multiple dfs

for filename in filesList:
    df1 = pd.read_csv(filename, header=None, sep=' ', index_col=0)
    d[filename[:-13]] = df1
    
df = pd.concat(d, axis=1)
df.columns = df.columns.droplevel(-1) 

df.to_pickle('/data/purpletag/mocs_by_date.pkl')

Move the file from the AWS server to local if you want to work locally:

```scp -i ~/.ssh/carolgrrr.pem ubuntu@purpletag.casmlab.org:/data/purpletag/mocs_by_date.pkl ~/Documents/git/casmlab/purpletag/2016-election-study/data-files/```

## Locally: Prepping for stats

We now have a pickled dataframe of the form handleXdate. We need to keep data only from Labor Day to Election Day and get weekly averages.

In [16]:
import pandas as pd

df = pd.read_pickle('data-files/mocs_by_date.pkl')
df.head()

Unnamed: 0,2015-11-10,2015-11-11,2015-11-12,2015-11-13,2015-11-14,2015-11-15,2015-11-16,2015-11-17,2015-11-18,2015-11-19,...,2016-10-30,2016-10-31,2016-11-01,2016-11-02,2016-11-03,2016-11-04,2016-11-05,2016-11-06,2016-11-07,2016-11-08
austinscottga08,,2.40933,,,,,,,,,...,,,,,,,,,,
benniegthompson,,,,,,,,3.85585,,,...,,,,,,,,,,
bettymccollum04,,,,,,,,,,,...,,,-90.838,-61.3435,-40.482,,-33.4513,,,
billpascrell,,,-1.16875,-0.916501,-0.972477,,,,,,...,,,,-17.9542,,,,,,
boblatta,,,0.100723,,,1.70286,,1.56897,,-0.031978,...,,,,,,,,,,


In [17]:
week1_dates = ['2016-09-06','2016-09-07','2016-09-08','2016-09-09','2016-09-10','2016-09-11','2016-09-12']
week2_dates = ['2016-09-13','2016-09-14','2016-09-15','2016-09-16','2016-09-17','2016-09-18','2016-09-19']
week3_dates = ['2016-09-20','2016-09-21','2016-09-22','2016-09-23','2016-09-24','2016-09-25','2016-09-26']
week4_dates = ['2016-09-27','2016-09-28','2016-09-29','2016-09-30','2016-10-01','2016-10-02','2016-10-03']
week5_dates = ['2016-10-04','2016-10-05','2016-10-06','2016-10-07','2016-10-08','2016-10-09','2016-10-10']
week6_dates = ['2016-10-11','2016-10-12','2016-10-13','2016-10-14','2016-10-15','2016-10-16','2016-10-17']
week7_dates = ['2016-10-18','2016-10-19','2016-10-20','2016-10-21','2016-10-22','2016-10-23','2016-10-24']
week8_dates = ['2016-10-25','2016-10-26','2016-10-27','2016-10-28','2016-10-29','2016-10-30','2016-10-31']
week9_dates = ['2016-11-01','2016-11-02','2016-11-03','2016-11-04','2016-11-05','2016-11-06','2016-11-07']

df['week1'] = df[week1_dates].mean(axis=1)
df['week2'] = df[week2_dates].mean(axis=1)
df['week3'] = df[week3_dates].mean(axis=1)
df['week4'] = df[week4_dates].mean(axis=1)
df['week5'] = df[week5_dates].mean(axis=1)
df['week6'] = df[week6_dates].mean(axis=1)
df['week7'] = df[week7_dates].mean(axis=1)
df['week8'] = df[week8_dates].mean(axis=1)
df['week9'] = df[week9_dates].mean(axis=1)

df.head()

Unnamed: 0,2015-11-10,2015-11-11,2015-11-12,2015-11-13,2015-11-14,2015-11-15,2015-11-16,2015-11-17,2015-11-18,2015-11-19,...,2016-11-08,week1,week2,week3,week4,week5,week6,week7,week8,week9
austinscottga08,,2.40933,,,,,,,,,...,,,4.932035,,1.06429,3.6757,6.0426,1.56161,,
benniegthompson,,,,,,,,3.85585,,,...,,,-1.21138,,,,-1.82301,,,
bettymccollum04,,,,,,,,,,,...,,-56.669016,-119.88476,-67.1729,-59.540265,-27.995798,-33.955595,-4.879673,-33.080633,-56.5287
billpascrell,,,-1.16875,-0.916501,-0.972477,,,,,,...,,-2.87143,-2.62381,-2.353348,-2.337897,,-1.02655,,-1.33333,-17.9542
boblatta,,,0.100723,,,1.70286,,1.56897,,-0.031978,...,,1.37997,9.673527,,,,0.974138,,,


In [18]:
weekly_df = df[['week1','week2','week3','week4','week5','week6','week7','week8','week9']]
weekly_df

Unnamed: 0,week1,week2,week3,week4,week5,week6,week7,week8,week9
austinscottga08,,4.932035,,1.064290,3.675700,6.042600,1.561610,,
benniegthompson,,-1.211380,,,,-1.823010,,,
bettymccollum04,-56.669016,-119.884760,-67.172900,-59.540265,-27.995798,-33.955595,-4.879673,-33.080633,-56.528700
billpascrell,-2.871430,-2.623810,-2.353348,-2.337897,,-1.026550,,-1.333330,-17.954200
boblatta,1.379970,9.673527,,,,0.974138,,,
bradsherman,,-1.941180,,-5.381950,,,,,-0.748092
call_me_dutch,-1.312583,-4.765343,-3.129013,-1.573270,-2.967365,-2.715476,-20.466664,-1.576630,-6.732820
candicemiller,1.678195,,0.029657,-0.969076,,1.086960,,,
cathymcmorris,7.926215,3.141282,8.873570,13.830632,1.000000,7.823813,3.747878,11.442992,10.554400
cbrangel,-39.434357,-61.711175,-41.164824,-37.333904,-19.490345,-32.642642,-8.514975,-4.536950,-18.458800


In [19]:
import pandas as pd
import yaml

# get the data from Govtrack
with open('/Users/libbyh/Dropbox/CASM/SMCE/Shared Social Media and Civic Engagement/Data/purpletag/legislators-social-media.yaml', 'r') as f:
    df_social = pd.io.json.json_normalize(yaml.load(f))

with open('/Users/libbyh/Dropbox/CASM/SMCE/Shared Social Media and Civic Engagement/Data/purpletag/legislators-current.yaml', 'r') as f:
    df_current = pd.io.json.json_normalize(yaml.load(f))

print(len(weekly_df))
# merge everything into one data frame with one row per MOC
df_meta = pd.merge(df_current, df_social, on="id.govtrack")
df_meta["handle"] = df_meta["social.twitter"].str.lower()
weekly_df["handle"] = weekly_df.index.str.lower()

print(len(df_meta))

df_merged = pd.merge(df_meta, weekly_df, left_on="handle", right_index=True)

print(len(df_merged))

#cols_to_keep = ['id.govtrack','social.twitter','name.official_full','bio.gender','terms','week1','week2','week3','week4','week5','week6','week7','week8','week9']

df_merged = df_merged[['id.govtrack','social.twitter','name.official_full','bio.gender','terms','week1','week2','week3','week4','week5','week6','week7','week8','week9']]


511
529
444


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Not sure why we have only 444 matches, but it's better than 12.

In [20]:
df1 = pd.concat([df_merged.drop(['terms'], axis=1), df_merged['terms'].apply(pd.Series)], axis=1)
df2 = pd.concat([df1.drop([0], axis=1), df1[0].apply(pd.Series)], axis=1)

keep_df = df2[['id.govtrack','social.twitter','name.official_full','bio.gender','type','party','week1','week2','week3','week4','week5','week6','week7','week8','week9']]
keep_df

Unnamed: 0,id.govtrack,social.twitter,name.official_full,bio.gender,type,party,week1,week2,week3,week4,week5,week6,week7,week8,week9
0,400050,SenSherrodBrown,Sherrod Brown,M,rep,Democrat,-0.873145,-1.141480,-1.825096,-21.350100,-0.467001,0.422609,-16.612679,0.359653,-0.367748
1,300018,SenatorCantwell,Maria Cantwell,F,rep,Democrat,-42.572318,-8.845897,-3.729938,-21.854320,-10.628532,-18.030278,-17.871626,-9.655197,-17.863474
2,400064,SenatorCardin,Benjamin L. Cardin,M,rep,Democrat,-8.692816,-4.934183,-5.241676,-15.389205,-1.365323,-8.171898,-9.316775,-5.438023,-9.632692
3,300019,SenatorCarper,Thomas R. Carper,M,rep,Democrat,-38.936383,0.153670,-4.047725,-10.227893,-11.687868,0.652402,-2.603721,-1.830320,-2.459966
4,412246,SenBobCasey,"Robert P. Casey, Jr.",M,sen,Democrat,-19.997507,-4.658190,-1.272941,-21.918950,-1.944227,-3.823387,-12.673860,-7.019990,-36.656500
5,412248,SenBobCorker,Bob Corker,M,sen,Republican,2.096852,5.911753,2.764636,4.465404,,1.819670,,1.748260,
6,300043,SenFeinstein,Dianne Feinstein,F,sen,Democrat,-22.061820,-39.389700,-7.216368,-16.176455,-6.700642,-5.140945,-19.095330,,-19.184040
7,300052,SenOrrinHatch,Orrin G. Hatch,M,sen,Republican,5.009444,11.089169,11.336792,11.149503,6.292435,5.263352,2.093247,19.990525,2.214622
9,412243,McCaskillOffice,Claire McCaskill,F,sen,Democrat,-1.730870,-30.371824,-8.932897,-2.108497,-0.586722,1.789353,-0.857143,-1.657192,-0.748092
10,400272,SenatorMenendez,Robert Menendez,M,rep,Democrat,-11.664095,-8.637506,-0.829845,-19.863147,-1.791103,-16.889765,-13.598838,-3.754033,-29.232743


In [21]:
# melt it so each row is a person x week
df_long = pd.melt(keep_df, id_vars=['id.govtrack','social.twitter','name.official_full','bio.gender','party','type'],
                value_vars=['week1','week2','week3','week4','week5','week6','week7','week8','week9'],
                var_name='week', value_name='avg_score')
df_long['week'] = df_long['week'].str[-1:]

df_long.rename(columns = {'type':'chamber', 'social.twitter': 'handle', 'name.official_full': 'name', 'bio.gender': 'gender'}, inplace = True)

df_long.head()

Unnamed: 0,id.govtrack,handle,name,gender,party,chamber,week,avg_score
0,400050,SenSherrodBrown,Sherrod Brown,M,Democrat,rep,1,-0.873145
1,300018,SenatorCantwell,Maria Cantwell,F,Democrat,rep,1,-42.572318
2,400064,SenatorCardin,Benjamin L. Cardin,M,Democrat,rep,1,-8.692816
3,300019,SenatorCarper,Thomas R. Carper,M,Democrat,rep,1,-38.936383
4,412246,SenBobCasey,"Robert P. Casey, Jr.",M,Democrat,sen,1,-19.997507


In [22]:
# make sure we have just two parties
df_long.party.unique()

array(['Democrat', 'Republican'], dtype=object)

In [23]:
# get an absolute value of the polar score
df_long['abs'] = df_long['avg_score'].abs()

In [24]:
df_long.to_csv('data-files/weekly_averages_long.csv')

# Now move to R for analysis

Run ```~/Documents/git/casmlab/purpletag/2016_election.R```

That R script sends its output to ```2016_election_results.txt```

In [180]:
results = open('data-files/2016_election_results.txt', 'r')
print(results.read())


> # for pretty regression tables
> # http://stackoverflow.com/questions/30195718/stargazer-save-to-file-dont-show-in-console
> mod_stargazer <- functi .... [TRUNCATED] 

> df <- read.csv('weekly_averages_long.csv', header = TRUE, sep = ",", quote = "\"",
+                dec = ".", fill = TRUE, comment.char = "")

> summary(df)
       X           id.govtrack                 handle                 name      gender  
 Min.   :   0.0   Min.   :300002   AustinScottGA08:   9   Adam B. Schiff:   9   F: 792  
 1st Qu.: 998.8   1st Qu.:400326   BennieGThompson:   9   Adam Kinzinger:   9   M:3204  
 Median :1997.5   Median :412292   BettyMcCollum04:   9   Adam Smith    :   9           
 Mean   :1997.5   Mean   :401868   BillPascrell   :   9   Adrian Smith  :   9           
 3rd Qu.:2996.2   3rd Qu.:412533   BobLatta       :   9   Al Franken    :   9           
 Max.   :3995.0   Max.   :412674   BradSherman    :   9   Al Green      :   9           
                                   (Other)    

Based on the outlier-excluded linear mixed-effects models, it makes sense to remove RepThompson. The pattern stays the same even with RepThompson in the set though: negative effect for republican and week, positive effect for their interaction. ```lmm5``` is the model-of-best-fit. 

## Changing the way we score hashtags

What if we score tags for the 63-day period and then score MOCS?

Run the following (on the server) to get new scores:

* purpletag parse -t 63 -d 200
* purpletag score
* ~~purpletag score --counts --score-mocs~~

That first command took a week because the code starts with today and works backwards 200 days, one day at a time. Each day takes over an hour. See Issue #18 about options for changing this behavior.

With the new tag measures, can start the process over. Start at "On Server: Parsing scores to CSV" with a new file name.

Score MOCs didn't work quite right, so doing it here.

In [12]:
import io

def get_tag_scores(score_file):
    scores = {}
    with io.open(score_file, encoding='utf8') as f:
        for line in f:
            (key, val) = line.split()
            scores[key] = val
    return scores

def score_mocs(score_file, counts):
    """ Output a file with user scores. """
    scores = get_tag_scores(score_file)

    mocs = {}
    # get the MOC's tags
    for line in inf:
        moc_score = 0
        tags = {}
        parts = line.strip().split()
        handle = parts[0]
        tags = parse_tags(parts[1:], args)

        # calculate a MOC score
        for tag, count in tags.iteritems():
            if not counts:
                try:
                    moc_score += float(scores[tag])
                except KeyError:
                    # print "Can't find key for", tag
                    continue
            else:
                moc_score += float(scores[tag]) * count
        mocs[handle] = moc_score

In [13]:


# handle2party = twitter_handle_to_party()
# print handle2party.items()[0]

score_file = "data-files/2016-11-09.63.scores"
print('scoring MOCs with', score_file)
score_mocs(score_file, 'true')

scoring MOCs with data-files/2016-11-09.63.scores


NameError: name 'inf' is not defined