# Why this notebook?
In running analysis, I found discrepancies between the lists of legislators collected and those about whom we have data. I can't recreate the collection process from the notes on our servers, so I'm re-collecting tweets, rescoring them, and then running the analysis. This way I can keep track of each step and ensure that I know what data is included and where decision points happen.

## Getting Old Data
Purpletag relies on the [unitedstates project's congress-legislators list](https://github.com/unitedstates/congress-legislators), but that changes. We aren't currently tracking which commit of the file is being used. Instead, I will use the following process:

1. get a specific commit of the legislators-social-media.yaml file to get a list of accounts to collect
1. use a mod of Jefferson Henrique's [GetOldTweets - python](https://github.com/libbyh/GetOldTweets-python) script to collect the ids of the old tweets
1. use the Twitter API to get a JSON of the tweets by ID's
1. pass those JSON files to [Purpletag](http://github.com/casmlab/purpletag) for scoring
1. analyze as before

In [28]:
# common imports and functions
import pandas as pd
import yaml
import urllib.request
import os
import errno
from pathlib import Path

# set file paths for the right system
git_path = '/Users/libbyh/Documents/git/casmlab/purpletag/'
git_data_path = git_path + '2016-election-study/data-files/'

### Get Specific commit of legislators-social-media.yaml

November 16, 2016 commit was last before new House and Senate sworn in.

In [2]:
legs_yaml = 'https://raw.githubusercontent.com/unitedstates/congress-legislators/a35d649180d55a0b7d1e381e1774d315371a9188/legislators-social-media.yaml'
urllib.request.urlretrieve(legs_yaml, git_data_path + 'legs.yaml')
with open(git_data_path + 'legs.yaml', 'r') as f:
    df_legs = pd.io.json.json_normalize(yaml.load(f))

df_legs.head()

Unnamed: 0,id.bioguide,id.govtrack,id.thomas,social.facebook,social.facebook_id,social.instagram,social.instagram_id,social.twitter,social.twitter_id,social.youtube,social.youtube_id
0,R000600,412664.0,2222,congresswomanaumuaamata,1537155909907320,,,RepAmata,3026623000.0,,UCGdrLQbt1PYDTPsampx4t1A
1,H001070,412645.0,2260,RepCresentHardy,320612381469421,,,RepHardy,2964223000.0,RepHardy,UCc8E6NWCdgrXjBVI2NNPYdA
2,Y000064,412428.0,2019,RepToddYoung,186203844738421,,,RepToddYoung,234128500.0,RepToddYoung,UCuknj4PGn91gHDNAfboZEgQ
3,E000295,412667.0,2283,senjoniernst,351671691660938,senjoniernst,1582703000.0,SenJoniErnst,2856788000.0,,UCLwrmtF_84FIcK3TyMs4MIw
4,T000476,412668.0,2291,SenatorThomTillis,1576257352609470,,,senthomtillis,2964175000.0,,UCUD9VGV4SSGWjGdbn37Ea2w


In [18]:
series_handles = df_legs["social.twitter"].str.lower()
series_handles.dropna(axis=0, how='any')
series_handles.rename(index=str, columns={"social.twitter": "handle"})
print(series_handles.head())
series_handles.to_csv(git_data_path + 'handles.csv', index=False)

0         repamata
1         rephardy
2     reptoddyoung
3     senjoniernst
4    senthomtillis
Name: social.twitter, dtype: object


### Get Old Tweet IDs
Now we pass the Twitter handles we just collected to the GetOldTweets-python script for collection.

In [None]:
get_old_tweets_path = '/Users/libbyh/Documents/git/libbyh/GetOldTweets-python/'
get_old_tweets = get_old_tweets_path + 'Exporter.py'

list_handles = series_handles.tolist()

# getting all tweets for all handles is faster than searching by date, so we're greedy
for handle in list_handles: 
    output_file = git_data_path + "tweets/" + handle + '_tweets.csv'
    %run $get_old_tweets --username $handle --output $output_file
    # break # run once to test