# Why this notebook?
In running analysis, I found discrepancies between the lists of legislators collected and those about whom we have data. I can't recreate the collection process from the notes on our servers, so I'm re-collecting tweets, rescoring them, and then running the analysis. This way I can keep track of each step and ensure that I know what data is included and where decision points happen.

## Getting Old Data
Purpletag relies on the [unitedstates project's congress-legislators list](https://github.com/unitedstates/congress-legislators), but that changes. We aren't currently tracking which commit of the file is being used. Instead, I will use the following process:

1. get a specific commit of the legislators-social-media.yaml file to get a list of accounts to collect
1. use a mod of Jefferson Henrique's [GetOldTweets - python](https://github.com/libbyh/GetOldTweets-python) script to collect the ids of the old tweets
1. create tags files from the tweets returned
1. pass those tags files to [Purpletag](http://github.com/casmlab/purpletag) for scoring
1. analyze as before

In [97]:
# common imports and functions
import pandas as pd
import yaml
import urllib.request
import os
import errno
from pathlib import Path
from datetime import datetime, timedelta
from ipywidgets import widgets
import glob
import csv
from collections import Counter

# set file paths for the right system
git_path = '/Users/libbyh/Documents/git/casmlab/purpletag/'
git_data_path = git_path + '2016-election-study/data-files/'

In [None]:
def get_files(subdir, extension):
    """ Return all files in a subdirectory matching this extension. """
    return glob.glob(config.get('data', 'path') + '/' +
                     config.get('data', subdir) + '/*.' + extension)

### Get Specific commit of legislators-social-media.yaml

November 16, 2016 commit was last before new House and Senate sworn in.

In [2]:
legs_yaml = 'https://raw.githubusercontent.com/unitedstates/congress-legislators/a35d649180d55a0b7d1e381e1774d315371a9188/legislators-social-media.yaml'
urllib.request.urlretrieve(legs_yaml, git_data_path + 'legs.yaml')
with open(git_data_path + 'legs.yaml', 'r') as f:
    df_legs = pd.io.json.json_normalize(yaml.load(f))

df_legs.head()

Unnamed: 0,id.bioguide,id.govtrack,id.thomas,social.facebook,social.facebook_id,social.instagram,social.instagram_id,social.twitter,social.twitter_id,social.youtube,social.youtube_id
0,R000600,412664.0,2222,congresswomanaumuaamata,1537155909907320,,,RepAmata,3026623000.0,,UCGdrLQbt1PYDTPsampx4t1A
1,H001070,412645.0,2260,RepCresentHardy,320612381469421,,,RepHardy,2964223000.0,RepHardy,UCc8E6NWCdgrXjBVI2NNPYdA
2,Y000064,412428.0,2019,RepToddYoung,186203844738421,,,RepToddYoung,234128500.0,RepToddYoung,UCuknj4PGn91gHDNAfboZEgQ
3,E000295,412667.0,2283,senjoniernst,351671691660938,senjoniernst,1582703000.0,SenJoniErnst,2856788000.0,,UCLwrmtF_84FIcK3TyMs4MIw
4,T000476,412668.0,2291,SenatorThomTillis,1576257352609470,,,senthomtillis,2964175000.0,,UCUD9VGV4SSGWjGdbn37Ea2w


In [3]:
series_handles = df_legs["social.twitter"].dropna(axis=0, how='any')
print("MOCs with Twitter handles:", len(series_handles))
series_handles.rename(index=str, columns={"social.twitter": "handle"})
print(series_handles.head())
series_handles.to_csv(git_data_path + 'handles.csv', index=False)

MOCs with Twitter handles: 526
0         RepAmata
1         RepHardy
2     RepToddYoung
3     SenJoniErnst
4    senthomtillis
Name: social.twitter, dtype: object


### Get Old Tweet IDs
Now we pass the Twitter handles we just collected to the GetOldTweets-python script for collection.

In [4]:
# get_old_tweets_path = '/Users/libbyh/Documents/git/libbyh/GetOldTweets-python/'
# get_old_tweets = get_old_tweets_path + 'Exporter.py'

# list_handles = series_handles.tolist()

# # getting all tweets for all handles is faster than searching by date, so we're greedy
# for handle in list_handles: 
#     print("working on", handle)
#     output_file = git_data_path + "tweets/" + str(handle) + '_tweets.csv'
#     if os.path.exists(output_file): # if the file is already there, move on; lets me pick up where i left off when moving
#         continue
#     else:
#         %run $get_old_tweets --username $handle --output $output_file

working on RepAmata
working on RepHardy
working on RepToddYoung
working on SenJoniErnst
working on senthomtillis
working on RepKevinYoder
working on RepJohnYarmuth
working on RepDonYoung
working on Rep_SteveWomack
working on RepWilson
working on RepWebster
working on MarkWarner
working on RobWittman
working on SenWhitehouse
working on PeterWelch
working on RepWalberg
working on RepDWStweets
working on RepWestmoreland
working on RepJoeWilson
working on RepGregWalden
working on RonWyden
working on SenatorWicker
working on RepEdWhitfield
working on MaxineWaters
working on ChrisVanHollen
working on RepVisclosky
working on NydiaVelazquez
working on SenatorTomUdall
working on RepFredUpton
working on RepTipton
working on RepPaulTonko
working on RepDinaTitus
working on CongressmanGT
working on NikiInTheHouse
working on SenatorTester
working on RepMikeTurner
working on TiberiPress
working on SenToomey
working on RepThompson
working on SenJohnThune
working on MacTXPress
working on BennieGThompso

Traceback (most recent call last):
  File "/Users/libbyh/anaconda/envs/purpletag-analysis/lib/python3.5/site-packages/restkit/__init__.py", line 9, in <module>
    from restkit.conn import Connection
  File "/Users/libbyh/anaconda/envs/purpletag-analysis/lib/python3.5/site-packages/restkit/conn.py", line 12, in <module>
    import cStringIO
ImportError: No module named 'cStringIO'


Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepMarthaRoby_tweets.csv".
working on RepScottRigell
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepScottRigell_tweets.csv".
working on RepRichmond
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepRichmond_tweets.csv".
working on RepRibble
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepRibble_tweets.csv".
working on RepJimRenacci
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepJimRenacci_tweets.csv".
working on RepTomReed
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-file

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/GraceNapolitano_tweets.csv".
working on EleanorNorton
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/EleanorNorton_tweets.csv".
working on SenBillNelson
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SenBillNelson_tweets.csv".
working on RepRichardNeal
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepRichardNeal_tweets.csv".
working on RepJerryNadler
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepJerryNadler_tweets.csv".
working on Sen_JoeManchin
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/dat

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepLipinski_tweets.csv".
working on RepStephenLynch
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepStephenLynch_tweets.csv".
working on RepRickLarsen
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepRickLarsen_tweets.csv".
working on JimLangevin
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/JimLangevin_tweets.csv".
working on RepJohnLarson
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepJohnLarson_tweets.csv".
working on RepLoBiondo
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/twe

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/GreggHarper_tweets.csv".
working on MazieHirono
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/MazieHirono_tweets.csv".
working on SenDeanHeller
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SenDeanHeller_tweets.csv".
working on RepBrianHiggins
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepBrianHiggins_tweets.csv".
working on RepHensarling
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepHensarling_tweets.csv".
working on RepMikeHonda
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tw

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepAnnaEshoo_tweets.csv".
working on RepEliotEngel
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepEliotEngel_tweets.csv".
working on DesJarlaisTN04
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/DesJarlaisTN04_tweets.csv".
working on RepJeffDuncan
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepJeffDuncan_tweets.csv".
working on RepSeanDuffy
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepSeanDuffy_tweets.csv".
working on RepJeffDenham
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SenDanCoats_tweets.csv".
working on Clyburn
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/Clyburn_tweets.csv".
working on RepSteveChabot
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepSteveChabot_tweets.csv".
working on SenatorCarper
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SenatorCarper_tweets.csv".
working on SenatorCardin
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SenatorCardin_tweets.csv".
working on SenatorCantwell
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/Se

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepSwalwell_tweets.csv".
working on RepMullin
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepMullin_tweets.csv".
working on RepJoeKennedy
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepJoeKennedy_tweets.csv".
working on SusanWBrooks
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SusanWBrooks_tweets.csv".
working on RepWalorski
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepWalorski_tweets.csv".
working on RepTomRice
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepTomRice_

TypeError: catching classes that do not inherit from BaseException is not allowed

working on RepBera
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepBera_tweets.csv".
working on RepBillFoster
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepBillFoster_tweets.csv".
working on RepPittenger
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepPittenger_tweets.csv".
working on RepMarkMeadows
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepMarkMeadows_tweets.csv".
working on RepLoisFrankel
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepLoisFrankel_tweets.csv".
working on RepRichHudson
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/USRepKCastor_tweets.csv".
working on SenatorIsakson
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/SenatorIsakson_tweets.csv".
working on RepCardenas
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepCardenas_tweets.csv".
working on RepTimWalz
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepTimWalz_tweets.csv".
working on RepMarkPocan
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepMarkPocan_tweets.csv".
working on RepJuanVargas
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/Rep

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepGwenGraham_tweets.csv".
working on RepJohnKatko
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepJohnKatko_tweets.csv".
working on RepRyanZinke
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepRyanZinke_tweets.csv".
working on RepWesterman
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepWesterman_tweets.csv".
working on RepFrenchHill
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweets/RepFrenchHill_tweets.csv".
working on RepBrianBabin
Searching...

Done. Output file generated "/Users/libbyh/Documents/git/casmlab/purpletag/2016-election-study/data-files/tweet

### Generate tags files from those tweets
Purpletag's ```parse``` functions create .tags files from JSON returned by the Twitter API. Here, we create similar files based on the CSVs produced by GetOldTweets-python instead.

.tags files are named by date (e.g., 2016-09-22.1.tags) and the number of days included, and they are formatted like so:

```bettymccollum04 honorourheroesmn2016:1 doyourjob:1 zika:1```

In [124]:
# generate tag files for 9/5/2016 - 11/8/2016
csv_path = git_data_path + 'tweets/'

# loop through CSV files just generated
files = glob.glob(csv_path + '*.csv')
# files = files[:10] # use first 10 for testing

def get_tweet_data(file, start_date, end_date):
    with open(file, 'r') as f:
        tweets = pd.read_csv(f, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE, usecols=["date","hashtags","permalink"])
        tweets['h'], tweets['t'], tweets['twitter'], tweets['handle'], tweets['status'], tweets['tweet_id'] = tweets['permalink'].str.split('/',5).str # parse permalink to get username  
        tweets_in_range = tweets[(tweets.date >= start_date) & (tweets.date <= end_date) & (tweets.hashtags.notnull())] # just the dates we care about, and just the tweets with tags
        tweets_in_range.drop(['h','t','twitter','status'], inplace=True, axis=1)
        tweets_in_range['handle_lower'] = tweets_in_range['handle'].str.lower()
        
    return tweets_in_range

def get_tag_list(df):
    all_tags = []
    tweet_tags = df['hashtags'].tolist()
    for tag_list in tweet_tags:
        tags = tag_list.split(' ')
        all_tags.append(tags)
    flat_list = [item for sublist in all_tags for item in sublist]
    flat_list = [element.lower() for element in flat_list]
    
    return flat_list

def write_tags_file(outfile, handle, tag_list):
    with open(outfile, 'a+') as file:
        file.write(handle + ' ')
        file.write(' '.join('%s:%d' % (x[0], x[1]) for x in sorted(tag_list.items(), key=lambda x: x[1])))
        file.write(u'\n')

# for each file, generate a line like
'''handle tag:count tag:count tag:count'''
no_tweets = []
no_tags = []

for file in files:    
    handle = os.path.splitext(os.path.basename(file))[0].replace("_tweets","").lower()
    try:
        df_tweets = get_tweet_data(file, '2016-09-05', '2016-11-08') # get the tweets we care about
        tag_list = get_tag_list(df_tweets) # generate a list of tags in those tweets
        if len(tag_list) > 0:
            write_tags_file(git_data_path + 'election.tags', handle, Counter(tag_list)) # write to the big tags file
        else:
            no_tags.append(handle)
    except:
        no_tweets.append(handle)
        
print("no tweets for", no_tweets)
print("no tags for", no_tags)
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


no tweets for ['repboustany', 'repcorrinebrown', 'repduckworth', 'repgwengraham', 'repjoeheck', 'repjoepitts', 'repkirkpatrick', 'repmattsalmon', 'repmickmulvaney', 'repmikepompeo', 'repmiketurner', 'repmurphyfl', 'reprobbishop', 'repsamfarr', 'reptoddyoung', 'senatorsessions', 'senatortester', 'sentoomey', 'tiberipress', 'yvetteclarke']
no tags for ['andercrenshaw', 'blumenauermedia', 'conawaytx11', 'congressmandan', 'grahamblog', 'luisgutierrez', 'randy_forbes', 'rep_hunter', 'rep_janicehahn', 'repamata', 'repbecerra', 'repcheri', 'repdannydavis', 'repdavid', 'repedwhitfield', 'repfranklucas', 'repkaygranger', 'repmarktakai', 'reppittenger', 'reprichnugent', 'reprussell', 'repstutzman', 'repwestmoreland', 'sencoonsoffice', 'senkaineoffice', 'stevescalise']
