## A bit more interesting

I want to breifly talk about two other ways to use representational learning in addtion to the first:

1. Representations as a byproduct of prediction
2. Hand crafted representations
3. Prediction as a byproduct of representations

We covered the first in WordRepresentations notebook. The second we won't cover here, but I took inspiration from this homework from a class I took a long time ago: https://github.com/cs109/content/blob/master/HW5_solutions.ipynb

In this homework we used graph algorithms to hand craft a representation of senators, and lo and behold they turned out to be quite partison. So if you want a great example of tactic 2, then check that out.

The third tactic is what we will do below. We will create a prediciton problem that we don't really care about. But this problem if solved by representations will create quite useful ones. The classic example of this is word2vec.

We will walk through much more of the details this time because we are fresh ground.

In [1]:
import requests
from pattern import web
import json

In [2]:
def get_senate_vote(congress, year, vote):
    url = 'http://www.govtrack.us/data/congress/{}/votes/{}/s{}/data.json'.format(congress, year, vote)
    page = requests.get(url).text
    return json.loads(page)

In [3]:
def get_all_votes(congress, year):
    page = requests.get('https://www.govtrack.us/data/congress/{}/votes/{}/'.format(congress, year)).text
    dom = web.Element(page)
    votes = [a.attr['href'] for a in dom.by_tag('a') 
             if a.attr.get('href', '').startswith('s')]
    n_votes = len(votes)
    votes_on_bills = []
    for i in range(1, n_votes + 1):
        vote = get_senate_vote(congress, year, i)
        if 'bill' in vote:
            votes_on_bills.append(vote)
    
    return votes_on_bills

The above two functions will scrape a website that keeps track of how US government votes go. I have already scraped it, but for people curious as to how I got the data, you can check out above.

In [4]:
# vote_data_113_2013 = get_all_votes(113, 2013)
# vote_data_113_2014 = get_all_votes(113, 2014)
# vote_data_114_2015 = get_all_votes(114, 2015)
# vote_data_114_2016= get_all_votes(114, 2016)

# all_vote_data = vote_data_113_2013 + vote_data_113_2014 + vote_data_114_2015 + vote_data_114_2016

# with open('all_vote_data.json', 'w') as outfile:
#     json.dump(all_vote_data, outfile)
    
all_vote_data = json.load(open('all_vote_data.json'))

In [5]:
all_vote_data[0]

{u'bill': {u'congress': 113,
  u'number': 15,
  u'title': u'A resolution to improve procedures for the consideration of legislation and nominations in the Senate.',
  u'type': u'sres'},
 u'category': u'passage',
 u'chamber': u's',
 u'congress': 113,
 u'date': u'2013-01-24T19:54:00-05:00',
 u'number': 1,
 u'question': u'On the Resolution S.Res. 15',
 u'record_modified': u'2013-01-24T20:38:00-05:00',
 u'requires': u'3/5',
 u'result': u'Resolution Agreed to',
 u'result_text': u'Resolution Agreed to (78-16, 3/5 majority required)',
 u'session': u'2013',
 u'source_url': u'http://www.senate.gov/legislative/LIS/roll_call_votes/vote1131/vote_113_1_00001.xml',
 u'subject': u'S. Res. 15',
 u'type': u'On the Resolution',
 u'updated_at': u'2016-12-25T10:01:28-05:00',
 u'vote_id': u's1-113.2013',
 u'votes': {u'Nay': [{u'display_name': u'Crapo (R-ID)',
    u'first_name': u'Mike',
    u'id': u'S266',
    u'last_name': u'Crapo',
    u'party': u'R',
    u'state': u'ID'},
   {u'display_name': u'Cruz (R-

You can see that we have the bill and all the votes that it got from various senators. In addition to this information we will want to find out one more bit of info, who sponsored the bill?

In [6]:
def get_senate_bill(bill_type, bill_number):
    url = 'http://www.govtrack.us/data/congress/113/bills/{}/{}{}/data.json'.format(bill_type, bill_type, bill_number)
    page = requests.get(url).text
    return json.loads(page)

In [8]:
# bill_data = []
# for vote in all_vote_data:
#     if 'bill' in vote:
#         bill_type = vote['bill']['type']
#         bill_number = vote['bill']['number']
#         bill = get_senate_bill(bill_type, bill_number)
#         bill['id'] = '{}{}'.format(bill_type, bill_number)
#         bill_data.append(bill)
        
# with open('bill_data.json', 'w') as outfile:
#     json.dump(bill_data, outfile)
    
bill_data = json.load(open('bill_data.json'))

In [9]:
bill_data[0]

{u'actions': [{u'acted_at': u'2013-01-24',
   u'references': [{u'reference': u'CR S293',
     u'type': u'text of measure as introduced'}],
   u'text': u'Submitted in the Senate.',
   u'type': u'action'},
  {u'acted_at': u'2013-01-24',
   u'references': [{u'reference': u'CR S270-274', u'type': u'consideration'},
    {u'reference': u'CR S293', u'type': u'text of measure as introduced'}],
   u'text': u'Measure laid before Senate by unanimous consent.',
   u'type': u'action'},
  {u'acted_at': u'2013-01-24',
   u'how': u'roll',
   u'references': [{u'reference': u'CR S272', u'type': u'text'}],
   u'result': u'pass',
   u'roll': u'1',
   u'status': u'PASSED:SIMPLERES',
   u'text': u'Resolution agreed to in Senate, under the order of 1/24/2012, having achieved 60 votes in the affirmative, without amendment by Yea-Nay Vote. 78 - 16. Record Vote Number: 1.',
   u'type': u'vote',
   u'vote_type': u'vote',
   u'where': u's'}],
 u'amendments': [{u'amendment_id': u'samdt3-113',
   u'amendment_type':

Again we get a ton of information. But we are just interested in who sponsored it.

We will then map each senator to an ID, just like we did with words:

In [10]:
senators = []
for vote in all_vote_data:
    for sen in vote['votes']['Nay']:
        senators.append(sen['last_name'] + ', ' + sen['state'])
        
    for sen in vote['votes']['Yea']:
        senators.append(sen['last_name'] + ', ' + sen['state'])
        
# leave the first two blank for padding and not senators
senator_to_id = {k: v + 2 for v, k in enumerate(set(senators))}

We will convert all the sponsors and cosponsors into IDs:

In [11]:
# for each bill, pull out it's sponsor and co-sponsors
def unique_name(senator):
    last_name = senator['name'].split(',')[0]
    return '{}, {}'.format(last_name, senator['state'])

def get_sen_id(sen):
    if sen not in senator_to_id:
        return 1
    return senator_to_id[sen]

bill_dict = {}
for bill in bill_data:
    bill_dict[bill['id']] = {
        'sponsor': get_sen_id(unique_name(bill['sponsor'])),
        'cosponsors': [get_sen_id(unique_name(cosponsor)) for cosponsor in bill['cosponsors']]
    }

And finally we will make our data. So we are really interested in representing our senators, but for an ML algorithm to learn that, it needs a goal to acheive with the representations aka a procedure to determine if the representation is good. 

Our prediciton problem will be: can we predict a senator's vote based on who sponsored it?

Notice the prediciton problem is composed of an interaction of representations (if representations don't interact the problem becomes too simple). Now if we were truely interested in the prediciton problem we would include a ton more features: the age of the sentator, whether they are rep or a dem. But we are interested in the representation. So let's get our data:

In [12]:
senator_vote_data = []
id_to_displayname = {}

for vote in all_vote_data:
    bill_type = vote['bill']['type']
    bill_number = vote['bill']['number']
    bill_id = '{}{}'.format(bill_type, bill_number)
    if bill_id in bill_dict:
        bill_sponsors = bill_dict[bill_id]
        sponsor = get_sen_id(bill_sponsors['sponsor'])
        cosponsors = [get_sen_id(cosponsor) for cosponsor in bill_sponsors['cosponsors']]
    else:
        continue
    
    for sen in vote['votes']['Nay']:
        senator_id = get_sen_id(sen['last_name'] + ', ' + sen['state'])
        id_to_displayname[senator_id] = sen[u'display_name']
        senator_vote_data.append((0, senator_id, sponsor, cosponsors)) 
        
    for sen in vote['votes']['Yea']:
        senator_id = get_sen_id(sen['last_name'] + ', ' + sen['state'])
        id_to_displayname[senator_id] = sen[u'display_name']
        senator_vote_data.append((1, senator_id, sponsor, cosponsors)) 

In [14]:
print len(senator_vote_data)
senator_vote_data[0]

69856


(0, 106, 1, [1, 1])

~70k examples of (vote, senator voting, sponsor, cosponsor) tuples is pretty good (we could of course scrape more). 

In [15]:
# we extract our target 
y = [d[0] for d in senator_vote_data]

In [16]:
# we extract our features
import numpy as np

# again we pad
def pad_or_crop(lst, l=10):
    return (lst + [0] * l)[:10]

x_1 = np.array(map(lambda x: x[1], senator_vote_data))
x_2 = np.array(map(lambda x: x[2], senator_vote_data))
x_3 = np.array(map(lambda x: pad_or_crop(x[3]), senator_vote_data))
x = [x_1, x_2, x_3]

In [17]:
# we add in padding and unknown senators
id_to_displayname[0] = '<PAD>'
id_to_displayname[1] = '<NOT A SENATOR>'

In [18]:
# this gives us how many representations:
len(id_to_displayname)

120

In [19]:
# we again need to write down the metadata
import csv

with open('senator_metadata.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
       
    for key, value in sorted(id_to_displayname.items()):
        writer.writerow([value.encode('utf8')])


In [20]:
# finally we build our model
from keras.layers import concatenate
from keras.layers import Dense, Input, Flatten
from keras.layers import MaxPooling1D, Embedding

embedding_layer = Embedding(len(id_to_displayname), 100)

# train a 1D convnet with global maxpooling
voting = voting_input = Input(shape=(1,), dtype='int32')
voting = embedding_layer(voting)
voting = Dense(32, activation='relu')(voting)
voting = Dense(32, activation='relu')(voting)

sponsor = sponsor_input = Input(shape=(1,), dtype='int32')
sponsor = embedding_layer(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)

cosponsor = cosponsor_input = Input(shape=(10,), dtype='int32')
cosponsor = embedding_layer(cosponsor)
cosponsor = MaxPooling1D(10)(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)

combined = concatenate([voting, sponsor, cosponsor])
combined = Dense(32, activation='relu')(combined)
combined = Dense(1, activation='sigmoid')(combined)

Using TensorFlow backend.


In [25]:
from keras.models import Model
from keras.callbacks import TensorBoard

model = Model([voting_input, sponsor_input, cosponsor_input], combined)

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

embedding_metadata = {
    embedding_layer.name: 'senator_metadata.csv'
}

model.fit([x_1, x_2, x_3], np.array(y).reshape(-1, 1, 1),
          batch_size=128,
          epochs=10,
          validation_split=0.2,
          callbacks=[TensorBoard(log_dir='senator_reps', embeddings_freq=1, embeddings_metadata=embedding_metadata)])

Train on 55884 samples, validate on 13972 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x12c316fd0>