# Representation Learning at the Houses of the Oireachtas in Ireland




Notebook dedicated to run representation learning on the Houses of the Oireachtas' data in Ireland.

This work is based on a Senator Representations' work of the US Congress by Nathaniel Tucker:

https://github.com/knathanieltucker/tf-keras-tutorial/blob/master/SenatorRepresentations.ipynb

The Oireachtas is the legislature of the Republic of Ireland.

The Oireachtas consists of:
- The President of Ireland
- The two houses of the Oireachtas:
    - Dáil Éireann (lower house)
    - Seanad Éireann (upper house)
    
Further info: https://en.wikipedia.org/wiki/Oireachtas

Information regarding the Houses of the Oireachtas is featured on their website, including legislation, and is the copyright of the Houses of the Oireachtas: https://beta.oireachtas.ie/. 

Another interesting resource is https://www.kildarestreet.com/ which is a searchable archive of everything that's been said in the Dáil and all written parliamentary questions since January 2004, everything in the Seanad since September 2002, and all Committee meetings since September 2012.

Open Data from the Houses of the Oireachtas can be accessed via:
https://beta.oireachtas.ie/en/open-data/

This includes a link to the open data APIs via a Swagger UI:
https://api.oireachtas.ie/

A vote in the Houses of the Oireachtas is also called a division. We will be looking at this divisions (votings) table for the house and the Seanad (Senate):

In [1]:
import requests # http://docs.python-requests.org/
from BeautifulSoup import BeautifulSoup
import json
import time
import csv
import numpy as np

In [2]:
def get_votings(chamber = 'seanad'):  # chamber: 'seanad' or 'dail' if chamber_type: 'house'
    
    votings_data = []

    table = "divisions" # votings table
    url = 'https://api.oireachtas.ie/v1/{}'.format(table)
    date_start = '1900-01-01'
    date_end = '2017-12-31'
    chamber_type = 'house' # 'committee' or 'house'

    batch = 500
    skip = 0
    limit = batch
    count = 1 # start

    while count > 0:

        params = dict(chamber_type=chamber_type,
                      chamber=chamber, 
                      date_start=date_start, 
                      date_end=date_end,
                      skip=skip, 
                      limit=limit)
        r = requests.get(url, params=params)

        print r.url
        # print "Status Code:", r.status_code
        # print "Headers:", r.headers

        # Add data from this batch
        contents = r.json()

        if 'message' in contents and contents['message'] == 'server error':
            print 'ERROR RETRIEVING DATA'
            return votings_data

        # Results    
        votings = contents['results'][:batch] # in order not to have duplicates
        votings_data.extend(votings)

        # Update count, more to retrieve?
        count = len(votings)
        print 'Number retrieved', count
        # Update skip and limit for query
        skip += batch
        limit += batch

        time.sleep(1)
        
    return votings_data

### Seanad Éireann (upper house)

In [3]:
# seanad_data = get_votings()

In [4]:
# with open('data/congress/Ireland/seanad_data.json', 'w') as outfile:
#      json.dump(seanad_data, outfile)

In [5]:
seanad_data = json.load(open('data/congress/Ireland/seanad_data.json'))

How many votings in the Senate?

In [6]:
len(seanad_data)

3530

In [7]:
# seanad_data[0]

### Dáil Éireann (lower house)

In [8]:
# dail_data = get_votings(chamber = 'dail')

In [9]:
# with open('data/congress/Ireland/dail_data.json', 'w') as outfile:
#      json.dump(dail_data, outfile)

In [10]:
dail_data = json.load(open('data/congress/Ireland/dail_data.json'))

In [11]:
len(dail_data)

5000

**Note**

The division count for the Dail is actually 7,329 but the API does only let me retrieve the first 5,000 votings

In [12]:
# dail_data[0]

Verify there is no duplicates:

In [13]:
def verify_no_voting_duplicates(vote_data):
    votes = []
    duplicates = []
    for vote in vote_data:
        uri = vote["division"]["uri"]
        if uri not in votes:
            # print uri
            votes.append(uri)
        else:
            duplicates.append(uri)
    return len(duplicates)

In [14]:
verify_no_voting_duplicates(seanad_data)

0

In [15]:
verify_no_voting_duplicates(dail_data)

0

Get members from the votings:

In [16]:
def get_members(vote_data):
    members = []
    for vote in vote_data:
        div = vote["division"]
        for outcome, voting in div["tallies"].iteritems():            
            if voting is not None:
                # print voting["showAs"] # Tá (YES), Níl (NO), Staon (Abstention)
                # print len(voting["members"])
                for member in voting["members"]:
                    member = member["member"]
                    #last_name, first_name = member["showAs"].split(",")
                    #first_name = first_name[:-1].strip()
                    #last_name = last_name.strip()
                    member_code = member["memberCode"]
                    # print member_code
                    if member_code is not None:
                        members.append(member_code)
    return members

**Senators**

In [17]:
senators = get_members(seanad_data)

In [18]:
senators[3]

u'Paudie-Coffey.S.2007-07-23'

In [19]:
len(set(senators))

197

**Deputies**

In [20]:
deputies = get_members(dail_data)

In [21]:
deputies[2]

u'Thomas-P-Broughan.D.1992-12-14'

In [22]:
len(set(deputies))

372

All members:

In [23]:
all_members = list(senators) # copy list
all_members.extend(deputies)

In [24]:
len(set(all_members))

478

In [25]:
# for v, k in enumerate(set(all_members)):
#     print v, k

In [26]:
# leave the first two blank for padding and not a member
member_to_id = { k: v + 2 for v, k in enumerate(set(all_members)) }

In [27]:
# member_to_id

In [28]:
senator_to_id = { k: v + 2 for v, k in enumerate(set(senators)) }
deputy_to_id = { k: v + 2 for v, k in enumerate(set(deputies)) }

In [29]:
len(senator_to_id), len(deputy_to_id)

(197, 372)

In [30]:
def get_member_id(name, collection):
    name = name.replace(" ", "-").replace(".", "")
    sens = [k for k, v in collection.iteritems() if k is not None and k.startswith(name)]
    if len(sens) > 0:
        return collection[sens[0]]
    return 1

In [31]:
get_member_id('Aideen Hayden', member_to_id)

445

In [32]:
get_member_id('Aideen Hayden', senator_to_id), get_member_id('Aideen Hayden', deputy_to_id)

(133, 1)

In [33]:
def get_member_party(name):
    page = requests.get('https://beta.oireachtas.ie/en/members/member/%s/' % (name)).text
    # dom = web.Element(page)
    # party = dom.content.split('<p class="bio-text">Party:</p> <p>')[1].split('</p')[0].strip()
    bs = BeautifulSoup(page)
    party = bs.find(text="Party:").findNext('p').contents[0].strip()
    return party

In [34]:
all_members[0], get_member_party(all_members[0])

(u'Victor-Boyhan.S.2016-04-25', u'Independent')

In [35]:
all_members[17], get_member_party(all_members[17])

(u'Brian-\xd3-Domhnaill.S.2007-07-23', u'Fianna F\xe1il')

In [36]:
parties = {}
for member in set(all_members):
    parties[member] = get_member_party(member)

In [37]:
# parties

In [38]:
set(val for val in parties.values())

{u'Anti-Austerity Alliance - People Before Profit',
 u'Ceann Comhairle',
 u'Fianna F\xe1il',
 u'Fine Gael',
 u'Green Party',
 u'Independent',
 u'Independents 4 Change',
 u'Labour',
 u'Labour Party',
 u'Progressive Democrats',
 u'Renua',
 u'Sinn F\xe9in',
 u'Social Democrats',
 u'Socialist Party',
 u'Solidarity - People Before Profit'}

In [39]:
from collections import Counter
Counter([v for v in parties.values()])

Counter({u'Anti-Austerity Alliance - People Before Profit': 2,
         u'Ceann Comhairle': 1,
         u'Fianna F\xe1il': 165,
         u'Fine Gael': 122,
         u'Green Party': 12,
         u'Independent': 64,
         u'Independents 4 Change': 3,
         u'Labour': 12,
         u'Labour Party': 45,
         u'Progressive Democrats': 9,
         u'Renua': 3,
         u'Sinn F\xe9in': 33,
         u'Social Democrats': 2,
         u'Socialist Party': 1,
         u'Solidarity - People Before Profit': 4})

**Extracting sponsors and members against a bill**

The following function encapsulates the extract of sponsors and members against a bill based on a line of text manually inputed and in several formats:

In [40]:
# Sponsors and members against a Bill are specified in a text line where we will extract:
# 2 members that sponsored the bill
# 2 members that opposed to the bill
def get_sponsors(tellers_data, collection):
    
    #print tellers_data
    
    sp_first, sp_second, ag_first, ag_second = None, None, None, None
    
    if ";" in tellers_data:
        
        tellers = tellers_data.split(";")
        
        for teller in tellers:
            
            if teller.strip() == "":
                continue
            if "Tellers:" in teller:
                teller = teller.split("Tellers:")[1].strip()
            elif ":" in teller:
                teller = teller.split(":")[1].strip()

            if "," in teller:
                values = teller.split(",")
            elif ":" in teller:
                values = teller.split(":")
            else:
                #print "Bad format"
                continue
            if len(values) != 2:
                #print "BAD format"
                continue
                
            support, senators = values

            if "Senators" in senators:
                senators = senators.split("Senators")
                senators = senators[-1]
                
            if " and" in senators:
                senators = senators.split(" and")
            elif "and " in senators:
                senators = senators.split("and ")
            elif "an d":
                senators = senators.split("an d")
            else:
                senators = senators.split("and")
            
            if len(senators) > 1:
                first_senator, second_senator = senators[:2]
                second_senator = second_senator.strip()
            else:
                first_senator = senators[0]
                first_senator = first_senator.strip()
                second_senator = "None"
            
            support = support.strip()

            if support.encode('utf-8') == str('T\xc3\xa1'):
                sp_first = get_member_id(first_senator, collection)
                sp_second = get_member_id(second_senator, collection)
                # print "Sponsors", first_senator, sp_first, second_senator, sp_second
            elif support.encode('utf-8') == str('N\xc3\xadl'):
                ag_first = get_member_id(first_senator, collection)
                ag_second = get_member_id(second_senator, collection)
                # print "Against", first_senator, ag_first, second_senator, ag_second
                
    return sp_first, sp_second, ag_first, ag_second 

In [41]:
# Not all members will appear on the votings 
member_id_to_displayname = {}
senator_id_to_displayname = {}
deputy_id_to_displayname = {}
for member_code in set(all_members):
    member_id = member_to_id[member_code]
    name = member_code
    party = parties[name]
    if name is not None and '.' in name:
        name = name.split('.')[0].split('-')[-1] # last name
    if party is not None:
        name = name + ", " + party
    member_id_to_displayname[member_id] = name
    if member_code in senators:
        senator_id = senator_to_id[member_code]
        senator_id_to_displayname[senator_id] = name
    if member_code in deputies:
        deputy_id = deputy_to_id[member_code]
        deputy_id_to_displayname[deputy_id] = name

In [42]:
# member_id_to_displayname

In [43]:
# senator_id_to_displayname

In [44]:
# deputy_id_to_displayname

In [45]:
len(member_id_to_displayname), len(senator_id_to_displayname), len(deputy_id_to_displayname)

(478, 197, 372)

In [46]:
def get_vote_data(data, collection_to_ids, display_names):

    votings = []
    added, not_added = 0, 0

    for vote in data:

        div = vote["division"]
        #outcome = div["outcome"] # Carried, Lost, _
        #print div["date"]
        #print div["isBill"]
        #print div["category"]
        #print div["chamber"]["showAs"]
        #print div["debate"]["showAs"]

        # Bill sponsors
        sponsors = get_sponsors(div["tellers"], collection_to_ids)

        if None not in sponsors:

            sponsor, cosponsor, against, coagainst = sponsors

            tallies = div["tallies"]

            if tallies["taVotes"] is not None:
                for member in tallies["taVotes"]["members"]:
                    member = member["member"]
                    name = member["showAs"]
                    last_name = name.split(',')[0]
                    #uri = member["uri"]
                    #short = member["memberCode"]
                    if member["memberCode"] is not None:
                        member_id = collection_to_ids[member["memberCode"]]
                        party = parties[member["memberCode"]]
                        display_names[member_id] = last_name + ", " + party # replace names with utf8 encoding names
                        votings.append((1, member_id, sponsor, [cosponsor, against, coagainst])) 

            if tallies["nilVotes"] is not None:
                
                for member in tallies["nilVotes"]["members"]:
                    member = member["member"]
                    name = member["showAs"]
                    last_name = name.split(',')[0]
                    if member["memberCode"] is not None:
                        member_id = collection_to_ids[member["memberCode"]]
                        party = parties[member["memberCode"]]
                        display_names[member_id] = last_name + ", " + party
                        votings.append((0, member_id, sponsor, [cosponsor, against, coagainst]))   

            added += 1

        else: # Votings that could not be added

            not_added += 1
            
    return votings, added, not_added

In [47]:
seanad_votings, added, not_added = get_vote_data(seanad_data, senator_to_id, senator_id_to_displayname)

In [48]:
len(seanad_votings), added, not_added

(75487, 2802, 728)

In [49]:
# seanad_votings

In [50]:
dail_votings, added, not_added = get_vote_data(dail_data, deputy_to_id, deputy_id_to_displayname)

In [51]:
len(dail_votings), added, not_added

(148744, 3574, 1426)

In [52]:
# dail_votings

In [53]:
len(seanad_votings), len(dail_votings)

(75487, 148744)

### Senate votings

~75 examples of (vote, senator voting, sponsor, cosponsor, against, coagainst) tuples is pretty good.

In [54]:
y = [d[0] for d in seanad_votings]

In [55]:
# len(y)
Counter(y)

Counter({0: 36729, 1: 38758})

In [56]:
# again we pad
def pad_or_crop(lst, l=10):
    return (lst + [0] * l)[:10]

In [57]:
x_1 = np.array(map(lambda x: x[1], seanad_votings))
x_2 = np.array(map(lambda x: x[2], seanad_votings))
x_3 = np.array(map(lambda x: pad_or_crop(x[3]), seanad_votings))
x = [x_1, x_2, x_3]

In [58]:
# x

In [59]:
# len(x)

In [60]:
# we add in padding and unknown senators
senator_id_to_displayname[0] = '<PAD>'
senator_id_to_displayname[1] = '<NOT A SENATOR>'

In [61]:
# senator_id_to_displayname

In [62]:
# this gives us how many representations:
len(senator_id_to_displayname)

199

In [63]:
# we again need to write down the metadata
with open('data/congress/Ireland/senator_metadata.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
    for key, value in sorted(senator_id_to_displayname.items()):
        writer.writerow([value.encode('utf8')])

In [64]:
# finally we build our model
from keras.layers import concatenate
from keras.layers import Dense, Input, Flatten
from keras.layers import MaxPooling1D, Embedding

embedding_layer = Embedding(len(senator_id_to_displayname), 100)

# train a 1D convnet with global maxpooling
voting = voting_input = Input(shape=(1,), dtype='int32')
voting = embedding_layer(voting)
voting = Dense(32, activation='relu')(voting)
voting = Dense(32, activation='relu')(voting)

sponsor = sponsor_input = Input(shape=(1,), dtype='int32')
sponsor = embedding_layer(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)

cosponsor = cosponsor_input = Input(shape=(10,), dtype='int32')
cosponsor = embedding_layer(cosponsor)
cosponsor = MaxPooling1D(10)(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)

combined = concatenate([voting, sponsor, cosponsor])
combined = Dense(32, activation='relu')(combined)
combined = Dense(1, activation='sigmoid')(combined)

Using TensorFlow backend.


In [65]:
from keras.models import Model
from keras.callbacks import TensorBoard

model = Model([voting_input, sponsor_input, cosponsor_input], combined)

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

embedding_metadata = {
    embedding_layer.name: '../senator_metadata.csv'
}

model.fit([x_1, x_2, x_3], np.array(y).reshape(-1, 1, 1),
          batch_size=128,
          epochs=10,
          validation_split=0.2,
          callbacks=[TensorBoard(log_dir='data/congress/Ireland/senator_reps', 
                                 embeddings_freq=1,
                                 embeddings_metadata=embedding_metadata)])

Train on 60389 samples, validate on 15098 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1845ab0350>

In data/congress/Ireland, launch TensorBoard:
    
> $ tensorboard --logdir=senator_reps/

Go to TensorBoard:

> http://localhost:6006/#projector
        
In TensorBoard, we can look at the representations created in our model using t-SNE or PCA.

### Dail votings

~150K examples of (vote, senator voting, sponsor, cosponsor, against, coagainst) tuples. Around two times the senate votes scrapped.

In [66]:
y = [d[0] for d in dail_votings]

In [67]:
# len(y)
Counter(y)

Counter({0: 69948, 1: 78796})

In [68]:
x_1 = np.array(map(lambda x: x[1], dail_votings))
x_2 = np.array(map(lambda x: x[2], dail_votings))
x_3 = np.array(map(lambda x: pad_or_crop(x[3]), dail_votings))
x = [x_1, x_2, x_3]

In [69]:
# we add in padding and unknown senators
deputy_id_to_displayname[0] = '<PAD>'
deputy_id_to_displayname[1] = '<NOT A DEPUTY>'

In [70]:
# deputy_id_to_displayname

In [71]:
len(deputy_id_to_displayname)

374

In [72]:
# we again need to write down the metadata
with open('data/congress/Ireland/deputy_metadata.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
    for key, value in sorted(deputy_id_to_displayname.items()):
        writer.writerow([value.encode('utf8')])

In [73]:
# finally we build our model
from keras.layers import concatenate
from keras.layers import Dense, Input, Flatten
from keras.layers import MaxPooling1D, Embedding

embedding_layer = Embedding(len(deputy_id_to_displayname), 100)

# train a 1D convnet with global maxpooling
voting = voting_input = Input(shape=(1,), dtype='int32')
voting = embedding_layer(voting)
voting = Dense(32, activation='relu')(voting)
voting = Dense(32, activation='relu')(voting)

sponsor = sponsor_input = Input(shape=(1,), dtype='int32')
sponsor = embedding_layer(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)
sponsor = Dense(32, activation='relu')(sponsor)

cosponsor = cosponsor_input = Input(shape=(10,), dtype='int32')
cosponsor = embedding_layer(cosponsor)
cosponsor = MaxPooling1D(10)(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)
cosponsor = Dense(32, activation='relu')(cosponsor)

combined = concatenate([voting, sponsor, cosponsor])
combined = Dense(32, activation='relu')(combined)
combined = Dense(1, activation='sigmoid')(combined)

In [74]:
from keras.models import Model
from keras.callbacks import TensorBoard

model = Model([voting_input, sponsor_input, cosponsor_input], combined)

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

embedding_metadata = {
    embedding_layer.name: '../deputy_metadata.csv'
}

model.fit([x_1, x_2, x_3], np.array(y).reshape(-1, 1, 1),
          batch_size=128,
          epochs=10,
          validation_split=0.2,
          callbacks=[TensorBoard(log_dir='data/congress/Ireland/deputy_reps', 
                                 embeddings_freq=1,
                                 embeddings_metadata=embedding_metadata)])

Train on 118995 samples, validate on 29749 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a4a597210>

In data/congress/Ireland, launch TensorBoard:
    
> $ tensorboard --logdir=deputy_reps/

### Utils

In [75]:
print '\xc3\xa1' == str('á')
print str('á').decode('utf-8').encode('utf-8')
print str('á').decode('utf-8').encode('utf-8') == str('\xc3\xa1')
print str('Tá').decode('utf-8').encode('utf-8') == str('T\xc3\xa1')
print str('Níl').decode('utf-8').encode('utf-8') == str('N\xc3\xadl')

True
á
True
True
True
