# Collecting and parsing US House voting records

Since 1990 all roll-call votes in the house that were performed electronically have been recorded and are available from https://clerk.house.gov. Our goal will be to download this data and reformat it into a convenient form for working with. Specifically we would like to have, for each year, the voting record for each representative, ideally as a numeric vector of their votes on each item. To do this the first thing we will need to do is collect all the data itself. We can manage that with a series of shell commands.

In [None]:
!mkdir house_votes
!cd house_votes
!for y in {1990..2021}; do mkdir ${y}; cd ${y}; for i in {001..999}; do wget -q https://clerk.house.gov/evs/${y}/roll${i}.xml; done; cd ..; done
!cd ..

Next we'll need the ability to parse the data (which comes in XML files) and form it into pandas dataframes.

In [2]:
import pandas as pd
import xml.etree.ElementTree as et
import glob

The next catch is that the votes are recorded in a number of ways. Sometimes it is "Yea" or "Nay", sometimes "Aye" or "No". There are also abstentions (voting "Present", or a recording a "Not Voting"). Finally there are the votes for House speaker at every new congress (i.e. every 2 years); these are recorded as the names of the individuals being voted for as speaker. This final option is harder to deal with because we have a plurality of options rather than the traditional Yes/No/Abstain. For the sake of relatively simple analysis, and since we are mostly concerned with how house representatives align amongst each other, we will select the two most popular candidates and code republican candidates as -1 and democratic candidates as +1 (rather arbitrarily).

We can then set about parsing the XML file. Each file is a roll-call vote and there is a separate XML entity for each representative that voted, giving their vote on the issue. There is also some metadata associated to the vote that we can pull out for use in any further analysis we may wish to do later.

In [66]:
vote_map = {
    "Aye":1, 
    "No":-1, 
    "Yea":1, 
    "Nay":-1,
    "Not Voting":0, 
    "Present":0,
    # Speaker votes -- major candidates only R = -1, D = 1
    "Michel":-1,
    "Foley":1,
    "Gingrich":-1,
    "Gephardt":1,
    "Hastert":-1,
    "Pelosi":1,
    "Boehner":-1,
    "Ryan":-1,
    "Ryan (WI)":-1,
    "McCarthy":-1,
}

def parse_roll_call_xml(filename):
    xtree = et.parse(filename)
    xroot = xtree.getroot() 
    vote_root = xroot.find('vote-data')
    vote_data = []
    for node in vote_root:
        vote = node.find('legislator').attrib
        vote['legislator'] = node.find('legislator').text
        vote['vote'] = node.find('vote').text
        vote_data.append(vote)
    result = pd.DataFrame(vote_data)
    result["congress"] = xroot.find('vote-metadata').find('congress').text
    result["session"] = xroot.find('vote-metadata').find('session').text
    result["rollcall"] = xroot.find('vote-metadata').find('rollcall-num').text
    result["rollcall_id"] = (
        xroot.find('vote-metadata').find('congress').text + 
        "-" + xroot.find('vote-metadata').find('session').text + 
        "-" + xroot.find('vote-metadata').find('rollcall-num').text
    )
    try:
        result["legislation"] = xroot.find('vote-metadata').find('legis-num').text
    except AttributeError:
        result["legislation"] = 'None'
    result["description"] = xroot.find('vote-metadata').find('vote-desc').text
    result["question"] = xroot.find('vote-metadata').find('vote-question').text
    return result

With a function in hand to parse an individual XML file and pack it into a dataframe we now simply need to go through all the files, grouped by year, and process them, concatenating the result together into a signle dataframe for the year. At that point we can process the votes into numeric representations and then simply use a pivot table to get the voting record for each representative for the whole year. We then save all of that off for later use.

In [67]:
for year in range(1990, 2021):
    df = pd.concat([parse_roll_call_xml(fname) for fname in glob.glob(f'house_votes/{year}/*.xml')])
    df.to_csv(f'house_votes/{year}_full.csv')
    df["vote_numeric"] = df.vote.map(vote_map)
    voting_record = df.pivot_table(index='legislator', columns='rollcall_id', values='vote_numeric').fillna(0)
    voting_record.to_csv(f'house_votes/{year}_voting_record.csv')

We can now look at the last voting record to see that we are indeed getting what we expected.

In [68]:
voting_record

rollcall_id,116-2nd-1,116-2nd-10,116-2nd-100,116-2nd-101,116-2nd-102,116-2nd-103,116-2nd-104,116-2nd-105,116-2nd-106,116-2nd-107,...,116-2nd-90,116-2nd-91,116-2nd-92,116-2nd-93,116-2nd-94,116-2nd-95,116-2nd-96,116-2nd-97,116-2nd-98,116-2nd-99
legislator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abraham,0.0,1.0,1.0,-1.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,...,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0
Adams,0.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0
Aderholt,0.0,0.0,1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0,-1.0,...,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0
Aguilar,0.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,1.0,-1.0
Allen,0.0,1.0,1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0,-1.0,...,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,1.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wright,0.0,1.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0
Yarmuth,0.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,-1.0
Yoho,0.0,1.0,1.0,-1.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,...,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0
Young,0.0,-1.0,1.0,-1.0,0.0,-1.0,1.0,-1.0,-1.0,-1.0,...,1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0
