# How similar are Bernie Sanders' and Hillary Clinton's voting records?

<img src="https://upload.wikimedia.org/wikipedia/en/e/e3/DemocraticLogo.png" style="height:100px">

## Motivation

The 2015 primary season is in full-swing and plenty of attention is being given to the two main Democratic candidates, Bernie Sanders and Hillary Clinton. Although there are clear differences between Sanders and Clinton, I thought it would be interesting to try to quantify those differences, and explore whether those differences were correlated with any meaningful characteristics. This actually seems like a pretty simple project, given the numerous efforts to open up government records and the tools available to evaluate this type of data.

For this exploration, we're going to use the <a href="https://govtrack.us"><b>GovTrack.us API</b></a> to access information on Sanders, Clinton, and other Senators. As an overview, we're going to access Senator identifying information and voting records, convert that raw data into useful structures, and calculate and compare similarity metrics, much as we would for social network analysis or recommendation systems.

## I. Accessing Senator information

First, we need to pull a comprehensive list of Senators to analyze. I arbitrarily decided that I wanted to compare Senators that served after January 01, 2000, and I wanted to restrict the analysis to Senators that served for at least two years. Without confirming whether these constraints are useful, I wanted to avoid potential issues with temporal voting trends and changing political climates, as well as Senators that served for a short amount of time and have "outlier" voting records. With that in mind, we can pull identifying information for each Senator, including their GovTrack ID, name, party, state, and gender.

In [1]:
import json
import requests

# Set the base GovTrack API URL
_URL_GOVTRACK = 'https://www.govtrack.us/api/v2/'

# Set the URL for Senator information, only search for Senators that
# match our arbitrary criteria
url_senators = _URL_GOVTRACK + \
    'role?role_type=senator&enddate__gte=2002-01-01&limit=500'

# Access and load Senator information, careful to ensure that the number
# of senators does not exceed the API limit
senators_raw = json.loads(requests.get(url_senators).text)['objects']

# Clean Senator information, keeping only certain identifying details.
senators = {str(sen['person']['id']): {
                'name': sen['person']['name'],
                'lastname': sen['person']['lastname'],
                'firstname': sen['person']['firstname'],
                'party': sen['party'],
                'state': sen['state'],
                'gender': sen['person']['gender']}
            for sen in senators_raw}

At the time of this writing, this returns a dictionary with records for approximately 200 Senators, each similar to this:

In [2]:
bernie_id = [id_ for id_, sen in senators.items()
             if sen['lastname'] == 'Sanders' and
                 sen['firstname'] == 'Bernard'][0]

senators[bernie_id]

{'firstname': u'Bernard',
 'gender': u'male',
 'lastname': u'Sanders',
 'name': u'Sen. Bernard \u201cBernie\u201d Sanders [I-VT]',
 'party': u'Independent',
 'state': u'VT'}

As an aside, I didn't have time to go very deep with this analysis. One could use the <a href="https://www.opensecrets.org"><b>OpenSecrets.org</b></a> or the <a href="https://www.votesmart.org"><b>VoteSmart.org</b></a> APIs (among others) to access additional information for each Senator. For instance, OpenSecrets.org has great information on net worth and funding, while VoteSmart.org has contextual data, including education and religious affiliation. GovTrack.us makes it easy to work with all three by providing the API IDs as `osid` and `vsid` in the raw response.

## II. Accessing Senator voting records

Second, we need to grab the votes for each Senator, but there are two tricky points here: one political and one technical. For the former, many Senators were also Representatives at one point and we do not want to incorrectly include Represenative votes in our analysis. For the latter, the GovTrack API has a limit of 600 objects (Senators, votes, bills, etc.) per each request, and we need to manually step through all API calls.

In [3]:
# Set the base URL for Senator voting information, searching for only 
# votes made after Jan 01, 2000.
_URL_VOTERS = _URL_GOVTRACK + 'vote_voter?created__gt=2000-01-01'
_API_LIMIT = 600

# Functions to get all votes
def get_votes_for_senator(govtrack_id):
    """
    Get all votes for a Senator given his/her GovTrack ID.
    """
    # Get total vote count for the senator, used for paging
    total_count = _get_total_vote_count_for_senator(govtrack_id)
    # Create dictionary to hold votes
    votes = {}
    # Page through the API responses, get and clean vote information
    offsets = [idx * _API_LIMIT for idx in range(total_count / _API_LIMIT)]
    for offset in offsets:
        votes_raw = _get_raw_votes_for_senator(govtrack_id, offset)
        votes = _update_votes_with_raw_votes(votes, votes_raw)
    return votes


def _get_total_vote_count_for_senator(govtrack_id):
    """
    Get the total number of votes for a Senator given his/her GovTrack ID.
    """
    # URL for metainformation without records
    url_meta = _URL_VOTERS + '&person={id}&limit=0'.format(id=govtrack_id)
    # Access and load metainformation
    meta = json.loads(requests.get(url_meta).text)['meta']
    # Return vote count
    return meta['total_count']


def _get_raw_votes_for_senator(govtrack_id, offset):
    """
    Get a page of votes in the raw GovTrack API format.
    """
    # URL for votes, starting with vote number == offset
    url_votes = _URL_VOTERS + '&person={id}&offset={offset}&limit={limit}'\
        .format(id=govtrack_id, offset=offset, limit=_API_LIMIT)
    # Access, load, and return votes
    return json.loads(requests.get(url_votes).text)['objects']


def _update_votes_with_raw_votes(votes, votes_raw):
    """
    Update a dictionary of votes with GovTrack vote IDs and Senator vote 
    actions.
    """
    # Step through each raw vote
    for vote in votes_raw:
        # Keep the vote only if it was a Senate vote
        if vote['vote']['chamber_label'] == 'Senate':
            # Store the vote ID and Senator action
            vote_id = str(vote['vote']['id'])
            vote_action = vote['option']['value']
            votes.update({vote_id: vote_action})
    return votes

Whew, now we can actually get the votes for each Senator. This can take some time, depending on your setup, so you may find it helpful to track its progress. I usually prefer sending log statements to stdout when developing, but the print statements with a leading carriage return and a trailing comma are great for keeping progress statements to a single line.

Note that I'm commenting this out because I've already downloaded votes and saved them to a JSON file.

In [5]:
# Object to store votes
senator_votes = {}

# Get votes for each Senator
"""
for idx, govtrack_id in enumerate(senators.keys()):
    print '\rGetting votes for Senator ', idx+1, ' out of ', len(senators),
    senator_votes.update({govtrack_id: get_votes_for_senator(govtrack_id)})
"""

# Export downloaded votes
_LOCAL_PATH = '{LOCAL_PATH}'
"""
with open(_LOCAL_PATH, 'w') as file_:
    file_.writelines(json.dumps(senator_votes))
"""

## III. Cleaning and formatting Senator voting information

After loading the downloaded data, we have a dictionary of votes for each Senator. For example, Bernie's votes look like:

In [6]:
# Load data from local file
with open(_LOCAL_PATH, 'r') as file_:
    senator_votes = json.loads(file_.readlines()[0])
    
for key, value in senator_votes[bernie_id].items()[:5]:
    print 'Vote ID:  ', key, ' - Vote action:  ', value

Vote ID:   110004  - Vote action:   Yea
Vote ID:   110005  - Vote action:   Yea
Vote ID:   110007  - Vote action:   Nay
Vote ID:   110000  - Vote action:   Yea
Vote ID:   110001  - Vote action:   Yea


Okay, we now have Senator identifying information and voting records in two separate objects. Before proceeding, we want to create one derived object, a pandas dataframe, to format the data into something more amenable to exploration. Specifically, we'll create a matrix where Senators are columns, votes are rows, and the action taken by Senator `i` for vote `j` is represented by the `[i, j]` element. Fortunately, pandas is clever enough to handle this transformation for us.

In [7]:
import pandas

# Convert data to pandas dataframe
votes = pandas.DataFrame(senator_votes)

# Access Bernie's voting action on the Keystone XL vote, identifying
# the Keystone XL vote by manually searching GovTrack.us
keystone_xl_id = '116202'
print 'Bernie voted "{v}" on Keystone XL'.format(
    v=votes[bernie_id][keystone_xl_id])

Bernie voted "Nay" on Keystone XL


The example above illustrates how we can use a Senator ID and vote ID to access the Senator's action on that vote. In this case, Bernie Sanders voted Nay on Keystone XL in January 2015.

## IV. Calculating similarity values

One critical detail for our analysis is identifying the set of possible actions that Senators can take on each vote.

In [8]:
list(pandas.Series(votes.values.ravel()).unique())

[nan, u'Yea', u'Not Voting', u'Nay', u'Present', u'Not Guilty', u'Guilty']

We can now develop and apply a similarity metric to this dataset. Remember, our goal is to evaluate the similarity of Senators' voting records. To illustrate the data we're evaluating, check out this subset of the pandas dataframe for Bernie and Hillary's columns:

In [9]:
hillary_id = [id_ for id_, sen in senators.items()
              if sen['lastname'] == 'Clinton' and
                  sen['firstname'] == 'Hillary'][0]

votes.get([bernie_id, hillary_id])[1017:1025]

Unnamed: 0,400357,300022
111248,Yea,Yea
111250,Yea,Yea
111251,Yea,Yea
111254,Nay,Nay
111255,Nay,Nay
111260,Yea,Yea
111264,Nay,Yea
111266,Yea,Not Voting


Each row from the two columns is a datapoint to evaluate whether Bernie and Hillary are similar to one another. If you're comfortable with graph theory, social networks, or recommendation systems, this will look pretty familiar.

Looking at the pair of actions in each datapoint or vote, we can see that Sanders and Clinton can 1) both vote `Yea`, 2) both vote `Nay`, 3) vote `Nay` and `Yea`, respectively, and 4) vote and not vote, respectively. Ignoring for a second that there are other possible responses, how can we use this information to calculate their similarity?

Well, one way of addressing this question would be to count a point towards similarity if both Senators take the same action (both `Yea` or both `Nay`, a point towards dissimilarity if the two cast different votes (`Yea` and `Nay`), and to ignore the datapoint if at least one Senator did not vote (`nan` or `Not Voting`). Essentially, we're applying a modified Jaccard index to these two columns. Similarity values will range between `1` for Senators with identical voting records and `-1` for Senators who always took different voting actions.

To make things even simpler, we're only going to count `Yea` and `Nay`, ignoring all other possible actions. It's not that we couldn't tease meaning from `Present` or `Not Voting`, such as Senators avoiding divisive votes during election years, but we would need to dig deeper to feel confident about their meaning. We're also going to require that each pair of Senators have at least 100 shared votes to evaluate their similarity, defaulting to `NA` if we don't have enough data.

In [11]:
import numpy as np

# Require at least 100 shared votes
_THRESHOLD_VOTES = 100

# Function to calculate senator similarity
def calculate_senator_similarity(votes_i, votes_j):
    """
    Calculate the similarity of two Senators based on their voting records.
    """
    idx_overlap = \
        np.logical_and(
            np.logical_or(votes_i.values == 'Yea', votes_i.values == 'Nay'),
            np.logical_or(votes_j.values == 'Yea', votes_j.values == 'Nay'))
    same = np.sum(votes_i.values[idx_overlap] == votes_j.values[idx_overlap])
    diff = np.sum(votes_i.values[idx_overlap] != votes_j.values[idx_overlap])
    total = np.sum(idx_overlap)
    if total >= _THRESHOLD_VOTES:
        similarity = float(same - diff) / total
    else:
        similarity = None
    return similarity


# Step through Senators, only calculating once for each pair
similarities = {senator: {} for senator in votes.columns}

for idx_i, senator_i in enumerate(votes.columns):
    print '\rCalculating similarity for Senator ', idx_i+1, ' out of ', len(votes.columns),
    for idx_j, senator_j in enumerate(votes.columns):
        if idx_i < idx_j:
            votes_i = votes.get(senator_i)
            votes_j = votes.get(senator_j)
            similarity = calculate_senator_similarity(votes_i, votes_j)
            similarities[senator_i].update({senator_j: similarity})
            similarities[senator_j].update({senator_i: similarity})

Calculating similarity for Senator  201 out of 201


Note that I calculated the similarity for each pair of Senators only once. We store the similarity value for Senators `i` and `j` twice, in case we'd like to use the indices in any order.

Again, we'll make use of a pandas dataframe to format this dictionary. However, this matrix will have Senators as columns and rows, with each matrix element `[i, j]` representing the similarity between Senators `i` and `j`. As a reminder, we expect Senators to have no similarity value relative to Senators they share fewer than `100` votes with, as well as themselves given how we iterated through indices.

In [12]:
# Convert similarities to a dataframe
similarities = pandas.DataFrame(similarities)

## V. Analyzing similarity values

Using this dataframe, we can easily see the similarity of any two Senators. For instance, the whole premise of this post was that we wanted to know how similar Bernie and Hillary were based on their voting record:

In [13]:
similarities[bernie_id][hillary_id]

0.86390532544378695

Given that this is essentially a modified recommendation system, this is the relative confidence that we think someone would support Hillary, given that they thought Bernie was a good Senator (or vice versa). That is, if a similarity value of `1` represents high confidence that someone would like a second Senator, given their preference for the first, `-1` would represent high confidence that someone would not like a second Senator, and `0` would represent complete uncertainty. Here, we're pretty confident that a Bernie supporter would also be comfortable voting for Hillary, and a Hillary supporter would just as easily vote for Bernie.

What about the Republicans? What's the chance that a Bernie or Hillary supporter would also support Ted Cruz, Lindsey Graham, Rand Paul, or Rick Santorum?

In [14]:
# Get Republican GovTrack IDs
republicans = [('Ted', 'Cruz'), ('Lindsey', 'Graham'), ('Rand', 'Paul'),
               ('Richard', 'Santorum')]
r_ids = {' '.join(republican): id_ for id_, sen in senators.items()
         for republican in republicans
         if sen['firstname'] == republican[0] and 
            sen['lastname'] == republican[1]}

# Print similarity values
for r_name, r_id in r_ids.items():
    print 'Bernie and ' + r_name, similarities[bernie_id][r_id]
    print 'Hillary and ' + r_name, similarities[hillary_id][r_id], '\n'

Bernie and Rand Paul -0.435294117647
Hillary and Rand Paul nan 

Bernie and Lindsey Graham -0.300217864924
Hillary and Lindsey Graham -0.232323232323 

Bernie and Richard Santorum nan
Hillary and Richard Santorum -0.205555555556 

Bernie and Ted Cruz -0.441605839416
Hillary and Ted Cruz nan 



There are two things we expect to see here. First, supporters of Bernie and Hillary are expected to not support any of the Republican candidates for President. Second, we cannot estimate the similarity between Bernie and Santorum, or between Hillary and Paul or Cruz, because their tenures did not overlap.

We could infer unknown similarity values from known Senator similarities. For instance, Bernie joined the Senate in 2007, the same year that Santorum left. If we can find Senators that served both before and after 2007, we can compare their pre-2007 Santorum similarities to their post-2007 Bernie similarities. One of the simplest approaches would be looking at correlations in the Santorum and Bernie similarities. Positive correlations would suggest that Bernie and Santorum would have been similar, which negative correlations would suggest the opposite.

In [15]:
import scipy.stats

# Find Senators that have a defined similarity to both Bernie and Santorum
bernie_similarities = similarities.get(bernie_id)\
    .values.flatten()
santorum_similarities = similarities.get(r_ids['Richard Santorum'])\
    .values.flatten()
idx_shared = np.where(
    np.logical_and(np.isfinite(bernie_similarities),
                   np.isfinite(santorum_similarities)))[0]

# Calculate the correlation between Bernie and Santorum similarities to 
# other Senators
scipy.stats.stats.pearsonr(bernie_similarities[idx_shared],
                           santorum_similarities[idx_shared])[0]

-0.94672584909717317

Ignoring any statistical or political issues with this type of analysis, the correlation between the similarity of Senators to Bernie and the similarity of Senators to Santorum is `-.94`, suggesting that Bernie and Santorum would have been dissimilar had their Senate service overlapped.

Another way to look at this would be to ask which Senators have voting records most similar to Bernie or Hillary. Again, this is our relative estimate for which Senators someone might like, given that they like either Bernie or Hillary. These seem to make sense given political and demographic trends.

In [19]:
print 'Senators most similar to Bernie:'
bernie_sorted = similarities[bernie_id].sort_values(ascending=False)
for id_, sim in bernie_sorted[:10].iteritems():
    print '  ', senators[id_]['name'], ' = ', sim

Senators most similar to Bernie:
   Vice President Joseph Biden [D]  =  0.943396226415
   Sen. Tammy Baldwin [D-WI]  =  0.924349881797
   Sen. Edward “Ed” Markey [D-MA]  =  0.920477137177
   Sen. Cory Booker [D-NJ]  =  0.919642857143
   Sen. Brian Schatz [D-HI]  =  0.913533834586
   Sen. Mazie Hirono [D-HI]  =  0.911894273128
   Sen. Elizabeth Warren [D-MA]  =  0.905429071804
   Sen. Patrick Leahy [D-VT]  =  0.903357903358
   Sen. Christopher Murphy [D-CT]  =  0.893961708395
   Sen. Roland Burris [D-IL, 2009-2010]  =  0.890971039182


In [20]:
print 'Senators most similar to Hillary:'
hillary_sorted = similarities[hillary_id].sort_values(ascending=False)
for id_, sim in hillary_sorted[:10].iteritems():
    print '  ', senators[id_]['name'], ' = ', sim

Senators most similar to Hillary:
   Sen. Sherrod Brown [D-OH]  =  0.928143712575
   Sen. Benjamin Cardin [D-MD]  =  0.911504424779
   Sen. Charles “Chuck” Schumer [D-NY]  =  0.910031488979
   Sen. Sheldon Whitehouse [D-RI]  =  0.905882352941
   President Barack Obama [D]  =  0.882750845547
   Sen. Frank Lautenberg [D-NJ, 2003-2013]  =  0.878064110622
   Sen. Robert “Bob” Casey [D-PA]  =  0.876832844575
   Sen. John Kerry [D-MA, 1985-2013]  =  0.870855148342
   Sen. Barbara Mikulski [D-MD]  =  0.869763205829
   Sen. Barbara Boxer [D-CA]  =  0.868943239502


We could also ask which Senators a Bernie supporter would not support:

In [21]:
print 'Senators most dissimilar to Bernie:'
bernie_sorted = similarities[bernie_id].sort_values(ascending=True)
for id_, sim in bernie_sorted[:10].iteritems():
    print '  ', senators[id_]['name'], ' = ', sim

Senators most dissimilar to Bernie:
   Sen. Jim DeMint [R-SC, 2005-2012]  =  -0.584964761159
   Sen. Jon Kyl [R-AZ, 1995-2012]  =  -0.530221882173
   Sen. Thomas Coburn [R-OK, 2005-2014]  =  -0.527764815679
   Sen. Tim Scott [R-SC]  =  -0.490254872564
   Sen. Jim Bunning [R-KY, 1999-2010]  =  -0.486935866983
   Sen. Mike Lee [R-UT]  =  -0.486338797814
   Sen. James “Jim” Inhofe [R-OK]  =  -0.476989247312
   Sen. James Risch [R-ID]  =  -0.46511627907
   Sen. John Ensign [R-NV, 2001-2011]  =  -0.461461461461
   Sen. Ted Cruz [R-TX]  =  -0.441605839416


That's interesting and all, but we can do so much more with the data we've gathered. All of the comparisons above are pairwise; that is, they only account for the relationship between any two Senators. We can use the overall structure of this matrix, which is essentially a social network with weights representing political similarity, to explore groupings and other large-scale patterns. That's on tap for another notebook.