### Principle Component Analysis of US Senators Based on Voting Patterns
&nbsp;

We perform principle component analysis (PCA) on US Senate voting patterns to identify clusters in US Senators. We'll use the voting history to compare the polarisation in American politics for the years 1993 to 2016.
&nbsp;

We use PCA on the voting records to reduce their dimensionality to 2D in order to visualise voting patterns. For more information on PCA, I suggest reading [this great blog post by Matt Brems](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c?gi=de80317e8307) on the subject.

&nbsp;

The voting records are downloaded from [Bradley Robinso](https://data.world/bradrobinson/us-senate-voting-records). 

We process the data before applying PCA

The downloaded data votes are coded as follows:
* Yea is 1
* Nay is 0
* Abstension or missing data is 2

We change the coding to:
* Yea is 1
* Nay is -1
* Abstension or missing data is 0


Despite the analysis being based only on roll call votes and not policy positions, clear clusters emerge from the voting data. The trend towards more polarization in Congress over the past few decades naturally forms from this analysis. Moreover, the analysis is very successful at highlighting outliers and swing voters.


In [536]:
# import relevant libraries
import csv, os, re, math
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from plotly.graph_objs import *
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, HTML
init_notebook_mode(connected=True)

In [645]:
# given a list of names, this function returns the list sorted numerically
def numericalSort(list):
    def key(value):
        numbers = re.compile(r'(\d+)')
        parts = numbers.split(value)
        parts[1::2] = map(int, parts[1::2])
        return parts
    return sorted(list, key=key)

# path to data files
path = './data/data/' 
# list of filenames in data folder
filenames = numericalSort(os.listdir(path))
# list of years from 1993 to 2016
year = np.arange(1993, 1993+len(os.listdir(path)))
# list of congress session numbers
session = np.floor((year-1787)/2).astype(np.int8)
# dictionary for which congress session at each year
year_session_dict = {k: v for k, v in zip(year, session)}
# X1 and X2 are the two PCA components, namesArray and partiesArray 
# contain the names of the senators and their parties respectively
X1, X2, namesArray, partiesArray = [], [], [], []
idx = 0
# loop over the CSV files
for i in range(len(filenames)//2):
    # each congress meeting is split into two CSV files
    # we open both and combine them into one dataframe
    # file1 is the first half and file2 is the second half
    file1 = filenames[idx]
    file2 = filenames[idx+1]
    
    # read the CSV files and save into pandas dataframe
    df1 = pd.read_csv(path+file1, sep=',')
    df2 = pd.read_csv(path+file2, sep=',')
    if 'name' not in df1.columns:
        df1 = pd.read_csv(path+file1, sep=',', header=1)
    if 'name' not in df2.columns:
        df2 = pd.read_csv(path+file2, sep=',', header=1)    
    
    # change the index to be the senator names and drop the first column
    df1.set_index('name', inplace=True)
    df1 = df1.drop(df1.columns[0], axis=1)
    df2.set_index('name', inplace=True)
    df2 = df2.drop(df2.columns[0], axis=1)
    
    # merge both dataframes into one dataframe
    df = pd.merge(left=df1,right=df2, left_on='name', right_on='name')
    df.rename(columns={'party_x':'party'}, inplace=True)
    del df['party_y']
    
    # remove senators with more than 30% absence 
    df = df.loc[df.replace(2, np.nan).isnull().mean(axis=1) < .3 ,:]
    
    # list of senator names and party for each congressional meeting
    names = df.index.values
    parties = df['party'].values
    
    # changing the voting codes
    df = df.replace([True, False], [1, 0])
    df = df.replace([0, 1, 2], [-1, 1, 0])
    
    # remove party column and convert voting data to numpy array
    del df['party']
    x = df.to_numpy()
    
    # scale data to have zero mean and unit standard deviation
    x = StandardScaler().fit_transform(x)
    
    # calculate the two principal components
    pca = PCA(n_components=2)
    principalComponents = pca.fit_transform(x)

    print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
    print('PCA holds '+str(round(100*np.sum(pca.explained_variance_ratio_),2))+
                          '% of the information. '+ str(round(100*(1-
                           np.sum(pca.explained_variance_ratio_)),2))+'% is lost')

    x1 = principalComponents[:,0]
    x2 = principalComponents[:,1]
    
    # append the X1 and X2 arrays with the principal components of each...
    # ...congressional meeting
    X1.append(x1)
    X2.append(x2)
    
    # append names and parties array with names and parties of senators...
    # ... for each congressional meeting
    namesArray.append(names)
    partiesArray.append(parties)
    idx+=2

# change 'ID' with 'I' (both represent independent senators)
for idx1, session_year in enumerate(partiesArray):
    for idx2, party_name in enumerate(session_year):
        if party_name == 'ID':
            partiesArray[idx1][idx2] = 'I'

Explained variation per principal component: [0.38696505 0.04123276]
PCA holds 42.82% of the information. 57.18% is lost
Explained variation per principal component: [0.45013447 0.02918053]
PCA holds 47.93% of the information. 52.07% is lost
Explained variation per principal component: [0.3767398  0.04056872]
PCA holds 41.73% of the information. 58.27% is lost
Explained variation per principal component: [0.44172924 0.03081053]
PCA holds 47.25% of the information. 52.75% is lost
Explained variation per principal component: [0.37300319 0.04212149]
PCA holds 41.51% of the information. 58.49% is lost
Explained variation per principal component: [0.49671641 0.03544079]
PCA holds 53.22% of the information. 46.78% is lost
Explained variation per principal component: [0.44569079 0.03373602]
PCA holds 47.94% of the information. 52.06% is lost
Explained variation per principal component: [0.27928151 0.14012035]
PCA holds 41.94% of the information. 58.06% is lost
Explained variation per principa

Naturally, a large chunk of the information is lost in the process. This is because we are converting from a high dimensional space into a two dimensional space. 114th congress, we are converting from 493 dimensional space to 2 dimensional space while still being able to keep 53.88% percent of the information!
<br><br>
Next, we plot and visualise the principle components 

In [653]:
# function to make dataframe from the principle components data...
# ...which is used for plotting with plotly. Takes the congress year...
# ...as input and outputs the relevant dataframe for that year
def make_dataframe(year):
    i = (year - 1993)//2
    d = {'x1': X1[i], 'x2': X2[i], 'name': namesArray[i], 'party': partiesArray[i]}
    
    dd = pd.DataFrame(data=d)
    dd = dd.sort_values(['party'])
    
    party_names = np.unique(partiesArray[i])
    df = {party:dd.query("party == '%s'" %party)
                              for party in party_names}
    return df

# function that clusters data using k-means clustering...
# ...it finds the ideal number of clusters using the...
# ...silhouette method
def k_means_cluster(X):
    best_n = 0
    max_silhouette_avg = 0
    for n in range(2, 10):
        kmeans = KMeans(n_clusters=n, init='k-means++', 
                        max_iter=300, n_init=10, random_state=0).fit(X)
        pred_y = kmeans.fit_predict(X)
        silhouette_avg = silhouette_score(X, pred_y)
        if silhouette_avg > max_silhouette_avg:
            best_n = n
            max_silhouette_avg = silhouette_avg

    kmeans = KMeans(n_clusters=best_n, init='k-means++',
                    max_iter=300, n_init=10, random_state=0).fit(X)
    
    return kmeans.cluster_centers_


# function to plot the principle components using plotly. Takes the...
# ... congress year as input and plots the principle components for that year
def plot_PCA(year):
    idx = (year-1993)//2
    X = np.vstack([X1[idx], X2[idx]]).T
    clusters = k_means_cluster(X)
    if year%2 == 0:
        year_title = '(' + str(year-1) + '-' + str(year+1) +')'
    else:
        year_title = '(' + str(year) + '-' + str(year+2) +')'
    
    congress_number = year_session_dict[year]
    df = make_dataframe(year)
    fig = go.Figure()
    for party_name, party in df.items():
        fig.add_trace(go.Scatter(
            x = party['x1'],
            y = party['x2'],
            name = party_name,
            hovertemplate = 
            "<b>%{text}</b><br><br>",
            text = party['name'],
            marker_size=5))

    fig.update_traces(
        mode='markers')
            
    fig.add_trace(go.Scatter(
        mode='markers',
        hoverinfo='none',
        x = clusters[:, 0],
        y = clusters[:, 1],
        name = 'cluster',
        opacity = 0.25,
        marker=dict(
            color='rgba(0, 0, 0, 0)',
            size=100,
            line=dict(
                color='red',
                width=8))))
    
    fig.update_layout(
        width=600,
        height=500,
        margin=go.layout.Margin(l=15,r=1,b=10,t=50,pad=1),
        xaxis={'title':'PCA 1'},
        yaxis={'title':'PCA 2'},
        title=str(year_session_dict[year]
                 ) +'th Congress ' + year_title,
        font=dict(
            family="Courier New, monospace",
            size=12,
            color="#000000"
         )
    )
    fig.show()    

In [654]:
plot_PCA(1993)

### 103th Congerss (1993-1995)

The 103th Senate voting topology clearly seperates senators by their party, with the democrats forming a close knit cluster, while the republicans are more spread out. 

Two notable exceptions are Senator Shelby and Senator Jeffords. Senator Shelby voted more often along the Republican line and similarily, Senator Jeffords voted more often along the Democratic line. Unsurprisingly, both of the senators ended up eventually leaving their party, with Shelby joining the Republican party midway during President Clinton's presidency, and Jeffords leaving the Republican party to become an independent and caucus with the Democrats in 2001.

In [656]:
plot_PCA(2008)

### 110th Congress (2007-2009)

The 110th Senate voting topology seperates senators by their party; however, unlike the 103th Senate, each party is further divided into two voting blocks, indicating an inner split in each party's idealogy. The Democratic clusters are also more dense than their Republican counterpart, with very few Democrats escaping party lines. 

On the Republican side, multiple Senators diverge from party lines, with the ones most likely to vote across the aisle being Senator Snowe and Senator Collins.

In [657]:
plot_PCA(2016)

The 114th Senate voting topology shows two parties which are completely disjoint, highlighting the division in current US politics. Even senators considered as swing voters, such as Senator Manchin and Senator Collins are closely alligned with their respective party. 

Interestingly, the two libertarian leaning Republicans, Senator Mike Lee and Senator Rand Paul, are seperated from the rest of the Republican Party. On the Democrat side, Bernie Sanders also deviates from the rest of the Democrat cluster, illustrating his exceptionally progressive voting pattern.