### Principle Component Analysis of US Senators Based on Voting Patterns
&nbsp;

We perform principle component analysis (PCA) on US Senate voting patterns to identify clusters in US Senators. We'll use the voting history to compare the polarisation in American politics for the years 2012 to 2016.
&nbsp;

We use PCA on the voting records to reduce their dimensionality to 2D in order to visualise voting patterns. For more information on PCA, I suggest reading [this great blog post by Matt Brems](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c?gi=de80317e8307) on the subject.
&nbsp;

The voting records are downloaded from [Voteview](https://voteview.com/data).

In [675]:
# import relevant libraries
import csv, os, re, math
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import *
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, HTML
init_notebook_mode(connected=True)

We start by downloading the data from Voteview. There are three datasets required: Member's votes, member's idealogies, and parties' data. We download the data in CSV format and use pandas to read and manipulate the data

In [676]:
path_voting_data = 'all_votes.csv'   # member's voting data
path_bio_data = 'all_members.csv'    # member's bio-data and idealogy
path_party_data = 'all_parties.csv'  # parties codes

# lets open and see the data

df_bio = pd.read_csv(path_bio_data, sep=',')
print("Member's Biodata")
display(df_bio)

df_party_dict = pd.read_csv(path_party_data, sep=',')
print("Parties' Data")
display(df_party_dict)

df_voting = pd.read_csv(path_voting_data, sep=',')
print("Voting Data")
display(df_voting)

Member's Biodata


Unnamed: 0,congress,chamber,icpsr,state_icpsr,district_code,state_abbrev,party_code,occupancy,last_means,bioname,...,died,nominate_dim1,nominate_dim2,nominate_log_likelihood,nominate_geo_mean_probability,nominate_number_of_votes,nominate_number_of_errors,conditional,nokken_poole_dim1,nokken_poole_dim2
0,1,President,99869,99,0,USA,5000,,,"WASHINGTON, George",...,,,,,,,,,,
1,1,Senate,2936,1,0,CT,5000,0.0,3.0,"ELLSWORTH, Oliver",...,1807.0,0.530,0.809,-24.37915,0.77800,97.0,8.0,,0.528,0.849
2,1,Senate,4998,1,0,CT,5000,0.0,3.0,"JOHNSON, William Samuel",...,1819.0,0.991,0.137,-30.41227,0.69000,82.0,16.0,,0.997,0.075
3,1,Senate,507,11,0,DE,4000,0.0,3.0,"BASSETT, Richard",...,1815.0,0.087,0.007,-38.18355,0.65400,90.0,23.0,,0.024,0.166
4,1,Senate,7762,11,0,DE,5000,0.0,3.0,"READ, George",...,1798.0,0.282,-0.239,-34.31907,0.69900,96.0,15.0,,0.270,-0.206
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9805,116,Senate,40915,56,0,WV,100,,,"MANCHIN, Joe, III",...,,-0.055,0.445,-37.60304,0.89292,332.0,9.0,,-0.041,0.356
9806,116,Senate,29940,25,0,WI,100,,,"BALDWIN, Tammy",...,,-0.505,-0.204,-68.30391,0.81555,335.0,39.0,,-0.387,-0.140
9807,116,Senate,41111,25,0,WI,200,,,"JOHNSON, Ron",...,,0.599,-0.290,-24.53653,0.92834,330.0,11.0,,0.561,0.040
9808,116,Senate,40707,68,0,WY,200,,,"BARRASSO, John A.",...,,0.538,0.237,-17.83657,0.94830,336.0,12.0,,0.602,0.251


Parties' Data


Unnamed: 0,congress,chamber,party_code,party_name,n_members,nominate_dim1_median,nominate_dim2_median,nominate_dim1_mean,nominate_dim2_mean
0,1,President,5000,Pro-Administration,1,,,,
1,1,House,4000,Anti-Administration,29,0.0180,0.0920,-0.024379,0.141931
2,1,House,5000,Pro-Administration,37,0.5780,0.0570,0.527838,0.025757
3,1,Senate,4000,Anti-Administration,9,-0.2375,-0.2130,-0.239750,-0.044750
4,1,Senate,5000,Pro-Administration,20,0.4270,-0.3085,0.352050,-0.166200
...,...,...,...,...,...,...,...,...,...
820,116,House,200,Republican,202,0.5190,0.1170,0.507178,0.074411
821,116,House,328,Independent,1,0.2800,-0.9600,0.280000,-0.960000
822,116,Senate,100,Democrat,45,-0.3380,-0.1320,-0.345111,-0.124178
823,116,Senate,200,Republican,53,0.4520,0.0240,0.492170,0.042887


Voting Data


Unnamed: 0,congress,chamber,rollnumber,icpsr,cast_code,prob
0,1,Senate,1,507,1,90.4
1,1,Senate,1,1346,6,48.6
2,1,Senate,1,1536,1,99.8
3,1,Senate,1,2307,1,100.0
4,1,Senate,1,2936,1,99.7
...,...,...,...,...,...,...
4259377,116,Senate,366,49300,1,80.3
4259378,116,Senate,366,49308,1,45.8
4259379,116,Senate,366,49703,1,100.0
4259380,116,Senate,366,49706,1,100.0


The fields are (from Voteview):
<br><br>
* **congress**: Integer 1+. The number of the congress that this member’s row refers to. e.g. 115 for the 115th Congress (2017-2019)
<br><br>
* **chamber**: House, Senate, or President. The chamber in which the member served.
<br><br>
* **rollnumber**: Integer 1+. Starts from 1 in the first rollcall of each congress. Excludes quorum calls and vacated votes.
<br><br>
* **icpsr**: Integer 1-99999. This is an ID code which identifies the member in question. In general, each member receives a single ICPSR identifier applicable to their entire career. A small number of members have received more than one: this can occur for members who have switched parties; as well as members who subsequently become president. Creating a new identifier allows a new NOMINATE estimate to be produced for separate appearances of a member in different roles.
<br><br>
* **cast_code**: Integer 0-9. Indicator of how the member voted.
<br><br>
* **prob**: Estimated probability, based on NOMINATE, of the member making the vote as recorded
<br><br>
* **state_icpsr**: Integer 0-99. Identifier for the state represented by the member.
<br><br>
* **district_code**: Integer 0-99. Identifier for the district that the member represents within their state (e.g. 3 for the Alabama 3rd Congressional District). Senate members are given district_code 0. Members who represent historical “at-large” districts are assigned 99, 98, or 1 in various circumstances.
<br><br>
* **state_abbrev**: String. Two-character postal abbreviation for state (e.g. MO for Missouri).
<br><br>
* **party_code**: Integer 1-9999. Identifying code for the member’s party. Please see documentation for Party Data for more information about which party_code identifiers refer to which parties.
<br><br>
* **occupancy**: Integer 1+. ICPSR occupancy code. This item is considered legacy or incomplete information and has not been verified. In general, members receive 0 if they are the only occupant, 1 if they are the first occupant, 2 if they are the second occupant, etc.
last_means: Integer 1-5. ICPSR Attain-Office Code. This is an indicator that reflects the member’s last means of attaining office. This item is considered legacy or incomplete information and has not been verified. Members received 1 if they were elected in a general election, 2 if elected by special election, 3 if directly elected by a state legislature, and 5 if appointed.
<br><br>
* **bioname**: String. Name of the member, surname first. For most members, agrees with the Biographical Directory of Congress.
<br><br>
* **bioguide_id**: String. Member identifier in the Biographical Directory of Congress.
<br><br>
* **born**: Integer. Year of member’s birth.
<br><br>
* **died**: Integer. Year of member’s death
<br><br>

We process the data before applying PCA

Votes are coded as follows:
* Yea is 1
* Nay is -1
* Abstension or missing data is 0

zz = df_voting

zz.replace({'cast_code': {2:1,3:1,4:-1,5:-1,6:-1,7:0,8:0,9:0}})

In [677]:
# make a dictionary of names for each icpsr number
df_name = df_bio.set_index('icpsr')[['bioname']]
name_dict = df_name.to_dict()

# make a dictionary for which party each member belonged to at a given congress session
df_party = df_bio.set_index('icpsr')[['congress', 'party_code']]

# make a dictionary of parties for each party code
df_party_dict = df_party_dict[['party_code', 'party_name']].set_index('party_code')
party_dict = df_party_dict.to_dict()['party_name']

# set 1 for Yea, -1 for Nay, and 0 for abstention
df_voting = df_voting.replace({'cast_code': {2:1,3:1,4:-1,5:-1,6:-1,7:0,8:0,9:0}})
# index voting record by congress number
df_voting = df_voting.set_index('congress')

# delete fields which are not needed
del (df_voting['chamber'], df_voting['prob'])

# create unique list of congress sessions
CongressSessions = df_voting.index.unique()

# split the dataframe into subframes for each Congress number
DataFrameDict = {elem : pd.DataFrame for elem in CongressSessions}
for key in DataFrameDict.keys():
    DataFrameDict[key] = df_voting[:][df_voting.index == key]

# create array of dates corresponding to each Congress number
dates1 = (2*CongressSessions)+1787
dates2 = (2*CongressSessions)+1789
dates = [str(i)+'-'+str(j) for i, j in zip(dates1, dates2)]

Next, we apply PCA on the processed data

In [698]:
X1, X2 = [], []  # arrays where we will store components of PCA for each Congress number
ind = 0
# loop over each Congress number, which corresponds to a two year period
for key in DataFrameDict.keys():
    df = DataFrameDict[key].set_index('icpsr')                  # make icpsr id the index
    df = df.pivot_table('cast_code', ['icpsr'], 'rollnumber')   # pivot table 
   # df = df.loc[df.isnull().mean(axis=1) < .75 ,:] # delete senators with 25% or more absence
    df = df.fillna(0)    # fill NaN values with 0
   # df = df[df.astype('bool').mean(axis=1)>=0.8]
    

    x = df.to_numpy() # convert dataframe to numpy array
    x = StandardScaler().fit_transform(x) # scale data to have zero mean and unit std
    
    # apply PCA and append to X1 and X2
    pca = PCA(n_components=2)
    principalComponents = pca.fit_transform(x)
    x1 = principalComponents[:,0]
    x2 = principalComponents[:,1]

    X1.append(x1)
    X2.append(x2)

    print(dates[ind]+': PCA holds '+str(round(100*np.sum(pca.explained_variance_ratio_),2))+
                              '% of the information. '+ str(round(100*(1-
                               np.sum(pca.explained_variance_ratio_)),2))+'% is lost')
    ind+=1

1789-1791: PCA holds 36.69% of the information. 63.31% is lost
1791-1793: PCA holds 48.32% of the information. 51.68% is lost
1793-1795: PCA holds 47.2% of the information. 52.8% is lost
1795-1797: PCA holds 49.39% of the information. 50.61% is lost
1797-1799: PCA holds 42.72% of the information. 57.28% is lost
1799-1801: PCA holds 54.28% of the information. 45.72% is lost
1801-1803: PCA holds 56.77% of the information. 43.23% is lost
1803-1805: PCA holds 36.42% of the information. 63.58% is lost
1805-1807: PCA holds 37.64% of the information. 62.36% is lost
1807-1809: PCA holds 36.08% of the information. 63.92% is lost
1809-1811: PCA holds 31.35% of the information. 68.65% is lost
1811-1813: PCA holds 42.51% of the information. 57.49% is lost
1813-1815: PCA holds 43.84% of the information. 56.16% is lost
1815-1817: PCA holds 25.49% of the information. 74.51% is lost
1817-1819: PCA holds 23.13% of the information. 76.87% is lost
1819-1821: PCA holds 26.58% of the information. 73.42% is

Naturally, a large chunk of the information is lost in the process. This is because we are converting from a high dimensional space into a two dimensional space. For example, for the years 2017-2019, we are converting from 399 dimensional space to 2 dimensional space while still able to keep 63.13% percent of the information!
<br><br>
Next, we plot and visualise the principle components 

In [699]:
# make a list of all parties in the US
parties_list = []
for key in party_dict:
    parties_list.append(party_dict[key])
parties_list = np.unique(parties_list)

# assign a unique color to each party
dz = np.arange(len(parties_list))
colors = plt.cm.nipy_spectral(np.linspace(0,1,len(dz)))
np.random.shuffle(colors)
colors_dict = {}
for A, B in zip(parties_list, colors):
    colors_dict[A] = B
    
# make partyArray which contains the parties for each Congress year
# nameArray contains senator names for each Congress year
# colorArray contains the colors for each party for each Congress year
partyArray, nameArray, colorArray = [],[],[]
for i in range(0, len(X1)):
    x1 = X1[i]
    x2 = X2[i]
    names, parties, full_names = [], [], []
    for j in range(len(x1)):
        full_names.append(name_dict['bioname'][DataFrameDict[i+1]['icpsr'].values[j]]) 
        parties.append(party_dict[df_party.query(
            'congress=='+str(i+1)).loc[DataFrameDict[i+1]['icpsr'].values[j]][1]])

    colors = [colors_dict.get(key) for key in parties]
    nameArray.append(full_names)
    partyArray.append(parties)
    colorArray.append(colors)

In [700]:
X1[107].shape

(101,)

In [701]:
# function to make dataframe for plotting with plotly
def make_dataframe(year):
    congress_number = math.floor((year-1787)/2)
    i = congress_number - 1
    parties = partyArray[i]
    d = {'x1': X1[i], 'x2': X2[i], 'name': nameArray[i], 'party': parties}
    dd = pd.DataFrame(data=d)
    dd = dd.sort_values(['party'])
    
    party_names = np.unique(parties)
    df = {party:dd.query("party == '%s'" %party)
                              for party in party_names}
    
    return df

# function to plot the principle components
def plot_PCA(year):
    congress_number = math.floor((year-1787)/2)
    df = make_dataframe(year)
    fig = go.Figure()
    for party_name, party in df.items():
        fig.add_trace(go.Scatter(
            x = party['x1'],
            y = party['x2'],
            name = party_name,
            hovertemplate = 
            "<b>%{text}</b><br><br>",
            text = party['name'],
            marker_size=5))

    fig.update_traces(
        mode='markers')

    fig.update_layout(
        width=600,
        height=500,
        margin=go.layout.Margin(l=15,r=1,b=10,t=50,pad=1),
        xaxis={'title':'PCA 1'},
        yaxis={'title':'PCA 2'},
        title=str(CongressSessions[congress_number-1]
                 ) +'th Congress ('+ dates[congress_number-1]+')',
        font=dict(
            family="Courier New, monospace",
            size=12,
            color="#000000"
         )
    )

    fig.show()    

In [702]:
df = DataFrameDict[107].set_index('icpsr')                  # make icpsr id the index
df = df.pivot_table('cast_code', ['icpsr'], 'rollnumber')   # pivot table 
df = df.loc[df.isnull().mean(axis=1) < .8 ,:] # delete senators with 25% or more absence
df = df.fillna(0)    # fill NaN values with 0
  #  df = df[df.astype('bool').mean(axis=1)>=0.5]
    
df

rollnumber,1,2,3,4,5,6,7,8,9,10,...,624,625,626,627,628,629,630,631,632,633
icpsr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1366,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,1.0
4812,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,0.0,0.0,...,0.0,1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,1.0,1.0
9369,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
10808,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,-1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,1.0,1.0,1.0
11204,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,...,1.0,1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49905,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
94240,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0
94659,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,-1.0,1.0,1.0,1.0,1.0,1.0,-1.0,-1.0,1.0
95407,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [703]:
#name_dict['bioname'][94240]

df[98:99].values==0


array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
      

In [713]:
plot_PCA(2001)


In [672]:
len(partyArray)

116

In [673]:
x.shape

(1, 546)