Introduction - In this assignment, 2 methods of cluster analysis will be performed on the voting records of the members of the most recent full 2-year senate session.  

The first clustering will be done using K-Means (with K = 2).  K-Means clustering is a prototype-based algorithm whereby each data point is compared to a model centroid to inform which group a data point belongs to, and is performed iteratively until the locations of the centroids stabilize, or in this case, 10 iterations as an algorithmic parameter used.  What is left is a clustering of each data point to one of K clusters based on proximity or similarity to the most recently updated centroid.

The second clustering on the same data will be using the DBSCAN algorithm.  In this case, using an EPS distance parameter of 10, and a minimum of 8 data points. Unlike the K-Means algorithm, where number of clusters is a user-supplied parameter, DBSCAN infers how many clusters there are in the data based on the tuning paramters, including if a point is a noise point and doesn't belong to a cluster.  Essentially, as long as enough points continue to be near enough other points within a certain distance, they will be included in the cluster.

Results of the clusterings will be compared using their silhouette coefficients.


Data Provenance

The dataset being used for this assignment is the VoteView Congressional Roll Call Votes Database for the 116th session of congress for the U.S. Senate members.  The 116th session is the most recently concluded one running from 2019 until 2021.  Consisting of the 720 recorded votes for each of the 103 members for the whole session by a voting code that relates to a respective vote of yes, no, present or abstaining.  Each voting member is coded by their Inter-University Consortium for Political and Social Research (ICPSR) ID number, which is unique to a person, although "A small number of members have received more than one: this can occur for members who have switched parties; as well as members who subsequently become president." (Lewis et al, 2021)

This dataset is compiled and updated by Jeffrey Lewis, Keith Poole, Howard Rosenthal, Aaron Rudkin, and Luke Sonnet and can be found on VoteView.com, https://voteview.com/static/data/out/votes/S116_votes.csv .  Interpretation for the voting cast codes can be found at https://voteview.com/articles/data_help_votes. (Lewis et al, 2021)

Before summary statistics can be reported, there needs to be some preprocessing done to get the dataset presented in a usuable format.



In [1]:
#Import required packages for the assignment

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import cluster
from sklearn.cluster import DBSCAN
from sklearn import metrics

%matplotlib inline

In [2]:
#Read the file in and take a look:

RawVoteHist = pd.read_csv(r'C:\Users\Mike\Documents\Grad School 2021\DSC-607 Data Mining\S116_votes.csv')

RawVoteHist.head()

Unnamed: 0,congress,chamber,rollnumber,icpsr,cast_code,prob
0,116,Senate,1,14226.0,1,97.3
1,116,Senate,1,14307.0,6,97.9
2,116,Senate,1,14435.0,6,99.9
3,116,Senate,1,14852.0,1,97.9
4,116,Senate,1,14858.0,6,99.1


The dataset already only includes the senate votes for the 116th congress.  Additionally, the "prob" feature shown is not of use for this purpose.  Those features need to be dropped.

The feature "rollnumber" is actually the sequential vote number for each of the 720 items voted on.  For the clustering, we want the votes separated by vote, so we will need to pivot the data table as well.  

Finally, the casting codes in raw format are [1, 6, 7, 9] which correspond to ["Yea", "Nay", "Present(did not choose)", Not Voting (Abstention)].  There are also NaN values, which makes sense given how there are only 100 senators, yet we have voting records for 103, as some retire and get replaced.  Those values represent that that member was not in congress at that time, so those votes will be interpreted as the same as Present or Not Voting.  Those values need to be placed on a scale of 0 to 1.  Nay votes are 0, Yea votes are 1, and the other possible responses get assigned a value of 0.5.

In [3]:
#Drop unnecessary features
RawVoteHist = RawVoteHist.drop(['congress','chamber','prob'], axis = 1) 

#Change icpsr ID to an integer
RawVoteHist['icpsr'] = RawVoteHist['icpsr'].astype(int) #

#Pivot the data so each senator is their own row with each variables value being their response for that vote
RecordBySenator = RawVoteHist.pivot(index = 'icpsr', columns = 'rollnumber', values = 'cast_code')

RecordBySenator.head() # Take a quick look


rollnumber,1,2,3,4,5,6,7,8,9,10,...,711,712,713,714,715,716,717,718,719,720
icpsr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14226,1.0,1.0,1.0,1.0,6.0,6.0,1.0,1.0,1.0,6.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
14307,6.0,6.0,6.0,6.0,1.0,1.0,6.0,1.0,6.0,1.0,...,1.0,1.0,6.0,6.0,6.0,6.0,1.0,1.0,1.0,1.0
14435,6.0,6.0,6.0,6.0,1.0,1.0,6.0,1.0,6.0,1.0,...,6.0,6.0,6.0,6.0,6.0,6.0,1.0,6.0,6.0,6.0
14852,1.0,1.0,1.0,1.0,6.0,6.0,1.0,1.0,1.0,6.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
14858,6.0,6.0,6.0,6.0,1.0,1.0,6.0,1.0,6.0,1.0,...,1.0,1.0,6.0,6.0,6.0,6.0,1.0,1.0,1.0,1.0


In [4]:
#Convert vote cast code to 0 to 1 numberic value.
RecordBySenator.fillna(value = 0.5, inplace = True) #Not a member for the vote
RecordBySenator.replace(to_replace = 6 , value = 0, inplace = True) #Nay
RecordBySenator.replace(to_replace = 7, value = 0.5, inplace = True)#Present - did not choose
RecordBySenator.replace(to_replace = 9, value = 0.5, inplace = True)#Abstained

RecordBySenator.describe()

rollnumber,1,2,3,4,5,6,7,8,9,10,...,711,712,713,714,715,716,717,718,719,720
count,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,...,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0
mean,0.558252,0.548544,0.533981,0.427184,0.572816,0.572816,0.504854,0.88835,0.514563,0.538835,...,0.864078,0.849515,0.509709,0.509709,0.519417,0.504854,0.917476,0.830097,0.830097,0.825243
std,0.491601,0.482612,0.476209,0.487127,0.487127,0.487127,0.482514,0.288098,0.487371,0.483499,...,0.290319,0.303856,0.479893,0.479893,0.474457,0.482514,0.25338,0.339791,0.339791,0.354997
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
50%,1.0,1.0,0.5,0.0,1.0,1.0,0.5,1.0,0.5,1.0,...,1.0,1.0,0.5,0.5,0.5,0.5,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Looking at the summary statistics above we see that the data has all been converted to a 0 to 1 scale.

In [5]:
#Cluster using K-Means, K = 2, random initialization, 10 iterations
RecordClusterData = RecordBySenator

KM_SenatorClusterData = cluster.KMeans(n_clusters = 2,init = 'random', max_iter = 10, random_state = 1)

KM_SenatorClusterData.fit(RecordClusterData)

ClusterLabels = KM_SenatorClusterData.labels_

ClusterTable = pd.DataFrame(ClusterLabels, index = RecordBySenator.index, columns = ['Voting Bloc'])

ClusterTable #See which cluster each PSR got assigned to.

Unnamed: 0_level_0,Voting Bloc
icpsr,Unnamed: 1_level_1
14226,1
14307,0
14435,0
14852,1
14858,0
...,...
49308,0
49703,1
49706,1
94659,1


In [6]:
# Breakdown of cluster labels
ClusterSummary = ClusterTable['Voting Bloc'].value_counts()

ClusterSummary

1    55
0    48
Name: Voting Bloc, dtype: int64

In [7]:
#Cluster using DBSCAN, eps = 10, minimum samples = 8

VoteBloc = DBSCAN(eps = 10, min_samples = 8).fit(RecordClusterData)

core_samples_mask = np.zeros_like(VoteBloc.labels_)
core_samples_mask[VoteBloc.core_sample_indices_] = True

VoteBlocLabels = pd.DataFrame(VoteBloc.labels_, index = RecordBySenator.index, columns = ['Voting Bloc'])

VoteBlocLabels #See which cluster each PCSR got assigned to, inluding possible noise point classification

Unnamed: 0_level_0,Voting Bloc
icpsr,Unnamed: 1_level_1
14226,0
14307,1
14435,1
14852,0
14858,1
...,...
49308,1
49703,0
49706,0
94659,0


In [8]:
DBClusterSummary = VoteBlocLabels['Voting Bloc'].value_counts()

DBClusterSummary

 0    53
 1    44
-1     6
Name: Voting Bloc, dtype: int64

In [9]:
ClusterQty = len(set(VoteBlocLabels['Voting Bloc'])) - 1
NoiseQty = list(VoteBlocLabels['Voting Bloc']).count(-1)

print('Clusters Identified: %d' % ClusterQty)
print('Points not in a cluster (noise): %d' % NoiseQty)

Clusters Identified: 2
Points not in a cluster (noise): 6


In [10]:
#Calculate and compare the silhouette coefficients
print("K-Means \n\tK = 2\n\tSilhouette Coefficient: %0.4f" 
      % metrics.silhouette_score(RecordClusterData, ClusterLabels))
print("DBSCAN \n\teps = 10\n\tMinimum points = 8\n\tSilhouette Coefficient: %0.4f" 
      % metrics.silhouette_score(RecordClusterData, VoteBlocLabels['Voting Bloc']))


K-Means 
	K = 2
	Silhouette Coefficient: 0.5355
DBSCAN 
	eps = 10
	Minimum points = 8
	Silhouette Coefficient: 0.3688


Looking at the cluster quantities identified by the two algorithms shows a lot of agreement on the size of clusters present, assuming there are 2.  DBSCAN's results showing that some points don't belong to a cluster is interesting in comparison to K-Means, which does not allow for such a thing.  In the context of senate voting behavior, those individuals may be interesting in that they were not easily classed and so maybe do not follow the same patterns as the vast majority of the rest.

However, looking at the silhouette scores for the 2 algorithms, we see that K-Means identified clusters that are a little bit better defined in terms of density and separation.  DBSCAN's typical benfit over K-Means is an ability to identify non-globular shapes better. Perhaps the data this time is globular in the data space, reflecting more or less monolithic voting behavior, and K-Means is the more appropriate method.

References:

Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2021). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/ (Retrieved 15 Oct 2021)