**1: Clustering Overview**

So far, we've looked at regression and classification. These are both types of supervised machine learning. In supervised learning, you train an algorithm to predict an unknown variable from known variables.

Another major type of machine learning is called **unsupervised learning. In unsupervised learning, we aren't trying to predict anything. Instead, we're finding patterns in data.**

One of the main unsupervised learning techniques is called clustering. We use clustering when we're trying to explore a dataset, and understand the connections between the various rows and columns. For example, we can cluster NBA players based on their statistics. Here's how such a clustering might look:

NBA clusters

The clusters made it possible to discover player roles that might not have been noticed otherwise. Here's an article that describes how the clusters were created.

Clustering algorithms group similar rows together. There can be one or more groups in the data, and these groups form the clusters. As we look at the clusters, we can start to better understand the structure of the data.

Clustering is a key way to explore unknown data, and it's a very commonly used machine learning technique. In this mission, we'll work on clustering US Senators based on how they voted.

In [60]:
import json
a=open('data.json')
json_input=a.read()
dict=json.loads(json_input)



{'amendment': {'author': 'First Swalwell of California Amendment',
  'number': 17,
  'type': 'h-bill'},
 'bill': {'congress': 114, 'number': 2028, 'type': 'hr'},
 'category': 'amendment',
 'chamber': 'h',
 'congress': 114,
 'date': '2015-04-30T23:24:00-04:00',
 'number': 198,
 'question': 'On Agreeing to the Amendment: Amendment 17 to H R 2028',
 'requires': '1/2',
 'result': 'Failed',
 'result_text': 'Failed',
 'session': '2015',
 'source_url': 'http://clerk.house.gov/evs/2015/roll198.xml',
 'type': 'On the Amendment',
 'updated_at': '2015-06-05T15:41:27-04:00',
 'vote_id': 'h198-114.2015',
 'votes': {'Aye': [{'display_name': 'Adams',
    'id': 'A000370',
    'party': 'D',
    'state': 'NC'},
   {'display_name': 'Aguilar', 'id': 'A000371', 'party': 'D', 'state': 'CA'},
   {'display_name': 'Bass', 'id': 'B001270', 'party': 'D', 'state': 'CA'},
   {'display_name': 'Beatty', 'id': 'B001281', 'party': 'D', 'state': 'OH'},
   {'display_name': 'Becerra', 'id': 'B000287', 'party': 'D', 'stat

To group Senators together, we need some way to figure out how "close" the Senators are to each other. We'll then group together the Senators that are the closest. We can actually discover this distance mathematically, by finding how similar the votes of two Senators are. The closer together the voting records of two Senators, the more ideologically similar they are (voting the same way indicates that you share the same views).

To find the distance between two rows, we can use Euclidean distance. The formula is:

d=(q1−p1)2+(q2−p2)2+⋯+(qn−pn)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√d=(q1−p1)2+(q2−p2)2+⋯+(qn−pn)2
Let's say we have two Senator's voting records:


name,party,state,00001,00004,00005,00006,00007,00008,00009,00010,00020,00026,00032,00038,00039,00044,00047
Alexander,R,TN,0,1,1,1,1,0,0,1,1,1,0,0,0,0,0
Ayotte,R,NH,0,1,1,1,1,0,0,1,0,1,0,1,0,1,0
If we took only the numeric vote columns, we'd have this:


00001,00004,00005,00006,00007,00008,00009,00010,00020,00026,00032,00038,00039,00044,00047
0,1,1,1,1,0,0,1,1,1,0,0,0,0,0
0,1,1,1,1,0,0,1,0,1,0,1,0,1,0
If we wanted to compute the Euclidean distance, we'd plug the vote numbers into our formula:

d=(0−0)2+(1−1)2+(1−1)2+(1−1)2+(1−1)2+(0−0)2⋯+(0−0)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√d=(0−0)2+(1−1)2+(1−1)2+(1−1)2+(1−1)2+(0−0)2⋯+(0−0)2
As you can see, these Senators are very similar! If you look at the votes above, they only disagree on 3 bills. The final Euclidean distance between these two Senators is 1.73.

To compute Euclidean distance in Python, we can use the euclidean_distances() method in the scikit-learn library. The code below will find the Euclidean distance between the Senator in the first row and the Senator in the second row.


euclidean_distances(votes.iloc[0,3:], votes.iloc[1,3:])
It's necessary to only select columns after the first 3 because the first 3 are name, party, and state, which aren't numeric.



In [61]:
from sklearn.metrics.pairwise import euclidean_distances

In [64]:
import numpy as np

v = np.array([1, 0, 1])
type(v)

numpy.ndarray

In [66]:
X = [[0, 1], [1, 1]]
euclidean_distances(X, X)

array([[ 0.,  1.],
       [ 1.,  0.]])

5: Initial Clustering

We'll use an algorithm called k-means clustering to split our data into clusters. k-means clustering uses Euclidean distance to form clusters of similar Senators. We'll dive more into the theory of k-means clustering and build the algorithm from the ground up in a later mission. For now, it's important to understand clustering at a high level, so we'll leverage the scikit-learn library to train a k-means model.

The k-means algorithm will group Senators who vote similarly on bills together, in clusters. Each cluster is assigned a center, and the Euclidean distance from each Senator to the center is computed. Senators are assigned to clusters based on which one they are closest to. From our background knowledge, we think that Senators will cluster along party lines.

The k-means algorithm requires us to specify the number of clusters upfront. Because we suspect that clusters will occur along party lines, and the vast majority of Senators are either Republicans or Democrats, we'll pick 2 for our number of clusters.

We'll use the KMeans class from scikit-learn to perform the clustering. Because we aren't predicting anything, there's no risk of overfitting, so we'll train our model on the whole dataset. After training, we'll be able to extract cluster labels that indicate what cluster each Senator belongs to.

We can initialize the model like this:


kmeans_model = KMeans(n_clusters=2, random_state=1)

The above code will initialize the k-means model with 2 clusters, and a random state of 1 to allow for the same results to be reproduced whenever the algorithm is run.

We'll then be able to use the fit_transform() method to fit the model to votes and get the distance of each Senator to each cluster. The result will look like this:


array([[ 3.12141628,  1.3134775 ],
   [ 2.6146248 ,  2.05339992],
   [ 0.33960656,  3.41651746],
   [ 3.42004795,  0.24198446],
   [ 1.43833966,  2.96866004],
   [ 0.33960656,  3.41651746],
   [ 3.42004795,  0.24198446],
   [ 0.33960656,  3.41651746],
   [ 3.42004795,  0.24198446],
   [ 0.31287498,  3.30758755],
   ...
This is a NumPy array with two columns. The first column is the Euclidean distance from each Senator to the first cluster, and the second column is the Euclidean distance to the the second cluster. The values in the columns will indicate how "far" the Senator is from each cluster. The further away from the cluster, the less the Senator's voting history aligns with the voting history of the cluster.

In [68]:
from sklearn.cluster import KMeans
k_means=KMeans(3,random_state=1)
