## Clustering

Last class we studied document vectors and how to find key words and similar documents. What else can we do with vectors? We can cluster them to find natural groups or categories, or visualize them directly by projecting them to 2D or 3D space.

In [1]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer

We'll start by getting tf-idf vectors for the Menendez press releases, like we did last class.

In [2]:
pr = pd.read_csv('menendez-press-releases.csv')
len(pr)

1530

In [10]:
# need a tokenizer

In [1]:
# create tf-idf vectors

We're going to use a clustering algorithm called k-means. Here's an interactive demo of how it works.
See this [interactive demo](http://web.stanford.edu/class/ee103/visualizations/kmeans/kmeans.html) or [this one](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/).

In [5]:
from sklearn.cluster import KMeans

In [2]:
# cluster

Ok, let's see what's in each cluster!

In [9]:
def print_sorted_vector(v):
    # this "lambda" thing is an anonymous function, google me to unluck bonus coding knowledge
    sorted_list = sorted(v.items(), key=lambda x: (x[1],x[0]), reverse=True) 
    sorted_list = sorted_list[:10]
    print('\n'.join([str(x) for x in sorted_list]))

Now we're going to print out the top words of the center vector of each cluster, to see how the k-means algorithm did.

In [3]:
# print cluster centroids

In fact, Overview uses k-means in its "topic tree" visualization

### Visualizing clusters to understand politics
This is a fairly literal translation of a [previous post](http://www.compjournalism.com/?p=13) of mine (it was done in  R at the time). We're going to load up the voting record of the U.K. House of Lords, turn each MP's voting record into a vector, and see how all these politicians relate in this abstract ideological space.

The daia is circa 2012, because they had an interesting coalition government at the time. 

In [11]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

In [12]:
votes = pd.read_csv('uk-lords-votes.csv')
votes.shape

(613, 102)

In [9]:
# print out this votes matrix

This is data I processed earlier, and you can think of it as a template for the format you will need to get your data into to do your homework. Each row is one member of parliament. Each of the numbered columns is one vote, where 1 means aye, 0 means abstain, and -1 means nay. The `party` column indicates which political party that MP belonged to at the time.

If you're interested in the original data, including the names of these politicians and what they were voting on, you can find it all [here](http://www.compjournalism.com/?p=13).

We'll want to turn the list of parties in to a list of colors.

In [14]:
# compute the color that each MP should be, based on their party


Now that we've set everything up, we're ready to start projecting. We can view at most three dimensions at once with our puny human visual system. The simplest projection is just to pick three dimensions of our vectors and plot them.

In [8]:
# 3d scatterplot of three votes



Not very interesting. All of vote coordinates are in [-1,0,1] so no matter which votes (dimensions) we pick we can only get the corners, edges, and center of a cube. Plus, all 613 MPs overlap each other -- many MPs voted the same way on this set of three votes -- so we only see a few dots.

Instead, we're going to let the computer pick the right projection from this wacky high dimensional space to two dimensions. We are using PCA, "principal components analysis," which tries to find a direction to project that gives maximum separation of all the points. This dimension doesn't have to be aligned to any of our dimension axes -- PCA will "rotate" the points in high dimensional space until they are as spread out as possible.

In [5]:
# PCA to 2D

In [6]:
# 2D scatterplot

We can actually project down to any number of dimensions. More than 3 but less than the original 100 can be useful for some data processing operations.) Here, we'll project down to 3 and take a look at our voting clusters in glorious 3D.

In [18]:
# PCA to 3D

In [7]:
# 3D scatterplot