#Clustering in Sci-Kit Learn : An Introduction

<br>

Sci-Kit Learn is a Python module which provides a number of great tools for performing clustering data. 

It's the analytics engine behind our web-application and here, we'll show you the magic behind the machine! 


####**A Few Notes** : 

* This tutorial is meant for Python 2.7.10. If you try this with a different version of Python, we can't promise your computer won't explode. 

* Make sure you've installed the following Python modules: 

  * Pandas (version 0.16.2)
  * Sci-Kit Learn (version 0.17)
  
<br>

Ready? Lets begin by importing our modules! 

<br>

In [None]:
import pandas as pdbear
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
from sklearn import preprocessing

####Loading the Data :
<br>
And then lets get our dataset ready by loading it into a Pandas dataframe. 

<br>



In [None]:
rawDATA = pdbear.read_csv('data/cereal.csv')
#reads csv from computer directory into Python environment as dataframe

rawDATA[:3]
#selects first three rows of dataframe

<br>
This is a dataset of different cereal brands' nutritional information, including calories, carbs, fat, fiber, etc. 

For this clustering example, we're only going to concern ourselves with numerical variables, so lets remove categorical variables such as manufacturer, brand, display shelf, and name.

<br>



In [None]:
DATA = rawDATA[[u'Calories', u'Protein (g)', u'Fat', u'Sodium', u'Dietary Fiber', u'Carbs', u'Sugars', u'Potassium', 
                u'Vitamins and Minerals',u'Serving Size Weight', u'Cups per Serving']]
#note how with a Pandas dataframe, we nest the column names inside a list like this: data[[colnames]] 
#instead of passing them directly like this: data[colnames]

DATA[:3]
#selects first three rows of the dataframe

####Creating the Model : The R Way

<br>

Now that we've got our data loaded, we can start with our model building. 

<br>
Sci-Kit Learn is a little different from R. In R, we usually build the model in one step, by passing data and the parameters a function which outputs a model. Then, once we have the model, we pass it to a predict function to output new estimates. For example: 

  * Create model with parameters and fit to data all at once using a funtion:
    * **Model <- glm(Y~., data = your_dataset, family = binomial())**    
<br>

  * Feed the model into another function to get our predictions:
    * **predict(Model, test)**
  
<br>

####Creating the Model...Pythonically!!! : The Sci-Kit Way

<br>

With Sci-Kit Learn, it's a little different. We create a model-object first, passing it all of the parameters, without actually giving it any data. Then, once we have this model-object, we pass it the data and fit it using an object method (as opposed to a function). Once the model has been fit, the model's predictions are stored as an object attribute. For example:

  * First, instantiate a model with parameters. Notice how the model hasn't actually touched any data yet:
    * **Model = KMeans(n_clusters = 3)**    
<br>

  * Then, fit the model with actual data using the '.fit(x)' method:
    * **Model.fit(your_dataset)**  
<br>

  * Finally, we can extract our predictions simply by calling the now fit model's attribute, '.labels\_':
    * **Model.labels\_**

<br>


####But First : Preparing the Data!

<br>

Before we build our model, there are a few steps we should take before we build and fit our model. 

First of all, with K-Means, Heirarchical, and Spectral Clustering, it behooves us to standardize the data first (normalize the data by dividing by subtracting the mean from each value and then dividing, according to variable columns). Since all of these methods depend on some distance or similarity measure, we want to make sure each variable is being treated equally, so the results won't be over-infuenced by certain variables simply for being on a bigger measurement scale. 

Secondly, we need to convert our data into a matrix. All **'.fit(x)'** methods require a numpy array as input, and won't accept a pandas dataframe. Luckily, the **'scale(x)'** function accomplishes both goals. 

<br>


In [None]:
DATA_Matrix = preprocessing.scale(DATA)
#returns a scaled version of the dataframe, in numpy array/matrix form

<br>

Now we can start model building! 

<br>

##K-Means

Recall that the K means clustering algorithm has three main model parameters: 

  * Distance Measure Between Points
  * Cluster Center Measure
  * Number of Clusters
  
And one main computational parameter: 

  * Number of Iterations
  
Sci-Kit Learn uses **Euclidean Distance** as the distance measure, and **mean** as the center measure. You can choose number of clusters (in fact, the model requires it, duh!) and number of iterations to compute (default is 300), however, the first two parameters are set, unfortunately, and cannot be changed. 

So let's start!

<br>

First, we create an empty model (so no data yet) along with the parameters we'll use. 

In [None]:
Model_K = KMeans(n_clusters=3)
#creates an empty model object, with set parameters, but not fit to data yet

Then, we fit the model with the data using object method

In [None]:
Model_K.fit(DATA_Matrix)
#fits the model using data from the data matrix. Now, that the model has been fit, it will have additional object
#methods and attributes available

Now that the model has been fit, the cluster labels are stored in an object attribute called .labels_ as a numpy array.

In [None]:
Model_K.labels_
#calling the newly available object attribute .labels_ returns a numpy array with the cluster assignments for each
#observation, matched by index 

And that's it! Easy as that. We can go ahead and attach the cluster labels back to the observations, for context, but all the actual work is finished! 3 steps! Isn't Sci-Kit Learn great?!

In [None]:
resultKmeans = rawDATA[['Cereal Name']][:]
#we create new object, a single column dataframe with just the observations' names. Note how we select the column 
#'Cereal Name' from the original dataframe, and then put a [:] at the end of the call. Doing so ensure we're creating
#a new object, otherwise resultKmeans would simply be pointing to the original dataframe, and we don't want to do that
#since we'll be appending it below

resultKmeans['Cluster Labels K Means'] = Model_K.labels_
#joins the cluster labels from K Means to the dataframe of observation's names, created above

resultKmeans[:10]

Fast and easy! 

####Note on K-Means

The real drawback to K-Means in Sci-Kit Learn is that it's not very versitile. We can't control the distance or center measures. However, the next two clustering functions allow more flexibility. 

##Agglomerative Hierarchical

So with agglomerative hierarchical clustering, we have two main parameters for creating the dendogram:

  * Distance Measure
  * Linkage Function Between Clusters

And an additional two parametes/choices for extracting clusters from the dendogram:

  * Cluster Selection Method
  * Number of Clusters
  
Fortunately, Sci-Kit learn gives us some leeway with choosing these parameters. Distance measure can be selected using **affinity** (default: 'euclidean') and linkage function can be selected using **linkage** (default: 'ward'). There are additional sub-parameters which you can set depending on the linkage function (for example, a ward linkage function also requires a pooling function). 

For cluster assignment, we need to define the number of clusters we want, of course, but Sci-Kit Learn doesn't allow you to select a cluster selection method. They simply use the top down approach, start at the top of the dendogram and move down until you have k clusters. 

Also important to note, the agglomerative hierarchical clustering algorithm in Sci-Kit Learn is built for cluster assignment. It builds a dendogram in order to assign clusters, but it doesn't allow you to see or access the dendogram, so if you wanted to, lets say, build a dendogram, and analyze it, instead of choosing some pre-determined k clusters, you would need to use a different function. I believe scipy has a function for this, but the proof is left to the reader. 

One more note, Sci-Kit learn also has a parameter, **connectivity**, which allows you to pre-structure the data according to some connectivity matrix. Not sure what this means, and again, the proof is left to the reader. 

<br>

So Lets begin! Again, create the empty model with the model parameters:

In [None]:
Model_Agglo = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'complete')
#creates empty agglomerative clustering object. Parameters are selected, here, we change the linkage function from 
#the default to complete linkage

Now, we fit the model! 

In [None]:
Model_Agglo.fit(DATA_Matrix)
#fits our previously empty model

Lets look at the cluster labels! 

In [None]:
Model_Agglo.labels_
#cluster labels assigned to observations, by index

Lets put it into our results dataframe and compare against K-Means!

In [None]:
resultKmeans['Cluster Labels Agglo'] = Model_Agglo.labels_
#joins the cluster labels from Agglo to our previously created dataframe

resultKmeans[:10]

As we can see, our Agglomerative Clustering model produced different results then K Means did. 

<br> 

####Note on Agglomerative Clustering

With Agglomerative clustering, the linkage function we choose can have enormous impact on the dendogram, and hence, the cluster assignments produced. For example, Ward Linkage will minimize variance within groups, while Complete Linkage will make it harder for clusters to merge with each one another the further away the extremes of the clusters are away from each other, respectively. 

Because of this, Agglomerative Clustering can be versitile for different situations, depending on the data and context. 

##Spectral Clustering

Finally, Spectral Clustering. Spectral Clustering is similar to Agglomerative in that there are multiple phases: 

  * 1: Create an affinity/similarity matrix of the observations
  * 2: Generate the Laplacian
  * 3: Cluster the results
  
Hence, there is a lot of room for variation. The parameters are as follows:

  * Similarity Measure for computing affinity/similarity matrix
    * Many possible methods: KNN, RBF, etc. 
    * Any parameters associated with those methods
  * Clustering method
    * Possible methods: K-Means, Discretized
    * Any parameters of those methods
    * Number of clusters
    
The Spectral Clustering function for Sci-Kit Learn allows for pretty high customization of these parameters. The affinity/similarity matrix calculation can be set with **affinity** (default: 'RBF')- NOTE: affinity here is defined a little differently then from the affinity in the Agglomerative Clustering function. In the Agglomerative Clustering function, affinity is the measure used to calculate distance between points, whereas here, it's the measure used to compute an affinity/similarity matrix. END NOTE - The clustering method can be assigned using **assign_labels** (default: 'kmeans'). There are also parameters specfici to certain affinity and assign_label choices which only activate if the associated method is chosen. 

Again, as before, the number of clusters needs to be selected. 

One thing to note is that although one could run spectral clustering on either the observations of a dataset or the variables of a dataset, Spectral Clustering will by default treat the observations as the data being assigned clusters, and will compute the affinity/similarity matrix accordingly. 

So let's begin! 

<br>

First, begin, as usual, by creating the empty model: 

In [None]:
Model_SpookyGhost = SpectralClustering(n_clusters=3, eigen_solver= 'arpack', 
                                       affinity= 'nearest_neighbors', n_neighbors= 4, assign_labels= 'discretize')
#creates empty spectral clustering model, with set parameters. 
#this model is using KNN with K=4 to create the affinity matrix, and then assigning clusters using a 
#discretized approach. The 'eigen_solver' parameter specifies which linear algebra function to use to compute the 
#eigen-stuff
#Model is called SpookyGhost because spectres are ghosts and ghosts are spooky. 
#James Bond was also a choice, but I decided it was too topical.

Again, we fit the model. 

NOTE: we use the .fit(x) method as with the K-Means and Agglomerative, however, Spectral Clustering has another object method available: .fit_predict(x), which is slightly different. The .fit_predict(x) method does two things differently. 
  * 1: It doesn't compute a affinity/similarity matrix from the data, instead, it just clusters the data directly. 
  * 2: It directly outputs the cluster labels, instead of changing the object and storing the results in an object attribute. 
  
Since we want to run the full spectral clustering algorithm to our data, we'll use .fit(x) instead of .fit_predict(x). 

In [None]:
Model_SpookyGhost.fit(DATA_Matrix)
#fits the model using our data. 
#note that we are using .fit(), NOT .fit_predict()

And now, we can get our cluster labels!

In [None]:
Model_SpookyGhost.labels_
#returns our cluster labels, according to index. 

Lets throw it into our results dataframe and see how they all compare!


In [None]:
resultKmeans['Cluster Labels Spectre'] = Model_SpookyGhost.labels_
#joins the cluster labels from SpookyGhost to our previously created dataframe

resultKmeans[:10]

And there you go. Notice how all of these methods produce different results. Each clustering algorithm has it's own peculiarities, and parameter choices can definitely impact how the data is labeled. The appropriate method to use depends on the context and data being clustered, and there is still no consensus on the appropriate way to determine this. 

####Note on Spectral Clustering

Spectral Clustering can be quite different from K-Means Clustering in that while K-Means clusters according to closeness to some central point, Spectral clusters according to the similarity of the data to each other. However, the degree to which this is true is a function of the parameter choices selected. The parameter choices in Spectral Clustering can have enormous impact on how the algorithm behaves. If, for example, RBF was used instead of KNN, or KNN used a large K, and K-Means was used instead of Discretized, the algorithm can often behave similarly to K-Means, grouping data together based common region instead of connectivity to near points. 

The parameters used in our example were chosen for their tendency to make the model assign datapoints chain-linked together the same label, so it handles spirals and such data better. However, this is just one aspect of Spectral Clustering's range of behavior. 