##K-Means in Sci-Kit Learn : An Introduction

<br>

Sci-Kit Learn is a Python module which provides a number of great tools for performing clustering data. 

It's the analytics engine behind our web-application and here, we'll show you the magic behind the machine! 


####**A Few Notes** : 

* This tutorial is meant for Python 2.7.10. If you try this with a different version of Python, we can't promise your computer won't explode. 

* Make sure you've installed the following Python modules: 

  * Pandas (version 0.16.2)
  * Sci-Kit Learn (version 0.17)
  * SciPy (version 0.16.0)
  * Numpy (version 1.9.2)
  
<br>

Ready? Lets begin by importing our modules! 

<br>

In [5]:
import pandas as pdbear
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
from sklearn import preprocessing

####Loading the Data :
<br>
And then lets get our dataset ready by loading it into a Pandas dataframe. 

<br>



In [6]:
rawDATA = pdbear.read_csv('data/cereal.csv')
#reads csv from computer directory into Python environment as dataframe

rawDATA[:3]
#selects first three rows of dataframe

Unnamed: 0,Cereal Name,Manufacturer,Type,Calories,Protein (g),Fat,Sodium,Dietary Fiber,Carbs,Sugars,Display Shelf,Potassium,Vitamins and Minerals,Serving Size Weight,Cups per Serving
0,100%_Bran,Nabisco,C,70,4,1,130,10,5,6,3,280,25,1,0.33
1,100%_Natural_Bran,Quaker Oats,C,120,3,5,15,2,8,8,3,135,0,1,-1.0
2,All-Bran,Kelloggs,C,70,4,1,260,9,7,5,3,320,25,1,0.33


<br>
This is a dataset of different cereal brands' nutritional information, including calories, carbs, fat, fiber, etc. 

For this clustering example, we're only going to concern ourselves with numerical variables, so lets remove categorical variables such as manufacturer, brand, display shelf, and name.

<br>



In [8]:
DATA = rawDATA[[u'Calories', u'Protein (g)', u'Fat', u'Sodium', u'Dietary Fiber', u'Carbs', u'Sugars', u'Potassium', 
                u'Vitamins and Minerals',u'Serving Size Weight', u'Cups per Serving']]
#note how with a Pandas dataframe, we nest the column names inside a list like this: data[[colnames]] 
#instead of passing them directly like this: data[colnames]

DATA[:3]
#selects first three rows of the dataframe

Unnamed: 0,Calories,Protein (g),Fat,Sodium,Dietary Fiber,Carbs,Sugars,Potassium,Vitamins and Minerals,Serving Size Weight,Cups per Serving
0,70,4,1,130,10,5,6,280,25,1,0.33
1,120,3,5,15,2,8,8,135,0,1,-1.0
2,70,4,1,260,9,7,5,320,25,1,0.33


####Creating the Model : The R Way

<br>

Now that we've got our data loaded, we can start with our model building. 

<br>
Sci-Kit Learn is a little different from R. In R, we usually build the model in one step, by passing data and the parameters a function which outputs a model. Then, once we have the model, we pass it to a predict function to output new estimates. For example: 

  * Create model with parameters and fit to data all at once using a funtion:
    * **Model <- glm(Y~., data = your_dataset, family = binomial())**    
<br>

  * Feed the model into another function to get our predictions:
    * **predict(Model, test)**
  
<br>

####Creating the Model...Pythonically!!! : The Sci-Kit Way

<br>

With Sci-Kit Learn, it's a little different. We create a model-object first, passing it all of the parameters, without actually giving it any data. Then, once we have this model-object, we pass it the data and fit it using an object method (as opposed to a function). Once the model has been fit, the model's predictions are stored as an object attribute. For example:

  * First, instantiate a model with parameters. Notice how the model hasn't actually touched any data yet:
    * **Model = KMeans(n_clusters = 3)**    
<br>

  * Then, fit the model with actual data using the '.fit(x)' method:
    * **Model.fit(your_dataset)**  
<br>

  * Finally, we can extract our predictions simply by calling the now fit model's attribute, '.labels\_':
    * **Model.labels\_**

<br>


####But First : Preparing the Data!

<br>

Before we build our model, there are a few steps we should take before we build and fit our model. 

First of all, with K-Means, Heirarchical, and Spectral Clustering, it behooves us to standardize the data first (normalize the data by dividing by subtracting the mean from each value and then dividing, according to variable columns). Since all of these methods depend on some distance or similarity measure, we want to make sure each variable is being treated equally, so the results won't be over-infuenced by certain variables simply for being on a bigger measurement scale. 

Secondly, we need to convert our data into a matrix. All **'.fit(x)'** methods require a numpy array as input, and won't accept a pandas dataframe. Luckily, the **'scale(x)'** function accomplishes both goals. 

<br>


In [15]:
DATA_Matrix = preprocessing.scale(DATA)
#returns a scaled version of the dataframe, in numpy array/matrix form

<br>

Now we can start model building! 

<br>

In [17]:
Model = KMeans(n_clusters=3)
#creates an empty model object, with set parameters, but not fit to data yet
Model.fit(DATA_Matrix)
#fits the model using data from the data matrix. Now, that the model has been fit, it will have additional object
#methods and attributes available
Model.labels_
#calling the newly available object attribute .labels_ returns a numpy array with the cluster assignments for each
#observation, matched by index 

array([1, 1, 1, 1, 0, 0, 0, 2, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 2, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 2, 0, 0,
       2, 1, 2, 2, 0, 1, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 2, 0,
       0, 0, 1, 1, 0], dtype=int32)

In [48]:
resultKmeans = rawDATA[['Cereal Name']][:]

resultKmeans['labels'] = Model.labels_

#resultKmeans
#type(rawDATA)
resultKmeans

Unnamed: 0,Cereal Name,labels
0,100%_Bran,1
1,100%_Natural_Bran,1
2,All-Bran,1
3,All-Bran_with_Extra_Fiber,1
4,Almond_Delight,0
5,Apple_Cinnamon_Cheerios,0
6,Apple_Jacks,0
7,Basic_4,2
8,Bran_Chex,1
9,Bran_Flakes,1
