# Agenda
---
- How to model unsupervised learning using sklearn

# Clustering

### KMeans
- KMeans finds cluster centers that are the mean of the points within them
- Point is in a cluster because the cluster center is the closest cluster center for that point.

<img src="images/kmeans.gif" size="500"/>

In [8]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)

#### Loading the iris data in DataFrame

In [11]:
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris

iris = load_iris()
X = iris['data']

#### Executing KMeans clustering algorithm

In [12]:
km = KMeans(n_clusters=3, max_iter=1000)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=1000, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

#### Checking the clusters

In [13]:
km.cluster_centers_

array([[ 6.85      ,  3.07368421,  5.74210526,  2.07105263],
       [ 5.006     ,  3.418     ,  1.464     ,  0.244     ],
       [ 5.9016129 ,  2.7483871 ,  4.39354839,  1.43387097]])

#### Displaying the lables

In [14]:
km.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2,
       0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
       2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

#### Comparing the labels side by side

In [15]:
df = pd.DataFrame(X, columns=iris['feature_names'])
df['target'] = iris['target']
df['kmeans_lables'] = km.labels_
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kmeans_lables
0,5.1,3.5,1.4,0.2,0,1
1,4.9,3.0,1.4,0.2,0,1
2,4.7,3.2,1.3,0.2,0,1
3,4.6,3.1,1.5,0.2,0,1
4,5.0,3.6,1.4,0.2,0,1


#### Getting distance between new data point and cluster centers

In [19]:
new_data_point = np.array([[4.8, 4.3, 2, 0.9]])

# Rememeber variable 'km' holds the trained model of kmeans
km.transform(new_data_point)

array([[ 4.59141225,  1.24015805,  3.10405311]])