# Agenda
---
- How to model unsupervised learning using sklearn

> In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the training set given to the learner is unlabeled, there is no error or reward signal to evaluate a potential solution. Basically, we are just finding a way to represent the data and get as much information from it that we can.

# Clustering

### KMeans
- KMeans finds cluster centers that are the mean of the points within them
- Point is in a cluster because the cluster center is the closest cluster center for that point.

<img src="images/kmeans.gif" size="500"/>

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 200)

#### Loading the iris data in DataFrame

In [2]:
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris

iris = load_iris()
X = iris['data']
df = pd.DataFrame(X, columns=iris['feature_names'])
df['target'] = iris['target']

#### Executing KMeans clustering algorithm

In [3]:
km = KMeans(n_clusters=3, max_iter=1000)
km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### Checking the clusters

In [4]:
km.cluster_centers_

array([[ 5.9016129 ,  2.7483871 ,  4.39354839,  1.43387097],
       [ 5.006     ,  3.418     ,  1.464     ,  0.244     ],
       [ 6.85      ,  3.07368421,  5.74210526,  2.07105263]])

#### Displaying the lables

In [5]:
km.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0,
       2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

#### Comparing the labels side by side

In [10]:
df['kmeans_lables'] = km.labels_
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kmeans_lables
0,5.1,3.5,1.4,0.2,0,1
1,4.9,3.0,1.4,0.2,0,1
2,4.7,3.2,1.3,0.2,0,1
3,4.6,3.1,1.5,0.2,0,1
4,5.0,3.6,1.4,0.2,0,1


In [14]:
some_ndarray = km.transform(X)
some_ndarray[1:4]

array([[ 5.11494335,  0.43816892,  3.39857426],
       [ 5.27935534,  0.41230086,  3.56935666],
       [ 5.15358977,  0.51883716,  3.42240962]])