## Unsupervised Learning

* Unsupervised learning finds patterns in data.
    * E.g. clustering customers by their purchases
* Compression the data using purchases patterns (Dimession Reductions)


## Iris dataset

* Iris example are points in 4 dimensional space
* Dimension = number of features
* Dimension too high to visualize... but unsupervised learning gives insight

## k-Means Clustering

* Find clusters of samples
* Number of clusters must be specified

In [None]:
from sklearn.cluster import KMeans
samples = None
model = KMeans(n_clusters=3)
model.fit(samples)
labels = model.predict(samples)


Labels contain the cluster id for each sample

New samples can be assigned to existing clusters

> k-means remember the mean of each cluster (the centroids), so it finds what is the nearest centroid


In [None]:
new_labels = model.predict(new_samples)

To see the data it can be used matplotlib.pyplot

In [None]:
import matplotlib.pyplot as  plt

xs= samples[:,0] # petal length
ys= samples[:,2] # petal width

plt.scatter(xs,ys, c=labels)
plt.show()

> Centroids can be known by: model.cluster_centers_

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker="D", s=50)
plt.show()

## Evaluating the Clusters

* KMeans found 3 clusters among the iris samples
* Do the clusters correspond to the species?

### Clusters vs Species is a "cross-tabulation"

Given the species of each sample as a list of species

print(species)
["setosa","setosa", "versicolor", "virginica",...]

Now create a Dataframe, one column with the labels and the another one with the species

In [None]:
import pandas as pd
species = ["setosa","setosa", "versicolor", "virginica",...]
df = pd.DataFrame({"labels":labels, "species":species})
print(df)

ct = pd.crosstab(df["labels"], df["species"])
print(ct)

|species|setosa|versicolor|virginica|
|---|---|---|---|
|labels |   |    |    |
| 0 | 0  |  2 | 36 |
| 1 | 50 |  0 |  0 |
| 2 | 0  | 48 | 14 |


### What happens if there is not a species list?

* Using only samples, and their cluster labels
* A good clustering has tight clusters
* Samples in each cluster bunched together

### Inertia measures clustering quality

* Measures how spread out the clusters are (lower is better)
* Distance from each sample to centroid of its cluster

In [None]:
print(model.inertia_)
# 78.9408414261

This is the behavior depending on the different number of clusters

![](../images/clustering_inertia.png)

We can see that using 3 clusters is a low value, but if we use more the inertial decreases,
**so what is the best number of clusters?**

> A good clustering has tight clusters (so low inertia) BUT no too many clusters!

> Choose an elbow in the inertia plot

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters= k)

    # Fit model to samples
    model.fit(samples)

    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()


## Another example

### How many clusters of grain?

[Dataset](https://archive.ics.uci.edu/ml/datasets/seeds)

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters= k)

    # Fit model to samples
    model.fit(samples)

    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df["labels"],df["varieties"])

# Display ct
print(ct)

