![@mikegchambers](../../images/header.png)

# K-means Clustering

In this notebook, we explore K-means clustering using scikit-learn.

![Clusters](clusters.png)

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

import numpy as np
import matplotlib.pyplot as plt

from matplotlib import style
style.use('ggplot') or plt.style.use('ggplot')

# Data

In [None]:
K = 4
X, y = make_blobs(n_samples=100, centers=K, cluster_std=1.5)

In [None]:
plt.scatter(X[:,0], X[:,1], c=y)

# Model 

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

**n_clustersint**: The number of clusters to form as well as the number of centroids to generate.

**init**: Method for initialization

**n_init**: Number of times to run with different centroid seeds. 

**max_iter**: Maximum number of iterations of the k-means algorithm for a single run.

In [None]:
model = KMeans(n_clusters=K, init='random', n_init=5, max_iter=50)

Notice that when we fit the data we don't pass in 'y'.  This is unsupervised.

In [None]:
model.fit(X)

Let's get the centers of the clusters that K-means found.

In [None]:
clusters = model.cluster_centers_

Let's get the labels that K-means has decided on.

In [None]:
labels = model.labels_

This time we will make two plots, side by side.  We will compare the original, generated data, with the clusters 'discovered' by the algorithum. 

In [None]:
plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

plt.tight_layout()

plt.subplot(1, 2, 1)
plt.title('Original Data')
plt.scatter(X[:,0], X[:,1], c=y)

plt.subplot(1, 2, 2)
plt.title('K-means Data')
plt.scatter(X[:,0], X[:,1], c=labels)
plt.scatter(clusters[:,0], clusters[:,1], marker="x", s=400, linewidth=1, c="black", zorder=1000)

plt.show()

# Choosing a value for K

Let's explore how to find a good value for K.

In this loop, we try a range of K values from 1 to 10.  Each time we store the inverse of the "sum of squared distances of samples to their closest cluster center" as a way to track the change in variation of the clusters.

In [None]:
mi = []

for i in range(1,10):
    m = KMeans(n_clusters=i, init='random', n_init=5, max_iter=5)
    m.fit(X)
    mi.append([i,1-m.inertia_])
    
mi = np.array(mi)

Now let's plot what we found, and look for the elbow.  This point will represent a good choice for the value of K.

In [None]:
axes = plt.axes()

axes.plot(mi[:,0],mi[:,1])

plt.title('Elbow Plot')
axes.set_ylabel('Reduction in Variation')
axes.set_yticks([])
axes.set_xlabel('Number of clusters (K)')

plt.show()

# Viewing the cluster boundaries

Let's use the 'brute force' 'plot all the points' method to visualize the boundires of the clusters in our original model.

In [None]:
axes = plt.axes()

plt.scatter(X[:,0], X[:,1], c='white')
plt.scatter(clusters[:,0], clusters[:,1], marker="x", s=400, linewidth=1, c="black", zorder=1000)

# Create a grid of points to evaluate:
xlim = axes.get_xlim()
ylim = axes.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 200)
yy = np.linspace(ylim[0], ylim[1], 200)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T

Z = model.predict(xy)

axes.scatter(xy[:,0],xy[:,1],marker="s",c=Z,s=5, zorder=-10)

plt.show()