# K-Means clustering

In this notebook, we are going to look at how to do K-Means clustering in Python using Scikit-learn and other modules. Let us import the usual modules as well the `KMeans` model from Scikit-learn

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import KMeans

As example data, we will use some age and income data. The data is example data from the book ["Introduction to R for Business Intelligence"](https://jgendron.github.io/com.packtpub.intro.r.bi/), Packt Publishing Ltd., 2016, by Jay Gendron. The dataset is also on Moodle and can be loaded in by:

In [None]:
data = pd.read_csv("Ch5_age_income_data.csv")
data.head()

For our clustering, we will select only the `age` and the `income` variables:

In [None]:
X = data[['age', 'income']]

Here is an example of how to run K-Means clustering using `KMeans` with $K=3$:

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

After calling the **fit()** function, the clustering is done, as it is *unsupervised* learning. The clustering result is stored in an array called **labels_**. It stores a cluster label for each data point in X.

In [None]:
kmeans.labels_

Each cluster's center (/centroid) is also stored, in another array called **cluster_centers_**:

In [None]:
kmeans.cluster_centers_

Let us now visualize the result of the clustering. To do this easily with Seaborn, we first a column with assigned cluster to the dataset:

In [None]:
data["3MeansCluster"] = kmeans.labels_

We can now plot the points of X together with their associated cluster and the cluster centroids:

In [None]:
sns.scatterplot(data = data, x = "age", y = "income", hue = "3MeansCluster")
plt.scatter(x = kmeans.cluster_centers_[:,0], y = kmeans.cluster_centers_[:,1], color='blue', s = 100)
plt.title("3-Means clustering of the age-income data")
plt.show()

From the plot above we can see the data is not really clustered based on the two columns of Age and Income. It is only done on the Income that dominates the distance calculation. Therefore, we need to apply data scaling before we do clustering appropriately.

### Data Scaling

In [None]:
minMaxScaler = MinMaxScaler()
X_scaled_mm = pd.DataFrame(minMaxScaler.fit_transform(X), columns=X.columns)

We can now build the 3-means clustering model again:

In [None]:
kmeans_scaled = KMeans(n_clusters=3)
kmeans_scaled.fit(X_scaled_mm)

Let us visualize clustering with the scaling also:

In [None]:
data_scaled = X_scaled_mm.copy()
data_scaled["3MeansClusterScaled"] = kmeans_scaled.labels_

sns.scatterplot(data = data_scaled, x = "age", y = "income", hue = "3MeansClusterScaled")
plt.scatter(x = kmeans_scaled.cluster_centers_[:,0], y = kmeans_scaled.cluster_centers_[:,1], color='blue', s = 100)
plt.title("3-Means clustering of the age-income data with Min-Max scaling")
plt.show()

If we want a plot of the point that are not scaled, we can simply add the clustering information to the original dataset `data` as before:

In [None]:
data["3MeansClusterScaled"] = kmeans_scaled.labels_

However, our cluster centroids are scalled, so we need to rescale them to plot those also:

In [None]:
centroids = minMaxScaler.inverse_transform(kmeans_scaled.cluster_centers_)

We can now plot the unscaled data with the clusters generated from the scaled data:

In [None]:
sns.scatterplot(data = data, x = "age", y = "income", hue = "3MeansClusterScaled")
plt.scatter(x = centroids[:,0], y = centroids[:,1], color='blue', s = 100)
plt.title("3-Means clustering of the age-income data (with scaling used for the clustering")
plt.show()

### The Elbow Method

We can try different K values and plot the SSEs for all of them. From the plot, we can choose the Elbow Point, i.e., the best K.

We generate a series K-Means models by varying K from 1 to 20. A model's variable **inertia_** stores the overall SSE (sum of squared error) for the model.

In [None]:
errors = []
K = range(1, 20)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X_scaled_mm)
    errors.append(kmeanModel.inertia_)

We plot the (K, SSE) pairs for all Ks:

In [None]:
plt.figure(figsize=(10, 8))
plt.plot(K, errors, 'bx-')
plt.xlabel('k')
plt.ylabel('SSE')
plt.title('The Elbow Method showing the optimal k')
plt.show()

Let's take a closer look by printing out the SSE decrease for each K:

In [None]:
for i in range(1, len(errors)):
    print('K={0}: {1}'.format(i+1, errors[i-1] - errors[i]))

In this case, it looks like we should choose K=5, 6, or 7.

Let us try with k=7, for instance:

In [None]:
kmeans_scaled7 = KMeans(n_clusters=7)
kmeans_scaled7.fit(X_scaled_mm)

data["7MeansClusterScaled"] = kmeans_scaled7.labels_
centroids7 = minMaxScaler.inverse_transform(kmeans_scaled7.cluster_centers_)

In [None]:
sns.scatterplot(data = data, x = "age", y = "income", hue = "7MeansClusterScaled", palette="deep")
plt.scatter(x = centroids7[:,0], y = centroids7[:,1], color='blue', s = 100)
plt.title("7-Means clustering of the age-income data (with scaling used for the clustering")
plt.show()