<a href="https://colab.research.google.com/github/henthornlab/ProcessAnalytics/blob/master/RODataKmeans_Elbow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Elbow Method in K-Means Algorithm**#
David B. Henthorn, Dept. of Chemical Engineering,
Rose-Hulman Institute of Technology

<img style="float: right;" src="https://raw.githubusercontent.com/henthornlab/ProcessAnalytics/master/RHITlogo.png">

Python example of how we can use the "elbow method" to understand how many clusters we should use in a k-means algorithm. The elbow method is rudimentary but often effective for simpler datasets.


In [0]:
import pandas as pd
import plotly.express as px
import numpy as np

We are using sci-kit to enable the k-means algorithm. More details are at: https://scikit-learn.org/

To calculate the distance from the data points to their centroid centers we use cdist from scipy

In [0]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

I've hosted some of our reverse osmosis data in an Excel file on my Github page. Link to it below:

In [0]:
df = pd.read_excel('https://raw.githubusercontent.com/henthornlab/ProcessAnalytics/master/ROSampleRuns.xlsx')

In this next code block I ask scikit-learn to evaluate the k-means clustering algorithm using between 1 and 20 clusters. We then use the cdist function from scipy to sum up the intra-cluster distance (datapoint locations relative to the cluster centers). We should see that distance drop as we locate more clusters centers.

In [0]:
intraCluster = []
models = range(1,20)
for k in models:
    kmeanModel = KMeans(n_clusters=k).fit(df)
    kmeanModel.fit(df)
    intraCluster.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])

We'll plot the intra-cluster distance vs number of clusters and look for the "elbow" in the plot.

In [0]:
fig = px.scatter(x=models, y=intraCluster)
fig.update_layout(
    title="Elbow Method for k-means clustering",
    xaxis_title='Number of Clusters',
    yaxis_title="Intra-cluster variation",
    font=dict(
        family="Arial",
        size=16,
        color="#7f7f7f"
    )
)
fig.show()

We see that after 5 terms there is a substantial decrease in intra-cluster distance. ***We will therefore focus on five clusters for this dataset.***