# K-Means Clustering

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mp
import matplotlib.pyplot as plt
%matplotlib inline

import random
random.seed(0)

K-Means Clustering (kclust) is a clustering algorithm that finds the optimal kmeans clustering model by iterating over a range of values and creating a model with the lowest possible silhouette score.

Kmeans clustering is a form of clustering that works to seperate samples into a number of equivariant groups while minimizing the sum of the smallest squared deviations that each sample point has from any of the cluster means. Mathematically, the algorithm is trying to create clusters with small enough means to minimize the following quantity:

$\sum_{i=1}^{N}$ min$_{\mu_{c}^{c \in C}}||x_{i} - \mu_{c}||^{2}$

where $N$ is the number of samples, $\mu_{c}$ is the mean of cluster $c$, $C$ is the set of all clusters, $x_{i}$ is a sample point, and $|| \ \ \ \ \ ||$ is the Euclidean distance.



Let's look at an example using the synthetic data from this [paper](https://arxiv.org/abs/1909.02688)

In [None]:
from graspologic.cluster.kclust import KMeansCluster

# Synthetic data
df = pd.read_csv('https://raw.githubusercontent.com/tathey1/autogmm/master/data/synthetic.csv')
arr = df.to_numpy()

x_synthetic = arr[:,1:]

c_true_synthetic = arr[:,0]

# Fit model
model_synthetic = KMeansCluster(max_clusters=20)

c_hat_kmeans_synthetic = model_synthetic.fit_predict(x_synthetic,c_true_synthetic)

In [None]:
# We can review each model tested and observe the best one generated
model = model_synthetic.model_
k = model_synthetic.n_clusters_
silhouettes = model_synthetic.silhouette_
ari = model_synthetic.ari_

print('Best model: ' + str(model))
print('\nBest k: ' + str(k))
print('\nSilhouettes: ' + str(silhouettes))
print('\nARIs: ' + str(ari))

Now let's look at the Drosophila mushroom dataset

In [None]:
from graspologic.cluster.autogmm import AutoGMMCluster

# Drosophila Data
df_x = pd.read_csv('https://raw.githubusercontent.com/tathey1/autogmm/master/data/embedded_right.csv')
arr_x = df_x.to_numpy()[1:,:]

x_drosophila = arr_x

df_c = pd.read_csv('https://raw.githubusercontent.com/tathey1/autogmm/master/data/classes.csv')
arr_c = df_c.to_numpy()[1:,:]
arr_c = arr_c.reshape(len(arr_c),)

c_true_drosophila = arr_c

# Fit model
model_drosophila = KMeansCluster(max_clusters=20)

c_hat_kmeans_drosophila = model_drosophila.fit_predict(x_drosophila,c_true_drosophila)

We can compare our method to the existing implementation of KMeans clustering in Sklearn. Our method expands upon the existing Sklearn framework by allowing the user to automatically find the optimal number of clusters and achieve the best clustering possible. We can compare each optimal kclust model with the default KMeans model SKlearn generates.

In [None]:
from sklearn.cluster import KMeans

# Default Sklearn KMeans
g_synthetic_default = KMeans()
g_drosophila_default = KMeans()

c_hat_default_synthetic = g_synthetic_default.fit_predict(x_synthetic)
c_hat_default_drosophila = g_drosophila_default.fit_predict(x_drosophila)

Now let's compare the synthetic data results:

In [None]:
# Plotting Synthetic clusters
import seaborn as sns
sns.set()
sns.set_context("talk", font_scale=1.10)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Labels
c_list = ['red', 'green', 'blue','orange','purple','yellow','gray']

# Figure
fig = plt.figure(figsize=(36,9))

# Synthetic Kmeans Model
plt.subplot(1, 3, (2,2))
plt.title('Synthetic Kclust Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_kmeans_synthetic))
plt.scatter(x_synthetic[:,0],x_synthetic[:,1],c=c_hat_kmeans_synthetic,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xlabel('First Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic Default SKlearn Model
plt.subplot(1, 3, (3,3))
plt.title('Synthetic Default SKlearn Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_default_synthetic))
plt.scatter(x_synthetic[:,0],x_synthetic[:,1],c=c_hat_default_synthetic,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic True Clustering
plt.subplot(1, 3, (1,1))
plt.title('True Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_true_synthetic))
plt.scatter(x_synthetic[:,0],x_synthetic[:,1],c=c_true_synthetic,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.ylabel('Second Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.show()

We can see the perfect fit achieve by the optimized kclust in the middle plot.

Now let's do the same with the Drosophila data models

In [None]:
# Plotting Drosophila clusters
import seaborn as sns
sns.set()
sns.set_context("talk", font_scale=1.10)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Labels
c_list = ['red', 'green', 'blue','orange','purple','yellow','gray']

# Figure
fig = plt.figure(figsize=(36,9))

# Drosophila Kmeans Model
plt.subplot(1, 3, (2,2))
plt.title('Drosophila Kclust Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_kmeans_drosophila))
plt.scatter(x_drosophila[:,0],x_drosophila[:,1],c=c_hat_kmeans_drosophila,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xlabel('First Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Drosophila Default SKlearn Model
plt.subplot(1, 3, (3,3))
plt.title('Drosophila Default SKlearn Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_default_drosophila))
plt.scatter(x_drosophila[:,0],x_drosophila[:,1],c=c_hat_default_drosophila,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Drosophila True Clustering
plt.subplot(1, 3, (1,1))
plt.title('Drosophila True Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_true_drosophila))
plt.scatter(x_drosophila[:,0],x_drosophila[:,1],c=c_true_drosophila,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.ylabel('Second Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.show()

The best fit is once again the optimal kclust in the middle.