# K-Means Clustering

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import random
random.seed(10)

K-Means Clustering (kclust) is a clustering algorithm that finds the optimal kmeans clustering model by iterating over a range of values and creating a model with the lowest possible silhouette score.

Kmeans clustering is a form of clustering that works to seperate samples into a number of equivariant groups while minimizing the sum of the smallest squared deviations that each sample point has from any of the cluster means. Mathematically, the algorithm is trying to create clusters with small enough means to minimize the following quantity:

$\sum_{i=1}^{N}$ min$_{\mu_{c}^{c \in C}}||x_{i} - \mu_{c}||^{2}$

where $N$ is the number of samples, $\mu_{c}$ is the mean of cluster $c$, $C$ is the set of all clusters, $x_{i}$ is a sample point, and $|| \ \ \ \ \ ||$ is the Euclidean distance.  
$$  
$$  
Let's look at an example using synthetic data



In [None]:
from graspologic.cluster.kclust import KMeansCluster

# Synthetic data

# Dim 1
class_1 = np.random.randn(150, 1)
class_2 = 2 + np.random.randn(150, 1)
dim_1 = np.vstack((class_1, class_2))

# Dim 2
class_3 = np.random.randn(150, 1)
class_4 = 2 + np.random.randn(150, 1)
dim_2 = np.vstack((class_3, class_4))

X = np.hstack((dim_1, dim_2))

# Labels
label_1 = np.zeros((150, 1))
label_2 = 1 + label_1

c = np.vstack((label_1, label_2)).reshape(300,)

# Fit model
kclust_ = KMeansCluster(max_clusters=2)

c_hat_kclust = kclust_.fit_predict(X,c)

We can compare our method to the existing implementation of KMeans clustering in Sklearn. Our method expands upon the existing Sklearn framework by allowing the user to automatically find the optimal number of clusters and achieve the best clustering possible. We can compare each optimal kclust model with the default KMeans model SKlearn generates.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from scipy.optimize import linear_sum_assignment

# Default Sklearn KMeans
k_default_ = KMeans()

c_hat_k_default = k_default_.fit_predict(X)

# Function to remap labels for direct comparison
def remap_labels(y_true, y_pred):

    confusion_mat = confusion_matrix(y_true, y_pred)
    row_inds, col_inds = linear_sum_assignment(confusion_mat, maximize=True)
    label_map = dict(zip(col_inds, row_inds))
    remapped_y_pred = np.vectorize(label_map.get)(y_pred)
    
    return remapped_y_pred

# Remap Predicted labels
c_hat_kclust = remap_labels(c, c_hat_kclust)
c_hat_k_default = remap_labels(c, c_hat_k_default)

In [None]:
# Plotting Synthetic clusters
import matplotlib as mp
import seaborn as sns
sns.set()
sns.set_context("talk", font_scale=1.10)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Labelings
c_list = ['red', 'green', 'blue','orange','purple','yellow','gray']

# Figure
fig = plt.figure(figsize=(36,9))

# Synthetic True Clustering
plt.subplot(1, 3, (1,1))
plt.title('True Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c))
plt.scatter(X[:,0],X[:,1],c=c,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.ylabel('Second Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic Kmeans Model
plt.subplot(1, 3, (2,2))
plt.title('Kclust Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_kclust))
plt.scatter(X[:,0],X[:,1],c=c_hat_kclust,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xlabel('First Dimension',fontsize=24)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

# Synthetic Default SKlearn Model
plt.subplot(1, 3, (3,3))
plt.title('Default SKlearn Clustering',fontsize=24,fontweight='bold')
max_c = int(np.max(c_hat_k_default))
plt.scatter(X[:,0],X[:,1],c=c_hat_k_default,cmap=mp.colors.ListedColormap(c_list[0:max_c+1]))
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.show()