#### Unsupervised Clustering. K-means algorithm.

#### k-means

- unsupervised clustering algorithm
- grups objects into k categories based on their attributes (k is unknown, must be guessed)
- minimizes the sum of distances (e.g. euclidean) between each object and the cluster centroid

#### heuristic

- given k clusters $S=\{s_1, \dots, s_k\}$ with centroids $M=\{\mu_1, \dots, \mu_k\}$


- each observation is assigned to the cluster $s_c$ with the closest centroid,

$
\forall\,x_{i}\in\,X\,,\;\text{centroid of}\,x_{i}\,\text{is}\;\underset{\mu_c\,\in\,M}{\arg\min}\,\|x_i-\mu_c\|^2
$
- minimize the sum of the distances from each observation to its cluster centroid

$
\underset{M}{\min}\sum_{c=1}^k\,\sum_{x_{i}\in\,s_c}\,\|x_i-\mu_c\|^2
$

#### algorithm

- randomly select k centroids
- iterate until convergence or fixed number of iterations
    - 1. for each $x_{i}\in X$:
        - find the closest centroid
        - assign $x_i$ to that cluster
    - 2. for each cluster $s_{c},\;c=\{1,\dots,k\}$:
        - update centroid: $\mu_c=\frac{1}{|s_c|}\sum_{x_i\in\,s_c}x_{i}$


#### visualization: 
https://www.youtube.com/watch?v=9nKfViAfajY

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
#sns.set_theme()

#### generate synthetic dataset

In [None]:
from sklearn.datasets import make_classification

In [None]:
X, y = make_classification(n_samples = 1000, n_features = 5, n_informative = 3, n_classes = 3, n_clusters_per_class = 1, class_sep = 2.0, n_redundant = 0, random_state = 1234)
print(X.shape, y.shape)

#### data exploration

In [None]:
df = pd.DataFrame(X, columns = ['X%d' %j for j in range(X.shape[1])])
df['target'] = y
df.head()

In [None]:
df['target'].value_counts()

In [None]:
sns.pairplot(df, hue = 'target', corner = True)

#### train/test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size = 0.2, random_state = 2783)

#### instantiate kMeans model

In [None]:
from sklearn.cluster import KMeans 

In [None]:
km = KMeans(n_clusters = 3)

In [None]:
%time km.fit(Xtrain)

In [None]:
Ypred = km.predict(Xtest)

#### evaluate kMeans model

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
cm = confusion_matrix(Ytest, Ypred)

In [None]:
#ConfusionMatrixDisplay(cm, display_labels = lbls).plot(ax = axs[0], xticks_rotation = 90.0, values_format = '.2f', cmap = 'GnBu')
ConfusionMatrixDisplay(cm).plot();

#### k-means implementation

In [None]:
class kamins():
    
    def __init__(self, k):
        
        self.k = k
        
    def fit(self, Xtrain, maxIterations = 100):
                
        # pick random centrodis
        rangeX = Xtrain.max() -Xtrain.min()
        self.M = np.array([Xtrain.min() +rangeX *np.random.rand(Xtrain.shape[1]) for k in range(self.k)])

        # initialize array for cluster assignement
        S = -np.ones(Xtrain.shape[0])
        iteration = 0
        while iteration < maxIterations:
            # compute distances
            for i, x in enumerate(Xtrain):
                dxM = np.array([np.sum((x -m)**2) for m in self.M])
                # assign cluster
                S[i] = np.argmin(dxM)
            # update centroids
            for c in range(self.k):
                self.M[c] = np.mean(Xtrain[S == c], axis = 0)
            # loop
            iteration += 1
                    
        
    def predict(self, Xtest):

        S = -np.ones(Xtest.shape[0])
        # compute distances
        for i, x in enumerate(Xtest):
            dxM = np.array([np.sum((x -m)**2) for m in self.M])
            # assign cluster
            S[i] = np.argmin(dxM)
        
        # return predictions
        return S
        

In [None]:
mykm = kamins(3)

In [None]:
%time mykm.fit(Xtrain, maxIterations = 50)

In [None]:
Ypred = mykm.predict(Xtest)

In [None]:
cm = confusion_matrix(Ytest, Ypred)
ConfusionMatrixDisplay(cm).plot();