# KMeans Clustering
Clustering is an unsupervised learning method, i.e. to find hidden structures in a dataset.
It works by measuring similarity between observations and assuming, that the two observations which have a high smiliarity with each other, somehow belong to the same category (or cluster).

__KMeans__
KMeans works by creating random centers (they can also be part of the initialization) and assigning observations to the center by assigning the center with the lowest distance. Then the centers get repositioned by centering them exactly in the middle of all assigned observations. The process starts again, until the termination criterium (i.e. num of iterations or no more change of the centers' positions) is reached.

In [24]:
# Imports
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

import numpy as np

# Load Dataset
dataset = load_digits()

X = dataset.data
X_train, X_test = train_test_split(X, random_state=0)

In [26]:
#Euclidean distance to measure distance to centers

def distance(p1, p2):
    return np.sqrt(np.sum(p1-p2)**2)

# Test
print(X_train[0,])
print(X_train[1,])
print(distance(X_train[0,], X_train[1,]))

[  0.   3.  13.  16.   9.   0.   0.   0.   0.  10.  15.  13.  15.   2.   0.
   0.   0.  15.   4.   4.  16.   1.   0.   0.   0.   0.   0.   5.  16.   2.
   0.   0.   0.   0.   1.  14.  13.   0.   0.   0.   0.   0.  10.  16.   5.
   0.   0.   0.   0.   4.  16.  13.   8.  10.   9.   1.   0.   2.  16.  16.
  14.  12.   9.   1.]
[  0.   0.   1.  14.  13.   4.   0.   0.   0.   3.  15.  12.  11.  15.   0.
   0.   0.   8.  11.   1.   7.  13.   0.   0.   0.   1.  13.  14.  16.   1.
   0.   0.   0.   0.   0.  14.  13.  14.   2.   0.   0.   0.   2.  12.   0.
   9.   8.   0.   0.   0.   3.  13.   4.  12.   6.   0.   0.   0.   0.   9.
  14.  13.   1.   0.]
27.0


In [33]:
def initCenters(n, features, sigma, mu):
    centers = sigma*np.random.randn(n, features) + mu
    return centers

# Test
sigma = np.std(X)
mu = np.mean(X)

centers = initCenters(5, X.shape[1], sigma, mu)

In [47]:
def findNearestCenter(p1, centers):
    distances = [distance(p1, c) for c in centers]
    return np.argmin(distances)

def recalculateCenter(datapoints):
    return np.sum(datapoints, axis = 0) / datapoints.shape[0]

# Test
print(recalculateCenter(np.array([[1,2,3],[2,3,4]])))

[ 1.5  2.5  3.5]
