## K-Means  `kmeans(data)`
- This notebook `0-final-kmeans.ipynb` is working properly given the requirements
- For visualization, check the notebook `3-final-results-with-plot.ipynb`
- To see how my implementation compares to sklearn check the notebook `1-initial-comparisons.ipynb`
- Check the elbow graphs and discussion in `naive-elbow-inspection.ipynb`


#### `def kmeans(data, nk=10, niter=100)`
- `returns 2 items: best_k, vector of corresponding labels for each given sample`
- `nk` is predefined as 10, which is the max number of clusters our program will test. So given a data set, the best k would be less than or equal to nk but greater than 1. 
- `niter` is the number of iterations before our  algorithm "gives up", 
- if it doesn't converge to a centroid after `niter` iterations,
- it will just use the centroids it has computed the most recently

## Requirements
- where data is an MxN numpy array
- This should return
  - an integer K, which should be programmatically identified
  - a vector of length M containing the cluster labels


In [1]:
import numpy as np
from numpy.random import multivariate_normal
from KMeans import kmeans

In [2]:
# Create synthetic data data from three different multivariate distributions
x1 = np.random.multivariate_normal(mean=[-55, 5], cov=[[1, 0], [0, 1]], size=75)
x2 = np.random.multivariate_normal(mean=[-1, 1], cov=[[2, 0], [0, 2]], size=200)
x3 = np.random.multivariate_normal(mean=[60, -10], cov=[[1, 0], [0, 2]], size=60)
x4 = np.random.multivariate_normal(mean=[-1, 50], cov=[[1, 0], [0, 3]], size=50)
x5 = np.random.multivariate_normal(mean=[3, -64], cov=[[3, 0], [0, 1]], size=40)
x6 = np.random.multivariate_normal(mean=[100, 100], cov=[[1, 0], [0, 2]], size=20)

samples = np.concatenate([x1, x2, x3, x4, x5, x6])
np.random.shuffle(samples)

In [3]:
# Let's test the synthetic samples 

print()
print("(M x N) row = M (number of samples) columns = N (number of features per sample")
print("Shape of array:", samples.shape)

print()
print("Which means there are", samples.shape[0], "samples and", samples.shape[1], "features per sample")

print()
print("Let's run our kmeans implementation")

#----------------------------------------------
k, labels = kmeans(samples)
#----------------------------------------------

print()
print()
print("Proposed number of clusters:", k)

print("Labels shape:")
print(labels.shape)

print("Print all the labels:")
print(labels)


(M x N) row = M (number of samples) columns = N (number of features per sample
Shape of array: (445, 2)

Which means there are 445 samples and 2 features per sample

Let's run our kmeans implementation
>>>>>>>>>>

Proposed number of clusters: 6
Labels shape:
(445,)
Print all the labels:
[2 2 1 2 2 5 2 5 2 3 2 2 2 1 2 2 0 3 1 2 4 4 2 4 2 5 2 2 2 2 5 2 3 5 5 1 2
 2 5 2 5 1 2 2 5 3 1 2 3 2 3 2 3 2 5 2 5 2 5 3 2 2 1 2 2 5 0 5 2 5 2 5 2 0
 3 2 4 2 5 5 2 2 2 2 1 5 3 4 2 1 1 0 2 5 4 2 2 2 5 5 5 1 2 2 1 1 5 5 0 2 2
 2 2 2 0 1 5 2 2 4 3 3 4 1 0 3 5 3 2 0 4 2 1 5 2 2 0 2 2 4 2 4 2 2 2 4 2 4
 1 4 5 2 2 4 5 2 3 4 4 2 1 5 2 2 5 2 2 2 0 1 1 2 1 0 4 5 4 5 4 2 1 2 5 2 2
 2 5 3 1 5 5 2 1 2 3 2 2 2 4 2 1 5 2 2 1 5 1 5 5 2 4 3 2 2 2 5 2 0 2 3 2 2
 3 2 3 5 2 3 2 1 2 3 1 1 4 1 2 2 2 3 2 2 5 4 2 5 5 2 2 1 2 3 2 5 5 2 4 5 5
 2 2 1 2 2 1 1 2 1 2 3 5 2 1 2 2 2 2 1 2 4 1 2 3 5 1 1 2 4 2 2 2 3 2 2 2 0
 3 2 0 2 4 3 5 2 2 2 3 3 2 5 2 1 5 1 2 2 3 3 2 2 2 1 2 5 3 2 2 2 2 3 0 5 2
 3 1 5 5 2 2 2 2 1 1 3 2 2 5 3 4 2 2

In [4]:
unique, counts = np.unique(labels, return_counts=True)
dict(zip(unique, counts))

{0: 20, 1: 60, 2: 200, 3: 50, 4: 40, 5: 75}