```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. 

```

# K-Means Clustering


## How does it work?
Let’s say we’d like to divide the following points into clusters.

<img src="images/k-means1.png" alt="" style="width: 400px;"/>

First, we must choose how many clusters we’d like to have. The `K` in ‘K-means’ stands for the number of clusters we’re trying to identify. In fact, that’s where this method gets its name from. We can start by choosing two clusters.

The second step is to specify the cluster seeds. A `seed` is basically a starting cluster `centroid`. It is chosen at random or is specified by the data scientist based on prior knowledge about the data.

One of the clusters will be the green cluster, and the other one – the orange cluster. And these are the seeds.

<img src="images/k-means2.png" alt="" style="width: 400px;"/>

The next step is to assign each point on the graph to a seed. Which is done based on `proximity`.

For instance, this point is closer to the green seed than to the orange one. Therefore, it will belong to the green cluster.

<img src="images/k-means3.png" alt="" style="width: 400px;"/>

This point, on the other hand, is closer to the orange seed, therefore, it will be a part of the orange cluster.

<img src="images/k-means4.png" alt="" style="width: 400px;"/>

In this way, we can assign all points on the graph to a cluster, based on their Euclidean squared distance from the seeds.

The final step is to calculate the centroid or the geometrical center of the green points and the orange points. The green seed will move closer to the green points to become their centroid and the orange will do the same for the orange points.

<img src="images/k-means5.png" alt="" style="width: 400px;"/>

From here, we can repeat the last two steps. We can do that 10, 15 or 1000 times until we’ve reached a clustering solution where we can no longer adjust any of the clusters.

<img src="images/k-means6.png" alt="" style="width: 400px;"/>

## Discussion

One disadvantage arises from the fact that `in K-means we have to specify the number of clusters before starting`. In fact, this is an issue that a lot of the clustering algorithms share. In the case of K-means if we choose K too small, the cluster centroid will not lie inside the clusters. 

<img src="images/k-means7.png" alt="" style="width: 400px;"/>

In cases where K is too large, some of the clusters may be split into two.

Another important issue is that `K-means enforces clusters with a spherical shape or blobs`. 

<img src="images/k-means8.png" alt="" style="width: 400px;"/>

The reason is that we are trying to minimize the distance from the centroids in a straight line. So, if we have clusters, which are more elongated, K-means will have difficulty separating them.

<img src="images/k-means9.png" alt="" style="width: 400px;"/>

## Implementation
See 3000 Clustering algorithms

In [None]:
import pandas as pd
from sklearn.cluster import KMeans

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :', train_data.shape)
print('Shape of testing data :', test_data.shape)

# Now, we need to divide the training data into differernt clusters
# and predict in which cluster a particular data point belongs.  

# create the object of the K-Means model
model = KMeans(n_clusters=5)  

# fit the model with the training data
model.fit(train_data)

# Number of Clusters
print('\nDefault number of Clusters : ', model.n_clusters)

# predict the clusters on the train dataset
#predict_train = model.predict(train_data)
#print('\nCLusters on train data', predict_train) 

# predict the target on the test dataset
predict_test = model.predict(test_data)
print('Clusters on test data', predict_test) 