# K_Means Clustering

So what is clustering?  
Clustering is dividing the data into groups  
Let's learn about K_means.  
For K_means we need to decide optimum number of clusters we want the data to be divided into.  
It can optimum or as per our wish.  
For example we might want to cluster a cities on the basis of fire incidents.
Now we might want to cluster as low moderate high or maybe let the algorithm label it automatically

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
dataset = pd.read_csv('data/Mall_Customers.csv')
dataset.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


Lets cluster the people on the basis of income and score

In [3]:
X = dataset.iloc[:, [3, 4]].values

Lets first find the optimum number of cluster  
The n_clusters parameters take random n points in the plot  
then the points nearest to it are considered to be a part of the cluster  
The random keeps changing until an optimum group is found and no point is left

steps:  
* initialize k centroids randomly
* assign each point to closest centroid
* computer centroid of new cluster
* repead until no change

In [None]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 10), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

This is an elbow curve  
The point where the curve changes drastically like 5 is the point we want

In [None]:
kmeans = KMeans(n_clusters = 5, init = 'k-means++')
y_kmeans = kmeans.fit_predict(X)

# Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

These are the clusteres formed

In [None]:
kmeans.labels_

In [None]:
labels=pd.DataFrame({
    'Labels':kmeans.labels_
})

In [None]:
finals=pd.concat([dataset,labels],axis=1)
finals.head()

# Hierarchial Clustering
This type of clustering works on the basis of minimizin the distance only  
The distance can be of any type  

In [None]:
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

This type of graph is called a Dendogram and plots a map in hierachy  
So the points joined near the x axis are the closest and forms a cluster  
as we move up two cluster join to form a new cluster and this keeps happening until all the points are covered  
So how to determine ideal cluster?
* Observe the largest vertical line in association with the main cluster  
In this case its the one with red cluster at the bottom
* Start from the end of the end of the line ~ distance = 100 and move up
* as We are about to cross 250 we encouter a green horizontal line in the neigboring cluster. This is it !
* We need to stop as soon as we cross a horizontal line and imagine a horizontal line cutting the graph
* number of vertical lines this line crosses is the number of cluster or 5 in this case

In [None]:
# Fitting Hierarchical Clustering to the dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualising the clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()