### Clustering

1)	Clustering identifies similarities between objects, which it groups according to those characteristics in common and which differentiate them from other groups of objects. These groups are known as "clusters".<br>
2)	Clustering is framed in unsupervised learning; that is, for this type of algorithm we only have one set of input data (not labeled), about which we must obtain information, without previously knowing what the output will be.<br>
3)<b>	There is no need to split the data in training and testing dataset.</b>


#### Clustering can be categorised into the following categories
1) Centroid based - KMeans<br>
2) Density Based - DBSCAN (Density-based spatial clustering of applications with noise)<br>
3) Hierarchical - Divisive and  Agglomerative Clustering

### KMeans

Where K = number of clusters

1)	K-means algorithm is an iterative algorithm that tries to partition the dataset into<b> K pre-defined distinct non-overlapping subgroups (clusters)</b> where each data point belongs to only one group<br>
2)	It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the<b> sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum.</b> The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.<br>
3)	Since clustering algorithms including KMeans which use distance-based measurements to determine the similarity between data points, it’s recommended to standardize or scale the data since almost always the features in any dataset would have different units of measurements for instance as age vs. income.<br>


### K-Means Algorithm

1) Specify number of clusters K.<br>
2) Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.<br>
4) Compute the sum of the squared distance between each of the data points and all the centroids.<br>
5) Assign each data point to the closest cluster (centroid) based on its nearest distance<br>
6) Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.<br>
7) Repeat steps 4,5 and 6 until there is no change in the centroids.

<img src="kmeans1.png">

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
data  = {'Age': np.random.randint(18,80,20),
        'Expense': np.random.randint(200,1000,20)}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Age,Expense
0,68,745
1,50,401
2,74,835
3,22,944
4,54,452


In [6]:
df1 = df.copy()
df1.head()

Unnamed: 0,Age,Expense
0,68,745
1,50,401
2,74,835
3,22,944
4,54,452


### Assume K = 4

In [7]:
df.sample(4)

Unnamed: 0,Age,Expense
19,68,316
4,54,452
2,74,835
3,22,944


### Initial cluster centroids

In [10]:
# comments has row_no
k1 = [68,316]  # 19
k2 = [54,452]  # 4
k3 = [74,835]  # 2
k4 = [22,944]  # 3

#### Euclidean dist((x1,y1),(x2,y2)) = np.sqrt((x2-x1)^2 + (y2-y1)^2)

In [12]:
df['dist_k1'] = np.sqrt((k1[0] - df['Age'])**2 + (k1[1] - df['Expense'])**2) 
df['dist_k2'] = np.sqrt((k2[0] - df['Age'])**2 + (k2[1] - df['Expense'])**2)
df['dist_k3'] = np.sqrt((k3[0] - df['Age'])**2 + (k3[1] - df['Expense'])**2)
df['dist_k4'] = np.sqrt((k4[0] - df['Age'])**2 + (k4[1] - df['Expense'])**2)
df.head()

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4
0,68,745,429.0,293.33428,90.199778,204.247399
1,50,401,86.884981,51.156622,434.663088,543.721436
2,74,835,519.034681,383.521838,0.0,120.768373
3,22,944,629.68246,493.039552,120.768373,0.0
4,54,452,136.718689,0.0,383.521838,493.039552


In [14]:
df.head(20)

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4
0,68,745,429.0,293.33428,90.199778,204.247399
1,50,401,86.884981,51.156622,434.663088,543.721436
2,74,835,519.034681,383.521838,0.0,120.768373
3,22,944,629.68246,493.039552,120.768373,0.0
4,54,452,136.718689,0.0,383.521838,493.039552
5,71,206,110.040901,246.586699,629.007154,739.624905
6,40,705,390.00641,253.387056,134.372616,239.676866
7,41,644,329.109404,192.439601,193.829822,300.601065
8,44,609,293.981292,157.318149,227.982455,335.721611
9,71,734,418.010765,282.511947,101.044545,215.640905


In [16]:
df[['dist_k1','dist_k2','dist_k3','dist_k4']].min(axis=1)

0      90.199778
1      51.156622
2       0.000000
3       0.000000
4       0.000000
5     110.040901
6     134.372616
7     192.439601
8     157.318149
9     101.044545
10     44.294469
11     40.311289
12     79.056942
13    144.086779
14    108.046286
15     48.918299
16    134.391220
17     23.537205
18     58.137767
19      0.000000
dtype: float64

In [25]:
r1 = []
for i,j in df.iterrows():
    x = j[['dist_k1','dist_k2','dist_k3','dist_k4']].min()
    if x == j['dist_k1']:
        r1.append('C1')
    elif x == j['dist_k2']:
        r1.append('C2')
    elif x == j['dist_k3']:
        r1.append('C3')
    else:
        r1.append('C4')
print(r1)

['C3', 'C2', 'C3', 'C4', 'C2', 'C1', 'C3', 'C2', 'C2', 'C3', 'C4', 'C3', 'C1', 'C3', 'C3', 'C4', 'C2', 'C2', 'C3', 'C1']


In [27]:
df['Min_Dist'] = df[['dist_k1','dist_k2','dist_k3','dist_k4']].min(axis=1)
df['Cluster_Iter1']  = r1
df.head(20)

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4,Cluster_Iter1,Min_Dist
0,68,745,429.0,293.33428,90.199778,204.247399,C3,90.199778
1,50,401,86.884981,51.156622,434.663088,543.721436,C2,51.156622
2,74,835,519.034681,383.521838,0.0,120.768373,C3,0.0
3,22,944,629.68246,493.039552,120.768373,0.0,C4,0.0
4,54,452,136.718689,0.0,383.521838,493.039552,C2,0.0
5,71,206,110.040901,246.586699,629.007154,739.624905,C1,110.040901
6,40,705,390.00641,253.387056,134.372616,239.676866,C3,134.372616
7,41,644,329.109404,192.439601,193.829822,300.601065,C2,192.439601
8,44,609,293.981292,157.318149,227.982455,335.721611,C2,157.318149
9,71,734,418.010765,282.511947,101.044545,215.640905,C3,101.044545


In [28]:
df.sort_values(by='Cluster_Iter1')

Unnamed: 0,Age,Expense,dist_k1,dist_k2,dist_k3,dist_k4,Cluster_Iter1,Min_Dist
19,68,316,0.0,136.718689,519.034681,629.68246,C1,0.0
5,71,206,110.040901,246.586699,629.007154,739.624905,C1,110.040901
12,71,237,79.056942,215.671046,598.007525,708.695986,C1,79.056942
1,50,401,86.884981,51.156622,434.663088,543.721436,C2,51.156622
17,49,475,160.131196,23.537205,360.867012,469.776543,C2,23.537205
16,24,583,270.601183,134.39122,256.912436,361.00554,C2,134.39122
4,54,452,136.718689,0.0,383.521838,493.039552,C2,0.0
7,41,644,329.109404,192.439601,193.829822,300.601065,C2,192.439601
8,44,609,293.981292,157.318149,227.982455,335.721611,C2,157.318149
14,19,742,428.808815,292.104433,108.046286,202.022276,C3,108.046286


In [29]:
dfC1 = df[df['Cluster_Iter1']=='C1']
dfC2 = df[df['Cluster_Iter1']=='C2']
dfC3 = df[df['Cluster_Iter1']=='C3']
dfC4 = df[df['Cluster_Iter1']=='C4']
print(dfC1.shape)
print(dfC2.shape)
print(dfC3.shape)
print(dfC4.shape)

(3, 8)
(6, 8)
(8, 8)
(3, 8)


In [33]:
k1_iter2 = [dfC1['Age'].mean(),dfC1['Expense'].mean()]
k2_iter2 = [dfC2['Age'].mean(),dfC2['Expense'].mean()]
k3_iter2 = [dfC3['Age'].mean(),dfC3['Expense'].mean()]
k4_iter2 = [dfC4['Age'].mean(),dfC4['Expense'].mean()]

### Updated Cluster Centriods for Iteration2

In [34]:
print(k1_iter2)
print(k2_iter2)
print(k3_iter2)
print(k4_iter2)

[70.0, 253.0]
[43.666666666666664, 527.3333333333334]
[54.25, 768.25]
[45.666666666666664, 949.3333333333334]
