## K-Means Clustering

    K-Means Clustering algorithm is a very common Unsupervised machine learning algorithm.
    This algorithm clusters "n" objects into "k" clusters, where each object belongs to a cluster with the nearest mean.
    
    1. K-Mean clustering is an algorithm where the main goal is to group similar data points into a cluster.
    2. In K-Mean clustering "k" represents the total number of group or clusters.
    3. K-Mean clustering runs on "Euclidean Distance Calculation".
    
**Euclidean Distance :**

    Distance between two points (x1,y1) and (x2,y2),
d(x,y) = $\sqrt{(y1-x1)^2+(y2-x2)^2}$

**Let us go through the above steps using the example below.**
    
    1. Consider 4 data points A,B,C,D as below:
    
        X1   X2
    A   2    3
    B   6    1
    C   1    2
    D   3    0

    2. Choose two centroids AB and CD, calculated as
    
    AB = Average of A,B
    CD = Average of C,D
    
        X1   X2
    AB  4    2
    CD  2    1

In [20]:
(2+6)/2  # AB X1

4.0

In [21]:
(3+1)/2  # AB X2

2.0

In [22]:
(1+3)/2  # CD X1

2.0

In [23]:
(2+0)/2  # CD X2

1.0

    3. Calculate squared euclidean distance between all data points to the centroids AB,CD.
    For example distance between A(2,3) and AB(4,2) can be given by s = (2-4)^2 + (3-2)^2.
    
        A    B    C    D
    AB  5    5    9    5
    CD  4    16   2    2
    
    A is very near to CD than AB

In [25]:
(2-4)**2 + (3-2)**2  # A(X1,X2) AB(4,2)

5

In [26]:
(6-4)**2 + (1-2)**2  # B(X1,X2) AB(4,2)

5

In [29]:
(1-4)**2 + (2-2)**2  # C(X1,X2) AB(4,2)

9

In [30]:
(3-4)**2 + (0-2)**2  # D(X1,X2) AB(4,2)

5

In [31]:
(2-2)**2 + (3-1)**2  # A(X1,X2) CD(2,1)

4

In [32]:
(6-2)**2 + (1-1)**2  # B(X1,X2) CD(2,1)

16

In [33]:
(1-2)**2 + (2-1)**2  # C(X1,X2) CD(2,1)

2

In [34]:
(3-2)**2 + (0-1)**2  # D(X1,X2) CD(2,1)

2

    4. If we observed, the distance between (A,CD) is 4 is less compared to (AB,A) which is 5.
    Since point A is close to CD we can move A to CD cluster.
    
    5. There are two clusters formed so far, let recompute the centroids i.e, B,ACD similar to step 2.
    
    ACD = Average of A,C,D
    B = B
    
          X1  X2
    B     6   1
    ACD   2   1.67
    
    New centroids B,ACD

In [35]:
(2+1+3)/3  # ACD X1

2.0

In [36]:
(3+2+0)/3  # ACD X2

1.6666666666666667

    6. As we know K-Means is iterative procedure now we have to calculate the distance of all points (A,B,C,D) to new
    centroids (B,ACD) similar to step 3.
    
         A      B      C      D
    B    20     0      26     10
    ACD  1.77   16.45  1.11   3.79

In [37]:
(2-6)**2 + (3-1)**2  # A(X1,X2)  B(6,1)

20

In [38]:
(6-6)**2 + (1-1)**2  # B(X1,X2) B(6,1)

0

In [39]:
(1-6)**2 + (2-1)**2  # C(X1,X2) B(6,1)

26

In [40]:
(3-6)**2 + (0-1)**2  # D(X1,X2) B(6,1)

10

In [41]:
(2-2)**2 + (3-1.67)**2  # A(X1,X2)  ACD(2,1.67)

1.7689000000000001

In [42]:
(6-2)**2 + (1-1.67)**2  # B(X1,X2) ACD(2,1.67)

16.4489

In [43]:
(1-2)**2 + (2-1.67)**2  # C(X1,X2) ACD(2,1.67)

1.1089

In [44]:
(3-2)**2 + (0-1.67)**2  # D(X1,X2) ACD(2,1.67)

3.7889

    7. We can see respective cluster values are minimum that "A" is too far from cluster "B" and near to cluster "ACD".
    All data points are assigned to clusters (B,ACD) based on their minimum distance.
    The iterative procedure ends here.
    
    8. To cunclude, we have started with two centroids and end up with two clusters, K=2.

## K-Mean Clustering Problem

In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [46]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

In [51]:
col_names = ['sepal-length','sepal-width','petal-length','petal-width','class']

In [52]:
df = pd.read_csv(url, names=col_names)

In [53]:
df.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [54]:
df['class'].nunique()

3

### Split data on the basis of atrributes and labels

In [55]:
X = df.iloc[:,:-1].values

In [56]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [63]:
Y = df.iloc[:,4].values

In [64]:
Y

array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versic

### Training & Testing Data

In [65]:
from sklearn.model_selection import train_test_split

In [66]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

In [67]:
X_train

array([[5.1, 3.8, 1.6, 0.2],
       [6.5, 2.8, 4.6, 1.5],
       [5. , 3.5, 1.6, 0.6],
       [5. , 3. , 1.6, 0.2],
       [4.3, 3. , 1.1, 0.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.7, 3.1, 4.7, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.9, 3. , 4.2, 1.5],
       [5.1, 3.7, 1.5, 0.4],
       [6.9, 3.1, 5.4, 2.1],
       [6.4, 3.2, 5.3, 2.3],
       [5.4, 3.4, 1.5, 0.4],
       [6.1, 2.8, 4. , 1.3],
       [6.4, 2.7, 5.3, 1.9],
       [5. , 3.5, 1.3, 0.3],
       [6.2, 2.9, 4.3, 1.3],
       [5.7, 3. , 4.2, 1.2],
       [6.3, 2.9, 5.6, 1.8],
       [4.9, 2.5, 4.5, 1.7],
       [4.8, 3.4, 1.9, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [6.4, 2.8, 5.6, 2.1],
       [6.8, 3. , 5.5, 2.1],
       [6.9, 3.2, 5.7, 2.3],
       [7.3, 2.9, 6.3, 1.8],
       [6.7, 3.3, 5.7, 2.5],
       [5.7, 4.4, 1.5, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [6.5, 3. , 5.8, 2.2],
       [6. , 2.2, 5. , 1.5],
       [5.8, 4. , 1.2, 0.2],
       [5.1, 3.5, 1.4, 0.3],
       [7.7, 3. , 6.1, 2.3],
       [5.1, 3

In [69]:
Y_train

array(['Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
       'Iris-virginica', 'Iris-virginica', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-virginica',
       'Iris-setosa', 'Iris-setosa', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa', 'Iris-setosa',
       'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-vi

In [70]:
X_test

array([[5.3, 3.7, 1.5, 0.2],
       [6.6, 3. , 4.4, 1.4],
       [6.1, 2.8, 4.7, 1.2],
       [5.8, 2.7, 5.1, 1.9],
       [4.5, 2.3, 1.3, 0.3],
       [4.9, 2.4, 3.3, 1. ],
       [6.3, 3.4, 5.6, 2.4],
       [6. , 2.2, 4. , 1. ],
       [6.1, 3. , 4.6, 1.4],
       [5.8, 2.7, 5.1, 1.9],
       [5.4, 3.9, 1.7, 0.4],
       [5.2, 3.4, 1.4, 0.2],
       [7.2, 3.2, 6. , 1.8],
       [5.1, 3.8, 1.5, 0.3],
       [5.1, 3.5, 1.4, 0.2],
       [5.4, 3. , 4.5, 1.5],
       [7.4, 2.8, 6.1, 1.9],
       [5.6, 3. , 4.1, 1.3],
       [6. , 2.9, 4.5, 1.5],
       [6.9, 3.1, 5.1, 2.3],
       [6.7, 3.3, 5.7, 2.1],
       [5. , 3.4, 1.6, 0.4],
       [7.1, 3. , 5.9, 2.1],
       [5.7, 2.5, 5. , 2. ],
       [6.3, 2.3, 4.4, 1.3],
       [7.7, 3.8, 6.7, 2.2],
       [5. , 3.3, 1.4, 0.2],
       [6. , 3.4, 4.5, 1.6],
       [4.9, 3.1, 1.5, 0.1],
       [7.9, 3.8, 6.4, 2. ]])

In [71]:
Y_test

array(['Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-versicolor',
       'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-virginica',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris-virginica'],
      dtype=object)

### Normlization StandardScaler

In [72]:
from sklearn.preprocessing import StandardScaler

In [73]:
scaler = StandardScaler()

In [74]:
scaler.fit(X_train)

StandardScaler()

In [75]:
X_train = scaler.transform(X_train)

In [76]:
X_train

array([[-0.88266603,  1.78964731, -1.2040804 , -1.28673384],
       [ 0.85169529, -0.58074648,  0.51059106,  0.41870778],
       [-1.00654898,  1.07852917, -1.2040804 , -0.76198257],
       [-1.00654898, -0.10666772, -1.2040804 , -1.28673384],
       [-1.87372965, -0.10666772, -1.48985898, -1.41792166],
       [ 0.10839758,  0.36741104,  0.62490249,  0.81227124],
       [ 1.0994612 ,  0.13037166,  0.56774677,  0.41870778],
       [ 1.3472271 ,  0.13037166,  0.6820582 ,  0.41870778],
       [ 0.10839758, -0.10666772,  0.2819682 ,  0.41870778],
       [-0.88266603,  1.55260793, -1.26123612, -1.02435821],
       [ 1.3472271 ,  0.13037166,  0.96783678,  1.20583469],
       [ 0.72781234,  0.36741104,  0.91068107,  1.46821032],
       [-0.51101718,  0.84148979, -1.26123612, -1.02435821],
       [ 0.35616349, -0.58074648,  0.16765677,  0.15633215],
       [ 0.72781234, -0.81778585,  0.91068107,  0.94345905],
       [-1.00654898,  1.07852917, -1.37554755, -1.15554603],
       [ 0.48004644, -0.

In [77]:
X_test = scaler.transform(X_test)

In [78]:
X_test

array([[-0.63490013,  1.55260793, -1.26123612, -1.28673384],
       [ 0.97557825, -0.10666772,  0.39627963,  0.28751997],
       [ 0.35616349, -0.58074648,  0.56774677,  0.02514433],
       [-0.01548537, -0.81778585,  0.79636963,  0.94345905],
       [-1.62596374, -1.76594337, -1.37554755, -1.15554603],
       [-1.13043194, -1.52890399, -0.23243324, -0.2372313 ],
       [ 0.60392939,  0.84148979,  1.08214821,  1.59939814],
       [ 0.23228053, -2.00298275,  0.16765677, -0.2372313 ],
       [ 0.35616349, -0.10666772,  0.51059106,  0.28751997],
       [-0.01548537, -0.81778585,  0.79636963,  0.94345905],
       [-0.51101718,  2.02668668, -1.14692469, -1.02435821],
       [-0.75878308,  0.84148979, -1.31839183, -1.28673384],
       [ 1.71887596,  0.36741104,  1.31077107,  0.81227124],
       [-0.88266603,  1.78964731, -1.26123612, -1.15554603],
       [-0.88266603,  1.07852917, -1.31839183, -1.28673384],
       [-0.51101718, -0.10666772,  0.45343534,  0.41870778],
       [ 1.96664186, -0.

### Model

In [90]:
from sklearn.neighbors import KNeighborsClassifier

In [91]:
model = KNeighborsClassifier(n_neighbors=5)

In [92]:
model.fit(X_train, Y_train)

KNeighborsClassifier()

In [93]:
y_pred = model.predict(X_test)

In [94]:
y_pred

array(['Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-versicolor',
       'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-virginica',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris-virginica'],
      dtype=object)

### Accuracy

In [95]:
from sklearn.metrics import classification_report, confusion_matrix

In [96]:
print("Classification Report:")
print(classification_report(Y_test, y_pred))

Classification Report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         9
Iris-versicolor       1.00      1.00      1.00        10
 Iris-virginica       1.00      1.00      1.00        11

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30



In [98]:
print("Confusion Matrix")
print(confusion_matrix(Y_test, y_pred))

Confusion Matrix
[[ 9  0  0]
 [ 0 10  0]
 [ 0  0 11]]


In [100]:
pd.DataFrame(data=confusion_matrix(Y_test, y_pred),
            index=['setosa','versicolor','virginica'],
            columns=['setosa','versicolor','virginica'])

Unnamed: 0,setosa,versicolor,virginica
setosa,9,0,0
versicolor,0,10,0
virginica,0,0,11


### Test & Prediction

In [107]:
output = pd.DataFrame([Y_test,y_pred], index=['Y-Test','Y-Prediction'])

In [108]:
output

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
Y-Test,Iris-setosa,Iris-versicolor,Iris-versicolor,Iris-virginica,Iris-setosa,Iris-versicolor,Iris-virginica,Iris-versicolor,Iris-versicolor,Iris-virginica,...,Iris-virginica,Iris-setosa,Iris-virginica,Iris-virginica,Iris-versicolor,Iris-virginica,Iris-setosa,Iris-versicolor,Iris-setosa,Iris-virginica
Y-Prediction,Iris-setosa,Iris-versicolor,Iris-versicolor,Iris-virginica,Iris-setosa,Iris-versicolor,Iris-virginica,Iris-versicolor,Iris-versicolor,Iris-virginica,...,Iris-virginica,Iris-setosa,Iris-virginica,Iris-virginica,Iris-versicolor,Iris-virginica,Iris-setosa,Iris-versicolor,Iris-setosa,Iris-virginica


In [109]:
output.transpose()

Unnamed: 0,Y-Test,Y-Prediction
0,Iris-setosa,Iris-setosa
1,Iris-versicolor,Iris-versicolor
2,Iris-versicolor,Iris-versicolor
3,Iris-virginica,Iris-virginica
4,Iris-setosa,Iris-setosa
5,Iris-versicolor,Iris-versicolor
6,Iris-virginica,Iris-virginica
7,Iris-versicolor,Iris-versicolor
8,Iris-versicolor,Iris-versicolor
9,Iris-virginica,Iris-virginica
