<a href="https://colab.research.google.com/github/proteus21/DATA-SCIENCE-STUDY/blob/main/Machine%20Learning/06_Clustering/01_k_means_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### UCZENIE NIENADZOROWANE /  UNSUPERVISED LEARNING

#K-means implementation


A basic library for machine learning in Python
To install the scikit-learn library, use the command below:
```
!pip install scikit-learn
```
To update to the latest version of the scikit-learn library, use the command below:
```
!pip install --upgrade scikit-learn
```

### Contents:
1. [Import libraries](#0)
2. [Data generation](#1)
3. [Data visualisation](#2)
4.[Implementation of the K-means algorithm](#3)
5. [Implementation of the K-means algorithm - summary](#4)
6. [Implementation of the K-means algorithm - visualisation](#5)

In [21]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### <a name='0'></a> Import libraries

In [23]:
import numpy as np
import pandas as pd
from numpy.linalg import norm
import random
import plotly.express as px
import plotly.graph_objects as go

np.random.seed(42)
np.set_printoptions(precision=6)


### <a name='1'></a> Data generation

In [24]:
from sklearn.datasets import make_blobs
data=make_blobs(n_samples=40,centers=2,cluster_std=1.0, center_box=(-4.0,4.0), random_state=42)[0]
df=pd.DataFrame(data, columns=['x1','x2'])
df.head()

Unnamed: 0,x1,x2
0,0.37743,0.069424
1,2.217347,2.327304
2,1.376777,0.603609
3,-1.467097,3.139985
4,-1.605386,5.457993


### <a name='2'></a> Data visualisation

In [25]:
fig=px.scatter(df,'x1','x2',width=900, height=500, title='K-means algorithm')
fig.update_traces(marker_size=12)

### <a name='3'></a> Implementation of the K-means algorithm

In [26]:
# determination of the boundary values
x1_min=df.x1.min()
x1_max=df.x1.max()

x2_min=df.x2.min()
x2_max=df.x2.max()

print(x1_min,x1_max)
print(x2_min,x2_max)

-2.728596881734133 3.333845579232757
-1.1983010410246 5.457992635788267


In [27]:
# random generation of centroid coordinates
centroid_1=np.array([random.uniform(x1_min,x1_max),random.uniform(x2_min, x2_max)])
centroid_2=np.array([random.uniform(x1_min,x1_max),random.uniform(x2_min, x2_max)])
print(centroid_1)
print(centroid_2)

[-0.990841 -0.459394]
[2.676483 1.86425 ]


In [28]:
# visualisation of  centroid starting points
fig = go.Figure()
fig=px.scatter(df,'x1','x2', width=900, height=500, title='K-means algorithm - centroid initialization')
#Add trace
fig.add_trace(go.Scatter(x=[centroid_1[0]], y=[centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[centroid_2[0]], y=[centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.show()

In [29]:
# assigning points to the nearest centroid
clusters = []
for point in data:
    centroid_1_dist = norm(centroid_1 - point)
    centroid_2_dist = norm(centroid_2 - point)
    cluster = 1
    if centroid_1_dist > centroid_2_dist:
        cluster = 2
    clusters.append(cluster)
    
df['cluster'] = clusters
df.head()

Unnamed: 0,x1,x2,cluster
0,0.37743,0.069424,1
1,2.217347,2.327304,2
2,1.376777,0.603609,2
3,-1.467097,3.139985,1
4,-1.605386,5.457993,2


In [30]:
# assigment visualisation
fig = go.Figure()
fig=px.scatter(df,'x1','x2',color='cluster', width=900, height=500, title='K-means algorithm - iterration 1- assigment to nearest centroid')
#Add trace
fig.add_trace(go.Scatter(x=[centroid_1[0]], y=[centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[centroid_2[0]], y=[centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)
fig.show()


In [31]:
# calculate new centroid coordination
new_centroid_1 = [df[df.cluster == 1].x1.mean(), df[df.cluster == 1].x2.mean()]
new_centroid_2 = [df[df.cluster == 2].x1.mean(), df[df.cluster == 2].x2.mean()]

print(new_centroid_1, new_centroid_2)

[-1.0943174658289914, 2.4251944931955136] [1.4775237603801776, 1.775767516567894]


In [32]:
# visualsation updated ceontroids
fig = go.Figure()
fig=px.scatter(df,'x1','x2',color='cluster', width=900, height=500, title='K-means algorithm - calculation new centroids')
#Add trace
fig.add_trace(go.Scatter(x=[centroid_1[0]], y=[centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[centroid_2[0]], y=[centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_centroid_1[0]], y=[new_centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_centroid_2[0]], y=[new_centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)
fig.show()


In [33]:
# visualsation updated ceontroids
fig = go.Figure()
fig=px.scatter(df,'x1','x2',color='cluster', width=900, height=500, title='K-means algorithm - update new centroids')
#Add trace

fig.add_trace(go.Scatter(x=[new_centroid_1[0]], y=[new_centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_centroid_2[0]], y=[new_centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)
fig.show()

In [34]:
# reassign points to nearest centroid
clusters = []
for point in data:
    centroid_1_dist = norm(new_centroid_1 - point)
    centroid_2_dist = norm(new_centroid_2 - point)
    cluster = 1
    if centroid_1_dist > centroid_2_dist:
        cluster = 2
    clusters.append(cluster)
    
df['cluster'] = clusters
df.head()

Unnamed: 0,x1,x2,cluster
0,0.37743,0.069424,2
1,2.217347,2.327304,2
2,1.376777,0.603609,2
3,-1.467097,3.139985,1
4,-1.605386,5.457993,1


In [35]:
# assigment visualisation
fig = go.Figure()
fig=px.scatter(df,'x1','x2',color='cluster', width=900, height=500, title='K-means algorithm - iterration 2- assigment to nearest centroid')
#Add trace
fig.add_trace(go.Scatter(x=[new_centroid_1[0]], y=[centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_centroid_2[0]], y=[centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)
fig.show()


In [36]:
# update centroid coordination
new_2_centroid_1 = [df[df.cluster == 1].x1.mean(), df[df.cluster == 1].x2.mean()]
new_2_centroid_2 = [df[df.cluster == 2].x1.mean(), df[df.cluster == 2].x2.mean()]

print(new_2_centroid_1, new_2_centroid_2)

[-1.184810430866379, 3.18988309513586] [1.8482624297593075, 0.8622246431993411]


In [37]:
# updated ceontroids
fig = go.Figure()
fig=px.scatter(df,'x1','x2',color='cluster', width=900, height=500, title='K-means algorithm - recalculation centroids')
#Add trace
fig.add_trace(go.Scatter(x=[centroid_1[0]], y=[centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[centroid_2[0]], y=[centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_2_centroid_1[0]], y=[new_2_centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_2_centroid_2[0]], y=[new_2_centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)
fig.show()


In [38]:
# reassign points to nearest centroid
clusters = []
for point in data:
    centroid_1_dist = norm(new_centroid_1 - point)
    centroid_2_dist = norm(new_centroid_2 - point)
    cluster = 1
    if centroid_1_dist > centroid_2_dist:
        cluster = 2
    clusters.append(cluster)

df['cluster'] = clusters
df.head()


Unnamed: 0,x1,x2,cluster
0,0.37743,0.069424,2
1,2.217347,2.327304,2
2,1.376777,0.603609,2
3,-1.467097,3.139985,1
4,-1.605386,5.457993,1


In [39]:
fig = px.scatter(df, 'x1', 'x2', color='cluster', width=950, height=500, 
                 title='K-means algorithm - update centroids')
fig.add_trace(go.Scatter(x=[new_2_centroid_1[0]], y=[new_2_centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_2_centroid_2[0]], y=[new_2_centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)

### <a name='4'></a> Implementation of the K-means algorithm - summary

In [40]:
data = make_blobs(n_samples=40, centers=2, cluster_std=1.0, center_box=(-4.0, 4.0), random_state=42)[0]
df = pd.DataFrame(data, columns=['x1', 'x2'])
df.head()

x1_min = df.x1.min()
x1_max = df.x1.max()

x2_min = df.x2.min()
x2_max = df.x2.max()

centroid_1 = np.array([random.uniform(x1_min, x1_max), random.uniform(x2_min, x2_max)])
centroid_2 = np.array([random.uniform(x1_min, x1_max), random.uniform(x2_min, x2_max)])

for i in range(10):
    clusters = []
    for point in data:
        centroid_1_dist = norm(centroid_1 - point)
        centroid_2_dist = norm(centroid_2 - point)
        cluster = 1
        if centroid_1_dist > centroid_2_dist:
            cluster = 2
        clusters.append(cluster)

    df['cluster'] = clusters

    centroid_1 = [df[df.cluster == 1].x1.mean(), df[df.cluster == 1].x2.mean()]
    centroid_2 = [df[df.cluster == 2].x1.mean(), df[df.cluster == 2].x2.mean()]

print(new_centroid_1, new_centroid_2)

[-1.0943174658289914, 2.4251944931955136] [1.4775237603801776, 1.775767516567894]


### <a name='5'></a> Implementation of the K-means algorithm - visualisation

In [41]:
fig = go.Figure()
fig= px.scatter(df, 'x1', 'x2', color='cluster', width=950, height=500, 
                 title='K-means algorithm - final result')
fig.add_trace(go.Scatter(x=[new_2_centroid_1[0]], y=[new_2_centroid_1[1]], name='centroid 1', mode='markers', marker_line_width=3))
fig.add_trace(go.Scatter(x=[new_2_centroid_2[0]], y=[new_2_centroid_2[1]], name='centroid 2', mode='markers', marker_line_width=3))
fig.update_traces(marker_size=12)
fig.update_layout(showlegend=False)
fig.show()

###SUMMARY

The k-means algorithm:
* we initially set the parameter k, i.e. the number of groups into which we want 
* to divide our input data

* we work until stabilisation, i.e. when there is no more change in the groups obtained.

* we use a measure of the distance between points (usually Euclidean)


**Possible improvements or modifications:**

* different ways of finding distances

* changing the number of groups during the course of the algorithm (preventing over-unification and over-fragmentation)

* use of a weighted distance measure that takes into account the importance of attributes

**Advantages of the k-means algorithm:**
* low complexity and therefore high performance
* for large sets and low group sizes, the algorithm will be significantly faster than other algorithms in this class
* grouped sets are generally tighter and more compact

**Disadvantages:**
* does not help to determine the number of groups (K)
* different initial values lead to different results
* only works well for "spherical" clusters with homogeneous density

**Use Scikit-learn**

In [52]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, init='k-means++')
kmeans.fit(data)





In [53]:
y_kmeans = kmeans.predict(data)
y_kmeans[:10]

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [54]:
df['y_kmeans'] = y_kmeans
df.head()

Unnamed: 0,x1,x2,cluster,y_kmeans
0,0.37743,0.069424,1,1
1,2.217347,2.327304,1,1
2,1.376777,0.603609,1,1
3,-1.467097,3.139985,2,0
4,-1.605386,5.457993,2,0


Cluster visualisation

In [58]:
px.scatter(df, 'x1', 'x2', 'y_kmeans', width=950, height=500, title='kmeans algorithm  - 2 clusters')