<a href="https://colab.research.google.com/github/nileshgode/My-Python-Projects/blob/master/Finding_the_Closest_Centroids_in_Our_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt

In [0]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])

In [0]:
X = df[['Average total business income', 'Average total business expenses']]

In [0]:
business_income_min = df['Average total business income'].min()
business_income_max = df['Average total business income'].max()

business_expenses_min = df['Average total business expenses'].min()
business_expenses_max = df['Average total business expenses'].max()

In [5]:
print(business_income_min)
print(business_income_max)
print(business_expenses_min)
print(business_expenses_max)

0
876324
0
884659


In [0]:
import random
random.seed(42)

In [0]:
centroids = pd.DataFrame()

In [0]:
centroids['Average total business income'] = random.sample(range(business_income_min, business_income_max), 4)

In [0]:
centroids['Average total business expenses'] = random.sample(range(business_expenses_min, business_expenses_max), 4)

In [10]:
centroids['cluster'] = centroids.index
centroids

Unnamed: 0,Average total business income,Average total business expenses,cluster
0,670487,288389,0
1,116739,256787,1
2,26225,234053,2
3,777572,146316,3


In [0]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', color=alt.value('orange'),
    tooltip=['Postcode', 'Average total business income', 'Average total business expenses']
).interactive()

In [0]:
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()

In [14]:
chart1 + chart2

In [0]:
def squared_euclidean(data_x, data_y, centroid_x, centroid_y, ):
  return (data_x - centroid_x)**2 + (data_y - centroid_y)**2

In [0]:
data_x = df.at[0, 'Average total business income']
data_y = df.at[0, 'Average total business expenses']

In [18]:
distances = [squared_euclidean(data_x, data_y, centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
distances

[215601466600, 10063365460, 34245932020, 326873037866]

In [0]:
cluster_index = distances.index(min(distances))

In [0]:
df.at[0, 'cluster'] = cluster_index

In [21]:
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,
2,2007,575099,639499,
3,2008,53329,32173,
4,2009,237539,222993,


In [0]:
distances = [squared_euclidean(df.at[1, 'Average total business income'], df.at[1, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[1, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[2, 'Average total business income'], df.at[2, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[2, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[3, 'Average total business income'], df.at[3, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[3, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[4, 'Average total business income'], df.at[4, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[4, 'cluster'] = distances.index(min(distances))

In [23]:
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,2.0
4,2009,237539,222993,1.0


In [24]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', color='cluster:N',
    tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']
).interactive()

chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()
chart1 + chart2

In this final result, we can see where the four clusters have been placed in the graph and which cluster the five data points have been assigned to:

The two data points in the bottom-left corner have been assigned to cluster 2, which corresponds to the one with a centroid of coordinates of 26,000 (average total business income) and 234,000 (average total business expense). It is the closest centroid for these two points.
The two observations in the middle are very close to the centroid with coordinates of 116,000 (average total business income) and 256,000 (average total business expense), which corresponds to cluster 1.
The observation at the top has been assigned to cluster 0, whose centroid has coordinates of 670,000 (average total business income) and 288,000 (average total business expense).
Awesome! You just re-implemented a big part of the k-means algorithm from scratch. You went through how to randomly initialize centroids (cluster centers), calculate the squared Euclidean distance for some data points, find their closest centroid, and assign them to the corresponding cluster. 