## Finding the Closest Centroids in Our Dataset
In this exercise, we will be coding the first iteration of k-means in order to assign data points to their closest cluster centroids.

In [1]:
#2
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
import numpy as np

In [2]:
#3
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
df.tail(10)

Unnamed: 0,Postcode,Average total business income,Average total business expenses
2463,852,95299,79526
2464,853,21186,15336
2465,854,49303,29720
2466,860,63190,55802
2467,862,134224,144254
2468,870,62793,44687
2469,872,53025,45670
2470,880,45603,28700
2471,885,53148,39850
2472,886,121057,90120


Extract the 'Average total business income' and 'Average total business expenses' columns using the following pandas column subsetting syntax: dataframe_name[<list_of_columns>]. Then, save them into a new variable called X:

In [3]:
#6
X = df[['Average total business income', 'Average total business expenses']]

In [4]:
type(X)

pandas.core.frame.DataFrame

Now, calculate the minimum and maximum using the min() and max() values of the 'Average total business income' and 'Average total business expenses' variables

In [5]:
business_income_min = df['Average total business income'].min()
business_income_max = df['Average total business income'].max()
business_expenses_min = df['Average total business expenses'].min()
business_expenses_max = df['Average total business expenses'].max()

print(business_income_min)
print(business_income_max)
print(business_expenses_min)
print(business_expenses_max)

0
876324
0
884659


Now import the random package and use the seed() method to set a seed of 42

In [6]:
import random
random.seed(42)

Create an empty pandas DataFrame and assign it to a variable called centroids:

In [7]:
centroids = pd.DataFrame()

Generate four random values using the sample() method from the random package with possible values between the minimum and maximum values of the 'Average total business expenses' column using range() and store the results in a new column called 'Average total business income' from the centroids DataFrame:

Repeat the same process to generate 4 random values for 'Average total business expenses':

In [8]:
centroids['Average total business income'] = random.sample(range(business_income_min, business_income_max), 4)
centroids['Average total business expenses'] = random.sample(range(business_expenses_min, business_expenses_max), 4)

Create a new column called 'cluster' from the centroids DataFrame using the .index attributes from the pandas package and print this DataFrame:

In [9]:
centroids['cluster'] = centroids.index
centroids

Unnamed: 0,Average total business income,Average total business expenses,cluster
0,670487,288389,0
1,116739,256787,1
2,26225,234053,2
3,777572,146316,3


Create a scatter plot with the altair package to display the data contained in the df DataFrame and save it in a variable called 'chart1':

In [10]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income',
                                                   y='Average total business expenses',
                                                   color=alt.value('orange'),
                                                   tooltip=['Postcode', 'Average total business income',
                                                            'Average total business expenses']).interactive()

Now create a second scatter plot using the altair package to display the centroids and save it in a variable called 'chart2':

In [11]:
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income',
                                                   y='Average total business expenses',
                                                   color=alt.value('black'),
                                                   tooltip=['cluster', 'Average total business income',
                                                            'Average total business expenses']).interactive()

In [12]:
chart1 + chart2

Define a function that will calculate the squared_euclidean distance and return its value. This function will take the x and y coordinates of a data point and a centroid:

In [13]:
def squared_euclidean(data_x, data_y, centroid_x, centroid_y):
    return (data_x - centroid_x)**2 + (data_y - centroid_y)**2

Using the .at method from the pandas package, extract the first row's x and y coordinates and save them in two variables called data_x and data_y

In [14]:
data_x = df.at[0, 'Average total business income']
data_y = df.at[0, 'Average total business expenses']

Using a for loop or list comprehension, calculate the squared_euclidean distance of the first observation (using its data_x and data_y coordinates) against the 4 different centroids contained in centroids, save the result in a variable called distance, and display it:

In [15]:
distances = [squared_euclidean(data_x, data_y, centroids.at[i, 'Average total business income'], 
                               centroids.at[i, 'Average total business expenses']) for i in range(4)]
distances

[215601466600, 10063365460, 34245932020, 326873037866]

Use the index method from the list containing the squared_euclidean distances to find the cluster with the shortest distance, as shown in the following code snippet:

In [16]:
cluster_index = distances.index(min(distances))

Save the cluster index in a column called 'cluster' from the df DataFrame for the first observation using the .at method from the pandas package:

In [17]:
df.at[0, 'cluster'] = cluster_index
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,
2,2007,575099,639499,
3,2008,53329,32173,
4,2009,237539,222993,


Repeat Steps 15 to 19 for the next 4 rows to calculate their distances from the centroids and find the cluster with the smallest distance value:

Row 1

Using a for loop or list comprehension, calculate the squared_euclidean distance of the first observation (using its data_x and data_y coordinates) against the 4 different centroids contained in centroids, save the result in a variable called distance, and display it:

In [18]:
distances = [squared_euclidean(df.at[1, 'Average total business income'], 
                               df.at[1, 'Average total business expenses'],
                              centroids.at[i, 'Average total business income'], 
                               centroids.at[i, 'Average total business expenses']) for i in range(4)]

Save the cluster index in a column called 'cluster' from the df DataFrame for the first observation using the .at method from the pandas package:

In [19]:
df.at[1, 'cluster'] = distances.index(min(distances))
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,
3,2008,53329,32173,
4,2009,237539,222993,


Row 2

In [20]:
distances = [squared_euclidean(df.at[2, 'Average total business income'], 
                               df.at[2, 'Average total business expenses'],
                              centroids.at[i, 'Average total business income'], 
                               centroids.at[i, 'Average total business expenses']) for i in range(4)]

In [21]:
df.at[2, 'cluster'] = distances.index(min(distances))
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,
4,2009,237539,222993,


Row 3

In [22]:
distances = [squared_euclidean(df.at[3, 'Average total business income'], 
                               df.at[3, 'Average total business expenses'],
                              centroids.at[i, 'Average total business income'], 
                               centroids.at[i, 'Average total business expenses']) for i in range(4)]

In [23]:
df.at[3, 'cluster'] = distances.index(min(distances))
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,2.0
4,2009,237539,222993,


Row 4

In [24]:
distances = [squared_euclidean(df.at[4, 'Average total business income'], 
                               df.at[4, 'Average total business expenses'],
                              centroids.at[i, 'Average total business income'], 
                               centroids.at[i, 'Average total business expenses']) for i in range(4)]

In [25]:
df.at[4, 'cluster'] = distances.index(min(distances))
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,2.0
4,2009,237539,222993,1.0


Finally, plot the centroids and the first 5 rows of the dataset using the altair package as in Steps 12 to 13:

In [27]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', 
                                                  y='Average total business expenses', 
                                                  color='cluster:N',
                                                  tooltip=['Postcode', 'cluster', 
                                                          'Average total business income', 
                                                          'Average total business expenses']).interactive()
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', 
                                                           y='Average total business expenses', 
                                                           color=alt.value('black'), 
                                                           tooltip=['cluster', 
                                                                   'Average total business income', 
                                                                   'Average total business expenses']).interactive()

In [28]:
chart1 + chart2

In this final result, we can see where the four clusters have been placed in the graph and which cluster the five data points have been assigned to:

You just re-implemented a big part of the k-means algorithm from scratch. You went through how to randomly initialize centroids (cluster centers), calculate the squared Euclidean distance for some data points, find their closest centroid, and assign them to the corresponding cluster. This wasn't easy, but you made it.