## Finding the Optimal Number of Clusters

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
import numpy as np

In [3]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])
df.head(5)

Unnamed: 0,Postcode,Average total business income,Average total business expenses
0,2000,210901,222191
1,2006,69983,48971
2,2007,575099,639499
3,2008,53329,32173
4,2009,237539,222993


In [4]:
#6
X = df[['Average total business income', 'Average total business expenses']]

In [5]:
#7 Create an empty pandas DataFrame called clusters and an empty list called inertia:
clusters = pd.DataFrame()
inertia = []
clusters['cluster_range'] = range(1, 15)

In [7]:
#8 Create a for loop to go through each cluster number and fit a k-means model accordingly, 
   # then append the inertia values using the 'inertia_' parameter with the 'inertia' list
for k in clusters['cluster_range']:
    kmeans = KMeans(n_clusters=k).fit(X) # instantiate kmeans algorithm and pass k as n_clusters
    inertia.append(kmeans.inertia_)


In [8]:
#9 Assign the inertia list to a new column called 'inertia' from the clusters DataFrame and display its content:
clusters['inertia'] = inertia
clusters

Unnamed: 0,cluster_range,inertia
0,1,13335160000000.0
1,2,7061972000000.0
2,3,3718858000000.0
3,4,2341927000000.0
4,5,1715246000000.0
5,6,1226534000000.0
6,7,942557400000.0
7,8,748842100000.0
8,9,634720400000.0
9,10,563849200000.0


In [9]:
#10
#  use mark_line() and encode() from the altair package to plot the Elbow graph with 'cluster_range' as the x-axis and 'inertia' as the y-axis:
alt.Chart(clusters).mark_line().encode(alt.X('cluster_range'), alt.Y('inertia'))

In [10]:
#11 Looking at the Elbow plot, identify the optimal number of clusters, and assign this value to a variable called optim_cluster:
optim_cluster = 4

In [11]:
#12 Train a k-means model with this number of clusters and a random_state value of 42 using the fit method from sklearn:
kmeans = KMeans(random_state=42, n_clusters=optim_cluster)
kmeans.fit(X)

KMeans(n_clusters=4, random_state=42)

In [12]:
#13 Now, using the predict method from sklearn, get the predicted assigned cluster for each data point contained in the X variable 
  # and save the results into a new column called 'cluster2' from the df DataFrame:
df['clusters2'] = kmeans.predict(X)

In [13]:
#14 Display the first five rows of the df DataFrame using the head method from the pandas package:
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,clusters2
0,2000,210901,222191,3
1,2006,69983,48971,0
2,2007,575099,639499,2
3,2008,53329,32173,0
4,2009,237539,222993,3


In [14]:
#15 Now plot the scatter plot using the mark_circle() and encode() methods from the altair package. Also, to add interactiveness, use the tooltip parameter and the interactive() method from the altair package 
  # as shown in the following code snippet:
alt.Chart(df).mark_circle().encode(x='Average total business income', 
                                  y='Average total business expenses', 
                                  color='clusters2:N',
                                  tooltip=['Postcode', 'clusters2', 
                                          'Average total business income',
                                          'Average total business expenses']).interactive()

The results from Exercise 5.02, Clustering Australian Postcodes by Business Income and Expenses, have eight different clusters, and some of them are very similar to each other. Here, you saw that having the optimal number of clusters provides better differentiation between the groups, and this is why it is one of the most important hyperparameters to be tuned for k-means. In the next section, we will look at two other important hyperparameters for initializing k-means.