# $K$-Means Clustering Example

In [None]:
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# import the data
cust = pd.read_excel('./data/small_customer_data.xlsx')
cust.head()

In [None]:
# see info
cust.info()

In [None]:
# Create a scatter plot with age on x-axis and income on y-axis
sns.relplot(data=cust, x='Age', y='Income', kind='scatter')

In [None]:
# import k-means estimator
from sklearn.cluster import KMeans

In [None]:
# How many clusters do you want to try?
first_attempt = KMeans(n_clusters=????)

We can use the method `fit_predict()` to compute the cluster centers and predict the cluster index for each sample. This method will return the labels or index of the cluster that each sample belongs to.

In [None]:
# fit and predict
first_attempt.fit_predict(cust)

In [None]:
# To explicitly see the labels we have the atribute .labels_
first_attempt.labels_

In [None]:
# To explicitly see the cluster centers we have the attribute
# .cluster_centers_ which is a multidimensional array
first_attempt.cluster_centers_

Because we only have two attributes/variables we can easily plot the clusters. We will give each cluster a different color. Additionally, let's add the cluster centers to plot to see if they look like the center of their respective clusters.

In [None]:
# Create the scatter plot
# hue will be the labels_
fig, ax = plt.subplots()
sns.scatterplot(x=cust.Age, y=cust.Income, hue=first_attempt.labels_, palette='tab10')

# Now add the cluster centers to the plot
# These will be triangles and colored black
ax.scatter(first_attempt.cluster_centers_[:,0],
           first_attempt.cluster_centers_[:,1],
           marker='^', c='k')

## Notice Anything?

What did you notice about the resulting clusters? Why do you think that is happening?

### Solution?

Let's try to scale the data. This will help each attribute start out on "the same footing" when trying to determine the clusters.

There are various ways to scale the data. Depending on your chosen methodological approach, one type of scaling may be better than another. Recall that $K$-means clustering uses **distance** to determine the clusters. Because of this algorithmic detail, we want the range of the attributes to be the same. The easiest way to accomplish this task is with min-max scaling using the `MinMaxScaler`. 

One question you should always ask about any scaler is: Does the scaling change the original **shape** of my data?

Let's see the reults of using the `MinMaxScaler` on this dataset.

In [None]:
# First, create a histogram of `Age`
sns.displot(cust.Age)

In [None]:
# Look at a kernel density estimation plot
sns.kdeplot(cust.Age)

In [None]:
# Look at histogram of Income
sns.displot(cust.Income)

In [None]:
# KDE plot for Income
sns.kdeplot(cust.Income)

In [None]:
# Income looks right-skewed. Let's calculate
# the skewness coefficient to verify
cust.Income.skew()

### Time to Scale

We need to import `MinMaxScaler` from `sklearn.preprocessing`. Then we create an object by calling its class instantiator. We then fit and transform the data, make sure the columns are the same names as the original data, and then take a look at it.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
mm_scaler = MinMaxScaler()
mm_scaled_data = pd.DataFrame(mm_scaler.fit_transform(cust))
mm_scaled_data.columns = cust.columns
mm_scaled_data

In [None]:
# Did Age change shape?
sns.displot(mm_scaled_data.Age)

In [None]:
# KDE plot of Age
sns.kdeplot(mm_scaled_data.Age)

In [None]:
# What about Income
sns.displot(mm_scaled_data.Income)

In [None]:
# KDE plot for Income
sns.kdeplot(data=mm_scaled_data, x='Income')

In [None]:
# What about the skewness of Income
mm_scaled_data.Income.skew()

### Result?

Using the `MinMaxScaler` does **not** change the shape of our original data. 

### Find Clusters with Scaled Data

Using the same number of clusters as above, we now want to fit and predict on the scaled data.

In [None]:
# Use the same number of clusters as above
second_attempt = KMeans(n_clusters=????)

In [None]:
# See the labels for each observation
# It will return the labels_ 
second_attempt.fit_predict(mm_scaled_data)

In [None]:
# What do the cluster centers look like for the scaled data?
second_attempt.cluster_centers_

### Centers on Scaled Data

Notice that the cluster centers are in terms of the scaled data. This fact should not surpise you: you sent in scaled data, so the centers necessarily must also be scaled from the original data. Because the cluster centers are on a different scale than the original data, how do we interpret them in the original scale? For example, for the first cluster center, how do you interpret the Age attribute that shows, say 0.8153? To make sense of the cluster centers we need to put them back into the original units. Luckily, this is easily accomplished with the `.inverse_transform()` method.

In [None]:
# Put the new centers back into original units
centers_orig_scale = mm_scaler.inverse_transform(second_attempt.cluster_centers_)
centers_orig_scale

In [None]:
# Now, plot the new clusters and their centers
# Notice, that we can use the original data for the x and y coordinates
# and simply change the color of the observations based on the labels
# that resulted from clustering with the scaled data
fig, ax = plt.subplots()
sns.scatterplot(x=cust.Age, y=cust.Income, hue=second_attempt.labels_, palette='tab10')

# To correctly get the centers on the chart, we need to use the
# inverse_transform'ed centers
ax.scatter(centers_orig_scale[:,0],
           centers_orig_scale[:,1],
           marker='^', c='k')

## Finding the "Right" Number of Clusters

How do you know the "correct" number of clusters to use? In some instances, the number is solely driven by the business context. For example, you want to 3 groups of customers so that you can target each of those 3 groups separately. The business owner specified they wanted 3 groups. Other times, you may let the algorithm(s) decide the number of clusters. 

One approach to finding a "good" number of clusters is called the "elbow method". In this approach, you try different values for $k$ and plot the resulting values of *inertia*. You then try to find the "elbow" where adding an additional cluster does not drastically improve (i.e., lower) the inertia.

Let's try it.

In [None]:
# Create an empty dictionary to hold the results
# key = the number of clusters : value = inertia
inertia = {}

# Loop over 1 to 9 clusters finding the clusters and inertia
for k_value in range(1,10):
    kmeans = KMeans(n_clusters=k_value, random_state=42)
    kmeans.fit(mm_scaled_data)
    inertia[k_value] = kmeans.inertia_

In [None]:
# Look at results
inertia

In [None]:
# Plot the results
# Create DataFrame for easier plotting
df_inertia = pd.DataFrame.from_dict(inertia, orient='index', columns=['inertia'])
df_inertia

In [None]:
df_inertia.plot(marker='o')

Another approach instead of the elbow method is to calculate the **silhoutte score** for each value of $k$ to see how many clusters should be used. The best value is +1 and the worst is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. A +1 indicates highly dense clustering. Overall, you can think of the following: the score is higher when the clusters are dense and well separated, which relates to a standard concept of a cluster.

In [None]:
# import silhouette_score
from sklearn.metrics import silhouette_score

In [None]:
# Create an empty dictionary to hold silhouette scores
ss = {}

# Loop over 2 to 9 clusters finding the clusters and silhouette_score
for k_value in range(2,10):
    kmeans = KMeans(n_clusters=k_value, random_state=42)
    kmeans.fit(mm_scaled_data)
    labels = kmeans.labels_
    ss[k_value] = silhouette_score(mm_scaled_data, labels)

In [None]:
# See what the results look like
ss

In [None]:
# Create a DataFrame to make plotting easier
df_ss = pd.DataFrame.from_dict(ss, orient='index', columns=['silhouette_score'])
df_ss

In [None]:
# Create a line plot
df_ss.plot(marker='o')

In [None]:
# Try creating clusters with the "best" number indicated above


In [None]:
# Put the new centers back into original units


In [None]:
# Now, plot the new clusters and their centers
# Notice, that we can use the original data for the x and y coordinates
# and simply change the color of the observations based on the labels
# that resulted from clustering with the scaled data


# To correctly get the centers on the chart, we need to use the
# inverse_transform'ed centers


## Another Scaling Option

Another popular scaling option is the `StandardScaler`. It standardizes the data; that is, the scaled data will have a mean of 0 and a variance of 1. (Obviously, it will also have a standard deviation of 1.) This scaler is often used in many machine learning estimators that assume that the individual attributes/variables/features look more or less like a standard normally distributed distribution.

I will leave it to you to explore the scaler on your own using the ancillary material below.

### Additional Resources

The following links point you to additional resources that you might find helpful in learning this material.

1. [API documentation for `KMeans`][1].
2. [API documentation for `MinMaxScaler`][2].
3. [API documentation for `StandardScaler`][3].


-----

[1]: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
[2]: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
[3]: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

**&copy; 2023 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**