**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install yellowbrick
!{sys.executable} -m pip install --quiet sweetviz
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
from class_utils.sklearn import (
    make_ext_column_transformer, transformer_extensions
)
from sklearn.cluster import KMeans

import sweetviz as sv
from class_utils.plots import crosstab_plot, ColGrid, RainCloud
import seaborn as sns

from yellowbrick.cluster import SilhouetteVisualizer, KElbowVisualizer
# revert yellowbrick's invasive changes to matplotlib's
# styling; also suppressing deprecation warnings
import warnings
import yellowbrick

with warnings.catch_warnings(record=True) as w:
    yellowbrick.style.rcmod.set_aesthetic('reset')
    yellowbrick.style.rcmod.reset_orig()
    
cluster_colors = sns.color_palette()

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("Mall_Customers.csv"), directory="data")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Customer Segmentation using Clustering

As a further example, we are going to have a look at a dataset of mall customers and try to use clustering to identify customer segments, i.e. groups of customers who share some common characteristics. Knowing about customer segments can be very useful – it allows companies to e.g. use different marketing strategies when targeting different segments, etc.

Let's start by loading the dataset. As we can see, it is not very complex – it only contains the gender, the age, the annual income and the spending score of each customer. We are going to see, though, that we can still draw upon it to gain some useful insights.



In [None]:
df = pd.read_csv("data/Mall_Customers.csv")
df.head()

### Exploratory Analysis

As a first step in our process, we are going to do some light exploratory analysis on the data. First, you can run `sv.analyze` to get the basic information about the type and distribution of columns, their correlations, missing data and such.



In [None]:
report = sv.analyze(df, target_feat='Spending Score (1-100)')
report.show_notebook()

Afterwards we might want to explore relationships between various pairs of variables. E.g. we could use violin plots to display distributions of numeric variables, conditioning on the gender of the customer.



In [None]:
g = ColGrid(df, 'Gender', ["Age", "Annual Income (k$)", "Spending Score (1-100)"], col_wrap=2)
g.map_dataframe(sns.violinplot);
plt.gcf().set_size_inches(10, 6)

---
### Task 1: Relationships of Numeric Variables with the Spending Score

**To explore the relationships between the various numeric variables and the spending score, create a grid of scatter plots with those *numeric variables*  on the horizontal axis and the *spending score*  on the vertical axis.** 

---


In [None]:
g = ColGrid(      # ---
    
# ---

---
#### Task 2: Describing the Observed Segments

Having displayed and visually inspected the scatter plots, you should observe 2 clusters in the *age*  vs. *spending score*  plot and 5 clusters in the *annual income*  vs. *spending score* . In the cell below, **describe what customer segment each of these clusters could correspond to – how it could be interpreted** .

---


### Preprocessing

Having done some basic exploration, let's say that we now want to arrive at a certain number of interpretable clusters and then perhaps explore the properties of each of them further. You have already thought about and hopefully provided some interpretation of the five clusters that are present in the annual income vs. spending score plot. So let us try to capture those clusters now.

To this end, we are now going to drop all columns apart from `Annual Income (k$)` and `Spending Score (1-100)` and apply some standard preprocessing to these two.



In [None]:
# all inputs are numeric
categorical_inputs = [
    # "Gender"
]

numeric_inputs = [
    # "Age",
    "Annual Income (k$)", "Spending Score (1-100)"
]

# the preprocessing pipeline
input_preproc = make_ext_column_transformer(
    (make_pipeline(
        transformer_extensions(
            SimpleImputer(strategy='constant', fill_value='MISSING')
        ),
        OneHotEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        transformer_extensions(
            SimpleImputer()
        ),
        StandardScaler()),
     numeric_inputs),

    inverse_dropped='ignore',
    verbose_feature_names_out=False
)

# the preprocessed data and the classes
X = input_preproc.fit_transform(df)

---
### Task 3: Applying $k$-means to the Data

**As your next task, apply $k$-means clustering to the data. Assign the resulting cluster identifiers to a `clust` column in `df`.**  Note: To make the following cells work, also assign the `KMeans` object to `model`.

---


In [None]:
model = # ---


df["clust"] = # ---


One of the nice things about $k$-means is that with the clusters being ball-shaped, they can easily be represented by their centroids and this makes them relatively interpretable. When using scikit-learn's `KMeans` object, we can extract the cluster centers using `model.cluster_centers_`.

These cluster centers are, of course, already standardized, which is not necessarily a good thing when trying to interpret them. We are therefore going to use `input_preproc` to transform them back onto the original scale (k$ for the annual income and 1-100 for the spending score) before we display them.



In [None]:
cluster_centers = input_preproc.inverse_transform(model.cluster_centers_)
cluster_centers

### Analysis of the Discovered Clusters

Now that we have retrieved the clusters, we can do further analysis to get more insights about the customers in each of them. Given that we have been trying to capture the 5 clusters visible in the annual income vs. spending score plot, let's first make sure that this worked correctly.



In [None]:
sns.scatterplot(x="Annual Income (k$)", y="Spending Score (1-100)", data=df, s=20, hue="clust", palette=cluster_colors[:cluster_centers.shape[0]])
sns.scatterplot(x="Annual Income (k$)", y="Spending Score (1-100)", data=cluster_centers, s=100, color='k')
plt.grid(ls='--')
plt.gca().set_axisbelow(True)

Next we can have a look at associations between the cluster number and other variables – in much the same way that we do during exploratory analysis. Let's display the violin plots of cluster vs. the three numeric variables we have.

One thing we can observe is that the age distribution in two of the clusters is much more concentrated than in the others. E.g. in the group that earns a lot and spends a lot, the median age is 32, with the minimum being 27 and the maximum being 40. In the group that spends a lot in spite of having a low income, the ages are significantly lower with the median of 23.5 and the maximum of 35. The other clusters more or less span the entire range range of ages.

As you can see, this already gives us new useful insights – it shows us, for instace, that in our sample, older people were less likely to spend irresponsibly than young people.



In [None]:
g = ColGrid(df, 'clust', ["Age", "Annual Income (k$)", "Spending Score (1-100)"], col_wrap=2)
g.map_dataframe(sns.violinplot);
plt.gcf().set_size_inches([12, 8])

We can also display a matrix showing the association between gender and the clusters. One thing that strikes one here is that there is much less males than females in two of the clusters and these happen to be the clusters that correspond to low income customers.



In [None]:
plt.figure(figsize=(5, 5))
crosstab_plot(x='Gender', y='clust', data=df);

It seems that in our sample, women are a bit more likely to have low income than men (there is a bit more women than men in the low-income clusters). To analyze this further, we can have a look at the data for customers that earn less than 40k. Let's filter them out and count the number of men and women. This does indeed show that there is less men than women in this category



In [None]:
df_low = df[df["Annual Income (k$)"] < 40]
df_low_male = df_low[df_low["Gender"] == "Male"]
df_low_female = df_low[df_low["Gender"] == "Female"]

print(
    f"Number of males with <40k income: {len(df_low_male)};\n"
    f"Number of females with <40k income: {len(df_low_female)};\n"
    f"The ratio of males vs. females is: {len(df_low_male) / len(df_low_female)}"
)

We can even make a raincloud plot to get a fuller idea of how the male/female customers in this income range are distributed.



In [None]:
RainCloud(x="Gender", y="Annual Income (k$)", data=df_low)