# Cluster Analysis

Consider this scenario:
> Let's imagine that you are the owner of a mall. You want to held an annual promo to celebrate your mall's birthday. But of course your customer have their own preferences about what is the promo that they like, right? So you have to make a different promo for each behavioral group. You have a data about your member demography, such as gender, age, annual income, and spending score (score about your customer's behavior and purchasing data). Why doesn't you do a customer segmentation analysis?

Let's do a customer segmentation analysis!

## Prepare and explore the data

For this experiment, we will use [this](https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python) data from Kaggle platform.

In [None]:
# Package imports
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.cluster.hierarchy as shc
import seaborn as sns

In [None]:
df = pd.read_csv('data/Mall_Customers.csv')

# Rename the column to ease the analysis process
df.columns = ['customer_id', 'gender', 'age', 'annual_income', 'spending_score']

df.head()

Since the CustomerID is just our data identification, we could use this column as a rownames.

In [None]:
df = df.set_index('customer_id')

Lets do some data exploration

In [None]:
df.info()

The data contains 200 rows and 4 columns, each row represents a customer. The columns are:
* `gender` - The customer's gender
* `age` - The customer's age in year
* `annual_income` - The customer's annual income in thousands dollar
* `spending_score` - A score about the customer's behavior, range from 1 - 100

In [None]:
df.groupby(['gender']).size()

In [None]:
df.describe()

## Data preprocessing

Since the k-means algorithm is a distance-based algorithm, it is better if we standardized the data before doing cluster analysis. Also, the function that we will use only accept numerical data as input so we will transform the `gender` feature.

In [None]:
# Transform gender feature

df = df.assign(gender = [1 if gender == 'Male' else 0 for gender in df['gender']])

In [None]:
# Data normalisasi

scaler = MinMaxScaler().fit(df)

features = ['gender', 'age', 'annual_income', 'spending_score']

df_scaled = df.copy()
df_scaled[features] = scaler.transform(df_scaled[features])

df_scaled.head()

## Modeling

### K-Means

First, let's just use the default parameter.

In [None]:
kmeans = KMeans(random_state = 123).fit(df_scaled)

In [None]:
df_result = df.copy()
df_result['cluster'] = kmeans.labels_

In [None]:
sns.pairplot(data = df_result, hue = 'cluster', diag_kind = 'None', palette = 'tab10')

The default number of cluster is 8. Using the default parameter, the cluster result is not distinctive for each cluster. Let's tune this parameter to get a better results.

#### Elbow method

In [None]:
possible_k = [2, 3, 4, 5, 6, 7, 8, 9, 10]
inertia = []

for k in possible_k:
    kmeans = KMeans(n_clusters = k, random_state = 123).fit(df_scaled)

    inertia.append(kmeans.inertia_)

In [None]:
plt.plot(possible_k, inertia, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()

The picture above shows that the best k = 4.

#### Silhouette Score

In [None]:
possible_k = [2, 3, 4, 5, 6, 7, 8, 9, 10]
silhouette = []

for k in possible_k:
    kmeans = KMeans(n_clusters = k, random_state = 123).fit(df_scaled)

    silhouette.append(silhouette_score(X = df_scaled, labels = kmeans.labels_))

In [None]:
plt.plot(possible_k, silhouette, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score Method')
plt.show()

Using silhouette score, the best k = 2.

#### Compare the results

In [None]:
# 2 clusters

kmeans_2clust = KMeans(n_clusters = 2, random_state = 123).fit(df_scaled)

df_result_2clust = df.copy()
df_result_2clust['cluster'] = kmeans_2clust.labels_

_ = sns.pairplot(data = df_result_2clust, hue = 'cluster', diag_kind = 'None', palette = 'tab10')
plt.show()

In [None]:
# 4 clusters

kmeans_4clust = KMeans(n_clusters = 4, random_state = 123).fit(df_scaled)

df_result_4clust = df.copy()
df_result_4clust['cluster'] = kmeans_4clust.labels_

_ = sns.pairplot(data = df_result_4clust, hue = 'cluster', diag_kind = 'None', palette = 'tab10')
plt.show()

What do you think? What is the best k parameter?

### Hierarchical Clustering

Before we cluster the data, we could plot the dendrogram first to help us decide the number of cluster for this particular problem.

In [None]:
plt.figure(figsize=(16, 7))
plt.title("Dendrograms")

dend = shc.dendrogram(shc.linkage(df_scaled, method='ward'))

From the picture above, we could see that the vertical line with the maximum distance is the blue line. Hence, we could set the threshold of 6 and cut the dendrogram.

In [None]:
plt.figure(figsize=(16, 7))
plt.title("Dendrograms")

dend = shc.dendrogram(shc.linkage(df_scaled, method='ward'))
_ = plt.axhline(y=6, color='r', linestyle='--')
plt.show()

In [None]:
AggClust = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward').fit(df_scaled)

df_result_AggClust = df.copy()
df_result_AggClust['cluster'] = AggClust.labels_

_= sns.pairplot(data = df_result_AggClust, hue = 'cluster', diag_kind = 'None', palette = 'tab10')
plt.show()