<a href="https://colab.research.google.com/github/polo-music/client-segmentation-KMC/blob/main/KMeansCluster_client_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KMeans Clustering - Client segmentation

---

My idea with this project is looking in depth on the KMeans Clustering algorithm. I found in Kaggle (as always) a dataset with some information about characteristics of different clients. My idea is to be able to create a model that can segment the clients to know which ones are willing to be converted into costumers.
I'll be using SKlearn & Pandas.

In [148]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/client-segmentation/Mall_Customers.csv')
print(df.head(7))
print(df.info())
print(df.isnull().sum())

   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40
5           6  Female   22                  17                      76
6           7  Female   35                  18                       6
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Sc

Good. We can see that the relation only has a few parameters, and I think that the most interesting value to predict is the spending score (whatever this means). I will split the data with Pareto as always and start traing the model.

But before doing that I will change the index to the CostumerID property since it looks like is acting as a primary key, just for optimization.

In [149]:
df = df.set_index(keys = 'CustomerID')
print(df.head(7))

            Gender  Age  Annual Income (k$)  Spending Score (1-100)
CustomerID                                                         
1             Male   19                  15                      39
2             Male   21                  15                      81
3           Female   20                  16                       6
4           Female   23                  16                      77
5           Female   31                  17                      40
6           Female   22                  17                      76
7           Female   35                  18                       6


Now this makes more sense from a human point of view. I always try to simplify as much as I can the development of the projects to make sure that I'm not letting anything behind.

Now we can start with the split of the dataset and the training of the model. The fastest approach to get some useful information in this dataset is to analyse the cluster relationship beteween the year income and the spending score and try to get as many clusters as possible. If the dataset was a little bit more big and had more properties, it would be grat to re-do the same analysis with different pairs of properties. With this we could give to the client a more in depth analysis of different properties.

>**NOTE** that in the gender column the type of data that we have is a string. This will not fit our model so we'll have to make it binary, for example, 1 for female and 0 for men. Not necessary but good practices.

In [150]:
df.loc[df.Gender == 'Male', 'Gender'] = 0
df.loc[df.Gender == 'Female', 'Gender'] = 1

x = df.iloc[:, [2, 3]].values # The relationship between the gender and the spending score

# Since we now have array objects, we'll have to convert them to dataframes
x = pd.DataFrame(data=x, columns=['Annual income', 'Spending score'])
print(x.head())

   Annual income  Spending score
0             15              39
1             15              81
2             16               6
3             16              77
4             17              40


At this stage and from my point of view, I think that to better understend the inside mechanics of the KMeans Clustering is important to introduce basic concepts:

> The KMeans Clustering algorithm takes n parameters or observations and aims to partition these observations into k clusters (collection of data points aggregated together because of certain similarities) where k <= n to minimize WCSS and to maximize BCSS.

*   WCSS: The *Whithin Cluster Sum of Squares* is the distance between points in the same cluster.
*   BCSS: The *Between Cluster Sum of Squares* is distance between the center point of each cluster.

An interesting approach I saw a while ago was to calculate the WCSS for a different number of clusters. This allows us to fit in the model the number of clusters that gives us a better reading of the data without being incomplete or too complex.



In [151]:
wcss = [] # We initialize an empty list

for c in range(1, 10): # We will loop through 10 clusters
  model = KMeans(n_clusters = c, random_state = 7) # We initialize the model
  model.fit(x) # We train the model
  wcss.append(model.inertia_) # The property of the model object 'inertia_' is the sum of the squared distance of the data points to its closest centroid


Now we can visually see which number of clusters will probably suit better our study case.

In [152]:
# We make a dictionary with the data we have
dic1 = {
    'n of clusters': [x for x in range(1,10)],
    'WCSS': wcss
}

df_wcss1 = pd.DataFrame(data = dic1) # We convert it to a dataframe

figure = px.line(data_frame = df_wcss1, x = 'n of clusters', y = 'WCSS', width = 800, height=400, title = 'Number of clusters depending on WCSS') # We make a visual object
figure.show() # Plot the result

It is easy to see that the best number of clusters in this case is 5. Why? Because from 5 to 10 we can't find really an improvement in the decending of the WCSS. It is safe to assume that the best number of clusters is 5. Now we can re-do de model and check on some values!

In [153]:
model = KMeans(n_clusters = 5, random_state = 7)
model.fit(x)

KMeans(n_clusters=5, random_state=7)

In [154]:
y = model.predict(x)

For the plotting of the result I'll use Matlotlib. First I will make a dataset with a column named 'Label', which will be the column with the cluster each datapoint belongs to.

Then I'll plot the results.

In [159]:
x['Label'] = y

# I will use a for with a globals variable to iterate also the variable's name
for c in range(1, 6):
  globals()['df_cluster' + str(c)] = x.loc[x['Label'] == c]

In [164]:
figure = go.Figure()
figure.add_trace(go.Scatter(x = df_cluster1['Annual income'], y = df_cluster1['Spending score'], mode='markers', name='Cluster 1'))
figure.add_trace(go.Scatter(x = df_cluster2['Annual income'], y = df_cluster2['Spending score'], mode='markers', name='Cluster 2'))
figure.add_trace(go.Scatter(x = df_cluster3['Annual income'], y = df_cluster3['Spending score'], mode='markers', name='Cluster 3'))
figure.add_trace(go.Scatter(x = df_cluster4['Annual income'], y = df_cluster4['Spending score'], mode='markers', name='Cluster 4'))
figure.add_trace(go.Scatter(x = df_cluster5['Annual income'], y = df_cluster5['Spending score'], mode='markers', name='Cluster 5'))
figure.update_layout(
    title='KMeans Cluster - Client segmentation (Annual salary vs Spending score)',
    xaxis_title='Annual salary',
    yaxis_title='Spending score'
)
figure.show()

At this point, our job is done. What we could do is to take this information and look also at the client ID of each entry so we can clusterize every client and then try to find classification patterns between some of their properties (and that, of course, would be a classification project).
The final step of the analyis would be to send this information to the business in order to take a marketing or expansion decision.