# Know your customers

One of the most common applications of KMeans is to get to know your customers. Take a very simple dataset that is Mall Customers to try to discover customer segmentations.

0. Import usuals librairies

In [1]:
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import  silhouette_score

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe_connected"

1. Import the ```Mall_Customers.csv``` dataset

In [2]:
dataset = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/KMeans/Exercices/Datasets/Mall_Customers.csv")
dataset.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [3]:
# Statistiques basiques
print("Nombre de lignes : {}".format(dataset.shape[0]))
print()

print("Aperçu du dataset : ")
display(dataset.head())
print()

print("Statistiques basiques : ")
data_desc = dataset.describe(include='all')
display(data_desc)
print()

print("Pourcentage de valeurs manquantes : ")
display(100*dataset.isnull().sum()/dataset.shape[0])

Nombre de lignes : 200

Aperçu du dataset : 


Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40



Statistiques basiques : 


Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200,200.0,200.0,200.0
unique,,2,,,
top,,Female,,,
freq,,112,,,
mean,100.5,,38.85,60.56,50.2
std,57.879185,,13.969007,26.264721,25.823522
min,1.0,,18.0,15.0,1.0
25%,50.75,,28.75,41.5,34.75
50%,100.5,,36.0,61.5,50.0
75%,150.25,,49.0,78.0,73.0



Pourcentage de valeurs manquantes : 


CustomerID                0.0
Genre                     0.0
Age                       0.0
Annual Income (k$)        0.0
Spending Score (1-100)    0.0
dtype: float64

2. Remove the "CustomerID" variable from your dataset. 

In [4]:
# On jette les colonnes inutiles 
useless_cols = ['CustomerID']

print("Les colonnes suivantes vont être jetées : ", useless_cols)
dataset = dataset.drop(useless_cols, axis=1)
print("...Terminé.")
print(dataset.head())

Les colonnes suivantes vont être jetées :  ['CustomerID']
...Terminé.
    Genre  Age  Annual Income (k$)  Spending Score (1-100)
0    Male   19                  15                      39
1    Male   21                  15                      81
2  Female   20                  16                       6
3  Female   23                  16                      77
4  Female   31                  17                      40


3. Dummy your categorical variables 

In [5]:
# Création du pipeline pour les variables quantitatives
numeric_features = [1,2,3] # Positions des colonnes quantitatives dans X
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()) # pour normaliser les variables
])

# Création du pipeline pour les variables catégorielles
categorical_features = [0] # Positions des colonnes catégorielles dans X
categorical_transformer = Pipeline(
    steps=[
    ('encoder', OneHotEncoder(drop='first')) # on encode les catégories sous forme de colonnes comportant des 0 et des 1
    ])

# On combine les pipelines dans un ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Preprocessings sur le dataset
print("Preprocessing sur le train set...")
print(dataset.head())
X = preprocessor.fit_transform(dataset) # fit_transform !!
print('...Terminé.')
print(X[0:5, :])
print()

Preprocessing sur le train set...
    Genre  Age  Annual Income (k$)  Spending Score (1-100)
0    Male   19                  15                      39
1    Male   21                  15                      81
2  Female   20                  16                       6
3  Female   23                  16                      77
4  Female   31                  17                      40
...Terminé.
[[-1.42456879 -1.73899919 -0.43480148  1.        ]
 [-1.28103541 -1.73899919  1.19570407  1.        ]
 [-1.3528021  -1.70082976 -1.71591298  0.        ]
 [-1.13750203 -1.70082976  1.04041783  0.        ]
 [-0.56336851 -1.66266033 -0.39597992  0.        ]]



4. We are going to build our clusters, but to do so, we need to know the optimum number of clusters we need. First use the ```Elbow``` method to see if we can see how many we need to take as a value for ```k```.

In [6]:
# Utilisation de la méthode Elbow pour trouver le nombre optimal de clusters

wcss =  []
for i in range (2,11): 
    kmeans = KMeans(n_clusters= i)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    
print(wcss)

[438.5224115567773, 344.4341455934711, 254.28290726083466, 216.78490151651047, 181.9514362434146, 164.72617895920516, 150.28418136449196, 138.94104830026822, 129.01779874884602]


In [7]:
fig = px.line(x = range(2,11), y = wcss)
fig.show()

5. Then use the _Silhouette_ method to see if we can refine our hypothesis for ```k```.

In [8]:
# Utilisation du silhouette_score pour déterminer le nombre optimal de clusters
s_score = []
for i in range (2,11): 
    kmeans = KMeans(n_clusters= i)
    kmeans.fit(X)
    s_score.append(silhouette_score(X, kmeans.predict(X)))

print(s_score)

[0.3031976564160757, 0.31384595454509323, 0.3502702043465398, 0.34977050035201074, 0.356485834425401, 0.3346555570188711, 0.33241936178446657, 0.33880160672227616, 0.31734740445677234]


In [9]:
# Affichage de scores en fonction du nombre de clusters
fig = px.bar(x = range(2,11), y = s_score)
fig.show()

6. Next, we will take $K=6$ clusters. Apply the KMeans to your dataset.

In [10]:
# On ré-entraîne un KMeans avec le nombre optimal de clusters
kmeans = KMeans(n_clusters= 6)
kmeans.fit(X)

KMeans(n_clusters=6)

In [11]:
dataset.loc[:,'Cluster_KMeans'] = kmeans.predict(X)
dataset.head()

Unnamed: 0,Genre,Age,Annual Income (k$),Spending Score (1-100),Cluster_KMeans
0,Male,19,15,39,0
1,Male,21,15,81,0
2,Female,20,16,6,1
3,Female,23,16,77,0
4,Female,31,17,40,1


7. Let's create a graph that will allow us to visualize each of the clusters as well as their centroids. We will first take the ```Spending Score``` as the ordinate and the ```Annual Income``` as the abscissa. 

In [12]:
# Visualisation bi-dimensionnelle
fig = px.scatter(dataset, x = 'Annual Income (k$)', y = "Spending Score (1-100)", color = "Cluster_KMeans")
fig.show()

8. We have a nice visualization with a nice cluster separation. Look this time at the variable ```Age``` in relation to the ```Spending Score```. What do you notice?

In [13]:
# Visualisation bi-dimensionnelle
fig = px.scatter(dataset, x = 'Age', y = "Spending Score (1-100)", color = "Cluster_KMeans")
fig.show()

----> This time clusters are definitely less visible. 

In [14]:
# Visualisation dans l'espace des trois variables quantitatives
fig = px.scatter_3d(dataset, x = 'Annual Income (k$)', y = "Spending Score (1-100)", z = 'Age', color = "Cluster_KMeans")
fig.show()