# Customer Segmentation

In this project, our goal is to find different groups of customer based on their age , spending score and annual income
We will use KMeans method to cluster the customers

## Importing necessary libraries

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

import warnings

warnings.filterwarnings("ignore")

## Loading the data

In [4]:
data = pd.read_csv("Mall_Customers.csv")
data.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [5]:
data.describe()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100)
count,200.0,200.0,200.0,200.0
mean,100.5,38.85,60.56,50.2
std,57.879185,13.969007,26.264721,25.823522
min,1.0,18.0,15.0,1.0
25%,50.75,28.75,41.5,34.75
50%,100.5,36.0,61.5,50.0
75%,150.25,49.0,78.0,73.0
max,200.0,70.0,137.0,99.0


## Preprocessing

"Customer ID" is not a significant feature
So we'll remove it

In [6]:
data.drop(["CustomerID"], axis=1, inplace=True)

In [7]:
data.isna().sum()

Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

It seems like we don't have any missing values

In [8]:
data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Gender                  200 non-null    object
 1   Age                     200 non-null    int64 
 2   Annual Income (k$)      200 non-null    int64 
 3   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 16.9 KB


## Exploratory Data Analysis

### Age

In [12]:
fig = px.histogram(data_frame=data, x="Gender", color_discrete_sequence=["pink"], title="Gender Count", text_auto=True)
fig.show()

It seems there are more "Female" customers than "Male" customers

In [22]:
fig = px.histogram(data_frame=data, x="Age", facet_col="Gender", color_discrete_sequence=["pink"],
                   title="Age and Gender", barmode="group", text_auto=True)
fig.update_layout(bargap=0.2)
fig.show()

There are many "Female" customers between the age from "30-35"

### Spending Score

In [25]:
fig = px.histogram(data_frame=data, x="Spending Score (1-100)", facet_col="Gender",
                   color_discrete_sequence=["pink", "blue"],
                   title="Spending Score", barmode="group", text_auto=True)
fig.update_layout(bargap=0.2)
fig.show()

"Female" customers have most of the 70-100 spending score

### Income

In [39]:
data.groupby(["Gender"]).agg(Average_Spending=("Spending Score (1-100)", "mean"))

Unnamed: 0_level_0,Average_Spending
Gender,Unnamed: 1_level_1
Female,51.526786
Male,48.511364


In [40]:
data.groupby("Gender").agg(Average_Income=("Annual Income (k$)", "mean"))

Unnamed: 0_level_0,Average_Income
Gender,Unnamed: 1_level_1
Female,59.25
Male,62.227273


Even though "Male" has more average income they spend less money

###  Age and Income

In [29]:
fig = px.scatter(data_frame=data, x="Age", y="Annual Income (k$)", color="Gender", title="Income vs Age", width=1000,
                 height=600)
fig.show()

People with age less than 30 and greater than 60 has low "Annual Income"

### Age and Spending Score

In [30]:
fig = px.scatter(data_frame=data, x="Age", y="Spending Score (1-100)", color="Gender", title="Spending Score vs Age",
                 width=1000,
                 height=600)
fig.show()

"Age" above 40 spend very less

## Income and Spending Score

In [31]:
fig = px.scatter(data_frame=data, x="Annual Income (k$)", y="Spending Score (1-100)", color="Gender",
                 title="Annual Income and Spending Score",
                 width=1000,
                 height=600)
fig.show()

There is definitely a pattern here
We will explore it using KMeans

## Preprocessing

### Scaling
To give every feature the same significance we scale the data before training the model

In [34]:
scaler = StandardScaler()

In [35]:
data_new = data.iloc[:, [2, 3]]

In [36]:
scaler.fit(data_new)

In [37]:
data_scaled = scaler.transform(data_new)

## Creating the model

### Elbow Method
In KMeans clustering ,We use elbow method to select optimal no of clusters

In [38]:
wcss = []
for i in range(1, 10):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data_scaled)
    wcss.append(kmeans.inertia_)

In [56]:
fig = px.line(x=range(1, 10), y=wcss, height=600, width=700, title="Elbow Method", color_discrete_sequence=["red"])
fig.update_layout(xaxis_title="Clusters",
                  yaxis_title="WCSS")
fig.show()

We can consider 3 and 5 as optimal clusters

### Silhouette Score

In [48]:
silhouette_avg = []
for i in range(2, 10):
    temp_kmeans = KMeans(n_clusters=i)
    temp_kmeans.fit(data_scaled)
    cluster_label = temp_kmeans.labels_
    silhouette_avg.append(silhouette_score(data_scaled, cluster_label))
fig = px.line(x=range(2, 10), y=silhouette_avg, height=600, width=700, title="Silhouette Method")
fig.update_layout(xaxis_title="Clusters",
                  yaxis_title="Silhouette Score")
fig.show()

From this we will choose 5 as the optimal number of clusters

In [49]:
kmeans = KMeans(5)
clusters = kmeans.fit_predict(data_new)
data_new["Clusters"] = clusters
data_new.head()

Unnamed: 0,Annual Income (k$),Spending Score (1-100),Clusters
0,15,39,4
1,15,81,3
2,16,6,4
3,16,77,3
4,17,40,4


### Center of Clusters

In [50]:
clusters = kmeans.cluster_centers_
clusters

array([[55.2962963 , 49.51851852],
       [86.53846154, 82.12820513],
       [88.2       , 17.11428571],
       [25.72727273, 79.36363636],
       [26.30434783, 20.91304348]])

## Visualizing the Clusters

In [51]:
data_new["Clusters"] = data_new["Clusters"].apply(lambda x: f"Cluster {x + 1}")

In [57]:
fig = px.scatter(data_frame=data_new, x="Annual Income (k$)", y="Spending Score (1-100)", color="Clusters")
fig.show()

### From this we can say there are 5 types of people
1 . Low Income & High Spender
2 . Mid Income & Mid Spender
3 . Low Income & Low Spender
4 . High Income & Low Spender
5 . High Income & High Spender

Using this information , we can give personalized advertisements and services to each groups