Customer segmentation using KMeans

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('D:\\K_Means_Project\\data\\raw\\Mall_Customers.csv')

In [4]:
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


Dropping CustomerID and Gender since they are not relevant for this analysis

Although Gender is an available feature, exploratory analysis shows significant overlap in spending and income behavior across genders. Additionally, as a categorical variable, encoding Gender for K-Means would introduce artificial distance assumptions. Therefore, Gender is excluded from clustering and retained only for post-cluster interpretation.

In [17]:
new_df = df[['Age','Annual Income (k$)','Spending Score (1-100)']]

In [18]:
new_df.head()

Unnamed: 0,Age,Annual Income (k$),Spending Score (1-100)
0,19,15,39
1,21,15,81
2,20,16,6
3,23,16,77
4,31,17,40


#### Since K-Means is distance-based and the selected features operate on different scales, feature scaling is required. StandardScaler is used to normalize features to zero mean and unit variance, ensuring that no single variable disproportionately influences the clustering outcome.

In [19]:
from sklearn.preprocessing import StandardScaler

In [20]:
scale = StandardScaler()

In [21]:
scaled_data = scale.fit_transform(new_df)

In [22]:
scaled_df = pd.DataFrame(scaled_data, columns = new_df.columns)

In [23]:
scaled_df.head()

Unnamed: 0,Age,Annual Income (k$),Spending Score (1-100)
0,-1.424569,-1.738999,-0.434801
1,-1.281035,-1.738999,1.195704
2,-1.352802,-1.70083,-1.715913
3,-1.137502,-1.70083,1.040418
4,-0.563369,-1.66266,-0.39598


In [None]:
from sklearn.cluster import KMeans