<a href="https://www.kaggle.com/code/manishkr1754/customer-segmentation-using-k-means?scriptVersionId=144343830" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Customer Segementation using K-Means</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The primary goal of this project is to employ advanced data analytics and machine learning methods to perform **customer segmentation** for a retail business. Customer segmentation is a critical task for understanding and effectively targeting different customer groups. This problem falls within the realm of **Classification Machine Learning** as it involves categorizing customers into distinct segments based on various attributes and behaviors. For example, You own the mall and want to understand the customers like who can be easily converge so that the sense can be given to marketing team and plan the strategy accordingly.

## 2) Understanding Data
---

The project uses **Customer Segmentation Data** which contains several variables (independent variables) and one outcome variable (dependent variable).

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
customer_segment_data = pd.read_csv('Datasets/Day13_Customer_Segmentation_Data.csv') 

In [None]:
customer_segment_data

In [None]:
print('The size of Dataframe is: ', customer_segment_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
customer_segment_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in customer_segment_data.columns if customer_segment_data[feature].dtype != 'O']
categorical_features = [feature for feature in customer_segment_data.columns if customer_segment_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=customer_segment_data.isnull().sum().sort_values(ascending=False)
percent=(customer_segment_data.isnull().sum()/customer_segment_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
customer_segment_data.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
customer_segment_data.describe(include='object').T

## 5) Data Cleaning & Preprocessing
---

### Choosing the Annual Income Column & Spending Score column for Clustering

In [None]:
X = customer_segment_data.iloc[:,[3,4]].values

In [None]:
X

## 6) Model Building
---

### Data Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

## Chosing Number of Clusters

### Method-1: Using `.score()` method

In [None]:
from sklearn.cluster import KMeans

cluster_iteration = range(1,10)
scores = []

for cluster_number in cluster_iteration:
    K_Means = KMeans(n_clusters=cluster_number, random_state=45)
    K_Means.fit(X)
    scores.append(K_Means.fit(X).score(X))

In [None]:
scores

#### Elbow Curve (Number of Clusters Vs Score)

In [None]:
plt.plot(cluster_iteration,scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.xticks(cluster_iteration)
plt.style.use('ggplot')
plt.show()

#### Inference

- Optimal number of clusters for Customer Segmentation = 5

#### Method-2: Using `.inertia_` method (Within-Cluster Sum of Sqaures)

In [None]:
from sklearn.cluster import KMeans

cluster_iteration = range(1,10)
inertias = []

for cluster_number in cluster_iteration:
    K_Means = KMeans(n_clusters=cluster_number, random_state=45)
    K_Means.fit(X)
    inertias.append(K_Means.inertia_)

In [None]:
inertias

#### Elbow Curve (Number of Clusters Vs Inertia)

In [None]:
plt.plot(cluster_iteration,inertias)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia(Within-Cluster Sum of Squares)')
plt.title('Elbow Curve')
plt.xticks(cluster_iteration)
plt.style.use('ggplot')
plt.show()

#### Inference

- Optimal number of clusters for Customer Segmentation = 5

#### Method-3: Silhouette Score Method

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

cluster_iteration = range(2,10)  # Start from at least 2 clusters (Silhouette Score Requirement)
inertias = []

for cluster_number in cluster_iteration:
    K_Means = KMeans(n_clusters=cluster_number, random_state=45)
    K_Means.fit(X)
    clustered_labels = K_Means.labels_
    
    silhouette_avg = silhouette_score(X,clustered_labels)
    print(f'For n_clusters={cluster_number}, the Silhouette score is {silhouette_avg}')

#### Inference

- Optimal number of clusters for Customer Segmentation = 5

### Final Model with number of clusters(n_cluster)=5

In [None]:
# Clustering for final model
K_Means_final_model = KMeans(n_clusters= 5, random_state=45)
K_Means_final_model.fit(X)

In [None]:
#Create a cluster label column in the original DataFrame
cluster_labels = K_Means_final_model.labels_

In [None]:
cluster_labels

In [None]:
customer_segment_data['cluster'] = cluster_labels

In [None]:
customer_segment_data

In [None]:
customer_segment_data.groupby('cluster').agg({'Annual Income (k$)': 'mean',
                            'Spending Score (1-100)': ['mean', 'count'],}).round(0)

### Cluster Visualization

In [None]:
Y = cluster_labels

In [None]:
Y

In [None]:
# plotting all the clusters and their Centroids

plt.figure(figsize=(15,8))
plt.scatter(X[Y==0,0], X[Y==0,1], s=50, c='green', label='Cluster 0')
plt.scatter(X[Y==1,0], X[Y==1,1], s=50, c='red', label='Cluster 1')
plt.scatter(X[Y==2,0], X[Y==2,1], s=50, c='yellow', label='Cluster 2')
plt.scatter(X[Y==3,0], X[Y==3,1], s=50, c='violet', label='Cluster 3')
plt.scatter(X[Y==4,0], X[Y==4,1], s=50, c='blue', label='Cluster 4')

# plot the centroids
plt.scatter(K_Means_final_model.cluster_centers_[:,0], K_Means_final_model.cluster_centers_[:,1], s=100, c='cyan', label='Centroids')

plt.legend(loc='upper right')
plt.title('Customer Groups')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()

### Inference

The customer segmentation analysis reveals distinct customer groups based on annual income and spending behavior.

- Cluster 0 represents a balanced segment with moderate income and spending scores. 
- Cluster 1 consists of customers with lower incomes but high spending scores, making them potential targets for promotions. 
- Cluster 2 comprises high-income individuals with high spending scores, ideal for premium offerings. 
- Cluster 3, with high incomes but low spending, presents an opportunity to encourage increased spending. 
- Lastly, Cluster 4 encompasses customers with low incomes and conservative spending habits. 

These findings enable businesses to tailor marketing strategies, product offerings, and promotions to better meet the unique needs of each segment, ultimately enhancing customer satisfaction and optimizing revenue generation.