# Customer Segmentation

## Table of Contents

| Sr No | Topic                                            |
| ----- | -------------------------------------------------- |
| 1     | [Overview](#overview)                            |
| 2     | [EDA - Insights & Actionables](#eda)            |
| 3     | [Initial Model Building](#initial-model-building) |
| 4     | [Learning & Iteration](#learning-iteration)     |
| 5     | [Cluster Definition & Actionable Business Strategy](#cluster-definition-actionable-business-strategy) |




### Overview  <a id="overview"></a>

We have customer data of a credit card company. In this project, we are looking to leverage this data to gain insights about the customer base that would help in developing business stratergies to drive the usage of credit cards while keeping default rates to a minimum.


#### Objective:   

 Conduct segmentation on the credit card customer base to allow for building stratergies in driving growth and reducing risk. 



In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt  
import seaborn as sns
import warnings
from sklearn.preprocessing import   OneHotEncoder
from sklearn.preprocessing import   OrdinalEncoder
from sklearn.preprocessing import   LabelEncoder
from sklearn.preprocessing import   OrdinalEncoder

import sys
sys.path.append('../../src/features/')

import build_features

sys.path.append('../../src/models')

import predict_model

warnings.filterwarnings('ignore')


import sklearn as sk

customer_data = pd.read_csv('../../data/raw/customer_segmentation.csv')
customer_data.head()

Unnamed: 0,customer_id,age,gender,dependent_count,education_level,marital_status,estimated_income,months_on_book,total_relationship_count,months_inactive_12_mon,credit_limit,total_trans_amount,total_trans_count,avg_utilization_ratio
0,768805383,45,M,3,High School,Married,69000,39,5,1,12691.0,1144,42,0.061
1,818770008,49,F,5,Graduate,Single,24000,44,6,1,8256.0,1291,33,0.105
2,713982108,51,M,3,Graduate,Married,93000,36,4,1,3418.0,1887,20,0.0
3,769911858,40,F,4,High School,Unknown,37000,34,3,4,3313.0,1171,20,0.76
4,709106358,40,M,3,Uneducated,Married,65000,21,5,1,4716.0,816,28,0.0


### EDA - Insights & Actionables  <a id="eda"></a>

Below is the summary of the EDA findings & actionables. For a deeper understanding, here is a link to the EDA Report: <a href="/Users/nishantchandavarkar/Documents/Dataquest Based Projects/Customer Segmentation/notebooks/Reports/EDA_Report.ipynb">EDA Report</a>


#### EDA Summary

- **Transaction Variables:** A new variable, `avg_transaction_value`, was created by dividing `total_trans_amount` by `total_trans_count` to avoid multicolinearity , and left-skewness was addressed with a log transformation.

- **Months on Book (MOB):** `months_on_book` was categorized into "MOB > 3 Years" and "MOB < 3 Years" as a categorical feature for loyalty analysis based on its [scatterplot](../../reports/figures/MOB_scatterplot.png).

- **Gender Labeling:** The "gender" column was transformed into numerical values (1 for males and 0 for females) for K-Means Clustering.

- **Education Level:** `education_level` was converted into an ordinal categorical variable to explore educational impacts.

- **Marital Status:** Simplified to a binary feature (married or not) to streamline analysis due to diverse categories.

- **Categorization:** Other variables were categorized for segmentation and analysis purposes.

- **Dropping Variable:** `total_relationship_count` was removed from analysis due to a lack of contextual information.  





In [2]:
from  build_features import ProcessData



DataProcessor = ProcessData(customer_data.copy())

processed_data = DataProcessor.main()

processed_data.head()

Unnamed: 0,customer_id,age,gender,education_level,estimated_income,credit_limit,avg_utilization_ratio,avg_transaction_value,> 2 Dependants,Married,MOB > 3Y,Inactive > 2 Months
0,768805383,45,1,0,69000,12691.0,0.061,3.304617,1,1,1,0
1,818770008,49,0,1,24000,8256.0,0.105,3.666665,1,0,1,0
2,713982108,51,1,1,93000,3418.0,0.0,4.547011,1,1,0,0
3,769911858,40,0,0,37000,3313.0,0.76,4.069881,1,0,0,1
4,709106358,40,1,0,65000,4716.0,0.0,3.37221,1,1,0,0


### Initial Model Building <a id="initial-model-building"></a>


**Model Objective Definition**

 Our primary aim is to conduct segmentation on our credit card customer base, thereby enabling the formulation of strategies aimed at promoting growth while minimizing risk. In this context, the interpretability of the clusters is of paramount importance. With this in mind we define the model objective.

 **Model Objective:** Create a model that generates clusters that are easily explainable using input attributes. 

**Choice of Model: KMeans Clustering**

Given our model objective's emphasis on cluster interpretability to facilitate meaningful business strategy application, we selected KMeans Clustering as our model of choice.

**Explanation:** KMeans Clustering is renowned for its interpretability and its capacity to generate clusters of approximately equal sizes. This aligns perfectly with our objective of creating clusters that can be readily explained based on the input attributes provided for training. KMeans, being a centroid-based clustering algorithm, is highly suitable for this purpose.

**Number of Clusters: 5**

To determine the appropriate number of clusters for our initial model, we employed the [Elbow Curve](../../reports/figures/elbow_curve.png) method After careful analysis, we concluded that 5 clusters would best suit our segmentation goals since after 5 clusters the rate of decrease in inertia slowed down.



### Learning & Iteration  <a id="learning-iteration"></a>


After the initial model building phase, we refined our approach to achieve more meaningful and interpretable clusters. Here are the key learnings and actions taken:

**1. Adjustment of Cluster Number**

- **Observation:** With the initial choice of 5 clusters, we observed that some clusters exhibited similarities in characteristics, making it challenging to explain them based on input parameters. Additionally, most variables were evenly distributed across clusters, hampering our ability to uniquely define and strategize for each cluster.

- **Action:** To address these challenges, we decided to reduce the number of clusters to 4. This adjustment aimed to strike a better balance between granularity and interpretability.

**2. Handling of Variables**

- **Observation:** After moving to 4 clusters, we noticed an improvement in model explainability. However, certain variables, namely `education_level`, `>2 dependants`, `age`, and `marital status`, remained equally distributed across clusters even when reducing the number of clusters.

- **Action:** To further enhance cluster interpretability, we made a strategic decision to drop these variables during the cluster assignment stage. By excluding these variables from defining clusters, we achieved well-defined clusters that are easy to explain.

These iterative steps yielded clusters that possess distinct characteristics and are readily interpretable. Importantly, the properties of these clusters enable us to construct effective business strategies aimed at both fostering growth and managing risk.

In [3]:
from  build_features import ProcessData

DataProcessor = ProcessData(customer_data)

processed_data = DataProcessor.main()
scaled_data = build_features.standardize_data(processed_data.drop(['customer_id','education_level', '> 2 Dependants', 'age', 'Married'], axis=1))
customer_data = predict_model.assign_cluster(scaled_data, 4, customer_data )

customer_data.head()

Unnamed: 0,customer_id,age,gender,education_level,estimated_income,credit_limit,avg_utilization_ratio,avg_transaction_value,> 2 Dependants,Married,MOB > 3Y,Inactive > 2 Months,Cluster
0,768805383,45,1,0,69000,12691.0,0.061,3.304617,1,1,1,0,2
1,818770008,49,0,1,24000,8256.0,0.105,3.666665,1,0,1,0,0
2,713982108,51,1,1,93000,3418.0,0.0,4.547011,1,1,0,0,2
3,769911858,40,0,0,37000,3313.0,0.76,4.069881,1,0,0,1,3
4,709106358,40,1,0,65000,4716.0,0.0,3.37221,1,1,0,0,2


In [4]:
cluster_aggregates = customer_data.drop(['customer_id','education_level', '> 2 Dependants', 'age','Married'], axis=1).groupby('Cluster').mean()
cluster_aggregates

Unnamed: 0_level_0,gender,estimated_income,credit_limit,avg_utilization_ratio,avg_transaction_value,MOB > 3Y,Inactive > 2 Months
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.0,37427.105153,4500.780708,0.349202,4.06257,0.36992,0.0
1,0.914704,111366.645449,25947.769574,0.050229,4.205333,0.385105,0.005729
2,0.989849,75580.550098,6465.890341,0.272225,4.013863,0.36313,0.0
3,0.424451,60726.648352,7442.710027,0.283618,4.044832,0.524725,1.0


### Cluster Definition & Actionable Business Stratergy  <a id="cluster-definition-actionable-business-strategy"></a>



In the context of customer segmentation for a credit card company, four distinct clusters have been identified based on income, activity level, utilization ratio, and credit limit considerations. Here's a summary of each cluster and the recommended actions:

**Cluster 0 - Low-Income (Low Income, Active Customers, High Utilization Ratio):**
- This cluster comprises customers with low income levels who are actively using their credit cards but have a high utilization ratio.
- Action Recommendation: It's crucial to closely monitor default rates among these customers and consider modifying credit policies to mitigate the risk of defaults. This may involve reevaluating credit limits, interest rates, or offering financial counseling services to help manage debt.

**Cluster 1 - Premium Customers (High Income, Low Utilization Ratio):**
- This cluster represents high-income customers who, despite their high income, have a low utilization ratio.
- Action Recommendation: To maximize profitability, incentivize these premium customers to use their credit cards more frequently. This can be achieved through targeted marketing campaigns, exclusive rewards, or cashback offers tailored to their spending habits.

**Cluster 2 - Potential Premiums (Above Average Income, New Active Customers, Low Credit Limit Based on Income):**
- Customers in this cluster have above-average incomes and are newly active but have relatively low credit limits based on their income levels.
- Action Recommendation: Recognize the potential of these customers to become premium clients. Increase their credit limits to align with their income and spending capacity. This can enhance their loyalty and spending on the card.

**Cluster 3 - Inactive Customers (Above Average Income, Inactive Customers, Low Credit Based on Income):**
- This cluster consists of above-average income customers who are currently inactive with low credit limits based on their income.
- Action Recommendation: Focus on marketing initiatives to reactivate these customers and encourage card usage. Consider offering promotions, rewards, or benefits that align with their lifestyle and financial capacity.

By categorizing customers into these clusters and tailoring strategies accordingly, the credit card company can effectively manage risk, boost card usage, and cater to the unique needs of each customer segment, ultimately enhancing its overall performance and customer satisfaction.