<table class="table table-bordered">
    <tr>
        <th style="width:250px;">
            <img src='https://bcgriseacademy.com/hs-fs/hubfs/rise2.0_black_logo.png' style="background-color:white; width: 100%; height: 100%; padding: 20px">
        </th>
        <th style="text-align:center;">
            <h1>Mini project - Unsupervised Learning</h1>
            <h3>IBF TFIP</h3>
        </th>
    </tr>
</table>

## Know the context

You are a data analyst working in a retail bank based in the Middle East, where they have been doing traditional mass marketing campaigns for years. The bank is now keen to explore the benefits of running tailored marketing campaigns for their customer base.

## Business problem

The retail bank is facing a couple of challenges:
1. Profitability pressure from reduced utilization by existing customers
2. Increasingly competitive landscape where other banks are running personalized ad campaigns using differentiated products and services

## Project objectives and description

In this discovery phase, the objective is to understand the various segments that exist in the bank's customer base, based on the customers' demographics and utilization patterns.
___

## 1. Initial Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn import metrics

from sklearn.cluster import AgglomerativeClustering
import matplotlib.cm as cm

import warnings
warnings.filterwarnings('ignore')

#### (1.1) Read the raw CSV file from the Data folder into pandas

In [None]:
# insert your code here


#### (1.2) View the first 5 rows of the data

In [None]:
# insert your code here


___
## 2. Data Exploration

#### (2.1) Find the shape and data types of the dataset

In [None]:
# insert your code here


#### (2.2) Convert the columns to the appropriate data type (wherever necessary)
- HINT: Review date related columns

In [None]:
# insert your code here


#### (2.3) Check if there are any null values in the columns

In [None]:
# insert your code here


#### (2.4) For each categorical column, count the number of unique categories
- HINT: Instead of doing it one by one, filter by dtypes that are equal to object, and print the unique count using `.nunique()`

In [None]:
# insert your code here


#### (2.5) For numerical columns, check the data distributions using appropriate functions and plots (e.g., box plots)

In [None]:
# insert your code here


**Note**
- For this dataset, there are a few very large values (outliers) in the monthly average balance.
- Instead of dropping outliers, we will convert them into categories in the next step. 
- Note that dropping outliers is a bad move if we do not know exactly what is going on. For bank accounts, there will definitely be outliers because some customers are more wealthy than the rest. These are not data errors, so should not be dropped just because they are 'inconvenient' for our analysis

___
## 3. Data preprocessing

The goal here is to convert all values to numerical so that record of each customer will be in the form of numeric feature vectors

#### (3.1) Create a new column `Tenure` for the number of months (rounded to nearest integer) since the customer has opened the account
- HINT: The snapshot date is 01 Mar 2022, so you should find the date difference between the snapshot date and the account opening date.
- Feel free to search for 'Pandas - Number of Months Between Two Dates' on StackOverflow for the answer

In [None]:
# insert your code here


#### (3.2) Clean the `Monthly Average Balance` column by converting all negative balance values to 0

In [None]:
# insert your code here


#### (3.3) Rather than discarding outliers, we want to group customers into the following categories based on the following monthly average balance deciles. 

- $\ge$95%: 'Very High'
- $\ge$80% - 95%: 'High'
- $\ge$50% - 80%: 'Upper'
- $\lt$50%: 'Normal'

#### Create Python variables to store these decile cutoffs
- HINT: Only need to create 3 variables to indicate Very High, High, and Middle. Use the `.quantile` method

In [None]:
# insert your code here


#### (3.4) Using the decile cutoff variables above, create a new column called `Balance Level` by splitting the `Monthly Average Balance` column into 4 categories and giving them the corresponding integer labels: 1, 2, 3, 4 (where 4 corresponds to the very high balance category)
- HINT: Refer to https://stackoverflow.com/questions/44314670/create-rename-categories-with-pandas on how to use `pd.cut`. For the `bins` parameter, suggest to see the lowest value as -0.01 and highest value as 99999999

In [None]:
# insert your code here


#### (3.5) For categorical columns, convert them to numeric format
- Hint: use pandas dummy encoding 

In [None]:
# insert your code here


#### (3.6) Drop columns that are no longer informative for downstream clustering
- `Customer No`, `Customer Nationality`, `Account Opening Date`, `Monthly Average Balance`

In [None]:
# insert your code here


#### (3.6) Standardize the numerical values in the dataset
Note: Perform min max scaling

**ðŸ’¬ Checkpoint** 
- Why do we need to standardize the dataset before clustering?

In [None]:
# insert your code here


___
## 4. K-Means Clustering
The objective is to
- Find the most appropriate k value using the Elbow method
- Calculate and store the silhouette coefficient values

#### (4.1) Create a baseline k-means model with `k=3` and `random_state=0` by fitting on the scaled data

In [None]:
# insert your code here


#### (4.2) Obtain the inertia value and silhouette score of this model
HINT: For silhouette score, you will need to first get predictions with `fit_predict`

In [None]:
# insert your code here


#### (4.3) Repeat the above two steps but for the range of k values from 2 to 12. Save the inertia scores and silhouette scores in separate dictionaries

In [None]:
# insert your code here


#### (4.4) Use the Elbow method (along with showing the plot) to determine the optimal number of clusters. 

In [None]:
# insert your code here


#### (4.5) Use the Silhouette method (along with showing the plot) to determine the optimal number of clusters. 
- HINT: This part is NOT about generating the silhouette analysis graphs. Instead, repeat the code in Step 4.4, but instead of using inertia values, use the silhouette scores that you have already saved from Step 4.3

In [None]:
# insert your code here


#### (4.6) What can we summarize from the elbow and silhouette plots, and what is the optimal cluster number you would choose?

Answer: 

___
## 5. Hierarchical clustering

#### (5.1) Using the scaled data we created earlier, produce a dendrogram with hierarchical clustering

In [None]:
# insert your code here


#### (5.2) What can we summarize from the dendrogram, and what is the optimal cluster number you would choose?

Answer: 

#### (5.3) Using the optimal number of clusters above, create a hierarchical clustering model
- Use `AgglomerativeClustering`

In [None]:
# insert your code here


#### (5.4) Calculate the silhouette score

In [None]:
# insert your code here


___
## 6. Generating cluster labels

#### (6.1) Compare the silhouette score of k-means model (with k=3) and hierarchical clustering model (with k=4), and determine which is the ideal model to use. Explain your choice of model.

Answer: 

#### (6.2) Using the model with the higher silhouette score, generate the predicted labels for each customer in the dataset, and save it the output array in a variable called `labels`
- HINT: Use `fit_predict`

In [None]:
# insert your code here


#### (6.3) Append the `labels` array as a column (with the same name) to the processed dataframe from Step 3.6, so that each customer now has an assigned label

In [None]:
# insert your code here


#### (6.4) Append the `labels` array as a column (with the same name) to the original dataset

In [None]:
# insert your code here


___
## 7. Extracting Insights
Now that we have assigned each customer to the relevant cluster labels, it is time to better understand how the clusters differ in terms of profile

#### (7.1) Compare the customer ages across the clusters
- HINT: Sample code: `df.groupby(['labels'])['Column name'].mean().round(2)`

In [None]:
# insert your code here


#### (7.2) Compare the average monthly balance across the clusters
- HINT: Use the right dataset that contains the column required

In [None]:
# insert your code here


#### (7.3) Compare the tenure across the clusters
- HINT: Use the right dataset that contains the column required

In [None]:
# insert your code here


___