<a href="https://colab.research.google.com/github/milind69/milind69/blob/main/all_bank_customer_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Description
Context
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help

 

### Objective
To identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these customers.

 

### Data Description
The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit cards the customer has, and different channels through which customers have contacted the bank for any queries (including visiting the bank, online and through a call center).

### Data Dictionary

- Sl_No: Primary key of the records
- Customer Key: Customer identification number
- Average Credit Limit: Average credit limit of each customer for all credit cards
- Total credit cards: Total number of credit cards possessed by the customer
- Total visits bank: Total number of visits that customer made (yearly) personally to the bank
- Total visits online: Total number of visits or online logins made by the customer (yearly)
- Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)

In [None]:
# this will help in making the Python code more structured automatically (good coding practice)
# %reload_ext nb_black

# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")


import pandas as pd

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

import numpy as np
import missingno as msg
from scipy import stats as st
import pandas_profiling
import altair as alt
import math


# libaries to help with data visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline
import seaborn as sns

sns.set(color_codes=True)
sns.set_style("whitegrid")
# sns.set(style="ticks")


# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, Lasso, Ridge, RidgeClassifier

from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    RandomizedSearchCV,
    KFold,
    cross_val_score,
    LeaveOneOut,
    StratifiedKFold,
)


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
    make_scorer,
    classification_report,
)


from pandas_profiling import ProfileReport


from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    StackingClassifier,
)

from sklearn.ensemble._forest import ForestClassifier, ForestRegressor

# import treeinterpreter

from sklearn.neighbors import (
    KNeighborsClassifier,
    KNeighborsRegressor,
    KNeighborsTransformer,
    kneighbors_graph,
)

from sklearn.cluster import KMeans
from xgboost import XGBClassifier

from sklearn.svm import SVC

from sklearn.impute import SimpleImputer, MissingIndicator

from sklearn.preprocessing import (
    MinMaxScaler,
    OneHotEncoder,
    OrdinalEncoder,  # for features
    StandardScaler,
    PolynomialFeatures,
    LabelEncoder,  # convert yes=1 no=0 data alphabetical for targets
    RobustScaler,
)

from sklearn.compose import ColumnTransformer

# We can use Dummy for Baseline
from sklearn.dummy import DummyClassifier, DummyRegressor
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import SCORERS
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score
from sklearn.decomposition import PCA


# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to compute distances
from scipy.spatial.distance import cdist, pdist

print("Setup Done!!!")

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
df = pd.read_excel("Credit Card Customer Data.xlsx", engine="openpyxl")

In [None]:
df.info()

In [None]:
print("There are {0} observations and {1} features".format(df.shape[0], df.shape[1]))

In [None]:
print(f"There are duplicated rows: {df.duplicated().any()}")

In [None]:
# ProfileReport(df)

In [None]:
df.describe().T

In [None]:
#
# for x, y in dict(df.isnull().mean()).items():
#   print(x, y)

_ = [
    print(f"There are {y}% missing rows in {x} ")
    for x, y in dict(df.isnull().mean()).items()
]

In [None]:
## The columns Sl_no is of no use we can drop it
df.drop("Sl_No", axis=1, inplace=True)

###  Let us check if customer Key has duplicate records,

In [None]:
df["Customer Key"].duplicated().any()


In [None]:
# Check Which Keys are duplicate Keys ...
print(
    f' Duplicate customer keys are {list(df.loc[df["Customer Key"].duplicated() == True, "Customer Key"])}'
)
df.loc[
    df["Customer Key"].isin(
        list(df.loc[df["Customer Key"].duplicated() == True, "Customer Key"])
    )
]

#### Customer Key has few duplicate ids , we will drop one duplication row , use keep=First, and create new dataframe , we will use this DataFrame for analysis 


In [None]:
dfx = df.loc[~df["Customer Key"].duplicated() == True]

In [None]:
dfx.info()

In [None]:
dfx.describe().T

In [None]:
for col in df.columns:
    print("-" * 80)
    print(df[col].value_counts())

In [None]:
for col in df.columns:
    print("-" * 80)
    print(col)
    print(df[col].unique())

### Observations:
 - Total_visits_bank , Total_visits_online , Total_calls_made have minimum values as zero , which looks valid observationb so we will not delete or impute it
 - There were no duplicate record in Dataframe 
 - Customer key was duplicated but could be valid , we will drop duplicate customer key for analysis 
 - We will bring duplicate Customer key back again as it could be valid record 
 - There are no missing value
 - There are no out of bound values , all outliers so far look valid 

## EDA

### Univariate Analysis

In [None]:
# selecting numerical columns
num_col = dfx.select_dtypes(include=np.number).columns.tolist()

for item in num_col:
    fig = plt.figure()
    histogram_boxplot(dfx, item, figsize=(5, 5), kde=True)
    plt.show()

In [None]:
fig = plt.figure(figsize=(20, 10))
numcol = [
    "Total_Credit_Cards",
    "Total_visits_bank",
    "Total_visits_online",
    "Total_calls_made",
]
for i, col in enumerate(numcol):
    ax = fig.add_subplot(2, 2, i + 1)
    sns.countplot(
        data=dfx, x=col, palette="Paired", ax=ax,
    )
    for p in ax.patches:
        label = "{:.1f}%".format(100 * p.get_height() / len(dfx))
        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height() + 1  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the p

### Observations:
 - There are outliers in Avg Credit Limit and Total Visits online but looks all valid we will keep it
 - in average 4 or less calls were made by customers
 - It looks like the higher number of credit card holders have higher Avg Credit Limit and make use of Online banking more , possibly to check expenditure ,increase the credit limit 
 - 66% Customer hold 4 or more Credit Cards 
 - 23% Customer visited bank personally twice in a year 15% made 5 visits 
 - 22% customer did not do any online activity, while 68% customers made between 1-5 online activity
 - Bank made 4 or less calls per customer in an year 
 - There are outliers in Avg_Credit_Limit and Total_visits_online but that do seems to be a valid data, as we find some relation with Total_Credit_Cards  We will keep Outliers 

### Bivariate Analysis

In [None]:
sns.pairplot(dfx, diag_kind="kde")

### Observations:
- Total_visits_online and Avg_Credit_Limit Right skewed. More and More user should be encoraged to visit online for more business 
- Total_calls_made vs Total_visit_online show two major cluster , less online visitors made more support  calls. 
- Similar two cluster groups are seen with many features ( Intertesing to see how they will be clustred)

In [None]:
dfx.corr()

In [None]:
found = 0
corcols = list(dfx.corr().columns)
for colsx in corcols:
    tempxs = [x for x in corcols if x != colsx]
    for temps in tempxs:
        corval = dfx[[colsx, temps]].corr()[colsx][1]
        if abs(corval) > 0.95:
            print(f"corelation value between {colsx} and {temps} is {corval:0.2f}")
            found += 1
if not found:
    print(f"no high correlation between features found")

In [None]:
sns.heatmap(dfx.corr(), vmin=-1, vmax=1, annot=True, cmap=cm.Accent_r)

### Observations:
- No specific High correlation found between two features 

In [None]:
# fig, ax = plt.subplots(111,figsize=(10, 10))
fig = plt.figure(figsize=(12, 4))
sns.barplot(
    data=dfx,
    x="Total_visits_bank",
    y="Total_visits_online",
    hue="Total_Credit_Cards",
    ci=None,
)
plt.legend(loc="upper right")
plt.show()

In [None]:
# fig, ax = plt.subplots(111,figsize=(10, 10))
fig = plt.figure(figsize=(6, 4))
sns.barplot(data=dfx, x="Total_Credit_Cards", y="Avg_Credit_Limit", ci=None)

plt.show()

In [None]:
fig = plt.figure(figsize=(6, 4))
sns.barplot(data=dfx, x="Total_Credit_Cards", y="Total_calls_made", ci=None)

plt.show()

### Observations:
- customer with 8 or more credit cards used online banking more that others
- Customer with 8 or more Credit cards have higher credit limits
- More support calls were made by Customer with 4 or less credit cards

### Data Preprocessing 

In [None]:
# No outlier treament needed
# we will not drop duplicate customer ids as that could be valid entries
# Drop Customer Key colums from clustering

In [None]:
custdata = df.drop("Customer Key", axis=1)

In [None]:
custdata.head()

In [None]:
# Due to mismatch in the feature we will scale it using standard scaling

In [None]:
subset_scaled_df = custdata.apply(st.zscore)

In [None]:
subset_scaled_df.head()

## KMeans Clustering

#### Deciding number of clusters

We don't know how many clusters need to build for correct grouping of data. There are multiple ways to decide the k value.

Elbow plot using inertia_ which is sum of the squared distance to centroid and see where the curve bends as that value of k is best choice for the clustering. Similar elbow curve can be plotted against distortion

#### validating the k value

There are mutliple ways to cross validate the cluster. Popular and most used one is Silhouette Coefficient 

*Silhouette Coefficient* : is a value between -1 to 1 where 1 indicates tight cluster and 0 indicates overlapping cluster 


In [None]:
%%time
clusters = range(2, 12)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(subset_scaled_df)
    distortion = (
        sum(
            np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
        )
        / subset_scaled_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
plt.show()

In [None]:
%%time
clusters = range(2, 12)
Inertias = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(subset_scaled_df)
    Inertias.append(model.inertia_)
    print("Number of Clusters:", k, "\tAverage Inertia:", model.inertia_)

plt.plot(clusters, Inertias, "bx-")
plt.xlabel("k")
plt.ylabel("Average Inertias")
plt.title("Selecting k with the Elbow Method")
plt.show()

In [None]:
from yellowbrick.cluster import KElbowVisualizer

model = KMeans()
vz = KElbowVisualizer(model, k=range(2, 12), metric="silhouette")
vz.fit(subset_scaled_df)  # Fit the data to the visualizer
vz.show()

In [None]:
%%time
inertias = []
sils = []
chs = []
dbs = []
sizes = range(2, 12)
for k in sizes:
    k2 = KMeans(random_state=1, n_clusters=k)
    k2.fit(subset_scaled_df)
    inertias.append(k2.inertia_)
    sils.append(silhouette_score(subset_scaled_df, k2.labels_))
    chs.append(calinski_harabasz_score(subset_scaled_df, k2.labels_))
    dbs.append(davies_bouldin_score(subset_scaled_df, k2.labels_))
    print("Silhouette Score for k {0} is {1}".format(k, sils[k - 2]))
fig, ax = plt.subplots(figsize=(10, 8))
_ = (
    pd.DataFrame(
        {
            "inertia": inertias,
            "sillhouttes": sils,
            "calinski": chs,
            "davis": dbs,
            "k": sizes,
        },
    )
    .set_index("k")
    .plot(ax=ax, subplots=True, layout=(2, 2))
)

In [None]:
# finding optimal no. of clusters with Silhouette Visualize
fig = plt.figure(figsize=(5, 5))
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
plt.show()

fig = plt.figure(figsize=(5, 5))
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
plt.show()

fig = plt.figure(figsize=(5, 5))
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
plt.show()

### Observations:
 - For silhouette Visualizer we look for 
         * clear cluster seperation and higher average Sillhouette score line, each cluster group spread extend beyond the average line
         * for k = 3 we get clear cluster separation and average ~ 0.5
 - From all above metrics it looks like k=3 is the n_cluster value cluster can give better results   

### Run KMeans with n_clusters=3

In [None]:
datacols = [
    "Avg_Credit_Limit",
    "Total_Credit_Cards",
    "Total_visits_bank",
    "Total_visits_online",
    "Total_calls_made",
]

In [None]:
%%time
# let's take 3 as number of clusters
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(subset_scaled_df)

In [None]:
df["K_means_segments"] = kmeans.labels_
custdata["K_means_segments"] = kmeans.labels_
subset_scaled_df["K_means_segments"] = kmeans.labels_

### Customer Profiling 

In [None]:
fig = plt.figure(figsize=(20, 6))
for i in subset_scaled_df["K_means_segments"].unique():
    ax = fig.add_subplot(1, 3, i + 1)
    sns.boxplot(
        data=subset_scaled_df.loc[subset_scaled_df["K_means_segments"] == i, datacols],
        ax=ax,
    )
    ax.set_title("cluster " + str(i))
    plt.xticks(rotation=45)

In [None]:
fig, ax1 = plt.subplots(figsize=(10, 5))
subset_scaled_df.groupby("K_means_segments").mean().T.plot.bar(ax=ax1)

### Observations:
- For group 0 Avg_Credit_Limit, Total_Credit_Cards, Total_visits_bank forms grouping features
- For group 1 Total_visits_online, Total_calls_made forms the grouping feature
- For group 2 Avg_Credit_Limit,Total_Credit_Cards,Total_visits_online forms the grouping feature

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
sns.scatterplot(
    data=df,
    x="Avg_Credit_Limit",
    y="Total_visits_online",
    palette="gist_rainbow",
    ax=ax1,
)
sns.scatterplot(
    data=df,
    x="Avg_Credit_Limit",
    y="Total_visits_online",
    hue="K_means_segments",
    palette="gist_rainbow",
    ax=ax2,
)
ax1.set_title("Before clustering")
ax2.set_title("After clustering")
ax2.legend( loc="upper left")


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
sns.scatterplot(
    data=df, x="Customer Key", y="Total_Credit_Cards", palette="gist_rainbow", ax=ax1,
)
sns.scatterplot(
    data=df,
    x="Customer Key",
    y="Total_Credit_Cards",
    hue="K_means_segments",
    palette="gist_rainbow",
    ax=ax2,
)
ax1.set_title("Before clustering")
ax2.set_title("After clustering")
ax2.legend(loc="upper left")

In [None]:
cluster_profile = custdata.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segments"] = (
    df.groupby("K_means_segments")["Avg_Credit_Limit"].count().values
)

In [None]:
# let's display cluster profiles
cluster_profile.style.background_gradient(cmap="nipy_spectral", axis=0)

In [None]:
cluster_profile.style.highlight_max(color="lightgreen", axis=0)

### Observations:
- The customers are grouped 3 clusters 
- Customers with Higher credit limits, more credit card and use online facility forms one group with customer count of lowest of the 3 , these customers do tend to make less support calls 
- Customers with lower credit limit less credit card form second group which tend to use online service but have made more support calls 
- Customers in group 0 have credit limit and credit cards between the two groups, these customes rely more on bank visit than using online facility and have second in making customer calls. This group forms majority of customer base 

### Aggolomerative (Hirearchical) Clustering

In [None]:
dfh = df.drop("K_means_segments", axis=1)

In [None]:
hcustdata = custdata.drop("K_means_segments", axis=1)

In [None]:
subset_scaled_hf = hcustdata.apply(st.zscore)

In [None]:
subset_scaled_hf.head()

In [None]:
%%time
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(subset_scaled_hf, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(subset_scaled_hf))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm

In [None]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)

**Let's explore different linkage methods with Euclidean distance only.**

In [None]:
%%time
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(subset_scaled_hf, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_hf))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

    compare.append([method, coph_corr])

In [None]:
# let's create a dataframe to compare cophenetic correlations for each linkage method
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc

### Observations:

- For dendograms we rely on which dendogram gives us clear cluster groups and have cophenetic score more that 0.50
- From dendograms *ward* linkage with *euclidean* distance looks better choice for cluster, as it gives separate and distinct clusters 
- From cophenetic distance *average* method give better score , we will evaluate it further 

#### We will check scores for linkage *ward* and *average*  with *euclidean* 

In [None]:
%%time
sils = []
chs = []
dbs = []
sizes = range(2, 12)
for k in sizes:
    hcluster = AgglomerativeClustering(
        n_clusters=k, affinity="euclidean", linkage="ward"
    )
    hcluster.fit(subset_scaled_hf)
    sils.append(silhouette_score(subset_scaled_hf, hcluster.labels_))
    chs.append(calinski_harabasz_score(subset_scaled_hf, hcluster.labels_))
    dbs.append(davies_bouldin_score(subset_scaled_hf, hcluster.labels_))
    print("Silhouette Score for k {0} is {1}".format(k, sils[k - 2]))
fig, ax = plt.subplots(figsize=(10, 8))
_ = (
    pd.DataFrame({"sillhouttes": sils, "calinski": chs, "davis": dbs, "k": sizes,},)
    .set_index("k")
    .plot(ax=ax, subplots=True, layout=(2, 2))
)

In [None]:
%%time
sils = []
chs = []
dbs = []
sizes = range(2, 12)
for k in sizes:
    hcluster = AgglomerativeClustering(
        n_clusters=k, affinity="euclidean", linkage="average"
    )
    hcluster.fit(subset_scaled_hf)
    sils.append(silhouette_score(subset_scaled_hf, hcluster.labels_))
    chs.append(calinski_harabasz_score(subset_scaled_hf, hcluster.labels_))
    dbs.append(davies_bouldin_score(subset_scaled_hf, hcluster.labels_))
    print("Silhouette Score for k {0} is {1}".format(k, sils[k - 2]))
fig, ax = plt.subplots(figsize=(10, 8))
_ = (
    pd.DataFrame({"sillhouttes": sils, "calinski": chs, "davis": dbs, "k": sizes,},)
    .set_index("k")
    .plot(ax=ax, subplots=True, layout=(2, 2))
)


### Observations:
 - for good cluster 
       * Sillhouette score must be between -1 to 1 but higher is better
       * calinski score is 0 and up but higher is better
       * davis score is 0 and up but lower is better
 
 - with *ward* linkage we get match for all 3 score at k = 3
 - with *Average* we get random results for all 3 scores 
 
 
 *For Agglomerative Clustering we will go with k = 3*

### Modeling 

In [None]:
%%time
hcluster = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="ward")
hcluster.fit(subset_scaled_hf)

In [None]:
dfh["hcluster_segments"] = hcluster.labels_
hcustdata["hcluster_segments"] = hcluster.labels_
subset_scaled_hf["hcluster_segments"] = hcluster.labels_

In [None]:
subset_scaled_hf.head()

### Customer Profiling 

In [None]:
fig = plt.figure(figsize=(20, 6))
for i in subset_scaled_hf["hcluster_segments"].unique():
    ax = fig.add_subplot(1, 3, i + 1)
    sns.boxplot(
        data=subset_scaled_df.loc[subset_scaled_hf["hcluster_segments"] == i, datacols],
        ax=ax,
    )
    ax.set_title("cluster " + str(i))
    plt.xticks(rotation=45)

In [None]:
fig, ax1 = plt.subplots(figsize=(10, 5))
subset_scaled_hf.groupby("hcluster_segments").mean().T.plot.bar(ax=ax1)

### Observations:
- For group 0 Avg_Credit_Limit, Total_Credit_Cards, Total_visits_bank forms grouping features 
- For group 1 Total_visits_online, Total_calls_made forms the grouping feature 
- For group 2 Avg_Credit_Limit,Total_Credit_Cards,Total_visits_online forms the grouping feature 

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sns.scatterplot(
    data=dfh,
    x="Avg_Credit_Limit",
    y="Total_visits_online",
    palette="gist_rainbow",
    ax=ax1,
)
sns.scatterplot(
    data=dfh,
    x="Avg_Credit_Limit",
    y="Total_visits_online",
    hue="hcluster_segments",
    palette="gist_rainbow",
    ax=ax2,
)
ax1.set_title("Before clustering")
ax2.set_title("After clustering")
ax2.legend( loc="upper left")


In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sns.scatterplot(
    data=dfh, x="Customer Key", y="Total_Credit_Cards", palette="gist_rainbow", ax=ax1,
)
sns.scatterplot(
    data=dfh,
    x="Customer Key",
    y="Total_Credit_Cards",
    hue="hcluster_segments",
    palette="gist_rainbow",
    ax=ax2,
)
ax1.set_title("Before clustering")
ax2.set_title("After clustering")
ax2.legend(loc="upper left")

In [None]:
cluster_profile_h = hcustdata.groupby("hcluster_segments").mean()
cluster_profile_h["count_in_each_segments"] = (
    dfh.groupby("hcluster_segments")["Avg_Credit_Limit"].count().values
)

In [None]:
# let's display cluster profiles
cluster_profile_h.style.background_gradient(cmap="nipy_spectral", axis=0)

In [None]:
cluster_profile_h.style.highlight_max(color="lightgreen", axis=0)

### Observations:
- The customers are grouped 3 clusters 
- Customers with Higher credit limits, more credit card and use online facility forms one group with customer count of lowest of the 3 , these customers do tend to make less support calls 
- Customers with lower credit limit less credit card form second group which tend to use online service but have made more support calls 
- Customers in group 0 have credit limit and credit cards between the two groups, these customes rely more on bank visit than using online facility and have second in making customer calls. This group forms majority of customer base 

### Compare the cluster profile of both the methods

In [None]:
print("\t\tCluster profile for Agglomerative Clustering")
cluster_profile_h.style.highlight_max(color="lightgreen", axis=0)

In [None]:
print("\t\tCluster profile for KMeans Clustering")
cluster_profile.style.highlight_max(color="lightgreen", axis=0)

### Observations
- Both clustering models gives similar results 
- There are no major deviations found between models
- Clustering criteria remains same across both the models
- Sillhoutte score is also same across both the models 
- There is no specific differecing criteria that can be applied to select perticular algorithm over the other but in general KMeans is may work efficintly with large dataset.
- Experiment with Larger dataset needed before selection of model can be made 

### Actionable Insights & Recommendations

- It looks like customer in group 2 use more online facility and tend to make less number of customer support calls , May be these customer are from higher income , higher online banking awareness class have mastered online banking  

- Customer in group 1 with lower credit Limits and lower credit cards , thise group tend to use online facility but have more customer support calls. This can be a possible indications of non user friendly online service, or opportunity to improve online service.

- Group 0 customers make more visit to bank and have lower online presence, they also have moderate number of customer support calls. Bank need to investigate reasons why these customer do not use online service and address the concerns if they related to techonology. These will help reduce the workload of banking staff.

- Group 0 customers have moderate number of credit cards with good credit limits and should be encouraged to use them with additional returns.  

- Bank need to focus on group 1 to address their concerns as this group try to use online service but not able to make their way. Addressing these set of customer can help increate more credit card usage and business opportunity 

- Bank need to review its online service, talking to the customers and support team to identify the issues in online service and fix it 

- Reviwing online banking service will help increase the business and also reduce the # complaints about customer support 