<a href="https://colab.research.google.com/github/jgabrielg99/Python/blob/main/Project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Users Churn Prediction

## Problem Statement

### Business Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

### Data Description

* CLIENTNUM: Client number. Unique identifier for the customer holding the account
* Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
* Customer_Age: Age in Years
* Gender: Gender of the account holder
* Dependent_count: Number of dependents
* Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
* Marital_Status: Marital Status of the account holder
* Income_Category: Annual Income Category of the account holder
* Card_Category: Type of Card
* Months_on_book: Period of relationship with the bank (in months)
* Total_Relationship_Count: Total no. of products held by the customer
* Months_Inactive_12_mon: No. of months inactive in the last 12 months
* Contacts_Count_12_mon: No. of Contacts in the last 12 months
* Credit_Limit: Credit Limit on the Credit Card
* Total_Revolving_Bal: Total Revolving Balance on the Credit Card
* Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
* Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
* Total_Trans_Amt: Total Transaction Amount (Last 12 months)
* Total_Trans_Ct: Total Transaction Count (Last 12 months)
* Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
* Avg_Utilization_Ratio: Average Card Utilization Ratio

#### What Is a Revolving Balance?

- If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance


##### What is the Average Open to buy?

- 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.

##### What is the Average utilization Ratio?

- The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.


##### Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:

- ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

### **Please read the instructions carefully before starting the project.**
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
* Blanks '_______' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.


## Importing necessary libraries

In [None]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.0/226.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.1/297.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[0m[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl

In [None]:
# reading, maniputating, and visualizing data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# impute missing data
from sklearn.impute import SimpleImputer

# model scoring
from sklearn import metrics

# feature engineering
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

# hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# balancing techniques
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from xgboost import XGBClassifier


import warnings
warnings.filterwarnings('ignore')

**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again*.

## Loading the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

KeyboardInterrupt: 

In [None]:
data = pd.read_csv('/content/drive/MyDrive/AIML Course/Files/BankChurners.csv')
df = data.copy()

In [None]:
df.head()

## Data Overview

- Observations
- Sanity checks

In [None]:
df.shape

* There are 10127 rows in this dataset and 21 features

In [None]:
df.info()

* Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Category, Card_Category are all object data types
* There appear to be missing values in Education_Level and Marital_Status

In [None]:
df.describe().round(3)

* From an inital look, there don't appear to be any unexpected values; however, there are certainly outliers in categories such as Credit_Limit, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, etc. as the max value is substantially higher than even the 3rd quartile

In [None]:
df.isna().sum()

In [None]:
df.loc[df['Education_Level'].isna()]

In [None]:
df.loc[df['Marital_Status'].isna()]

* There doesn't appear to be any pattern linking missing Education_Level or Marital_Status

In [None]:
df.duplicated().sum()

In [None]:
df['CLIENTNUM'].nunique()

In [None]:
df.drop(['CLIENTNUM'], axis=1, inplace=True)

In [None]:
df['Attrition_Flag'] = df['Attrition_Flag'].replace({
    'Existing Customer': 0,
    'Attrited Customer': 1
})

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**:

1. How is the total transaction amount distributed?
2. What is the distribution of the level of education of customers?
3. What is the distribution of the level of income of customers?
4. How does the change in transaction amount between Q4 and Q1 (`total_ct_change_Q4_Q1`) vary by the customer's account status (`Attrition_Flag`)?
5. How does the number of months a customer was inactive in the last 12 months (`Months_Inactive_12_mon`) vary by the customer's account status (`Attrition_Flag`)?
6. What are the attributes that have a strong correlation with each other?



#### The below functions need to be defined to carry out the Exploratory Data Analysis.

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

### Univariate Analysis

In [None]:
labeled_barplot(df, 'Attrition_Flag')

In [None]:
df['Attrition_Flag'].value_counts(normalize=True)

* 1627 (approximately 16%) customer leave their credit card services

In [None]:
df.columns

In [None]:
histogram_boxplot(df, 'Customer_Age')

* Normal distribution of ages with a few outliers at the older end
* The average age is around 45-46 years old

In [None]:
labeled_barplot(df, 'Gender')

In [None]:
df['Gender'].value_counts(normalize=True)

* 53% of customers are female
* 47% of customers are male

In [None]:
histogram_boxplot(df, 'Dependent_count')

* Normal distribution of the number of dependents
* The average number of dependents is slightly above 2

In [None]:
labeled_barplot(df, 'Education_Level')

In [None]:
df['Education_Level'].value_counts(normalize=True)

* The majority of customers (36%) hold a Graduate Degree
* The fewest number of customers (5%) hold a Doctorate

In [None]:
labeled_barplot(df, 'Marital_Status')

In [None]:
df['Marital_Status'].value_counts(normalize=True)

In [None]:
labeled_barplot(df, 'Income_Category')

* There is an unexpected value called "abc" in Income Category
** We may be able to inpute likely values if there is high correlation between Income Category and another feature such as Credit Limit

In [None]:
df.columns

In [None]:
labeled_barplot(df, 'Card_Category')

* The Blue Card is the most commonly held credit card. Platinum is the rarest card

In [None]:
histogram_boxplot(df, 'Months_on_book')

* Values around 35 months appear significantly more frequently than any other value
* There are outliers on either ends of the data set

In [None]:
histogram_boxplot(df, 'Total_Relationship_Count')

* There is a somewhat even distribution. The average number of products held is slightly below 4

In [None]:
histogram_boxplot(df, 'Months_Inactive_12_mon')

* Most customers are inactive between 2 and 3 months in a year
* An Inactive status of 1 month, 5 months, or 6 months are consider outliers

In [None]:
histogram_boxplot(df, 'Contacts_Count_12_mon')

* Most customers were contacted between 2 and 3 times in a year
* There are outliers on either end of this dataset - 0 contacts, 5 contacts, and 6 contacts

In [None]:
histogram_boxplot(df, 'Credit_Limit')

* Credit limit is heavily right skewed; however there is are a large number of customers at around 34k.
** This may be the maximum possible credit limit
* There are a large number of outliers on the higher end of the dataset

In [None]:
histogram_boxplot(df, 'Total_Revolving_Bal')

* There are a large number of customer with no revolving balance. Otherwise, the data has somewhat normal distribution.
* There are also a larger number of customers with a revolving balance around 2500.

In [None]:
df.columns

In [None]:
histogram_boxplot(df, 'Avg_Open_To_Buy')

* This data is heavily right skewed. Similarly to Credit Limit, there is a moderate concentration of values between 30,000-35,000
* There are large number of outliers on the higher end of the data

In [None]:
histogram_boxplot(df, 'Total_Amt_Chng_Q4_Q1')

* This data has normal distribution
** The average falls around .75
* There are outliers on either end of this dataset

In [None]:
histogram_boxplot(df, 'Total_Trans_Amt')

* This data is multimodal featuring 4 different peaks
* There are a large number of outliers on the higher end of the dataset

In [None]:
histogram_boxplot(df, 'Total_Trans_Ct')

* Bimodal dataset with a peak around 40 and 75 transactions
** There are few outliers at the top end of the dataset

In [None]:
histogram_boxplot(df, 'Total_Ct_Chng_Q4_Q1')

* Normal distribution
** Outliers on either end of the data

In [None]:
histogram_boxplot(df, 'Avg_Utilization_Ratio')

* Right skewed
* Related to Credit Limit and Open to Buy

###Bivariate Analysis

In [None]:
df.columns

In [None]:
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

NameError: name 'plt' is not defined

* Total Transaction Count has the strongest correlation with Attrition_Flag
* Total Transaction Count and Total Transation Amount have very high correlation
** This is expected as someone who uses there card more frequently will likely have higher transaction amount
* Months on Book and Customer Age also have a very high correlation

In [None]:
distribution_plot_wrt_target(df, 'Customer_Age', 'Attrition_Flag')

In [None]:
stacked_barplot(df, 'Gender', 'Attrition_Flag')

* A slightly higher percentage of males attrited than females

In [None]:
stacked_barplot(df, 'Dependent_count', 'Attrition_Flag')

In [None]:
stacked_barplot(df, 'Education_Level', 'Attrition_Flag')

* A slightly higher percentage of customers with a doctorate attrited than other education levels

In [None]:
stacked_barplot(df, 'Marital_Status', 'Attrition_Flag')

In [None]:
stacked_barplot(df, 'Income_Category', 'Attrition_Flag')

In [None]:
stacked_barplot(df, 'Card_Category', 'Attrition_Flag')

* A higher percentage of platinum card holders attrited compared to other card categories

In [None]:
distribution_plot_wrt_target(df, 'Months_on_book', 'Attrition_Flag')

In [None]:
df.columns

In [None]:
stacked_barplot(df, 'Total_Relationship_Count', 'Attrition_Flag')

* More customers with 1 or 2 products attrited than customers with more than 2 products

In [None]:
stacked_barplot(df, 'Months_Inactive_12_mon', 'Attrition_Flag')

* Surprisingly, customer with 0 months inactive are most likely to attrite; follwed by customer with 4 months inactive
** However, 0 months inactive is the smallest category with only 29 customers out of 10127

In [None]:
stacked_barplot(df, 'Contacts_Count_12_mon', 'Attrition_Flag')

* The more times a customer contacts the bank regarding credit services, the more likely they are to attrite
  * All customers that contacted 6 times attrited



In [None]:
distribution_plot_wrt_target(df, 'Credit_Limit', 'Attrition_Flag')

In [None]:
distribution_plot_wrt_target(df, 'Total_Revolving_Bal', 'Attrition_Flag')

* Customers with little to no revolving balance attrited more than customers with high revolving balances
  * This aligns with longer periods of inactivity resulting in attriting

In [None]:
distribution_plot_wrt_target(df, 'Avg_Open_To_Buy', 'Attrition_Flag')

In [None]:
distribution_plot_wrt_target(df, 'Total_Amt_Chng_Q4_Q1', 'Attrition_Flag')

In [None]:
distribution_plot_wrt_target(df, 'Total_Trans_Amt', 'Attrition_Flag')

* Customers that spent less on their cards attrited more often than customers who had a higher transaction amount

In [None]:
distribution_plot_wrt_target(df, 'Total_Trans_Ct', 'Attrition_Flag')

* Similarly to Transaction Amounts, customers with fewer transactions are more likely to attrite

In [None]:
distribution_plot_wrt_target(df, 'Total_Ct_Chng_Q4_Q1', 'Attrition_Flag')

In [None]:
distribution_plot_wrt_target(df, 'Avg_Utilization_Ratio', 'Attrition_Flag')

* Customers that attrited generally had a lower average utilization ratio

## Data Pre-processing

In [None]:
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

## Missing value imputation




* There are missing values in Income_Category, Education_Level, and Marital_Status

In [None]:
df.loc[df['Income_Category'] == 'abc', 'Income_Category'] = np.nan

In [None]:
SI = SimpleImputer(strategy = 'most_frequent')

In [None]:
df.isna().sum()

In [None]:
X = df.drop('Attrition_Flag', axis=1)
y = df['Attrition_Flag']

In [None]:
# split data for temp and test sets
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# split temp into train and valid sets
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)

print(X_train.shape, X_val.shape, X_test.shape)

In [None]:
cat_cols = list(X_train.select_dtypes(include='object').columns)

In [None]:
X_train[cat_cols] = SI.fit_transform(X_train[cat_cols])
X_val[cat_cols] = SI.transform(X_val[cat_cols])
X_test[cat_cols] = SI.transform(X_test[cat_cols])

In [None]:
# check for missing values in train, val, and test set
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())

In [None]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)

* After encoding the data, there are 29 variables in the dataset

## Model Building

### Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

- True positives (TP) are failures correctly predicted by the model.
- False negatives (FN) are real failures in a generator where there is no detection by model.
- False positives (FP) are failure detections in a generator where there is no failure.

**Which metric to optimize?**

* We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
* We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
* We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

**Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.**

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

In [None]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

### Model Building with original data

Sample code for model building with original data

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(('Decision tree', DecisionTreeClassifier(random_state=1)))
models.append(('AdaBoost', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting', GradientBoostingClassifier(random_state=1)))
models.append(('XGBoost', XGBClassifier(random_state=1)))

print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))

* XGBoost has the best performance on the original data.
  * There is overfitting in these models and the class weights are inbalanced

### Model Building with Oversampled data


In [None]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(('Decision tree', DecisionTreeClassifier(random_state=1)))
models.append(('AdaBoost', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting', GradientBoostingClassifier(random_state=1)))
models.append(('XGBoost', XGBClassifier(random_state=1)))

print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

In [None]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores_train = recall_score(y_train_over, model.predict(X_train_over))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference2 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))

* XGBoost has the best score followed by AdaBoosting

### Model Building with Undersampled data

In [None]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

In [None]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(('Decision tree', DecisionTreeClassifier(random_state=1)))
models.append(('AdaBoost', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting', GradientBoostingClassifier(random_state=1)))
models.append(('XGBoost', XGBClassifier(random_state=1)))


print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

In [None]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores_train = recall_score(y_train_un, model.predict(X_train_un))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference3 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))

* XGBoost has the best score with AdaBoost and Gradient Boosting close behind
  * With undersampling, AdaBoost performed better on the Validation set than on the Training set

* After building 18 models, it was observed that XGBoost, AdaBoost, and Gradient Boost performed best on an undersampled dataset. XGBoost and AdaBoost performed the best on the oversampled dataset.

### HyperparameterTuning

#### Sample Parameter Grids

**Note**

1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
  - Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase


- For Gradient Boosting:

```
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
```

- For Adaboost:

```
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
```

- For Bagging Classifier:

```
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
```
- For Random Forest:

```
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
```

- For Decision Trees:

```
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
```

- For XGBoost (optional):

```
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}
```

### Decision Tree Classifier

#### Sample tuning method for Decision tree with original data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

#### Sample tuning method for Decision tree with oversampled data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

#### Sample tuning method for Decision tree with undersampled data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_dtc = DecisionTreeClassifier(
    min_samples_leaf = 7,
    min_impurity_decrease = 0.0001,
    max_leaf_nodes = 15,
    max_depth = 5,
    random_state=1
)
tuned_dtc.fit(X_train_un, y_train_un)

In [None]:
# Checking model's performance on training set
dtc_train = model_performance_classification_sklearn(tuned_dtc, X_train_un, y_train_un)
dtc_train

In [None]:
# Checking model's performance on validation set
dtc_val = model_performance_classification_sklearn(tuned_dtc, X_val, y_val)
dtc_val

### AdaBoost Classifier


####  AdaBoost on Original Data

In [None]:
# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_ada = AdaBoostClassifier(
    n_estimators = 100,
    learning_rate = 0.1,
    base_estimator =
      DecisionTreeClassifier(max_depth=3, random_state=1)
)
tuned_ada.fit(X_train, y_train)

In [None]:
# Checking model's performance on training set
ada_train = model_performance_classification_sklearn(tuned_ada, X_train, y_train)
ada_train

In [None]:
# Checking model's performance on validation set
ada_val = model_performance_classification_sklearn(tuned_ada, X_val, y_val)
ada_val

#### AdaBoost on Undersampled Data

In [None]:
# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_ada_un = AdaBoostClassifier(
    n_estimators = 100,
    learning_rate = 0.1,
    base_estimator =
      DecisionTreeClassifier(max_depth=3, random_state=1)
)
tuned_ada_un.fit(X_train_un, y_train_un)

In [None]:
# Checking model's performance on training set
ada_train_un = model_performance_classification_sklearn(tuned_ada_un, X_train_un, y_train_un)
ada_train_un

In [None]:
# Checking model's performance on validation set
ada_val_un = model_performance_classification_sklearn(tuned_ada_un, X_val, y_val)
ada_val_un

#### AdaBoost on Oversampled Data

In [None]:
# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_ada_over = AdaBoostClassifier(
    n_estimators = 100,
    learning_rate = 0.1,
    base_estimator =
      DecisionTreeClassifier(max_depth=3, random_state=1)
)
tuned_ada_over.fit(X_train_over, y_train_over)

In [None]:
# Checking model's performance on training set
ada_train_over = model_performance_classification_sklearn(tuned_ada_over, X_train_over, y_train_over)
ada_train_over

In [None]:
# Checking model's performance on validation set
ada_val_over = model_performance_classification_sklearn(tuned_ada_over, X_val, y_val)
ada_val_over

* Of all of the AdaBoost models after hyperparameter tuning, the model built with the undersampled dataset performed the best

### Gradient Boosting Classifier

#### Gradient Boost on Undersampled Data

In [None]:
# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_gbc_un = GradientBoostingClassifier(
  subsample = 0.7,
  n_estimators = 100,
  max_features = 0.7,
  learning_rate = 0.05,
  init = DecisionTreeClassifier(random_state=1)
)
tuned_gbc_un.fit(X_train_un, y_train_un)

In [None]:
# Checking model's performance on training set
gbc_train_un = model_performance_classification_sklearn(tuned_gbc_un, X_train_un, y_train_un)
gbc_train_un

In [None]:
# Checking model's performance on validation set
gbc_val_un = model_performance_classification_sklearn(tuned_gbc_un, X_val, y_val)
gbc_val_un

### XGBoost Classifier

#### XGBoost Classifier on Original Data

In [None]:
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_xgb = XGBClassifier(
    subsample = 0.9,
    scale_pos_weight = 5,
    n_estimators = 100,
    learning_rate = 0.1,
    gamma = 3
)
tuned_xgb.fit(X_train, y_train)

In [None]:
# Checking model's performance on training set
xgb_train = model_performance_classification_sklearn(tuned_xgb, X_train, y_train)
xgb_train

In [None]:
# Checking model's performance on validation set
xgb_val = model_performance_classification_sklearn(tuned_xgb, X_val, y_val)
xgb_val

#### XGB Classifier on Undersampled Data

In [None]:
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
tuned_xgb_un = XGBClassifier(
    subsample = 0.9,
    scale_pos_weight = 5,
    n_estimators = 100,
    learning_rate = 0.1,
    gamma = 3
)
tuned_xgb_un.fit(X_train_un, y_train_un)

In [None]:
# Checking model's performance on training set
xgb_train_un = model_performance_classification_sklearn(tuned_xgb_un, X_train_un, y_train_un)
xgb_train_un

In [None]:
# Checking model's performance on validation set
xgb_val_un = model_performance_classification_sklearn(tuned_xgb_un, X_val, y_val)
xgb_val_un

## Model Comparison and Final Model Selection

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        ada_train.T,
        ada_train_un.T,
        ada_train_over.T,
        gbc_train_un.T,
        xgb_train.T,
        xgb_train_un.T
    ],
    axis=1,
)
models_train_comp_df.columns = [
    'AdaBoost trained with Original data',
    'AdaBoost trained with Undersampled data',
    'AdaBoost trained with Oversampled data',
    'Gradient boosting trained with Undersampled data',
    'XGBoost trained with Original data',
    'XGBoost trained with Undersampled data'
]
print("Training performance comparison:")
models_train_comp_df

In [None]:
# Validation performance comparison

models_train_comp_df = pd.concat(
    [
        ada_val.T,
        ada_val_un.T,
        ada_val_over.T,
        gbc_val_un.T,
        xgb_val.T,
        xgb_val_un.T
    ],
    axis=1,
)
models_train_comp_df.columns = [
    'AdaBoost trained with Original data',
    'AdaBoost trained with Undersampled data',
    'AdaBoost trained with Oversampled data',
    'Gradient boosting trained with Undersampled data',
    'XGBoost trained with Original data',
    'XGBoost trained with Undersampled data'
]
print("Validation performance comparison:")
models_train_comp_df

* The recall score is highest on XGBoost trained with Undersampled Data
  * Our test set will be run will this model

### Test set final performance

In [None]:
xgb_test = model_performance_classification_sklearn(tuned_xgb_un, X_test, y_test)
xgb_test

* The XGBoost Classifier trained with Undersampled data has a Recall score of 98.7% on the unseen test data
  * This performance aligns with the metrics achieved with this model on the train and validation set
  * Our model generalizes extremely well

In [None]:
feature_names = X_train.columns
importances = tuned_xgb_un.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

* As expected from the EDA, we can see that Total Transaction Count is the most import feature for making predictions
  * Features such as Total Transaction Amount, Total Revolving Balance, and Total Relationship Count are the a few of the other important features in this model

# Business Insights and Conclusions

* When determining whether or not a customer is likely to cancel their credit cards, Thera Bank should primarily look into the customer's credit card usage.
  * Total Transaction Count is the most important feature in our model - all customers in the data that attrited used their credit cards fewer than 100 times in the past 12 months. Over 50% of customers that attrited had fewer than 50 transactions in the past 12 months.
* Similarly to Transaction Count, Transaction Amount also has a big impact on credit card attrition. Removing outliers, all customers that closed their credit card accounts spent less than appoximately $4000 in the past 12 months.
* Another feature that has a significant impact on attrition is total product counts. Almost a third of all customers with only 1 or 2 products from Thera Bank closed their credit card accounts.
* Recommendations for retaining credit card customers:
  * Getting customers to regularly use their credit cards is shown to be the most valuable approach in credit card retention. Providing incentives on transactions or spend amounts could attract customers to using their card more often.
  * Brand loyalty is another important factor in customer retention. Customers that use several other products from Thera Bank in addition to their credit cards are more likely to keep their account active. Promoting Thera Bank's other offering such as checking and savings accounts, IRA, CD's, mortgage, loans, etc. can help get customers to build rapport with Thera Bank.

***