<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span><ul class="toc-item"><li><span><a href="#Data" data-toc-modified-id="Data-1.1">Data</a></span></li></ul></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-2">Exploratory Data Analysis</a></span></li></ul></div>

# Introduction
This dataset has been taken from [Kaggle](https://www.kaggle.com/sakshigoyal7/credit-card-customers). The data contained is for credit card customers from a particular company, and the senior management want to predict which customers will churn before they leave the company, and ideally target the customer to ensure they do not leave.

**The objective of the project is to fit a model that can best predict which customers will churn using the input features.**

The structure of the project will begin with an exploratory data analysis (EDA), where the data will be displayed, and features extracted/manipulated in order to have the data in a format whereby it can be fed into multiple machine learning models used to predict why customers are leaving. 

## Data
The data has been sourced from [Kaggle](www.kaggle.com), but was originally posted on the [LEAPS Analyttica](https://leaps.analyttica.com/). The data contains 10,000 records of credit card customers containing input variables such as `age`, `marital_status`, `gender`, `educational_level`, `income_category`, etc. 

In total, there are 22 input variables, and the response variable, `Attrition_Flag`, which takes a value of **'Existing Customer'** or **'Attrited Customer'**.

When the data is published on Kaggle, a few redundant features are present in the data that should be dropped, so we will do this straight away before we conduct the EDA. 


# Exploratory Data Analysis

In [2]:
import pandas as pd

In [7]:
repo_url = 'https://raw.githubusercontent.com/philip-papasavvas/ml_sandbox'
df = pd.read_csv(f"{repo_url}/main/data/BankChurners.csv")

In [11]:
# drop columns
columns_to_drop = [
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
]

df = df.drop(columns_to_drop, axis=1)

Let's inspect the dataset after dropping the unwanted columns.

In [19]:
pd.set_option('display.max_columns', 25)
df.head(2)

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105


There are lots of different input features, so let's get some intuition on what these features represent:
- `CLIENTNUM`: client number, the unique identifier for each account holder
- `Customer_Age`: age (in years) of account holder
- `Gender`: male M, female F
- `Dependent_count`: number of dependents of the account holder
- `Education_level`: highest qualification of the account holder (eg: 'High School', 'Graduate', 'Uneducated')
- `Marital_status`: status of account holdder (married, single, divorced or unknown(
- `Income_Category`: annual income of account holder ('$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K','$120K +', 'Unknown')
- `Card_Category`: type of card the account holder has ('Blue', 'Gold', 'Silver', 'Platinum')
- `Months_on_book`: period account holder has been with the bank
- `Total_Relationship_Count`: total number of products held by the account holder
- `Months_Inactive_12_mon`: number of months the account holder has been inactive in the last 12
- `Contacts_Count_12_mon`: number of contacts in the last 12 months
- `Credit_Limit`: monthly credit limit
- `Total_Revolving_Bal`: total revolving balance on the credit card
- `Avg_Open_To_Buy`: the amount of credit available over the last 12 months
- `Total_Amt_Chng_Q4_Q1`: change in transaction amount, Q4 divided by Q1
- `Total_Trans_Amt`: total transaction amount over the last 12 months
- `Total_Trans_Ct`: total transaction count over the last 12 months
- `Total_Ct_Chng_Q4_Q1`: change in transaction count, Q4 divided by Q1
- `Avg_Utilization_Ratio`: average card utilisation ratio (amount of revolving credit being used divided by total amount of revolving credit available)

The target variable is:
- `Attrition_Flag`: whether the account holder has churned or not, with churn label as 'Attrited Customer', non-churn as 'Existing Customer'
