# **Customer Attrition Study**

## Objectives

* Answer business requirement 1:
    * The client would like to better understand the patterns in the customer base so that the client can learn the variables of a prospect least likely to             attrition. 

## Inputs

* outputs/datasets/collection/BankChurners.csv

## Outputs

* Generate code and seaborn plots that answer business requirement 1 and can be used for the Streamlit App


---

# Change working directory

* Need to change working directory from the current jupyter_notebooks folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/creditcard-churn/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/creditcard-churn'

---

# Load Data

In [5]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/BankChurners.csv")
        .drop(['CLIENTNUM'], axis=1)
        )
df.head(3)

Unnamed: 0.1,Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1,Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0,0,0,45,M,3,High School,Married,$60K - $80K,Blue,39,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,9.3e-05,0.99991
1,1,0,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,5.7e-05,0.99994
2,2,0,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,2.1e-05,0.99998


---

# Data Exploration #

We wish to become familiar with the dataset, check variable types and their distribution, check for any missing data, and to understand what these variables mean in the business context. It appears for some demographic variables, the bank does not know customer information. This may necessitate imputation. Credit limit, revolving balance, and utilization ratios appear to skew significantly towards 0, which will likely need to be taken into consideration when working with these variables in models.

In [7]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

---

# Correlation Study

We use `OneHotEncoder` to transform categorical variables in the dataset into 1s and 0s in individual columns. This is in order to allow these variables to be correlated with `Attrition_Flag`

In [10]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(3)

(10127, 41)


Unnamed: 0.1,Unnamed: 0,Attrition_Flag,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,...,Income_Category_$60K - $80K,Income_Category_Less than $40K,Income_Category_$80K - $120K,Income_Category_$40K - $60K,Income_Category_$120K +,Income_Category_Unknown,Card_Category_Blue,Card_Category_Gold,Card_Category_Silver,Card_Category_Platinum
0,0,0,45,3,39,5,1,3,12691.0,777,...,1,0,0,0,0,0,1,0,0,0
1,1,0,49,5,44,6,1,2,8256.0,864,...,0,1,0,0,0,0,1,0,0,0
2,2,0,51,3,36,4,1,0,3418.0,0,...,0,0,1,0,0,0,1,0,0,0


In order that the correlation study can be considered meaningful for prospective customers, we drop the variables related to customer usage

Using `.corr()` for `spearman` and `pearson` methods and inspect the top 10 most correlated variables by sorting in descending order the absolute variable of the correlation coefficient, excluding the first element of this list as it will correspond to `attrition flag`

In [11]:
corr_spearman = df_ohe.corr(method='spearman')['Attrition_Flag'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2   -0.636359
Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1    0.636041
Total_Trans_Ct                                                                                                                       -0.376115
Total_Ct_Chng_Q4_Q1                                                                                                                  -0.312059
Total_Revolving_Bal                                                                                                                  -0.240551
Avg_Utilization_Ratio                                                                                                                -0.240385
Total_Trans_Amt                                                                                                                      -0.223782

We do the same for `pearson`

In [12]:
corr_pearson = df_ohe.corr(method='pearson')['Attrition_Flag'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1    0.999989
Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2   -0.999989
Total_Trans_Ct                                                                                                                       -0.371403
Total_Ct_Chng_Q4_Q1                                                                                                                  -0.290054
Total_Revolving_Bal                                                                                                                  -0.263053
Contacts_Count_12_mon                                                                                                                 0.204491
Avg_Utilization_Ratio                                                                                                                -0.178410

Note that most of the variables correlated to attrition are redundant varaibles related to the customer's usage of their account, or redundant statistical variables included by the dataset creator. These variables should be dropped from the dataset for cleaning, as none of the ML tasks involve their use. We will attempt the correlation study without these columns.

In [None]:
df_ohe.drop([
    'Contacts_Count_12_mon',
    'Credit_Limit',
    'Total_Revolving_Bal', 
    'Avg_Open_To_Buy',
    'Total_Amt_Chng_Q4_Q1', 
    'Total_Trans_Amt',
    'Total_Trans_Ct',
    'Total_Ct_Chng_Q4_Q1',
    'Avg_Utilization_Ratio',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'])

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
