# Project Description

Beta Bank customers are leaving, ony by one, every month. Bankers discovered that it is cheaper to save existing customers than to attract new ones. `We need to predict whether a customer will leave the bank soon.` The data about the past behavior of clients and the termination of contracts with the bank will be provided.

## Objective
Create a model with the maximum possible F1 value to pass this test, the value should be at least 0.59. Additionally, I will measure the AUC-ROC metric to compare the F1 values

## Way to work


## Preparing the data

### Importing libraries

In [95]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE

### Importing Dataset

In [2]:
df = pd.read_csv('D:/Tripleten/datasets/Churn.csv')

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


### Preprocessing the data


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [5]:
df.duplicated().sum()

0

With the previous code we can identify the following statements:
- name columns are OK
- dtypes are classified correctly 
- no duplicates found
- `Tenure has null values.`
- `"Geography" and "Gender" needs to be encoded into binary codes.`
- `Will be necessary to apply standard scaler (pending)`

Also it seems that "RowNumber", "CustomerID" and "Surname" are not relevant for the model training, I'll proceed removing them

#### Filtering by relevant columns

In [6]:
df.drop(columns=['RowNumber','CustomerId', 'Surname'], inplace=True)
df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


#### Dropping null values

In [7]:
# Simple code to analyze the % of null values
((df['Tenure'].isna().sum())/df.shape[0] )*100

9.09

Now I will focus on the null values for Tenure, I'm considering two options:

- Eliminate the rows which represents 9.09% of the dataset (909 null values of 10000)
- Look for the median of the column Tenure and replace the null values.

I will proceed with the first option, avoiding to integrate fictitious data in our dataset.

In [8]:
df.isna().sum() # To revalidate all null values came from Tenure column
df.dropna(inplace=True)
df.isna().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Checking if the changes are correct

In [9]:
df.shape[0] # 9091 values
df.shape[0] + 909

10000

Transforming into 'integer' type.

In [10]:
df['Tenure']= df['Tenure'].astype(int)

#### One hot encoding

In [11]:
data_ohe = pd.get_dummies(data=df, dummy_na=False )
data_ohe

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2,0.00,1,1,1,101348.88,1,True,False,False,True,False
1,608,41,1,83807.86,1,0,1,112542.58,0,False,False,True,True,False
2,502,42,8,159660.80,3,1,0,113931.57,1,True,False,False,True,False
3,699,39,1,0.00,2,0,0,93826.63,0,True,False,False,True,False
4,850,43,2,125510.82,1,1,1,79084.10,0,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,800,29,2,0.00,2,0,0,167773.55,0,True,False,False,True,False
9995,771,39,5,0.00,2,1,0,96270.64,0,True,False,False,False,True
9996,516,35,10,57369.61,1,1,1,101699.77,0,True,False,False,False,True
9997,709,36,7,0.00,1,0,1,42085.58,1,True,False,False,True,False


## Training the models


Setting X (features) and y (objective) and spliting the model

In [12]:
X = data_ohe.drop(columns='Exited')
y = data_ohe['Exited']

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1000)

In [87]:
params = {
'n_estimators': np.arange(1,101,10),
'max_depth': np.arange(1,6,1) ,
}

scoring = {'accuracy': 'accuracy', 
           'recall': 'recall',
           'f1': 'f1'}


# scoring = {'accuracy': 'accuracy', 
#            'precision': 'precision', 
#            'recall': 'recall', 
#            'f1': 'f1'}

rfc_gr = GridSearchCV(RandomForestClassifier(random_state=1000), param_grid=params, cv=3, verbose=2 ,scoring =scoring, refit='f1')
rfc_gr.fit(X_train,y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits
[CV] END ........................max_depth=1, n_estimators=1; total time=   0.0s
[CV] END ........................max_depth=1, n_estimators=1; total time=   0.0s
[CV] END ........................max_depth=1, n_estimators=1; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=11; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=11; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=11; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=21; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=21; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=21; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=31; total time=   0.0s
[CV] END .......................max_depth=1, n_estimators=31; total time=   0.0s
[CV] END .......................max_depth=1, n_

It was necessary to eliminate the `precision ` parameter in the previous training model because it was sending a warning explaining that the model could not predict the class due to an imbalance. In other words it is necessary `to assign weight to the class 1 for the parameter 'class_weight'`, or `apply class rebalancing techniques like oversampling or undersampling`.

Before starting with the rebalancing, I will check the best scores.

In [93]:
# print(f'Best parameters {rfc_gr.best_params_}')
# print(f'F1 Score: {rfc_gr.best_score_}')

no_balance_scores_df = pd.DataFrame(rfc_gr.cv_results_)
best_row = rfc_gr.best_index_
no_balance_results = no_balance_scores_df.iloc[best_row,:]
cols = ['mean_test_f1', 'mean_test_accuracy', 'mean_test_precision', 'mean_test_recall', 'params' ]
print(f'Best results of the test')
no_balance_results[cols].reset_index()


Best results of the test


Unnamed: 0,index,40
0,mean_test_f1,0.425115
1,mean_test_accuracy,0.836171
2,mean_test_precision,0.742309
3,mean_test_recall,0.299756
4,params,"{'max_depth': 5, 'n_estimators': 1}"


In [92]:
y_pred = rfc_gr.best_estimator_.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
df_results = pd.DataFrame(report).transpose()
print(df_results ,end='\n\n')
print('Confusion Matrix')
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score      support
0              0.852799  0.977864  0.911060  1807.000000
1              0.800995  0.345494  0.482759   466.000000
accuracy       0.848218  0.848218  0.848218     0.848218
macro avg      0.826897  0.661679  0.696909  2273.000000
weighted avg   0.842179  0.848218  0.823251  2273.000000

Confusion Matrix
[[1767   40]
 [ 305  161]]


The main objective of our model is to predict whether a customer will leave the bank soon (cancelling the account), we need to focus on the results of `1`, the current model has a highly ratio of precision and recall for the element `0` (users who will not leave the bank), however recall for the element `1` is just 34%, that means the following:

From the total elements in our dataset, the model could only identify 34.5% of elements classified as `1` while its prediction of them it reach the 80%, giving us an harmonic mean (f1 score) of 48.27%

This is not enough for our goal, I'll focus on rebalancing

### Rebalancing

Despite class_weight is a good option for rebalancing, I will explore SVM algorithms which are not capable of receive this parameters, as a solution I will use oversampling.

We will start leaving the samplint_strategy by default ('auto') which will balance the binaries classifications into 50% each one.

In [114]:
# X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1000)

smote = SMOTE(random_state=42, sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X_train,y_train)
print(y_resampled.value_counts())
# print(y_train.value_counts())
# y_resampled.shape

Exited
0    5430
1    5430
Name: count, dtype: int64


Now I will train 3 classification algorithm models to find a better result. I will use GridsearchCV to wrap all of them. The classification model will be.

- Random Forest Classifier. 
- Support Vector Machine.
- Logistic Regression.


In [None]:
grid