# Project Description

Beta Bank customers are leaving, ony by one, every month. Bankers discovered that it is cheaper to save existing customers than to attract new ones. `We need to predict whether a customer will leave the bank soon.` The data about the past behavior of clients and the termination of contracts with the bank will be provided.

## Objective
Create a model with the maximum possible F1 value to pass this test, the value should be at least 0.59. Additionally, I will measure the AUC-ROC metric to compare the F1 values

## Way to work
Along this project I focused on training and obtaining the best possible metrics for the models. You will find next the following structure:

- Preparing the data (Importing libraries and datasets).
- Preprocessing the data (Verifying duplicates, errors in datasets, null values, maintaining relevant information etc).
- Training a model (analyze the parameters to be improve and necessary process techniques).
- Rebalancing the datasets (oversampling)
- Training 3 classification models (Random Forest, SVM and Linear Regression)
- Conclusions




## Preparing the data

### Importing libraries

These are the libraries that will be used along the project.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing Dataset

In [None]:
df = pd.read_csv('D:/Tripleten/datasets/Churn.csv')

### Preprocessing the data

In order to provide a complete analysis, first I need to ensure the imported dataset is prepared to be used for the model

In [None]:
## Quick view of the dataset
df.head()

In [None]:
## Analyzing dtypes and null values
df.info()

In [None]:
## Checking duplicated rows
df.duplicated().sum()

With the previous code I identified the following statements:
- name columns are OK
- dtypes are classified correctly 
- no duplicates found
- `Tenure has null values.`
- `"Geography" and "Gender" needs to be encoded into binary codes.`
- `Will be necessary to apply standard scaler`

Also, in my consideration; "RowNumber", "CustomerID" and "Surname" are not relevant for the model training, I'll proceed removing them and do the necessary changes.

#### Filtering by relevant columns

In [None]:
df.drop(columns=['RowNumber','CustomerId', 'Surname'], inplace=True)
df

#### Dropping null values

In [None]:
# Simple code to analyze the % of null values
((df['Tenure'].isna().sum())/df.shape[0] )*100

Now I will focus on the null values for Tenure, I'm considering two options:

- Eliminate the rows which represents 9.09% of the dataset (909 null values of the total 10000)
- Look for the median of the column Tenure and replace the null values.

I will proceed with the first option, avoiding to integrate fictitious data in our dataset. However the main objective is to predict the users who will leave the bank (value 1 of column 'Exited'), so, before removing the data I will analyze the impact of deleting the rows.

In [None]:
# Analyzing the shape of the df
print(f'Total Rows: {df.shape[0]}')

# Analyzing null rows for the Tenure column
print(f'Total Null Rows: {df['Tenure'].isnull().sum()}',  end='\n\n')

# Distribution of the exited values.
exited_null_rows = df['Exited'].value_counts()
print(f'Excited values distribution: \n{exited_null_rows}',end='\n\n')

# Distribution of the exited results for tenure null values. 
tenure_exited_null_rows = df[df['Tenure'].isnull()]['Exited'].value_counts()
print(f'Excited values distribution for null rows: \n{tenure_exited_null_rows}', end='\n\n')

print(f'The percentage of missing 1 values to be eliminated will be {((tenure_exited_null_rows[1]*100)/exited_null_rows[1]):.2f}%')


We have a total of 10000, from which 909 are null rows. Inside those 909 null rows we have 183 null tenure values. Considering the percentage of the values that will be eliminated we can proceed to delete it (does not represent high looses).

In [None]:
df.isna().sum() # To revalidate all null values came from Tenure column
df.dropna(inplace=True)
df.isna().sum()

Checking if the changes are correct

In [None]:
df.shape[0] # 9091 values
df.shape[0] + 909

Transforming into 'integer' type.

In [None]:
df['Tenure']= df['Tenure'].astype(int)

Before jumping to remove any data, I will plot the data to see if something is missing.

In [None]:
df.describe()

At first sight data seems right, means and medians are pretty similar for all the columns, standard deviation show a high variability for 'Balance' and 'Estimated Salary' columns. 

In the other hand a box plot chart could help to identify atypical values. 

In [None]:
num_columns =df.select_dtypes(include=['int', 'float']).columns

# Creating a 3.3 figure that plots all the integer and float types.
fig, axs = plt.subplots(3, 3, figsize=(15, 10))

for i, column in enumerate(num_columns):
    row = i // 3
    col = i % 3    
    axs[row, col].boxplot(df[column])
    axs[row, col].set_title(column)
    i += 1

plt.show()


According to the previous charts we can infer the following:

Credit score: We have atypical values (outliers in the lower whisker), could be several situations like new clients with a credit history.

- Age: The analysis shows that the majority of the clients are concentrated between 20 - 60 years, however we can find users above this age.
- Tenure: Seems stable with a median of 5 and a max value of 10
- Balance: shows a highly concentration of the values located in the lower part, while some specific values represent high values
- Num of products: shows the value 1 as the median, however some clients can have up to 4 products.
- Has credit card and Is Active Member: Have a median of 1 having all their values in 0 or 1
- Estimated Salary: Have a normal distribution, not showing atypical values.
- Existe: Shows that the distribuition of clients remains with the banc, having atypical situation with users who left the banks


In [None]:
# Creating a figure that plot CreditScore and Age histograms
# fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(10, 5))

num_columns =df.select_dtypes(include=['int', 'float']).columns

# Creating a 3.3 figure that plots all the integer and float types.
fig, axs = plt.subplots(3, 3, figsize=(15, 10))

for i, column in enumerate(num_columns):
    row = i // 3
    col = i % 3    
    axs[row, col].hist(df[column])
    axs[row, col].set_title(f'{column} distribution')
    i += 1

plt.show()


This histogram graphs reinforce the highlights commented before.Let's also analize the categoric values.

In [None]:
# Creating a figure that plot the object types
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(10, 5))

ax1.hist(df['Geography'], color=['green'])
ax1.set_title('Geography distribution')
ax2.hist(df['Gender'])
ax2.set_title('Gender distribution')

plt.show()


As we can see the clients from france represents the half market of the product, while the distribution between males and females is similar. In the following charts we will filter the Geography and Gender analysis for users who left the bank.

In [None]:
quit_users_df = df[df['Exited']==1]

fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(10, 5))

ax1.hist(quit_users_df['Geography'], color='green')
ax1.set_title('Geography (user who left the bank)')
ax2.hist(quit_users_df['Gender'])
ax2.set_title('Gender (user who left the bank)')

plt.show()

The results are interesting, Germany and France are the countries where users left the bank account. Now that we now more about the columns let's analyze their correlation. The one hot encoding will be important to classified the categoric columns into binaries.

#### One hot encoding

The dataset has been preproceed, let's focus on apply the OHE in our dataframe, remember get_dummies method will only affect the objects columns, transforming into binary bools

In [None]:
data_ohe = pd.get_dummies(data=df, dummy_na=False )
data_ohe

Now we can look for the correlation between our column exited and the others.

In [None]:
corr_matrix = data_ohe.corr()
corr_with_exited = data_ohe.corr()['Exited'].sort_values(ascending=False)

print(corr_with_exited)

Despite the analyze of the correlation between the column 'Exited' and the other columns, we could not highlight big visible references, however thr age and the geography shows a little positive correlation. If we compare the correlation matrix, we can see the following results.

In [None]:
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".1f", vmin=-1, vmax=1)

## Training a model

Now that our dataset is prepared, it will need to be split into two datasets to train to test our model. To do this, I'm setting `'X'` (features) and `'y'` (objective) to use it as parameters for the method train_test_split

In [None]:
X = data_ohe.drop(columns='Exited')
y = data_ohe['Exited']

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.25,  random_state=1000)

Now, I will use the class `GridSearchCV` which will allow to grid my search when training my model 

In [None]:
params = {
'n_estimators': np.arange(1,101,10),
'max_depth': np.arange(1,6,1) ,
}

scoring = {'accuracy': 'accuracy', 
           'recall': 'recall',
           'f1': 'f1'}

rfc_gr = GridSearchCV(RandomForestClassifier(random_state=1000), param_grid=params, cv=3, verbose=2 ,scoring =scoring, refit='f1')
rfc_gr.fit(X_train,y_train)

It was necessary to eliminate the `precision ` parameter in the previous training model because it was sending a warning explaining that the model could not predict the class due to an imbalance. In other words it is necessary `to assign weight to the class 1 for the parameter 'class_weight'`, or `apply class rebalancing techniques like oversampling or undersampling`.

Before starting with the rebalancing, I will check the best scores.

In [None]:
no_balance_scores_df = pd.DataFrame(rfc_gr.cv_results_)
best_row = rfc_gr.best_index_
no_balance_results = no_balance_scores_df.iloc[best_row,:]
cols = ['mean_test_f1', 'mean_test_accuracy', 'mean_test_recall', 'params' ]

print(f'Best results of the test')
no_balance_results[cols]


The test shows an harmonic mean ('f1') of 42.51% which is not enough for our objective. I will continue with the prediction to se if the results can vary

In [None]:
y_pred = rfc_gr.best_estimator_.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
df_results = pd.DataFrame(report).transpose()
print(df_results ,end='\n\n')
print('Confusion Matrix')
print(confusion_matrix(y_test,y_pred))

The main objective of our model is to predict whether a customer will leave the bank soon (cancelling the account). We need to focus on the classification of `1`, the current model has a highly ratio of precision and recall for the classification `0` (users who will not leave the bank), however recall for the element `1` is just 34%, that means the following:

From the total elements in our dataset, the model could only identify 34.5% of elements classified as `1` while its precision of them it reach the 80%, giving us an harmonic mean (f1 score) of 48.27%

This is not enough for our goal, I'll focus on rebalancing

Definir en una tabla similar el porcentaje de usuarios que permanece y el % de personas que no permanece en funcion de las predicciones

In [None]:
predict_table = pd.DataFrame(y_pred, columns=['prediction'])
leaving_users = predict_table[predict_table['prediction']== 1]['prediction'].count()
staying_users = predict_table[predict_table['prediction']== 0]['prediction'].count()

y_test_table = pd.DataFrame(data=y_test, columns=['prediction'])
r_leaving_users = y_test_table[y_test_table['prediction']== 1]['prediction'].count()
r_staying_users = y_test_table[y_test_table['prediction']== 0]['prediction'].count()


y_test
np.array(y_tes)
# per_leaving_users = (100*leaving_users/(leaving_users+staying_users))
# per_staying_users = (100 - per_leaving_users)

# r_per_leaving_users = (100*r_leaving_users/(r_leaving_users+r_staying_users))
# r_per_staying_users = (100 - r_per_leaving_users)


# print(f'Total users {predict_table.value_counts()}', end='\n\n')
# # print(f'Percentage of leaving users: {per_leaving_users:.2f}%')


# fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(10, 5))


# ax1.bar(x=[f'Leaving users ({per_leaving_users:.2f}%)', f'Staying users ({per_staying_users:.2f}%)'], height=[leaving_users, staying_users])
# ax1.set_title('Predicted user behaviour')

# ax2.bar(x=[f'Leaving users ({r_per_leaving_users:.2f}%)', f'Staying users ({r_per_staying_users:.2f}%)'], height=[r_leaving_users, r_staying_users])
# ax2.set_title('Predicted user behaviour')

# plt.show()


In [None]:
According to the prediction of this model

## Rebalancing the dataset

Despite class_weight is a good option for rebalancing, I will explore SVM algorithms which are not capable of receive this parameters, as a solution I will use oversampling.

I will start leaving the sampling_strategy by default ('auto') which will balance the binaries classifications into 50% each one.

In [None]:
smote = SMOTE(random_state=42, sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X_train,y_train)
print(y_resampled.value_counts())


In [None]:
var1 = y_resampled.value_counts()[0]
var2 = y_resampled.value_counts()[1]

plt.bar(x=['0','1'], height=[var1,var2], )
plt.title('Exited Values')
plt.show()

As I expected the dataset is now balanced

## Training new models

Now I will train 3 classification algorithm models to find a better result. I will use GridsearchCV to wrap all of them. The classification models will be.

- Random Forest Classifier. 
- Support Vector Machine.
- Logistic Regression.


Defining the Pipelines

To keep a consistent result I will lock the random_state into 1000 for the creation of each model, in the specific case of SVC I will change the probability value for True (otherwise it will give an error because probabilities are being used)

In [None]:
pipe_rf = Pipeline([('scaler', StandardScaler()), ('rf', RandomForestClassifier(random_state=1000))])
pipe_svc = Pipeline([('scaler', StandardScaler()), ('svc', SVC(random_state=1000, probability=True))])
pipe_lr = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression(random_state=1000))])

I nested the parameters into dictionaries as a solution provided by GridSearchCV classifications, I'll also apply different evaluation metrics to go in detail with the models.

In [None]:
params = [
    {
    'rf__n_estimators': np.arange(1, 100, 10),
    'rf__max_depth': np.arange(1, 6, 1)},
    {
    'svc__C': [0.1, 1, 10],
    'svc__kernel':['rbf'], 
    'svc__gamma':['scale'],
    'svc__degree':np.arange(1, 4, 1)},
    {
    'lr__penalty':['elasticnet'],
    'lr__C': [0.1, 1, 10],
    'lr__solver': ['saga'],
    'lr__l1_ratio': [0.5],
    'lr__max_iter': [1000]}
]

scores =['accuracy','precision', 'recall', 'f1']

The following block of code will englobe all my process to select the best model.

Using GridSearch I iterate the 3 models for doing the following workflow:
- Scaling data : Using standard scaler on the pipeline of each model.
- Training resampled datasets: by adding each model the necessary parameters and looking for the best f1 score
- Obtaining the best parameters: The trained model will obtain the best scores and will be printed
- Testing validation datasets: Once the models are trained, It will predict the test dataset 
- Plotting ROC results: To give a better insight of the results the iteration will plot the ROC curve with the AUROC for all the models


In [None]:
%%time
# If you cannot see syntax highlight colors is due tue the %%time

# Creating a list of the models nested in a pipeline
pipes = [pipe_rf, pipe_svc, pipe_lr]

# Defining the columns we want to analyze once the model has been trained
cols = ['mean_test_f1', 'mean_test_precision' ,'mean_test_accuracy', 'mean_test_recall']

# Creating a iteration for each model with its own parameters
for pipe, grid in zip(pipes, params):
    gs =GridSearchCV(pipe, param_grid=grid, scoring=scores, refit='f1', cv=2)
    gs.fit(X_resampled,y_resampled)

    # The results of each model will be printed; F1 Score, Precision, Accuracy, Recall and best parameters.
    print('_______________________________________________')
    print(f'For the pipeline {pipe[1]}')
    results_df = pd.DataFrame(gs.cv_results_)
    results_df = results_df[cols]
    best_row = gs.best_index_
    results = results_df.iloc[best_row,:]
    print(f'With the parameters: {gs.best_params_}', end='\n\n')
    print('Te best mean test results are:')
    print(results, end= '\n\n')


    # This block is intended to show the performance for the validation process by the binaries results ('0' and '1').
    y_pred = gs.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    report_df = pd.DataFrame(report).transpose()
    print('And for the validation test we have the following results.')
    print(report_df ,end='\n\n')
    print('Confusion Matrix')
    print(confusion_matrix(y_test,y_pred) ,end='\n\n')

    # This block obtain the roc_curve and auc
    y_pred_proba = gs.predict_proba(X_test)[:,1]    
    fpr,tpr, tresholds = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)

    # This block will plot a AUROC chart by each model created.
    plt.plot(fpr, tpr, label= f'{pipe.steps[-1][0]} (AUROC) = {roc_auc:.4f}')

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')

plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

As we can see the best F1 score for each model during the test stage wre:
 
 | Model | F1 Mean |
 |---------------------|---------------------|
 |Random Forest|0.818391|
 |SVC|0.844246|
 | Linear regression| 0.798570|
       
At first sight SVC seems to be the best model when comparing the f1 mean scores, however the classification report can give us deeper details for each binary classification 
 
 | Model  | Classification | precision|   recall|  f1-score| support|
 |--------------|---|----------|---------|----------|-------------|
 | Random Forest| 1 |  0.544762| 0.613734|  0.577195|   466.000000|
 | SVM          | 1 |  0.623529| 0.568670|  0.594837|   466.000000|
 | Linear Regression | 1 |  0.532258| 0.424893|  0.472554|   466.000000|

 The results for Random Forest and SVM are close, while Random Forest is giving us a better recall, SVM is focused on getting more precision. If we focus con F1 score, the SVM should be the best model for the present options, however in a last exhaustive test, the AUROC method was applied.

 Giving us the following results
 
 | Model  | AUROC | 
 |--------------|---|
 | Random Forest| 0.8288 | 
 | SVM          | 0.8283 |  
 | Linear Regression |0.7553| 


 In this case, the Random Forest method show a better slightly performance. In conclusion I will personally choose the Random Forest method because the results will get more precision when trying to predict if a user is going to leave the bank.



## Conclusions

Along this project we face different situations, the original data set was preprocessed to be able to train our models. The codification model applied was the One hot encoder (OHE) from the pandas library (get_dummies method).

First, I started training one Random Forest Model to find the best f1 score, however it was necessary to balance and standardize the data to reach better results, I made those changes but including 3 new different models. The chosen classification algorithms were Random Forest, SVM, and Linear Regression, each one with different parameters to feed and create robust results. 

In general, It can be infer that Random Forest and SVM models are much better compared with the Linear regression, the evaluation metrics showed better results when analyzing the metrics accuracy, precision, recall and f1 score.

A classification report was ran to go deeper into the analysis showing results by binaries classification (focusing the efforts in classification '1'). Finally a ROC Curve chart was plot to check the performance of each model, showing the AUC results.

In conclusion the Results for Random Forest and SVM were pretty similar, but I found the Random Forest a better option for this project, because we are focusing on the precision of predicting the user who are going to leave the bank.