# Bank Customer Churn - Data Processing and Model Selection

With the data analysis finished and with a general idea of the features its possible to move forward.

It is needed to prepare the data for the model before any more specific feature engineering. 
Even though some models have particularities regarding the data they consume, there are some processing that is common to be applied in many models:

- Drop or treat features with high cardinality or high variance 
- Encode categorical variables (If not using models with support for categorical features like catboost)
- Deal with missing data
- Apply feature scalling (depends on the model)

Here it will be applied only data processing steps so a Machine Learning model may work proper, followed by the model selection for this project.

In [35]:
import pandas as pd
import numpy as np

from typing import Any

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report, accuracy_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from xgboost import XGBClassifier

In [2]:
data_path = '../data/raw/training/Abandono_clientes.csv'
df_raw = pd.read_csv(data_path)

In [3]:
df_raw

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


## Data Preprocessing

Since there's no need to fill any missing data and the only categorical variables are binary (Gender) or with very few categories (Geography), it's a simple processing. Of course, in case any candidate model requires a specific processing, it will be done accordingly so.

RowNumber, CustomerId and Surname won't be kept. Chance is these features will only provide noise to data and won't help in generalization.

In [4]:
df_pp = df_raw.copy()

df_pp = df_pp.iloc[:, 3:]

In [5]:
df_pp['Gender'].replace({'Male': 0, 'Female': 1}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_pp['Gender'].replace({'Male': 0, 'Female': 1}, inplace=True)
  df_pp['Gender'].replace({'Male': 0, 'Female': 1}, inplace=True)


In [6]:
geography_dummies = pd.get_dummies(df_pp['Geography']).astype(int)
df_pp = pd.concat([df_pp, geography_dummies],axis=1).drop('Geography', axis=1)

In [7]:
df_pp

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,France,Germany,Spain
0,619,1,42,2,0.00,1,1,1,101348.88,1,1,0,0
1,608,1,41,1,83807.86,1,0,1,112542.58,0,0,0,1
2,502,1,42,8,159660.80,3,1,0,113931.57,1,1,0,0
3,699,1,39,1,0.00,2,0,0,93826.63,0,1,0,0
4,850,1,43,2,125510.82,1,1,1,79084.10,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,0,39,5,0.00,2,1,0,96270.64,0,1,0,0
9996,516,0,35,10,57369.61,1,1,1,101699.77,0,1,0,0
9997,709,1,36,7,0.00,1,0,1,42085.58,1,1,0,0
9998,772,0,42,3,75075.31,2,1,0,92888.52,1,0,1,0


In [20]:
df_pp.to_csv('../data/interim/churn_customer_preprocessing.csv', index=False)

## Model Selection
Since its a binary output problem - if a customer will churn or not - it's a **classification** problem, thus a classification algorithm must be chosen.

A good practice is to use a simpler baseline model and compare it to other model candidates to see if they can top the simpler model. The chosen model to serve as a baseline is Logistic Regression model, from the family of linear models, while pick the other models from other families:
    
- **Non-linear models**: RandomForest and XGBoost;
- **Distance-based models**: K-Nearest Neighbours (KNN) and Suport Vector Machines (SVM).

LogisticRegression, SVM, and KNN benefit from Scalling, oposing to XGBoost and RandomForest that don't due to their inerent characteristics as tree-method models. 

The performance metric chosen for this is the **F1 Score**. 

The F1 score is described as the harmonic mean of the precision and recall of a classification model. 
It is defined by the following function:
$$
F_{1}=2*\frac{Precision*Recall}{Precision+Recall}
$$
With both Precision and Recall being defined as
$$ Precision=\frac{TP}{TP+FP} $$
$$ Recall=\frac{TP}{TP+FN} $$ 

With $TP$ being *True Positive*, $FP$ being *False Positive*, and $FN$ being *False Negative*.
A classifier with high recall may have low precision, meaning it captures the majority of positive classes but produces a considerable number of false positives. Hence, F1 Score is used to balance this trade-off.

 A high F1 Score generally indicates a well-balanced performance, demonstrating that the model can concurrently attain high precision and high recall, while A low F1 score implies that the model has trouble striking the precision x recall balance. Overall it is a robust metric that indicates whether the model managed to obtain a balanced generalization from the data.

In [30]:
scaler = StandardScaler()

In [8]:
x = df_pp.drop(columns='Exited')
y = df_pp.Exited

In [37]:
x_scaled = scaler.fit_transform(x)

In [41]:
x_scaled = pd.DataFrame(x_scaled, columns=x.columns)

In [9]:
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.3, stratify=y, random_state=0)

In [38]:
x_train_scaled = scaler.fit_transform(X=x_train)
x_valid_scaled = scaler.transform(X=x_valid)

In [34]:
log_reg_model = LogisticRegression(random_state=0)

knn_model = KNeighborsClassifier()
svm_model = SVC(random_state=0)

rf_model = RandomForestClassifier(random_state=0)
xgb_model = XGBClassifier()

In [43]:
def cross_validation(model: Any, x: pd.DataFrame, y: pd.Series,  skf: StratifiedKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)) -> list[float]:
    cv_f1_scores = []
    
    for train_idx, test_idx in skf.split(x,y):
        x_train_fold, x_test_fold = x.iloc[train_idx], x.iloc[test_idx]
        y_train_fold, y_test_fold = y.iloc[train_idx], y.iloc[test_idx]
        
        model.fit(x_train_fold, y_train_fold)
        model_f1_score = f1_score(y_test_fold, model.predict(x_test_fold))
        cv_f1_scores.append(model_f1_score)
    
    return cv_f1_scores

In [44]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

cv_f1_scores_logistic = cross_validation(model=log_reg_model, x=x_scaled,y=y)
cv_f1_scores_rf = cross_validation(model=rf_model, x=x, y=y)
cv_f1_scores_xgb = cross_validation(model=xgb_model, x=x, y=y)
cv_f1_scores_knn = cross_validation(model=knn_model, x=x_scaled, y=y)
cv_f1_scores_svm = cross_validation(model=svm_model, x=x_scaled, y=y)

In [47]:
cross_val_dict = {
    'logistic_reg':{'max_score': max(cv_f1_scores_logistic), 'min_score': min(cv_f1_scores_logistic), 'avg_score': np.mean(cv_f1_scores_logistic), 'std_score': np.std(cv_f1_scores_logistic)},
    'random_forest':{'max_score': max(cv_f1_scores_rf), 'min_score': min(cv_f1_scores_rf), 'avg_score': np.mean(cv_f1_scores_rf), 'std_score': np.std(cv_f1_scores_rf)},
    'xgboost':{'max_score': max(cv_f1_scores_xgb), 'min_score': min(cv_f1_scores_xgb), 'avg_score': np.mean(cv_f1_scores_xgb), 'std_score': np.std(cv_f1_scores_xgb)},
    'knn':{'max_score': max(cv_f1_scores_knn), 'min_score': min(cv_f1_scores_knn), 'avg_score': np.mean(cv_f1_scores_knn), 'std_score': np.std(cv_f1_scores_knn)},
    'svm':{'max_score': max(cv_f1_scores_svm), 'min_score': min(cv_f1_scores_svm), 'avg_score': np.mean(cv_f1_scores_svm), 'std_score': np.std(cv_f1_scores_svm)},
}

In [48]:
pd.DataFrame(cross_val_dict)

Unnamed: 0,logistic_reg,random_forest,xgboost,knn,svm
max_score,0.367491,0.634731,0.623229,0.517241,0.578947
min_score,0.22963,0.518987,0.534535,0.440895,0.481356
avg_score,0.314076,0.571552,0.573204,0.474631,0.524926
std_score,0.040593,0.031971,0.02234,0.023806,0.028779


The best models where the Random Forest and XGBoost, both with similar performance. Still, all models presents a better result than the baseline logistic regression model. 

Another simpler approach could be use a random binary generator, to check whether the models can surpass its performance. 

Let's follow comparing the ensemble models.

In [12]:
rf_model.fit(x_train, y_train)
xgb_model.fit(x_train, y_train)

rf_preds = rf_model.predict(x_valid)
xgb_preds = xgb_model.predict(x_valid)

In [22]:
rf_accuracy = accuracy_score(y_valid, rf_preds)
xgb_accuracy = accuracy_score(y_valid, xgb_preds)

In [28]:
print(f'Random Forest Accuracy: {rf_accuracy:.03f}')
print(f'XGBoost Accuracy: {xgb_accuracy:.03f}')

Random Forest Accuracy: 0.857
XGBoost Accuracy: 0.850


In [13]:
rf_classif_report = classification_report(y_valid, rf_preds)
xgb_classif_report = classification_report(y_valid, xgb_preds)

In [17]:
print(rf_classif_report)

              precision    recall  f1-score   support

           0       0.87      0.96      0.91      2389
           1       0.75      0.45      0.56       611

    accuracy                           0.86      3000
   macro avg       0.81      0.70      0.74      3000
weighted avg       0.85      0.86      0.84      3000



In [18]:
print(xgb_classif_report)

              precision    recall  f1-score   support

           0       0.88      0.95      0.91      2389
           1       0.69      0.47      0.56       611

    accuracy                           0.85      3000
   macro avg       0.78      0.71      0.74      3000
weighted avg       0.84      0.85      0.84      3000



Based on the analysis, both models perform very similarly, with XGBoost performing slightly worse in Accuracy and Precision than RandomForest. The choice now relies on the inner characteristics of each model:

- Random Forest is a bagging model, in which it build its trees in parallel, with the choice being an average of the tree outputs, which can lead to a higher accuracy, but tendency to overfitting in real-life scenarios. 
- XGBoost is a boosting model, in which its trees are built sequentially, which the subsequent tree working on the rights and wrongs of the previous tree. Due to its regularization techniques it seeks to avoid overfitting and perform better in real-life scenarios.

Based on that, both models would perform well for this task. In this case, I'll pick XGBoost due to being more knowledgeable of this model.

With the model selected, now for **Feature Engineering and Feature Selection**