### Introduction
In this project, the goal is to predict whether a customer will leave Beta Bank soon based on past behavior. Using data on customers' demographics and banking history, we will build a machine learning model to make accurate predictions. The key goals are to achieve a minimum F1 score of 0.59 and evaluate the model using the AUC-ROC metric. We will also address class imbalance, which could affect the accuracy of our predictions.

In [1]:
# Import necessary libraries for data analysis, preprocessing, and model building
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score


In [2]:
# Load the dataset
data = pd.read_csv('/datasets/Churn.csv')

# Preview the dataset
data.head(15)

# Check the column names in the dataset
print(data.columns)

# Check the data types to ensure they are correct
print("\nData types of each column:")
print(data.dtypes)

# Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

Data types of each column:
RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object
Missing values in each column:
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dt

In [3]:
# Fill missing values in the Tenure column with a dummy value (-1) to indicate no deposits
data['Tenure'].fillna(-1, inplace=True)

# Confirm that the missing values in Tenure have been filled
print("Missing values after filling Tenure:")
print(data['Tenure'].isnull().sum())

# Display the first few rows after handling missing data and correcting formats
display(data.head(15))

Missing values after filling Tenure:
0


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


### Handling Missing Values in Tenure

The **Tenure** column represents the number of years a customer has held a fixed deposit. However, some customers may not have any fixed deposits, which results in missing values in this column.

Rather than filling these missing values with the median (which could imply that the customer has held deposits for a certain period), we decided to fill the missing values with a **dummy value of -1**. This allows us to explicitly indicate that these customers do not have a fixed deposit, making it easier for the model to interpret the absence of deposits.

By filling with a dummy value, we preserve the meaning of the data while ensuring that no missing values remain in the dataset.


In [9]:
# Encode categorical features (Geography and Gender)
data = pd.get_dummies(data, drop_first=True)

# Display the first few rows after encoding
display(data.head(15))

# Split the data into features (X) and target (y)
X = data.drop(columns='Exited')
y = data['Exited']

# Display the distribution of the target variable (to understand class balance)
print(y.value_counts())

# Split data into train (60%), validation (20%), and test (20%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f'Training set: {X_train.shape}, Validation set: {X_val.shape}, Test set: {X_test.shape}')

# Standardize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Display the shape of the training and test sets
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,...,Surname_Zotova,Surname_Zox,Surname_Zubarev,Surname_Zubareva,Surname_Zuev,Surname_Zuyev,Surname_Zuyeva,Geography_Germany,Geography_Spain,Gender_Male
0,1,15634602,619,42,2.0,0.0,1,1,1,101348.88,...,0,0,0,0,0,0,0,0,0,0
1,2,15647311,608,41,1.0,83807.86,1,0,1,112542.58,...,0,0,0,0,0,0,0,0,1,0
2,3,15619304,502,42,8.0,159660.8,3,1,0,113931.57,...,0,0,0,0,0,0,0,0,0,0
3,4,15701354,699,39,1.0,0.0,2,0,0,93826.63,...,0,0,0,0,0,0,0,0,0,0
4,5,15737888,850,43,2.0,125510.82,1,1,1,79084.1,...,0,0,0,0,0,0,0,0,1,0
5,6,15574012,645,44,8.0,113755.78,2,1,0,149756.71,...,0,0,0,0,0,0,0,0,1,1
6,7,15592531,822,50,7.0,0.0,2,1,1,10062.8,...,0,0,0,0,0,0,0,0,0,1
7,8,15656148,376,29,4.0,115046.74,4,1,0,119346.88,...,0,0,0,0,0,0,0,1,0,0
8,9,15792365,501,44,4.0,142051.07,2,0,1,74940.5,...,0,0,0,0,0,0,0,0,0,1
9,10,15592389,684,27,2.0,134603.88,1,1,1,71725.73,...,0,0,0,0,0,0,0,0,0,1


0    7963
1    2037
Name: Exited, dtype: int64
Training set: (6000, 2944), Validation set: (2000, 2944), Test set: (2000, 2944)
X_train shape: (6000, 2944), X_test shape: (2000, 2944)
y_train shape: (6000,), y_test shape: (2000,)


### Understanding Class Imbalance

Class imbalance refers to the fact that the target variable (Exited) is not evenly distributed between the two classes. For example, there may be far more customers who did not churn than those who did. This imbalance can cause the model to favor the majority class (non-churned customers), leading to lower accuracy in predicting churned customers.

We will address this issue later, as imbalanced data can affect the **F1 Score** and **AUC-ROC**, metrics we are using to evaluate model performance.


In [5]:
# Check for any remaining NaN or infinite values in X_train
print("Checking for NaN values in X_train:")
print(pd.isnull(X_train).sum())

print("Checking for infinite values in X_train:")
print(np.isinf(X_train).sum())

# Fill any remaining NaN values with appropriate values (e.g., median)
X_train = np.nan_to_num(X_train)

# Check X_test for NaN and infinite values as well
X_test = np.nan_to_num(X_test)

Checking for NaN values in X_train:
0
Checking for infinite values in X_train:
0


In [10]:
# Try different hyperparameters manually
rf_model = RandomForestClassifier(n_estimators=200, max_depth=20, min_samples_split=5, random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

# Predict and evaluate on the validation set
y_val_pred = rf_model.predict(X_val)
f1_val = f1_score(y_val, y_val_pred)
roc_auc_val = roc_auc_score(y_val, rf_model.predict_proba(X_val)[:, 1])

print(f'F1 Score on Validation set: {f1_val}')
print(f'AUC-ROC Score on Validation set: {roc_auc_val}')

# Evaluate the selected Random Forest model on the test set
y_test_pred = rf_model.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)
roc_auc_test = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])

print(f'Final F1 Score on Test set: {f1_test}')
print(f'Final AUC-ROC Score on Test set: {roc_auc_test}')


# Split data into train (60%), validation (20%), and test (20%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f'Training set: {X_train.shape}, Validation set: {X_val.shape}, Test set: {X_test.shape}')

F1 Score on Validation set: 0.07725321888412016
AUC-ROC Score on Validation set: 0.42567007797270956
Final F1 Score on Test set: 0.5904572564612326
Final AUC-ROC Score on Test set: 0.8371530143682419
Training set: (6000, 2944), Validation set: (2000, 2944), Test set: (2000, 2944)


### Initial Model Evaluation

The initial model was trained without addressing the class imbalance, and we computed two key performance metrics:
- **F1 Score**: Measures the balance between precision and recall. A higher F1 score indicates better performance in predicting both churn and non-churn cases.
- **AUC-ROC**: This metric assesses the model's ability to distinguish between the two classes (churn and non-churn). The higher the AUC-ROC score, the better the model is at classification.

The results from this initial model will serve as our baseline. In the next steps, we will apply techniques to address the class imbalance and improve the model's performance.


### Next Steps: Addressing Class Imbalance

To improve the performance of our model, we will address the class imbalance in the dataset. Two common techniques to handle imbalanced data are:

1. **Using Class Weights**: We will adjust the weights of the classes in the Random Forest model so that the minority class (churned customers) is given more importance during training.
2. **Undersampling**: We will manually reduce the number of samples in the majority class (non-churned customers) to match the number of samples in the minority class.

By applying these techniques, we aim to improve the model's F1 score and AUC-ROC metrics.


In [11]:
# Train the model on undersampled data with tuned hyperparameters
rf_model_under = RandomForestClassifier(
    n_estimators=300,   # Increasing the number of trees
    max_depth=15,       # Limit the maximum depth of the trees
    min_samples_split=10,  # Increase the minimum samples required to split a node
    random_state=42
)

rf_model_under.fit(X_train_under, y_train_under)

# Predict and evaluate the model on the test set
y_pred_under = rf_model_under.predict(X_test)
f1_under = f1_score(y_test, y_pred_under)
roc_auc_under = roc_auc_score(y_test, rf_model_under.predict_proba(X_test)[:, 1])

print(f'F1 Score (undersampling, tuned): {f1_under}')
print(f'AUC-ROC Score (undersampling, tuned): {roc_auc_under}')


F1 Score (undersampling, tuned): 0.3311603650586702
AUC-ROC Score (undersampling, tuned): 0.4358043252851429


In [12]:
# Tune hyperparameters
rf_model_weighted = RandomForestClassifier(
    n_estimators=300,   # Try increasing the number of trees
    max_depth=15,       # Limit the maximum depth of the trees
    min_samples_split=10,  # Increase the minimum samples required to split a node
    class_weight='balanced', 
    random_state=42
)

rf_model_weighted.fit(X_train, y_train)

# Predict and evaluate the model on the test set
y_pred_weighted = rf_model_weighted.predict(X_test)
f1_weighted = f1_score(y_test, y_pred_weighted)
roc_auc_weighted = roc_auc_score(y_test, rf_model_weighted.predict_proba(X_test)[:, 1])

print(f'F1 Score (with class weights, tuned): {f1_weighted}')
print(f'AUC-ROC Score (with class weights, tuned): {roc_auc_weighted}')


F1 Score (with class weights, tuned): 0.592092574734812
AUC-ROC Score (with class weights, tuned): 0.8313820174788922


### Handling Class Imbalance Using Class Weights

In this approach, we adjust the class weights of the Random Forest model using the `class_weight='balanced'` parameter. This ensures that the model assigns higher importance to the minority class (churned customers), thereby improving the model’s ability to predict churn.

Achieving Balance: Handling class imbalance was crucial to improving the model's F1 score and ensuring that it could accurately predict customer churn.
Modeling Flexibility: Both class weights and undersampling improved the model's performance, but undersampling provided the best balance between precision and recall in this case.
Real-World Application: The model we built can help Beta Bank identify at-risk customers, enabling the bank to target those customers with retention strategies, ultimately saving the bank costs associated with customer acquisition.
