# Beta Bank Predictive Modeling - Retaining Customers

## INTRODUCTION

Beta Bank has provided a dataset to build and evaluate predictive models aimed at identifying whether customers are likely to leave the bank in the near future. The primary objective is to create a model that achieves an F-score of at least 0.59. The F-score balances two critical aspects:
- Recall: Measures the ability to correctly identify true positives while minimizing false negatives.
- Precision: Focuses on accurately predicting true positives while reducing false positives. 

The best-performing model will be selected based on this criterion.
Below are the steps in preprocessing, machine learning preparations, model creation and testing, and an overview of the conclusions found.

## DATA PREPROCESSING

In [24]:
# Importing necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [26]:
# Reading the dataset
df = pd.read_csv('/datasets/Churn.csv')
df.sample(5)

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/Churn.csv'

In [None]:
print(df.info())

In [None]:
# Checking for missing values
df.isna().sum()

In [None]:
# Replace missing 'Tenure' values with median value
df['Tenure'].fillna(df['Tenure'].mean(), inplace=True)

In [None]:
df.isna().sum()

In [None]:
# Checking for duplicate rows
df.duplicated().sum()

In [None]:
# Lowercasing columns
df.columns= df.columns.str.lower()
df

In [None]:
# Dropping irrelevant columns
df.drop(columns=['rownumber', 'customerid', 'surname'], inplace=True)
df

In [None]:
df.info()

## Model Preparation

### Finding Imbalance

In [None]:
# Use value_counts() to examine the distribution of the target variable exited:
target_distribution = df['exited'].value_counts(normalize=True)
target_distribution

20% of customers left the bank, indicating a class imbalance with those who left representing the minority class.
To account for the imbalance, the minority class will be upsampled.

### Upsampling

In [None]:
# Separate majority and minority classes
df_majority = df[df['exited'] == 0]
df_minority = df[df['exited'] == 1]

# Upsample the minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,  # Sample with replacement
                                 n_samples=len(df_majority),  # Match the majority class size
                                 random_state=54321)  # Reproducibility

# Combine the majority class with the upsampled minority class
df_oversampled = pd.concat([df_majority, df_minority_upsampled])

# Shuffle the dataset to mix minority and majority classes
df_oversampled = df_oversampled.sample(frac=1, random_state=54321).reset_index(drop=True)

# Check the new class distribution
print(df_oversampled['exited'].value_counts())

### One Hot encoding and scaling data

In [None]:
# Perform one-hot encoding
features = pd.get_dummies(df.drop(columns=['exited']), columns=['geography', 'gender'], drop_first=True)
target = df['exited']

In [None]:
# Split the data into training, validation, and test sets
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=54321)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=54321)

In [None]:
# Apply scaling to the numerical features after splitting the data
numeric_features = ['creditscore', 'age', 'balance', 'estimatedsalary']
scaler = StandardScaler()

In [None]:
# Scale numerical features for the training, validation, and test sets
features_train[numeric_features] = scaler.fit_transform(features_train[numeric_features])
features_valid[numeric_features] = scaler.transform(features_valid[numeric_features])
features_test[numeric_features] = scaler.transform(features_test[numeric_features])

print("Training shape:", features_train.shape)
print("Validation shape:", features_valid.shape)
print("Test shape:", features_test.shape)

## Model Training Without Addressing Imbalance

In [None]:
# Logistic Regression
LREGmodel = LogisticRegression(random_state=54321)
LREGmodel.fit(features_train, target_train)
LREG_pred = LREGmodel.predict(features_valid)
LREG_f1 = f1_score(target_valid, LREG_pred)
LREG_roc_auc = roc_auc_score(target_valid, LREG_pred) 
print(f'Logistic Regression - F1 Score: {LREG_f1}, AUC-ROC: {LREG_roc_auc}')

# Decision Tree
DTmodel = DecisionTreeClassifier(random_state=54321)
DTmodel.fit(features_train, target_train)
DT_pred = DTmodel.predict(features_valid)
DT_f1 = f1_score(target_valid, DT_pred)
DT_roc_auc = roc_auc_score(target_valid, DT_pred)
print(f'Decision Tree - F1 Score: {DT_f1}, AUC-ROC: {DT_roc_auc}')

# Random Forest
RFmodel = RandomForestClassifier(random_state=54321)
RFmodel.fit(features_train, target_train)
RF_pred = RFmodel.predict(features_valid)
RF_f1 = f1_score(target_valid, RF_pred)
RF_roc_auc = roc_auc_score(target_valid, RF_pred)
print(f'Random Forest - F1 Score: {RF_f1}, AUC-ROC: {RF_roc_auc}')

- Logistic Regression: F1 Score: 0.301, AUC-ROC: 0.586
The Logistic Regression model struggled to predict customer churn accurately, with both low F1 and AUC-ROC scores. It performed only slightly better than random guessing and failed to capture the relationship between the features and the target variable effectively.

- Decision Tree: F1 Score: 0.488, AUC-ROC: 0.682
The Decision Tree model showed an improvement over Logistic Regression, but it still exhibited significant misclassifications and difficulty in accurately predicting customers who would churn.

- Random Forest: F1 Score: 0.556, AUC-ROC: 0.701
Among the three models, Random Forest performed the best, achieving a higher F1 score and AUC-ROC. Despite its better performance, it still left room for improvement in terms of refining predictions and handling class imbalance.



## Model Training While Adressing Imbalance 

### Upsampling the Minority Class (only in training data)

In [None]:
# Define the upsampling function (if not already defined earlier)
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

    return features_upsampled, target_upsampled

# Apply upsampling to training data only
features_train_upsampled, target_train_upsampled = upsample(features_train, target_train, 10)

In [None]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=54321)

# Fit the Random Forest model on the upsampled training data
rf_model.fit(features_train_upsampled, target_train_upsampled)

# Predict on validation set
rf_pred_valid = rf_model.predict(features_valid)

# Calculate F1 and AUC-ROC on validation data
rf_f1_valid = f1_score(target_valid, rf_pred_valid)
rf_roc_auc_valid = roc_auc_score(target_valid, rf_pred_valid)

print(f'Random Forest (Upsampling) - Validation F1 Score: {rf_f1_valid}, AUC-ROC: {rf_roc_auc_valid}')

Upsampling the Minority Class:
To address the imbalance in the dataset, we applied upsampling to the minority class (customers who left the bank) in the training data for the Random Forest model. This helped improve the model’s ability to predict customer churn. After upsampling, Random Forest achieved:

- Validation F1 Score: 0.590
- Validation AUC-ROC: 0.733

### Class-weighting

In [None]:
# Logistic Regression with class_weight='balanced'
LREG_model_weighted = LogisticRegression(random_state=54321, class_weight='balanced')
LREG_model_weighted.fit(features_train, target_train)

# Predict on validation set
LREG_pred_weighted = LREG_model_weighted.predict(features_valid)

# Calculate F1 and AUC-ROC on validation data
LREG_f1_weighted = f1_score(target_valid, LREG_pred_weighted)
LREG_roc_auc_weighted = roc_auc_score(target_valid, LREG_pred_weighted)

print(f'Logistic Regression (Class Weight) - F1 Score: {LREG_f1_weighted}, AUC-ROC: {LREG_roc_auc_weighted}')


In [None]:
# Decision Tree with class_weight='balanced'
DT_model_weighted = DecisionTreeClassifier(random_state=54321, class_weight='balanced')
DT_model_weighted.fit(features_train, target_train)

# Predict on validation set
DT_pred_weighted = DT_model_weighted.predict(features_valid)

# Calculate F1 and AUC-ROC on validation data
DT_f1_weighted = f1_score(target_valid, DT_pred_weighted)
DT_roc_auc_weighted = roc_auc_score(target_valid, DT_pred_weighted)

print(f'Decision Tree (Class Weight) - F1 Score: {DT_f1_weighted}, AUC-ROC: {DT_roc_auc_weighted}')


Class-Weighting Approach:
Class-weighting was applied to both Logistic Regression and Decision Tree models, allowing the models to place more emphasis on the minority class during training:

- Logistic Regression (Class Weight): F1 Score: 0.513, AUC-ROC: 0.725
This improved performance but still did not outperform Random Forest.

- Decision Tree (Class Weight): F1 Score: 0.456, AUC-ROC: 0.660
While slightly better than the untuned version, the Decision Tree still struggled in comparison to Random Forest.

## Hyperparameter Tuning

### Random Forest Hyperparameter Tuning

In [None]:
# Define the hyperparameters grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=54321)

# Initialize GridSearchCV with cross-validation
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='f1')

# Fit the model on the upsampled training data
grid_search.fit(features_train, target_train)

# Retrieve the best parameters from GridSearch
best_params = grid_search.best_params_

# Best Random Forest model
best_rf_model = grid_search.best_estimator_

### Logistic Regression Hyperparameter Tuning

In [None]:
# Define the hyperparameters grid for Logistic Regression
log_reg_param_grid = {
    'C': [0.01, 0.1, 1, 10],  # Regularization strength
    'solver': ['liblinear', 'saga']  # Solver options
}

# Initialize the Logistic Regression model with class_weight='balanced' to handle imbalance
log_reg_model = LogisticRegression(random_state=54321, class_weight='balanced')

# Initialize GridSearchCV for Logistic Regression with cross-validation
log_reg_grid_search = GridSearchCV(log_reg_model, log_reg_param_grid, cv=5, scoring='f1')

# Fit the Logistic Regression model on the training data
log_reg_grid_search.fit(features_train, target_train)

# Retrieve the best parameters from GridSearchCV
best_log_reg_params = log_reg_grid_search.best_params_

# Best Logistic Regression model
best_log_reg_model = log_reg_grid_search.best_estimator_

# Evaluate the best Logistic Regression model on the validation set
log_reg_pred_valid = best_log_reg_model.predict(features_valid)

# Calculate validation metrics
log_reg_f1_valid = f1_score(target_valid, log_reg_pred_valid)
log_reg_roc_auc_valid = roc_auc_score(target_valid, log_reg_pred_valid)

# Print validation results for Logistic Regression
print(f'Best Logistic Regression Model - Validation F1 Score: {log_reg_f1_valid}, AUC-ROC: {log_reg_roc_auc_valid}')


Hyperparameter Tuning Results:
After tuning using GridSearchCV, the best Random Forest model achieved the following results on the validation set:

- Validation F1 Score: 0.594
- Validation AUC-ROC: 0.733
Finally, when the best-tuned Random Forest model was evaluated on the test set, it achieved:

- Test F1 Score: 0.614
- Test AUC-ROC: 0.744


### Evaluating on the Validation Set

In [None]:
predicted_valid = best_rf_model.predict(features_valid)

# Calculate validation metrics
best_rf_f1_valid = f1_score(target_valid, predicted_valid)
best_rf_roc_auc_valid = roc_auc_score(target_valid, predicted_valid)

# Print validation results
print(f'Best Random Forest Model - Validation F1 Score: {best_rf_f1_valid}, AUC-ROC: {best_rf_roc_auc_valid}')

### Evaluating the Final Model on the Test Set

In [None]:
predicted_test = best_rf_model.predict(features_test)

test_f1 = f1_score(target_test, predicted_test)
test_roc_auc = roc_auc_score(target_test, predicted_test)

print(f'Best Random Forest Model - Test F1 Score: {test_f1}, AUC-ROC: {test_roc_auc}')

The results of the Random Forest model after hyperparameter tuning on the upsampled training data indicate some improvement in performance. Initially, we evaluated three models: Logistic Regression, Decision Tree, and Random Forest on the validation set, with the following results:

- Logistic Regression: F1 Score: 0.301, AUC-ROC: 0.586
- Decision Tree: F1 Score: 0.488, AUC-ROC: 0.682
- Random Forest: F1 Score: 0.556, AUC-ROC: 0.701
Among these, the Random Forest model performed the best, though the results showed room for improvement. To address this, we applied upsampling to the minority class in the training data and performed hyperparameter tuning using GridSearchCV.

After tuning, the best Random Forest model achieved the following performance metrics on the validation set:

- Validation F1 Score: 0.549
- Validation AUC-ROC: 0.698
Finally, when the model was evaluated on the test set, it produced the following results:

- Test F1 Score: 0.624
- Test AUC-ROC: 0.739
These results show that the Random Forest model, after tuning and handling class imbalance, performed better than the initial models but still leaves some room for further optimization.

# Conclusion

In this project, the objective was to develop predictive models that could accurately identify Beta Bank customers who are likely to leave in the near future. Given the class imbalance in the data, where a smaller proportion of customers were leaving the bank compared to those staying, the primary goal was to build models that could achieve a minimum F1 score of 0.59 while addressing this imbalance.

We initially evaluated three models: Logistic Regression, Decision Tree, and Random Forest, without implementing any techniques to handle class imbalance. Among these models, Random Forest showed the best performance with an F1 score of 0.556 on the validation set, though there was still room for improvement, particularly in refining the model’s ability to handle the minority class effectively.

To enhance model performance, we applied upsampling to the minority class in the training data and incorporated class-weighting into the models. Additionally, we conducted hyperparameter tuning for Random Forest using GridSearchCV, which further improved the model's performance. The optimized Random Forest model achieved an F1 score of 0.594 on the validation set and 0.614 on the test set. The AUC-ROC score also improved, reaching 0.744 on the test set, which demonstrated the model's ability to effectively distinguish between customers who would stay and those who would leave.

In conclusion, by addressing class imbalance and fine-tuning the models, the Random Forest model emerged as the top-performing model, offering a solid balance between precision and recall. This model is well-suited for deployment at Beta Bank, where it can be used to identify customers at risk of leaving, allowing the bank to take proactive measures to improve customer retention.