# Beta Bank Projet. 

## Introduction.

In this project, we address the challenge of customer retention for Beta Bank, an institution that faces a recurring problem: customer churn. Each month, a percentage of users leave the bank, generating significant costs, as attracting new customers is often more expensive than retaining current ones. <br>

The main objective is to develop a machine learning model capable of predicting whether a customer is likely to leave the bank in the near future. To achieve this, we rely on historical data that reflects customer behavior and contract termination with the institution. <br>

The focus is on optimizing the model to achieve the highest possible value of the F1 metric, with a minimum required value of 0.59 on the test set to consider the project successful. Additionally, the AUC-ROC metric will be evaluated to provide a more complete view of the model's performance and compare it with the F1 value. <br>

This analysis will allow Beta Bank to identify at-risk customers early and take proactive measures to improve their experience, thereby strengthening the customer-bank relationship and reducing the churn rate.

## Development.

### Inspect data.

In [3]:
#Libraries import.

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, roc_curve, classification_report, confusion_matrix
from sklearn.utils import resample, shuffle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [4]:
# Data import.

df= pd.read_csv('/Users/pauli/Documents/Data/beta_bank/Churn.csv')

In [5]:
# Data inspect.

df.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


This initial inspection shows that there are null values ​​in the Tenure column, and columns that are of type string, which must be reviewed later since the models have conflicts with these columns.

In [7]:
# First, let's check for missing values.

df.isna().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

We need to check if the rows with NaN in Tenure correspond only to customers who stayed or if there are also customers who left.

In [8]:
nan= df.query('Tenure.isna()')
nan.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.0,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.0,1,0,0,84509.57,0


In [9]:
#Count how many customers containing NaN left or stayed to see if they affect the sample

current_client= nan[nan['Exited'] == 0]
not_client = nan[nan['Exited'] == 1]

print(current_client['Exited'].value_counts())
print(not_client['Exited'].value_counts())

Exited
0    726
Name: count, dtype: int64
Exited
1    183
Name: count, dtype: int64


Of the rows with null values, 726 correspond to customers who left, this is a significant sample and they cannot be eliminated, so they will be replaced with the mean.

In [10]:
tenure_mean= df['Tenure'].mean()

df['Tenure'] = df['Tenure'].fillna(tenure_mean)

#Check that there are no more NaNs.

df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Now, the RowNumber, Customer Id and Surname columns are removed since they do not represent important categories when creating a model.

In [11]:
df= df.drop(['CustomerId', 'RowNumber', 'Surname'], axis=1)

df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


### Data division and transformation.

Before training the model, we'll split the data into training, validation, and test sets. We'll also transform categorical features into numerical features using One-Hot Encoding (OHE).

In [12]:
#OHE

df_ohe= pd.get_dummies(df, drop_first=True)

#Test that it was done correctly.

df_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Age                10000 non-null  int64  
 2   Tenure             10000 non-null  float64
 3   Balance            10000 non-null  float64
 4   NumOfProducts      10000 non-null  int64  
 5   HasCrCard          10000 non-null  int64  
 6   IsActiveMember     10000 non-null  int64  
 7   EstimatedSalary    10000 non-null  float64
 8   Exited             10000 non-null  int64  
 9   Geography_Germany  10000 non-null  bool   
 10  Geography_Spain    10000 non-null  bool   
 11  Gender_Male        10000 non-null  bool   
dtypes: bool(3), float64(3), int64(6)
memory usage: 732.6 KB


In [13]:
#Divide the dataset into training and validation sets.

features= df_ohe.drop(['Exited'], axis=1)
target= df_ohe['Exited']

# First it is divided into training set (60%) and temporary set (40%)
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=15)

#Then the temporary set is divided into validation (20%) and test (20%)

features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=15)

#Display the size of the sets

print("Training set:", features_train.shape)
print("Validation set:", features_valid.shape)
print("Test set:", features_test.shape)

Training set: (6000, 11)
Validation set: (2000, 11)
Test set: (2000, 11)


### Train the model without considering imbalance.

In [14]:
model = LogisticRegression(random_state=15,  solver='liblinear')
model.fit(features_train, target_train)

valid_predictions = model.predict(features_valid)

# Class predictions (for F1-score)
pred_valid = model.predict(features_test)

# Probability predictions (for AUC-ROC)
prob_one_valid = model.predict_proba(features_test)[:, 1]

# Calculate the F1-score
f1 = f1_score(target_test, pred_valid)
print(f"F1-Score: {f1}")

# Calculate the AUC-ROC
auc_roc = roc_auc_score(target_test, prob_one_valid)
print(f"AUC-ROC: {auc_roc}")




F1-Score: 0.02358490566037736
AUC-ROC: 0.6626656193201742


The F1 score is extremely low (0.023), indicating that the model performs poorly in balancing accuracy and sensitivity. The model is failing to correctly capture the positive class (customers who leave the bank). <br>
The AUC-ROC is 0.66, which, although better than chance (0.5%), is far from excellent, as we ideally seek values ​​closer to 1.0. <br> This proves that the classes are unbalanced; the model is prioritizing the majority class.



### Improve model quality

Let's try two different approaches to balancing classes.

In [15]:
#Option 1: Subsampling the Majority Class


df_concat= pd.concat([features, target], axis=1)

#Separate the classes

class_majority = df_concat[df_concat['Exited'] == 0]
class_minority = df_concat[df_concat['Exited'] == 1]

# Subsample the majority class

class_majority_downsampled = resample(class_majority, replace=False, n_samples=len(class_minority),random_state=15)

#Combine balanced classes

balanced_df = pd.concat([class_majority_downsampled, class_minority])

#Divide the data

X_balanced_downsampled = balanced_df.drop(columns='Exited')
y_balanced_downsampled = balanced_df['Exited']


X_train_downsampled, X_test_downsampled, y_train_downsampled, y_test_downsampled = train_test_split(X_balanced_downsampled, y_balanced_downsampled, test_size=0.3, random_state=15)

# Train the model with balanced data
model = LogisticRegression(random_state=15, solver='liblinear')
model.fit(X_train_downsampled, y_train_downsampled)

y_pred_downsampled = model.predict(X_test_downsampled)
f1 = f1_score(y_test_downsampled, y_pred_downsampled)
auc_roc = roc_auc_score(y_test_downsampled, model.predict_proba(X_test_downsampled)[:, 1])

print(f"F1-Score: {f1}")
print(f"AUC-ROC: {auc_roc}")

F1-Score: 0.6687898089171974
AUC-ROC: 0.7025931928687196


In [16]:
# Option 2: Oversampling the minority class.


#Oversampling the minority class
class_minority_oversampled = class_minority.sample(n=len(class_majority), replace=True, random_state=15)

#Combine balanced classes
oversampled_data = pd.concat([class_majority, class_minority_oversampled])

# Shuffle the data
oversampled_data = shuffle(oversampled_data, random_state=15)

# Divide data
X_oversampled = oversampled_data.drop(columns=target.name)
y_oversampled = oversampled_data[target.name]

X_train, X_test, y_train, y_test = train_test_split(X_oversampled, y_oversampled, test_size=0.3, random_state=42)

# Train model

model = LogisticRegression(random_state=15, solver='liblinear')
model.fit(X_train, y_train)


y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

print(f"F1-Score: {f1}")
print(f"AUC-ROC: {auc_roc}")


F1-Score: 0.6564822460776218
AUC-ROC: 0.7005813423723871


The F1 score improved significantly in both cases, indicating a better balance between accuracy and sensitivity in identifying churning customers. For this specific case, we'll use Undersampling due to its better performance on both metrics. It also avoids the overfitting issues associated with oversampling.

### Training the fitted model

In [17]:
#Option 1: Logistic regression


model = LogisticRegression(random_state=15, solver='liblinear')
model.fit(X_train_downsampled, y_train_downsampled)

#Predict on the test set
y_pred = model.predict(X_test_downsampled)
y_pred_proba = model.predict_proba(X_test_downsampled)[:, 1]


f1 = f1_score(y_test_downsampled, y_pred)
auc_roc = roc_auc_score(y_test_downsampled, y_pred_proba)


print("Evaluation metrics:")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")


# Classification report and confusion matrix
print("\nClassification Report:")
print(classification_report(y_test_downsampled, y_pred))

print("Confusion matrix:")
print(confusion_matrix(y_test_downsampled, y_pred))



Evaluation metrics:
F1-Score: 0.6688
AUC-ROC: 0.7026

Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.63      0.65       617
           1       0.65      0.69      0.67       606

    accuracy                           0.66      1223
   macro avg       0.66      0.66      0.66      1223
weighted avg       0.66      0.66      0.66      1223

Confusion matrix:
[[387 230]
 [186 420]]


In [18]:
# Option 2: Decision Tree Classifier


tree_model = DecisionTreeClassifier(random_state=15, max_depth=10)
tree_model.fit(X_train_downsampled, y_train_downsampled)

#Predict on the test set
y_pred = tree_model.predict(X_test_downsampled)
y_pred_proba = tree_model.predict_proba(X_test_downsampled)[:, 1]


f1 = f1_score(y_test_downsampled, y_pred)
auc_roc = roc_auc_score(y_test_downsampled, y_pred_proba)


print("Evaluation metrics:")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")


print("\nClassification Report:")
print(classification_report(y_test_downsampled, y_pred))

print("Confusion matrix:")
print(confusion_matrix(y_test_downsampled, y_pred))



Evaluation metrics:
F1-Score: 0.7228
AUC-ROC: 0.7810

Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.75      0.74       617
           1       0.74      0.71      0.72       606

    accuracy                           0.73      1223
   macro avg       0.73      0.73      0.73      1223
weighted avg       0.73      0.73      0.73      1223

Confusion matrix:
[[465 152]
 [177 429]]


In [19]:
# Option 3: Random Forest Classifier


rf_model = RandomForestClassifier(random_state=15, n_estimators=50, max_depth=10)
rf_model.fit(X_train_downsampled, y_train_downsampled)

y_pred = rf_model.predict(X_test_downsampled)
y_pred_proba = rf_model.predict_proba(X_test_downsampled)[:, 1]


f1 = f1_score(y_test_downsampled, y_pred)
auc_roc = roc_auc_score(y_test_downsampled, y_pred_proba)


print("Evaluation metrics:")
print(f"F1-Score: {f1:.4f}")
print(f"AUC-ROC: {auc_roc:.4f}")


print("\nClassification Report:")
print(classification_report(y_test_downsampled, y_pred))

print("Confusion matrix:")
print(confusion_matrix(y_test_downsampled, y_pred))


Evaluation metrics:
F1-Score: 0.7605
AUC-ROC: 0.8489

Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.79      0.77       617
           1       0.78      0.74      0.76       606

    accuracy                           0.77      1223
   macro avg       0.77      0.77      0.77      1223
weighted avg       0.77      0.77      0.77      1223

Confusion matrix:
[[488 129]
 [155 451]]


Logistic regression performs decently, but it is the lowest among the three models, so it is discarded. <br> Decision Tree shows a significant improvement in both metrics compared to logistic regression.
It has a better balance between precision and recall, but may be less robust. <br>
Finally, Random Forest performs best in all metrics, also generalizes better and shows a solid balance between classes. It is also observed that for class 1 (customer who churns) it yielded 494 correctly classified items and 123 misclassified items, and for class 0 it yielded 449 correctly classified items and 157 misclassified items.

## Conclusion.

In this project, we sought to predict customer churn at Beta Bank using historical data and a machine learning-based approach. After analyzing several models, class balancing techniques, such as undersampling and oversampling, were implemented to address data imbalance. <br>

Of the three models evaluated (Logistic Regression, Decision Tree, and Random Forest), the Random Forest model proved to be the most effective, achieving an F1 score of 0.762 and an AUC-ROC of 0.849. This indicates an excellent balance between precision and recall, as well as a good ability to differentiate between customers who churn and those who remain with the bank. <br>

Analysis of the Random Forest model's confusion matrix highlighted its ability to minimize classification errors, both in customers who do not churn (Class 0) and those who do (Class 1). This makes it a reliable tool for identifying customers at risk of churn and designing personalized retention strategies.