# Beta Bank Customer Retention

## Introduction

This project focuses on predicting the customer churn using machine learning techniques. Customer churn refers to when a customer stops doing business with a company. Predicting churn is important for Beta Bank as it can help them identify customers who are likely to churn and take proactive steps to retain them.

The dataset used contains information about the bank's customers and whether they exited (churned) or not. The data includes customer information such as credit score, gender, age, geography, etc.

The project involves the following steps:
- Data is loaded, explored, and preprocessed. This includes handling missing values, converting data types, and dropping unnecessary columns.
- The target variable is imbalanced with more customers continuing their business compared to those leaving. Techniques such as upsampling the minority class and downsampling the majority class will be used to address this imbalance.
- A Logistic Regression model will be trained on the preprocessed data. The model's performance is evaluated using F1 score and AUC-ROC metrics.
- The model is then improved using upsampling and downsampling. The results will be compared before the model is improved vs after the model is improved.

The goal of this project is to build a model that can accurately predict customer churn. The insights gained from this project could potentially be used to improve Beta Bank's customer retention strategies.

## Prepare the data

In [121]:
# Import libraries
import pandas as pd
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.metrics import accuracy_score

from sklearn.utils import resample

# from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=UserWarning)

In [122]:
# Read the data
data = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/Churn.csv')

# Examine the data
data.info()
display(data.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
6456,6457,15755978,Tseng,606,France,Male,31,10.0,0.0,2,1,0,195209.4,0
4466,4467,15595160,Renwick,413,Spain,Male,35,2.0,0.0,2,1,1,60972.84,0
8539,8540,15668594,Diggs,620,Germany,Female,25,,137712.01,1,1,1,76197.05,0
2756,2757,15765806,Wu,492,France,Male,29,1.0,144591.96,1,1,1,196293.76,0
1510,1511,15786199,Hsing,535,France,Male,33,2.0,133040.32,1,1,1,110299.78,0
6658,6659,15777873,Downer,628,France,Female,31,5.0,0.0,1,0,0,147963.07,1
4862,4863,15686780,Rogova,645,Spain,Female,55,1.0,133676.65,1,0,1,17095.49,0
7851,7852,15651581,Lavrentyev,758,Germany,Male,68,6.0,112595.85,1,1,0,35865.44,1
2288,2289,15579166,Munro,619,France,Female,30,7.0,70729.17,1,1,1,160948.87,0
6877,6878,15695148,Ibeabuchi,614,Spain,Female,37,9.0,0.0,2,1,1,62023.1,0


In [123]:
# Check for duplicates
print(data.duplicated().sum())

0


There are no duplicate rows, so we can move on.

In [124]:
# Check for missing values
print(data.isnull().sum())

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64


There are 909 missing values for the 'Tenure' column. Some models will not be able to handle data with missing values. Therefore, we will fill in the missing values for tenure with the median value. We will also change the data type of 'Tenure' to integers if all the values are integers.

In [125]:
# Fill missing values in 'Tenure' with the median value
data['Tenure'].fillna(data['Tenure'].median(), inplace=True)

# Check to see if it's save to convert 'Tenure' from float to int. If so, then convert it.
if np.array_equal(data['Tenure'], data['Tenure'].astype('int')):
    data['Tenure'] = data['Tenure'].astype('int')

print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int32  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int32(1), int64(8), object(3)
memory usage: 1.0+ MB
None


We will now remove the columns that are not needed.

In [126]:
# Drop the columns that are not needed for the model
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

display(data.sample(10))

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
6358,652,France,Female,32,2,0.0,2,1,0,54628.11,0
7233,571,France,Male,38,1,121405.04,1,1,1,154844.22,0
4950,628,Germany,Female,45,6,53667.44,1,1,0,115022.94,0
6529,836,Spain,Female,37,5,0.0,2,1,0,111324.41,0
6062,718,Spain,Male,32,8,0.0,2,1,1,41399.33,0
6906,688,Spain,Female,46,3,0.0,2,0,1,104902.68,0
7751,750,Spain,Female,39,6,0.0,2,0,0,19264.33,0
1473,552,France,Male,36,8,0.0,2,0,0,132547.02,0
9357,418,France,Female,46,9,0.0,1,1,1,81014.5,1
8294,704,Spain,Female,36,2,175509.8,2,1,0,152039.67,0


These columns were dropped since they do not contribute to the model's prediction of customer churn. For RowNumber is an index column that does not provide meaningful information for the model. CustomerId is a unique identifier for each customer. Including this in the model could associate specific outcomes to the individual customer IDs and may not work well with unseen data. Surname is the customer's last name, which will probably not have influence towards their likelihood to churn.

In [127]:
# Convert categorical data into numerical data
data = pd.get_dummies(data, drop_first=True)

display(data.sample(10))

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
9413,751,44,10,0.0,2,1,0,170634.49,0,False,False,False
8938,693,47,8,107604.66,1,1,1,80149.27,0,False,True,True
2565,705,50,5,77065.9,2,0,1,145159.26,0,False,False,False
5585,432,38,2,135559.8,2,1,1,71856.3,0,True,False,True
8885,668,45,4,102486.21,2,1,1,158379.25,0,False,True,True
6149,643,34,6,0.0,2,1,1,116046.22,0,False,True,True
5639,523,61,8,66250.71,1,1,1,21859.06,0,False,False,True
1685,613,20,0,117356.19,1,0,0,113557.7,1,True,False,False
7280,804,55,5,0.0,2,1,1,118752.6,0,False,False,True
800,605,52,7,0.0,2,1,1,173952.5,0,False,False,True


We have the new dataframe that has the categories placed into separate columns. To avoid the dummy variable trap, the drop_first argument for get_dummies doesn't include a Geography_France column. It is assumed that the geography is France if it is not Germany or Spain. Same with Gender_Male assuming the gender is Female if Gender_Male is false.

In [128]:
# Split the data into features and target
# The 'Exited' column is the target, the rest are features
features = data.drop('Exited', axis=1)
target = data['Exited']

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=42)

In [129]:
# Examine the balance of classes
class_counts = target.value_counts()
print(class_counts)


Exited
0    7963
1    2037
Name: count, dtype: int64


This code shows the number of customers who stayed with the company vs those who took their business elsewhere. It shows that there are significantly more customers who are loyal customers than those who left.

In [130]:
# Calculate the imbalance ratio
imbalance_ratio = class_counts[0] / class_counts[1]
print(f'Imbalance Ratio: {imbalance_ratio}')

Imbalance Ratio: 3.9091801669121256


This shows that there are about 4 times the loyal customers as there are who took their business elsewhere at the time the data was collected.

## Train the model

In [131]:
# Train a Logistic Regression model without considering the imbalance
model = LogisticRegression(random_state=42)
model.fit(features_train, target_train)

In [132]:
# Make predictions on the test set
predictions = model.predict(features_test)

In [133]:
# Evaluate the model
print(classification_report(target_test, predictions))

              precision    recall  f1-score   support

           0       0.81      0.98      0.89      1607
           1       0.45      0.07      0.12       393

    accuracy                           0.80      2000
   macro avg       0.63      0.53      0.51      2000
weighted avg       0.74      0.80      0.74      2000



Here's what can be seen from the results:
- The precision, which is the ratio of correctly predicted positive observations to the total predicted positives vs false positives, is high with for 0, but relatively low for 1. The precision is .81 for 0 and .45 for 1.
- Recall is the ratio of correctly predicted positive observations to all the observations in the class. For 0, the recall is .98, while for 1, the recall is .07.
- F1 score is the weighted average of Precision and Recall. This score takes both false positive and false negatives into account. It is a better measure than accuracy for uneven class distribution such as what we have in our data. The F1 score for 0 is .89, while the F1 score for 1 is 0.12.
- Support is the number of actual occurences of the class specified in the dataset. For 0, it is 1607 and 1 is 393.

From these metrics, we can conclude that the model is performing well in predicting customers who did not exit (0), but not as well as predicting customers who exited.

## Improve the model

We will use upsampling and downsampling to improve our model.

#### Upsampling

In [134]:
# Upsampling
# Separate majority and minority classes
data_majority = data[data.Exited==0]
data_minority = data[data.Exited==1]

In [135]:
# Upsample minority class
data_minority_upsampled = resample(data_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(data_majority),    # to match majority class
                                 random_state=42) # reproducible results

In [136]:
# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled])

# Display new class counts
print(data_upsampled.Exited.value_counts())

Exited
0    7963
1    7963
Name: count, dtype: int64


Here's what we did:
- We separated majority and minority classes. Two datasets were made for each class of customers who exited. Majority of the customers continued their service, while a minority of the customers took their business elsewhere.
- The resample function upsampled the minority class. The 'replace=true' argument allows the same observation to be sampled more than once. The 'n_samples=len(data_majority)' argument specifies the number of samples to generate to equal the length of data_majority.
- The upsampled minority is then combines with the majority class to create the data_upsampled dataframe where both classes are are equally represented.

We can see that the new dataset has an equal number of instances in both classes. The class imbalance has been balanced through upsampling.

In [137]:
# Train a model using the upsampled data
# Separate input features (features) and target variable (target)
target_upsampled = data_upsampled.Exited
features_upsampled = data_upsampled.drop('Exited', axis=1)

In [138]:
# Perform train-test split
features_train, features_test, target_train, target_test = train_test_split(
    features_upsampled, target_upsampled, test_size=0.2, random_state=42)

In [139]:
# Train model
clf_1 = LogisticRegression().fit(features_train, target_train)

# Predict on test set
pred_y_1 = clf_1.predict(features_test)

In [140]:
# Is our model still predicting just one class?
print('Predicting classes:', np.unique(pred_y_1))

# How's our accuracy?
print('Accuracy score:', accuracy_score(target_test, pred_y_1))

# What about AUROC?
prob_y_1 = clf_1.predict_proba(features_test)
prob_y_1 = [p[1] for p in prob_y_1]
print('AUC-ROC score:', roc_auc_score(target_test, prob_y_1))

Predicting classes: [0 1]
Accuracy score: 0.6603892027620841
AUC-ROC score: 0.7108317027946743


We are trained a Logistic Regression model on the upsampled data and evaluating its performance.

- The target variable is separated from the input features.
- Data is split into a training and test set.
- Model is trained.
- The trained model is used to make predictions on the test set.
- The unique predicted classes are printed. 0 and 1 means the model is predicting both classes.
- Accuracy is calculated and printed.
- AUC-ROC is the Area Under Curve-Receiver Operating Characteristic measures the trade-off between the true positive rate and false positive rate for every possible cut-off.

We can see that the model is predicting both classes 0 and 1. The accuracy is about 66% while the AUC-ROC is about 71%.

#### Downsampling

Here we are downsampling, which is the opposite of upsampling. We take the majority and reduce the number down to match the length with the minority. This balances the ratio between the majority and minority. We combine them into a new, balanced dataframe.

In [141]:
# Downsample majority class
data_majority_downsampled = resample(data_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(data_minority),     # to match minority class
                                 random_state=42) # reproducible result

In [142]:
# Combine minority class with downsampled majority class
data_downsampled = pd.concat([data_majority_downsampled, data_minority])

# Display new class counts
print(data_downsampled.Exited.value_counts())

Exited
0    2037
1    2037
Name: count, dtype: int64


The dataframe is balanced with the same number of observations for majority and minority classes.

In [143]:
# Train model using the downsampled data
# Separate input features (features) and target variable (target)
target_downsampled = data_downsampled.Exited
features_downsampled = data_downsampled.drop('Exited', axis=1)

In [144]:
# Perform train-test split
features_train, features_test, target_train, target_test = train_test_split(
    features_downsampled, target_downsampled, test_size=0.2, random_state=42)

In [145]:
# Train model
clf_2 = LogisticRegression().fit(features_train, target_train)

# Predict on test set
pred_y_2 = clf_2.predict(features_test)

In [146]:
# Is our model still predicting just one class?
print('Predicting classes:', np.unique(pred_y_2))

# How's our accuracy?
print('Accuracy score:', accuracy_score(target_test, pred_y_2))

# What about AUROC?
prob_y_2 = clf_2.predict_proba(features_test)
prob_y_2 = [p[1] for p in prob_y_2]
print('AUC-ROC score:', roc_auc_score(target_test, prob_y_2))

Predicting classes: [0 1]
Accuracy score: 0.6478527607361964
AUC-ROC score: 0.6913760042719485


Similar to upsampling, we did the following:
- Downsampled the majority class.
- Combined them into a new, balanced dataframe where both classes are equally represented.
- The target variable is separated from the input features.
- Data is split into a training set and a test set.
- Model is trained using Logistic Regression.
- The trained model is used to make predictions on the test set.
- The unique predicted classes are 0 and 1, so it's predicting both classes.
- Accuracy and AUC-ROC are calculated and printed.

We can see that the accuracy is about 65%, while the AUC-ROC is approximately 69%.

## Testing

In [147]:
# Make predictions on the test set using both models
pred_y_1 = clf_1.predict(features_test)
pred_y_2 = clf_2.predict(features_test)

In [148]:
# Calculate F1 score for both models
f1_score_1 = f1_score(target_test, pred_y_1)
f1_score_2 = f1_score(target_test, pred_y_2)

In [149]:
# Calculate AUC-ROC for both models
roc_auc_score_1 = roc_auc_score(target_test, pred_y_1)
roc_auc_score_2 = roc_auc_score(target_test, pred_y_2)

In [150]:
# Print the scores
print(f'Upsampled Model: F1 Score = {f1_score_1}, AUC-ROC = {roc_auc_score_1}')
print(f'Downsampled Model: F1 Score = {f1_score_2}, AUC-ROC = {roc_auc_score_2}')

Upsampled Model: F1 Score = 0.6341463414634145, AUC-ROC = 0.6514023398626181
Downsampled Model: F1 Score = 0.6408010012515645, AUC-ROC = 0.6510837641690332


We compared the performance of the two models: one trained on upsampled data and the other trained on downsampled data.
- We used the predict function to make predictions on the test using both models.
- We calculated the F1 score for both models.
- We calculated the AUC-ROC score for both models.

From the output, we can see that they have similar scores for F1 and AUC-ROC. Both models are performing similarly on the test set.

Before improving the model using upsampling and downsampling, the model's performance was low. The F1 score for class 1 (customers who left) is around .12, which indicated that the model was not performing well on the minority class. Though, class 0 is at .89.

Addressing the imbalance significantly improved the model's F1 score, which shows improved performance on the minority class.

## Conclusion

This project involved building a machine learning model to predict customer churn. The dataset was initially imbalanced with a larger number of customers who continued their business with Beta Bank compared to those who did not. The initial model which was trained without addressing the imbalance performed poorly having a low F1 score for the minority class. After addressing the class imbalance using both upsampling and downsampling, the F1 scores improved dramatically from .12 to around .63-.64. The AUC-ROC scores of the improved model was around .65-.69.

This project demonstrated the importance of properly preprocessing the data, handling class imbalance, and choosing the right evaluation metrics when working with imbalanced datasets. It also shows the iterative process of building a model and continually improving the model based on performance.