# Beta Bank Customer Retention

## Introduction

This project focuses on predicting the customer churn using machine learning techniques. Customer churn refers to when a customer stops doing business with a company. Predicting churn is important for Beta Bank as it can help them identify customers who are likely to churn and take proactive steps to retain them.

The dataset used contains information about the bank's customers and whether they exited (churned) or not. The data includes customer information such as credit score, gender, age, geography, etc.

The project involves the following steps:
- Data is loaded, explored, and preprocessed. This includes handling missing values, converting data types, and dropping unnecessary columns.
- The target variable is imbalanced with more customers continuing their business compared to those leaving. Techniques such as upsampling the minority class and downsampling the majority class will be used to address this imbalance.
- A Logistic Regression model will be trained on the preprocessed data. The model's performance is evaluated using F1 score and AUC-ROC metrics.
- The model is then improved using upsampling and downsampling. The results will be compared before the model is improved vs after the model is improved.

The goal of this project is to build a model that can accurately predict customer churn. The insights gained from this project could potentially be used to improve Beta Bank's customer retention strategies.

## Prepare the data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, roc_auc_score, accuracy_score

from sklearn.utils import resample
from sklearn.utils import shuffle

from sklearn.exceptions import FitFailedWarning
warnings.filterwarnings(action='ignore', category=UserWarning)
warnings.filterwarnings('ignore')

In [2]:
# Read the data
try:
    data = pd.read_csv('./datasets/Churn.csv')
except:
    data = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/Churn.csv')

# Examine the data
data.info()
display(data.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9704,9705,15759872,L?,625,France,Male,22,9.0,0.0,2,1,0,157072.91,0
6370,6371,15798200,Manna,707,France,Male,35,,0.0,3,1,1,94148.3,0
229,230,15605461,Lucas,594,Germany,Female,29,3.0,130830.22,1,1,0,61048.53,0
8164,8165,15581370,Andreyeva,681,Spain,Male,38,2.0,99811.44,2,1,0,23531.5,0
2878,2879,15667751,Herrera,487,Spain,Female,36,1.0,140137.15,1,1,0,194073.33,0
9082,9083,15753161,Dickson,768,France,Female,36,,180169.44,2,1,0,17348.56,0
5692,5693,15662662,Duigan,573,France,Female,30,6.0,0.0,2,1,0,66190.21,0
1976,1977,15694192,Nwankwo,598,Spain,Female,38,6.0,0.0,2,0,0,173783.38,0
6608,6609,15576000,Chibueze,765,France,Male,40,,138033.55,1,1,1,67972.45,0
6688,6689,15814267,Zhdanova,550,France,Male,22,6.0,154377.3,1,1,1,51721.52,0


In [3]:
# Check for duplicates
print(data.duplicated().sum())

0


There are no duplicate rows, so we can move on.

In [4]:
# Check for missing values
print(data.isnull().sum())

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64


There are 909 missing values for the 'Tenure' column. Some models will not be able to handle data with missing values. Therefore, we will fill in the missing values for tenure with the median value. We will also change the data type of 'Tenure' to integers if all the values are integers.

In [5]:
# Fill missing values in 'Tenure' with the median value
data['Tenure'].fillna(data['Tenure'].median(), inplace=True)

# Check to see if it's save to convert 'Tenure' from float to int. If so, then convert it.
if np.array_equal(data['Tenure'], data['Tenure'].astype('int')):
    data['Tenure'] = data['Tenure'].astype('int')

print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int32  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int32(1), int64(8), object(3)
memory usage: 1.0+ MB
None


We will now remove the columns that are not needed.

In [6]:
# Drop the columns that are not needed for the model
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

display(data.sample(10))

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1068,594,France,Male,57,6,0.0,1,1,0,19376.56,1
8545,513,Germany,Male,34,7,60515.13,1,0,0,124571.09,0
8796,720,France,Male,33,2,0.0,2,0,1,141031.08,0
4946,546,France,Female,47,5,0.0,1,1,1,66408.01,1
1412,673,Germany,Female,29,4,99097.36,1,1,1,9796.69,0
9710,543,Germany,Female,37,3,122304.65,2,0,0,33998.7,0
9987,606,Spain,Male,30,8,180307.73,2,1,1,1914.41,0
1885,563,Spain,Male,32,6,0.0,2,1,1,19720.08,0
9821,652,Spain,Male,28,8,156823.7,2,1,0,198251.52,0
7218,757,France,Male,36,7,144852.06,1,0,0,130861.95,0


These columns were dropped since they do not contribute to the model's prediction of customer churn. For RowNumber is an index column that does not provide meaningful information for the model. CustomerId is a unique identifier for each customer. Including this in the model could associate specific outcomes to the individual customer IDs and may not work well with unseen data. Surname is the customer's last name, which will probably not have influence towards their likelihood to churn.

In [7]:
# Convert categorical data into numerical data
data = pd.get_dummies(data, drop_first=True)

display(data.sample(10))

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
5789,689,55,1,76296.81,1,1,0,42364.75,1,True,False,False
5516,592,37,3,96651.03,1,1,1,3232.82,0,False,False,True
1642,626,62,3,0.0,1,1,1,65010.74,0,False,True,True
2941,555,26,9,0.0,2,0,1,158918.03,0,False,True,False
4973,617,24,4,137295.19,2,1,1,91195.12,0,False,False,False
6064,772,23,2,0.0,2,1,0,18364.19,0,False,False,True
8089,692,24,2,120596.93,1,0,1,180490.53,0,True,False,True
5822,624,35,2,0.0,2,1,0,87310.59,0,False,True,True
5265,560,27,5,0.0,2,1,0,131919.48,0,False,False,False
7022,493,54,3,167831.88,2,1,0,150159.95,1,False,False,True


We have the new dataframe that has the categories placed into separate columns. To avoid the dummy variable trap, the drop_first argument for get_dummies doesn't include a Geography_France column. It is assumed that the geography is France if it is not Germany or Spain. Same with Gender_Male assuming the gender is Female if Gender_Male is false.

In [8]:
# Split the data into features and target
# The 'Exited' column is the target, the rest are features
features = data.drop('Exited', axis=1)
target = data['Exited']

# First, split the data into a training set (60% of the data) and a temp set (40%)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=42)

# Then, split the temp set into a validation set (50% of the temp)
# and a testing set (50% of the temp)
# This will result in a 20/20 split of the entire dataset for validation/testing
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=42)

In [9]:
# Examine the balance of classes
class_counts = target.value_counts()
print(class_counts)


Exited
0    7963
1    2037
Name: count, dtype: int64


This code shows the number of customers who stayed with the company vs those who took their business elsewhere. It shows that there are significantly more customers who are loyal customers than those who left.

In [10]:
# Calculate the imbalance ratio
imbalance_ratio = class_counts[0] / class_counts[1]
print(f'Imbalance Ratio: {imbalance_ratio}')

Imbalance Ratio: 3.9091801669121256


This shows that there are about 4 times the loyal customers as there are who took their business elsewhere at the time the data was collected.

In [11]:
# Universal variables
# Separate majority and minority classes
features_zeros = features_train[target == 0]
features_ones = features_train[target == 1]
target_zeros = target_train[target == 0]
target_ones = target_train[target == 1]

# Create a dictionary of hyperparameters that will be used
thresholds = np.arange(0, 1, 0.05)

## Logistic Regression

Here, we have the hyperparameters that we will use for our logistic regression models.

In [12]:
hyperparameters_lr = {
    'solver': ['newton-cholesky', 'saga', 'liblinear', 'newton-cg', 'lbfgs', 'sag'],
    'class_weight': ['balanced', None],
    'thresholds': thresholds
}

Define a function that will be used for all the logistic regression models.

In [13]:
def logistic_regression_experiment(hyperparameters=hyperparameters_lr, 
                                   features_train=features_train, 
                                   target_train=target_train, 
                                   features_valid=features_valid, 
                                   target_valid=target_valid, 
                                   method=None):
    results = []

    if method == 'upsampling':
        # Upsample minority class to match the number of samples in majority class
        features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
        target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
        print(target_upsampled.value_counts())
    elif method == 'downsampling':
        # Downsample the majority class
        features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
        target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
        print(target_downsampled.value_counts())
    else:
        print(target_train.value_counts())

    # Loop over each solver and run code based on which improvement method is used
    for solver in hyperparameters['solver']:
        model = LogisticRegression(random_state=42, solver=solver)
        
        if method == 'class_weight':
            # Loop over all possible values of class_weight in hyperparameters
            for class_weight in hyperparameters['class_weight']:
                # Set the class_weight of the model
                model.class_weight = class_weight
                # Fit the model on the training data
                model.fit(features_train, target_train)
                # Predict the target for the validation data
                predicted_valid = model.predict(features_valid)
                # Get the probabilities of the positive class for the validation data
                probabilities_valid = model.predict_proba(features_valid)[:, 1]
                # Calculate the accuracy of the model on the validation data
                accuracy = accuracy_score(target_valid, predicted_valid)
                # Calculate the F1 score of the model on the validation data
                f1 = f1_score(target_valid, predicted_valid)
                # Calculate the AUC-ROC of the model on the validation data
                auc_roc = roc_auc_score(target_valid, probabilities_valid)
                results.append(['LogisticRegression', solver, 'class_weight', class_weight, accuracy, f1, auc_roc])
        elif method == 'upsampling':
            # Shuffle the dataset
            features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
            # Fit the model on the upsampled training data
            model.fit(features_upsampled, target_upsampled)
            # Predict the target for the validation data
            predicted_valid = model.predict(features_valid)
            # Get the probabilities of the positive class for the validation data
            probabilities_valid = model.predict_proba(features_valid)[:, 1]
            # Calculate the accuracy of the model on the validation data
            accuracy = accuracy_score(target_valid, predicted_valid)
            # Calculate the F1 score of the model on the validation data
            f1 = f1_score(target_valid, predicted_valid)
            # Calculate the AUC-ROC of the model on the validation data
            auc_roc = roc_auc_score(target_valid, probabilities_valid)
            results.append(['LogisticRegression', solver, 'upsampling', 'N/A', accuracy, f1, auc_roc])
        elif method == 'downsampling':

            # Shuffle the dataset
            features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
            model.fit(features_downsampled, target_downsampled)
            predicted_valid = model.predict(features_valid)
            probabilities_valid = model.predict_proba(features_valid)[:, 1]
            accuracy = accuracy_score(target_valid, predicted_valid)
            f1 = f1_score(target_valid, predicted_valid)
            auc_roc = roc_auc_score(target_valid, probabilities_valid)
            results.append(['LogisticRegression', solver, 'downsampling', 'N/A', accuracy, f1, auc_roc])
        elif method == 'threshold':
            for threshold in hyperparameters['thresholds']:
                model.fit(features_train, target_train)
                probabilities_valid = model.predict_proba(features_valid)[:, 1]
                predicted_valid = probabilities_valid > threshold
                accuracy = accuracy_score(target_valid, predicted_valid)
                f1 = f1_score(target_valid, predicted_valid)
                auc_roc = roc_auc_score(target_valid, probabilities_valid)
                results.append(['LogisticRegression', solver, 'threshold', threshold, accuracy, f1, auc_roc])
        else:
            model.fit(features_train, target_train)
            predicted_valid = model.predict(features_valid)
            probabilities_valid = model.predict_proba(features_valid)[:, 1]
            accuracy = accuracy_score(target_valid, predicted_valid)
            f1 = f1_score(target_valid, predicted_valid)
            auc_roc = roc_auc_score(target_valid, probabilities_valid)
            results.append(['LogisticRegression', solver, 'None', 'None', accuracy, f1, auc_roc])

    df = pd.DataFrame(results, columns=['model_type', 'solver', 'method', 'parameter', 'accuracy', 'f1_score', 'auc_roc'])
    return df.sort_values('f1_score', ascending=False)


### Train the model

In [14]:
df_sorted = logistic_regression_experiment(method=None)

df_display = df_sorted[['solver', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display)
best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,solver,accuracy,f1_score,auc_roc
0,newton-cholesky,0.817,0.296154,0.751975
3,newton-cg,0.817,0.27668,0.748735
4,lbfgs,0.7975,0.097996,0.650513
2,liblinear,0.8055,0.044226,0.635962
1,saga,0.81,0.0,0.488424
5,sag,0.81,0.0,0.500179


Best so far:
 model_type    LogisticRegression
solver           newton-cholesky
method                      None
parameter                   None
accuracy                   0.817
f1_score                0.296154
auc_roc                 0.751975
Name: 0, dtype: object


Here's what can be seen from the results:
- The solver ‘newton-cholesky’ performed the best among all the solvers with an accuracy of 0.817, f1_score of 0.296154, and auc_roc of 0.751975.
- The solvers ‘saga’ and ‘sag’ had an f1_score of 0, indicating that they were not able to correctly classify any positive instances.

### Improve the model

#### Class Weight Adjustment

Class Weight Adjustment is useful on data that has imbalanced classes, such as what we currently have.

In [15]:
df_sorted = logistic_regression_experiment(method='class_weight')
df_display = df_sorted[['solver', 'parameter', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type    LogisticRegression
solver           newton-cholesky
method              class_weight
parameter               balanced
accuracy                   0.693
f1_score                0.461404
auc_roc                 0.752815
Name: 0, dtype: object


Unnamed: 0,solver,parameter,f1_score,auc_roc
0,newton-cholesky,balanced,0.461404,0.752815
6,newton-cg,balanced,0.461404,0.752802
8,lbfgs,balanced,0.456559,0.747117
4,liblinear,balanced,0.400665,0.681732
10,sag,balanced,0.330759,0.548354
2,saga,balanced,0.329276,0.546681
1,newton-cholesky,,0.296154,0.751975
7,newton-cg,,0.27668,0.748735
9,lbfgs,,0.097996,0.650513
5,liblinear,,0.044226,0.635962


Best so far:
 model_type    LogisticRegression
solver           newton-cholesky
method              class_weight
parameter               balanced
accuracy                   0.693
f1_score                0.461404
auc_roc                 0.752815
Name: 0, dtype: object


The solver ‘newton-cholesky’ with ‘balanced’ class weights performed the best among all the combinations with an f1_score of 0.461404 and auc_roc of 0.752815. This is an improvement over the previous best f1_score of 0.296154 achieved without class weights.

#### Upsampling

In [16]:
df_sorted = logistic_regression_experiment(method='upsampling')
df_display = df_sorted[['solver', 'accuracy', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    4773
Name: count, dtype: int64
model_type    LogisticRegression
solver           newton-cholesky
method                upsampling
parameter                    N/A
accuracy                   0.694
f1_score                0.464098
auc_roc                 0.753389
Name: 0, dtype: object


Unnamed: 0,solver,accuracy,f1_score,auc_roc
0,newton-cholesky,0.694,0.464098,0.753389
3,newton-cg,0.694,0.464098,0.753398
2,liblinear,0.6365,0.39768,0.682407
4,lbfgs,0.646,0.393836,0.688034
5,sag,0.482,0.332474,0.551472
1,saga,0.478,0.331626,0.547968


Best so far:
 model_type    LogisticRegression
solver           newton-cholesky
method                upsampling
parameter                    N/A
accuracy                   0.694
f1_score                0.464098
auc_roc                 0.753389
Name: 0, dtype: object


Upsampling is a technique used to handle class imbalance by increasing the number of instances in the minority class. The number of minority class is increased to match the number of majority class with the resample function. The solver ‘newton-cholesky’ performed the best among all the solvers with an accuracy of 0.694, f1_score of 0.464098, and auc_roc of 0.753389. This is an improvement over the previous best f1_score of 0.461404 achieved with the ‘class_weight’ method.

#### Downsampling

In [17]:
df_sorted = logistic_regression_experiment(method='downsampling')
df_display = df_sorted[['solver', 'accuracy', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    1227
1    1227
Name: count, dtype: int64
model_type    LogisticRegression
solver           newton-cholesky
method              downsampling
parameter                    N/A
accuracy                  0.6925
f1_score                0.460999
auc_roc                 0.750268
Name: 0, dtype: object


Unnamed: 0,solver,accuracy,f1_score,auc_roc
0,newton-cholesky,0.6925,0.460999,0.750268
3,newton-cg,0.6925,0.460999,0.75026
2,liblinear,0.634,0.397035,0.679607
4,lbfgs,0.6305,0.387738,0.675699
5,sag,0.477,0.329487,0.546278
1,saga,0.477,0.326031,0.545637


Best so far:
 model_type    LogisticRegression
solver           newton-cholesky
method                upsampling
parameter                    N/A
accuracy                   0.694
f1_score                0.464098
auc_roc                 0.753389
Name: 0, dtype: object


Downsampling reduced the number of the majority class to match the number of the minority class. The solver ‘newton-cholesky’ performed the best among all the solvers with an accuracy of 0.6925, f1_score of 0.460999, and auc_roc of 0.750268. However, this f1_score is not an improvement over the previous best f1_score of 0.46714 achieved with the ‘upsampling’ method.

#### Threshold Adjustment

In [18]:
df_sorted = logistic_regression_experiment(method='threshold')
df_display = df_sorted
print(df_sorted.iloc[0])
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type    LogisticRegression
solver           newton-cholesky
method                 threshold
parameter                   0.25
accuracy                  0.7405
f1_score                0.462176
auc_roc                 0.751975
Name: 5, dtype: object


Unnamed: 0,model_type,solver,method,parameter,accuracy,f1_score,auc_roc
5,LogisticRegression,newton-cholesky,threshold,0.25,0.7405,0.462176,0.751975
65,LogisticRegression,newton-cg,threshold,0.25,0.7330,0.461694,0.748735
66,LogisticRegression,newton-cg,threshold,0.30,0.7770,0.461353,0.748735
64,LogisticRegression,newton-cg,threshold,0.20,0.6735,0.458989,0.748735
6,LogisticRegression,newton-cholesky,threshold,0.30,0.7785,0.457772,0.751975
...,...,...,...,...,...,...,...
78,LogisticRegression,newton-cg,threshold,0.90,0.8100,0.000000,0.748735
39,LogisticRegression,saga,threshold,0.95,0.8100,0.000000,0.488424
19,LogisticRegression,newton-cholesky,threshold,0.95,0.8100,0.000000,0.751975
18,LogisticRegression,newton-cholesky,threshold,0.90,0.8100,0.000000,0.751975


Best so far:
 model_type    LogisticRegression
solver           newton-cholesky
method                upsampling
parameter                    N/A
accuracy                   0.694
f1_score                0.464098
auc_roc                 0.753389
Name: 0, dtype: object


The solver ‘newton-cholesky’ with a threshold of 0.25 performed the best among all the combinations with an accuracy of 0.7405, f1_score of 0.462176, and auc_roc of 0.751975. However, this f1_score is not an improvement over the previous best f1_score of 0.464098 achieved with the ‘upsampling’ method.

## Decision Tree

In [19]:
hyperparameters_dt = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'class_weight': ['balanced', None],
    'thresholds': thresholds
}

Define a function that will be used for all the decision tree models.

In [20]:
# Create a function that will be used for all Decision Tree models
def decision_tree_experiment(hyperparameters=hyperparameters_dt, 
                             features_train=features_train, 
                             target_train=target_train, 
                             features_valid=features_valid, 
                             target_valid=target_valid, 
                             method=None):
    results = []

    if method == 'upsampling':
        # Upsample minority class to match the number of samples in majority class
        features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
        target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
        print(target_upsampled.value_counts())
    elif method == 'downsampling':
        # Downsample the majority class
        features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
        target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
        print(target_downsampled.value_counts())
    else:
        print(target_train.value_counts())

    for max_depth in hyperparameters['max_depth']:
        for min_samples_split in hyperparameters['min_samples_split']:
            for min_samples_leaf in hyperparameters['min_samples_leaf']:
                model = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=42)
                
                if method == 'class_weight':
                    for class_weight in hyperparameters['class_weight']:
                        model.class_weight = class_weight
                        model.fit(features_train, target_train)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'class_weight', class_weight, accuracy, f1, auc_roc])
                elif method == 'upsampling':
                    # Shuffle the dataset
                    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
                    model.fit(features_upsampled, target_upsampled)
                    predicted_valid = model.predict(features_valid)
                    probabilities_valid = model.predict_proba(features_valid)[:, 1]
                    accuracy = accuracy_score(target_valid, predicted_valid)
                    f1 = f1_score(target_valid, predicted_valid)
                    auc_roc = roc_auc_score(target_valid, probabilities_valid)
                    results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'upsampling', 'N/A', accuracy, f1, auc_roc])
                elif method == 'downsampling':
                    # Shuffle the dataset
                    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
                    model.fit(features_downsampled, target_downsampled)
                    predicted_valid = model.predict(features_valid)
                    probabilities_valid = model.predict_proba(features_valid)[:, 1]
                    accuracy = accuracy_score(target_valid, predicted_valid)
                    f1 = f1_score(target_valid, predicted_valid)
                    auc_roc = roc_auc_score(target_valid, probabilities_valid)
                    results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'downsampling', 'N/A', accuracy, f1, auc_roc])
                elif method == 'threshold':
                    for threshold in hyperparameters['thresholds']:
                        model.fit(features_train, target_train)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        predicted_valid = probabilities_valid > threshold
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'threshold', threshold, accuracy, f1, auc_roc])
                else:
                    model.fit(features_train, target_train)
                    predicted_valid = model.predict(features_valid)
                    probabilities_valid = model.predict_proba(features_valid)[:, 1]
                    accuracy = accuracy_score(target_valid, predicted_valid)
                    f1 = f1_score(target_valid, predicted_valid)
                    auc_roc = roc_auc_score(target_valid, probabilities_valid)
                    results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'None', 'N/A', accuracy, f1, auc_roc])

    df = pd.DataFrame(results, columns=['model_type', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'method', 'parameter', 'accuracy', 'f1_score', 'auc_roc'])
    return df.sort_values('f1_score', ascending=False)

### Train the Model

In [21]:
df_sorted = decision_tree_experiment(method=None)

df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())

if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type           DecisionTree
max_depth                     5.0
min_samples_split              10
min_samples_leaf                5
method                       None
parameter                     N/A
accuracy                    0.848
f1_score                 0.511254
auc_roc                  0.814623
Name: 17, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,f1_score,auc_roc
17,5.0,10,5,0.511254,0.814623
11,5.0,2,5,0.511254,0.814623
14,5.0,5,5,0.511254,0.814623
13,5.0,5,2,0.508091,0.805005
15,5.0,10,1,0.508091,0.798223


Best so far:
 model_type           DecisionTree
max_depth                     5.0
min_samples_split              10
min_samples_leaf                5
method                       None
parameter                     N/A
accuracy                    0.848
f1_score                 0.511254
auc_roc                  0.814623
Name: 17, dtype: object


### Improve the model

#### Class Weight Adjustment

In [22]:
df_sorted = decision_tree_experiment(method='class_weight')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type           DecisionTree
max_depth                     5.0
min_samples_split              10
min_samples_leaf                5
method               class_weight
parameter                balanced
accuracy                    0.737
f1_score                 0.526126
auc_roc                  0.822382
Name: 34, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
34,5.0,10,5,balanced,0.526126,0.822382
28,5.0,5,5,balanced,0.526126,0.822382
22,5.0,2,5,balanced,0.526126,0.822382
20,5.0,2,2,balanced,0.518919,0.810549
32,5.0,10,2,balanced,0.518919,0.810549


Best so far:
 model_type           DecisionTree
max_depth                     5.0
min_samples_split              10
min_samples_leaf                5
method               class_weight
parameter                balanced
accuracy                    0.737
f1_score                 0.526126
auc_roc                  0.822382
Name: 34, dtype: object


#### Upsampling

In [23]:
df_sorted = decision_tree_experiment(method='upsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    4773
Name: count, dtype: int64
model_type           DecisionTree
max_depth                    10.0
min_samples_split              10
min_samples_leaf                5
method                 upsampling
parameter                     N/A
accuracy                    0.785
f1_score                 0.538627
auc_roc                  0.778881
Name: 26, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
26,10.0,10,5,0.785,0.538627,0.778881
23,10.0,5,5,0.785,0.538627,0.778881
20,10.0,2,5,0.785,0.538627,0.778881
17,5.0,10,5,0.721,0.52226,0.824539
14,5.0,5,5,0.721,0.52226,0.824539


Best so far:
 model_type           DecisionTree
max_depth                    10.0
min_samples_split              10
min_samples_leaf                5
method                 upsampling
parameter                     N/A
accuracy                    0.785
f1_score                 0.538627
auc_roc                  0.778881
Name: 26, dtype: object


#### Downsampling

In [24]:
df_sorted = decision_tree_experiment(method='downsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    1227
1    1227
Name: count, dtype: int64
model_type           DecisionTree
max_depth                     5.0
min_samples_split               2
min_samples_leaf                5
method               downsampling
parameter                     N/A
accuracy                    0.761
f1_score                    0.522
auc_roc                  0.812694
Name: 11, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
11,5.0,2,5,0.761,0.522,0.812694
17,5.0,10,5,0.761,0.522,0.812694
14,5.0,5,5,0.761,0.522,0.812694
9,5.0,2,1,0.759,0.518,0.814177
13,5.0,5,2,0.759,0.516064,0.812534


Best so far:
 model_type           DecisionTree
max_depth                    10.0
min_samples_split              10
min_samples_leaf                5
method                 upsampling
parameter                     N/A
accuracy                    0.785
f1_score                 0.538627
auc_roc                  0.778881
Name: 26, dtype: object


#### Threshold Adjustment

In [25]:
df_sorted = decision_tree_experiment(method='threshold')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type           DecisionTree
max_depth                     5.0
min_samples_split               2
min_samples_leaf                5
method                  threshold
parameter                     0.3
accuracy                   0.8445
f1_score                 0.566248
auc_roc                  0.814623
Name: 226, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
226,5.0,2,5,0.3,0.566248,0.814623
286,5.0,5,5,0.3,0.566248,0.814623
346,5.0,10,5,0.3,0.566248,0.814623
287,5.0,5,5,0.35,0.565035,0.814623
347,5.0,10,5,0.35,0.565035,0.814623


Best so far:
 model_type           DecisionTree
max_depth                     5.0
min_samples_split               2
min_samples_leaf                5
method                  threshold
parameter                     0.3
accuracy                   0.8445
f1_score                 0.566248
auc_roc                  0.814623
Name: 226, dtype: object


## Random Forest

In [26]:
hyperparameters_rf = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'class_weight': ['balanced', None],
    'thresholds': thresholds
}

Define a function that will be used for all the random forest models.

In [27]:
# Create a function that will be used for all Decision Tree models
def random_forest_experiment(hyperparameters=hyperparameters_rf, 
                             features_train=features_train, 
                             target_train=target_train, 
                             features_valid=features_valid, 
                             target_valid=target_valid, 
                             method=None):
    results = []

    if method == 'upsampling':
        # Upsample minority class to match the number of samples in majority class
        features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
        target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
        print(target_upsampled.value_counts())
    elif method == 'downsampling':
        # Downsample the majority class
        features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
        target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
        print(target_downsampled.value_counts())
    else:
        print(target_train.value_counts())

    for n_estimators in hyperparameters['n_estimators']:
        for max_depth in hyperparameters['max_depth']:
            for min_samples_split in hyperparameters['min_samples_split']:
                for min_samples_leaf in hyperparameters['min_samples_leaf']:
                    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=42)
                    if method == 'class_weight':
                        for class_weight in hyperparameters['class_weight']:
                            model.class_weight = class_weight
                            model.fit(features_train, target_train)
                            predicted_valid = model.predict(features_valid)
                            probabilities_valid = model.predict_proba(features_valid)[:, 1]
                            accuracy = accuracy_score(target_valid, predicted_valid)
                            f1 = f1_score(target_valid, predicted_valid)
                            auc_roc = roc_auc_score(target_valid, probabilities_valid)
                            results.append(['RandomForest', max_depth, min_samples_split, min_samples_leaf, 'class_weight', class_weight, accuracy, f1, auc_roc])
                    elif method == 'upsampling':
                        # Shuffle the dataset
                        features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
                        model.fit(features_upsampled, target_upsampled)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['RandomForest', max_depth, min_samples_split, min_samples_leaf, 'upsampling', 'N/A', accuracy, f1, auc_roc])
                    elif method == 'downsampling':
                        # Shuffle the dataset
                        features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
                        model.fit(features_downsampled, target_downsampled)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['RandomForest', max_depth, min_samples_split, min_samples_leaf, 'downsampling', 'N/A', accuracy, f1, auc_roc])
                    elif method == 'threshold':
                        for threshold in hyperparameters['thresholds']:
                            model.fit(features_train, target_train)
                            probabilities_valid = model.predict_proba(features_valid)[:, 1]
                            predicted_valid = probabilities_valid > threshold
                            accuracy = accuracy_score(target_valid, predicted_valid)
                            f1 = f1_score(target_valid, predicted_valid)
                            auc_roc = roc_auc_score(target_valid, probabilities_valid)
                            results.append(['RandomForest', max_depth, min_samples_split, min_samples_leaf, 'threshold', threshold, accuracy, f1, auc_roc])
                    else:
                        model.fit(features_train, target_train)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['RandomForest', max_depth, min_samples_split, min_samples_leaf, 'None', 'N/A', accuracy, f1, auc_roc])

    df = pd.DataFrame(results, columns=['model_type', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'method', 'parameter', 'accuracy', 'f1_score', 'auc_roc'])
    return df.sort_values('f1_score', ascending=False)

### Train the Model

In [28]:
df_sorted = random_forest_experiment(method=None)
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


model_type           RandomForest
max_depth                     NaN
min_samples_split               5
min_samples_leaf                1
method                       None
parameter                     N/A
accuracy                    0.865
f1_score                 0.560261
auc_roc                  0.829896
Name: 30, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,f1_score,auc_roc
30,,5,1,0.560261,0.829896
45,10.0,2,1,0.559322,0.844311
3,,5,1,0.55573,0.811037
34,,10,2,0.555556,0.834056
57,,5,1,0.555372,0.834625


Best so far:
 model_type           DecisionTree
max_depth                     5.0
min_samples_split               2
min_samples_leaf                5
method                  threshold
parameter                     0.3
accuracy                   0.8445
f1_score                 0.566248
auc_roc                  0.814623
Name: 226, dtype: object


#### Class Weight Adjustment

In [29]:
df_sorted = random_forest_experiment(method='class_weight')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type           RandomForest
max_depth                    10.0
min_samples_split               2
min_samples_leaf                1
method               class_weight
parameter                balanced
accuracy                    0.843
f1_score                 0.594315
auc_roc                   0.84575
Name: 144, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
144,10.0,2,1,balanced,0.594315,0.84575
122,,10,2,balanced,0.591093,0.839329
66,,10,1,balanced,0.588563,0.834272
104,10.0,10,2,balanced,0.585608,0.845286
150,10.0,5,1,balanced,0.585242,0.842479


Best so far:
 model_type           RandomForest
max_depth                    10.0
min_samples_split               2
min_samples_leaf                1
method               class_weight
parameter                balanced
accuracy                    0.843
f1_score                 0.594315
auc_roc                   0.84575
Name: 144, dtype: object


### Upsampling

In [30]:
df_sorted = random_forest_experiment(method='upsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    4773
Name: count, dtype: int64
model_type           RandomForest
max_depth                     NaN
min_samples_split               2
min_samples_leaf                5
method                 upsampling
parameter                     N/A
accuracy                   0.8345
f1_score                 0.579416
auc_roc                  0.835617
Name: 29, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
29,,2,5,0.8345,0.579416,0.835617
56,,2,5,0.8295,0.578492,0.83372
35,,10,5,0.833,0.578283,0.829168
59,,5,5,0.831,0.5775,0.838234
62,,10,5,0.833,0.577215,0.832942


Best so far:
 model_type           RandomForest
max_depth                    10.0
min_samples_split               2
min_samples_leaf                1
method               class_weight
parameter                balanced
accuracy                    0.843
f1_score                 0.594315
auc_roc                   0.84575
Name: 144, dtype: object


### Downsampling

In [31]:
df_sorted = random_forest_experiment(method='downsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    1227
1    1227
Name: count, dtype: int64
model_type           RandomForest
max_depth                    10.0
min_samples_split              10
min_samples_leaf                5
method               downsampling
parameter                     N/A
accuracy                   0.7785
f1_score                 0.560079
auc_roc                  0.836475
Name: 80, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
80,10.0,10,5,0.7785,0.560079,0.836475
48,10.0,5,1,0.7825,0.559271,0.835504
76,10.0,5,2,0.781,0.55668,0.834691
53,10.0,10,5,0.782,0.556008,0.8332
4,,5,2,0.777,0.555777,0.813875


Best so far:
 model_type           RandomForest
max_depth                    10.0
min_samples_split               2
min_samples_leaf                1
method               class_weight
parameter                balanced
accuracy                    0.843
f1_score                 0.594315
auc_roc                   0.84575
Name: 144, dtype: object


### Threshold Adjustment

In [32]:
df_sorted = random_forest_experiment(method='threshold')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:\n', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type           RandomForest
max_depth                     NaN
min_samples_split              10
min_samples_leaf                1
method                  threshold
parameter                     0.4
accuracy                    0.863
f1_score                 0.604046
auc_roc                  0.835946
Name: 668, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
668,,10,1,0.4,0.604046,0.835946
1207,,10,1,0.35,0.600791,0.840374
667,,10,1,0.35,0.59761,0.835946
707,,10,5,0.35,0.593923,0.838596
587,,2,5,0.35,0.593923,0.838596


Best so far:
 model_type           RandomForest
max_depth                     NaN
min_samples_split              10
min_samples_leaf                1
method                  threshold
parameter                     0.4
accuracy                    0.863
f1_score                 0.604046
auc_roc                  0.835946
Name: 668, dtype: object


Using hyperparameter tuning on each of the different model types and model improvement techniques, we found that the best model for this particular dataset is the random forest with threshold adjustment improvement method. It has the F1 score of .604.

## Testing

In [33]:
# Define the model with the best parameters
best_model = RandomForestClassifier(random_state=42, 
                                    max_depth=None, 
                                    min_samples_split=10, 
                                    min_samples_leaf=1)

# Fit the model on the training data
best_model.fit(features_train, target_train)

# Predict probabilities on the test set
probabilities_test = best_model.predict_proba(features_test)[:, 1]

# Apply the threshold to the probabilities to get the final predictions
predictions_test = (probabilities_test > 0.4).astype(int)

# Calculate the F1 score and AUC-ROC on the test set
f1_test = f1_score(target_test, predictions_test)
auc_roc_test = roc_auc_score(target_test, probabilities_test)

print("F1 Score on Test Set: ", f1_test)
print("AUC-ROC on Test Set: ", auc_roc_test)

F1 Score on Test Set:  0.6277561608300908
AUC-ROC on Test Set:  0.8653399496370908


Using the hyperparameters from our best model, we used the model on the test set that it has not seen before. The F1 score on the test set is approximately 0.627, which is higher than the project requirement of 0.59. The F1 score is a measure of a test’s accuracy that considers both the precision and the recall. A high F1 score indicates that the model has a good balance between precision and recall. The AUC-ROC score on the test set is approximately 0.865. The AUC-ROC score represents the likelihood of the model distinguishing observations from different classes. A score close to 1 indicates that the model has a good measure of separability and is capable of distinguishing between customers who will leave and those who will stay.

## Conclusion

This project involved building a machine learning model to predict customer churn. The dataset was initially imbalanced with a larger number of customers who continued their business with Beta Bank compared to those who did not. The initial model which was trained without addressing the imbalance performed poorly having a low F1 score. 

The class imbalance in the dataset was addressed using various model improvement techniques such as adjusting class weights, upsampling the minority class, and downsampling the majority class. Different thresholds were also experimented with to optimize the F1 score.

Several models were trained and validated, including Logistic Regression, Decision Tree, and Random Forest. Each model was evaluated based on its F1 score.

The best performing model was a Random Forest with Threshold Adjustment. It achieved an F1 score of approximately 0.627 and an AUC-ROC score of approximately 0.865 on the test set,  exceeding the project’s requirement.