**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there are a couple of problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Beta Bank Customer Retention

## Introduction

This project focuses on predicting the customer churn using machine learning techniques. Customer churn refers to when a customer stops doing business with a company. Predicting churn is important for Beta Bank as it can help them identify customers who are likely to churn and take proactive steps to retain them.

The dataset used contains information about the bank's customers and whether they exited (churned) or not. The data includes customer information such as credit score, gender, age, geography, etc.

The project involves the following steps:
- Data is loaded, explored, and preprocessed. This includes handling missing values, converting data types, and dropping unnecessary columns.
- The target variable is imbalanced with more customers continuing their business compared to those leaving. Techniques such as upsampling the minority class and downsampling the majority class will be used to address this imbalance.
- A Logistic Regression model will be trained on the preprocessed data. The model's performance is evaluated using F1 score and AUC-ROC metrics.
- The model is then improved using upsampling and downsampling. The results will be compared before the model is improved vs after the model is improved.

The goal of this project is to build a model that can accurately predict customer churn. The insights gained from this project could potentially be used to improve Beta Bank's customer retention strategies.

## Prepare the data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, roc_auc_score, accuracy_score

from sklearn.utils import resample
from sklearn.utils import shuffle

from sklearn.exceptions import FitFailedWarning
warnings.filterwarnings(action='ignore', category=UserWarning)
warnings.filterwarnings('ignore')

In [2]:
# Read the data
data = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/Churn.csv')

# Examine the data
data.info()
display(data.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
4456,4457,15724428,Abel,544,France,Male,40,8.0,0.0,2,1,0,61581.2,0
4738,4739,15644361,Hooper,702,France,Female,40,,103549.24,1,0,0,9712.52,1
8160,8161,15576990,Taplin,790,Germany,Female,25,5.0,152885.77,1,1,0,58214.79,0
729,730,15612525,Preston,499,France,Female,57,1.0,0.0,1,0,0,131372.38,1
2709,2710,15780212,Mao,592,France,Male,37,4.0,212692.97,1,0,0,176395.02,0
6011,6012,15783007,Parker,520,Germany,Female,45,1.0,123086.39,1,1,1,41042.4,1
6519,6520,15571869,Lei,669,Germany,Female,50,4.0,112650.89,1,0,0,166386.22,1
7976,7977,15659656,Pan,849,France,Male,35,4.0,110837.73,1,0,0,126419.8,0
6591,6592,15692110,Ch'eng,758,France,Female,33,7.0,0.0,1,1,0,188156.34,0
6943,6944,15603741,MacDonnell,719,Spain,Male,40,4.0,128389.12,1,1,1,176091.31,0


In [3]:
# Check for duplicates
print(data.duplicated().sum())

0


There are no duplicate rows, so we can move on.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

In [4]:
# Check for missing values
print(data.isnull().sum())

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64


There are 909 missing values for the 'Tenure' column. Some models will not be able to handle data with missing values. Therefore, we will fill in the missing values for tenure with the median value. We will also change the data type of 'Tenure' to integers if all the values are integers.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, that's one way to deal with missing values :)

</div>

In [5]:
# Fill missing values in 'Tenure' with the median value
data['Tenure'].fillna(data['Tenure'].median(), inplace=True)

# Check to see if it's save to convert 'Tenure' from float to int. If so, then convert it.
if np.array_equal(data['Tenure'], data['Tenure'].astype('int')):
    data['Tenure'] = data['Tenure'].astype('int')

print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int32  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int32(1), int64(8), object(3)
memory usage: 1.0+ MB
None


We will now remove the columns that are not needed.

In [6]:
# Drop the columns that are not needed for the model
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

display(data.sample(10))

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9254,686,France,Male,32,6,0.0,2,1,1,179093.26,0
1500,630,France,Male,50,1,81947.76,1,0,1,63606.22,1
5323,622,France,Male,32,5,179305.09,1,1,1,149043.78,0
260,732,Germany,Male,42,9,108748.08,2,1,1,65323.11,0
9064,521,Germany,Female,49,5,127948.57,1,1,1,182765.14,0
5586,816,Germany,Female,25,2,150355.35,2,1,1,35770.84,0
1159,729,Spain,Male,37,10,0.0,2,1,0,100862.54,0
4469,612,Spain,Male,33,5,69478.57,1,1,0,8973.67,1
8945,542,Spain,Male,35,2,174894.53,1,1,1,22314.55,0
6263,445,France,Male,37,3,0.0,2,1,1,180012.39,0


These columns were dropped since they do not contribute to the model's prediction of customer churn. For RowNumber is an index column that does not provide meaningful information for the model. CustomerId is a unique identifier for each customer. Including this in the model could associate specific outcomes to the individual customer IDs and may not work well with unseen data. Surname is the customer's last name, which will probably not have influence towards their likelihood to churn.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Make sense!

</div>

In [7]:
# Convert categorical data into numerical data
data = pd.get_dummies(data, drop_first=True)

display(data.sample(10))

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
3685,695,39,5,0.0,2,0,0,102763.69,0,False,True,False
3626,789,37,6,110689.07,1,1,1,71121.04,1,True,False,False
7772,792,50,4,146710.76,1,1,0,16528.4,1,True,False,False
4136,651,44,2,0.0,3,1,0,102530.35,1,False,False,True
246,772,26,7,152400.51,2,1,0,79414.0,0,True,False,True
3713,709,22,0,112949.71,1,0,0,155231.55,0,True,False,True
57,725,19,0,75888.2,1,0,0,45613.75,0,True,False,True
87,729,30,9,0.0,2,1,0,151869.35,0,False,False,True
2963,655,51,3,0.0,2,0,1,15801.02,0,False,False,False
9241,509,35,8,0.0,2,0,1,67431.28,0,False,False,True


We have the new dataframe that has the categories placed into separate columns. To avoid the dummy variable trap, the drop_first argument for get_dummies doesn't include a Geography_France column. It is assumed that the geography is France if it is not Germany or Spain. Same with Gender_Male assuming the gender is Female if Gender_Male is false.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Categorical features were encoded

</div>

In [8]:
# Split the data into features and target
# The 'Exited' column is the target, the rest are features
features = data.drop('Exited', axis=1)
target = data['Exited']

# First, split the data into a training set (60% of the data) and a temp set (40%)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=42)

# Then, split the temp set into a validation set (50% of the temp)
# and a testing set (50% of the temp)
# This will result in a 20/20 split of the entire dataset for validation/testing
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=42)

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Note that we need three sets here: train, validation and test. Train set to train the models, validation to compare different models and balancing tehchniques as well as tune hyperparameters, and the test set to evaluate the final model

</div>

In [9]:
# Examine the balance of classes
class_counts = target.value_counts()
print(class_counts)


Exited
0    7963
1    2037
Name: count, dtype: int64


This code shows the number of customers who stayed with the company vs those who took their business elsewhere. It shows that there are significantly more customers who are loyal customers than those who left.

In [10]:
# Calculate the imbalance ratio
imbalance_ratio = class_counts[0] / class_counts[1]
print(f'Imbalance Ratio: {imbalance_ratio}')

Imbalance Ratio: 3.9091801669121256


This shows that there are about 4 times the loyal customers as there are who took their business elsewhere at the time the data was collected.

In [11]:
# Universal variables
# Separate majority and minority classes
features_zeros = features[target == 0]
features_ones = features[target == 1]
target_zeros = target[target == 0]
target_ones = target[target == 1]

# Create a dictionary of hyperparameters that will be used
thresholds = np.arange(0, 1, 0.05)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Class distribution was examined

</div>

## Logistic Regression

In [12]:
hyperparameters_lr = {
    'solver': ['newton-cholesky', 'saga', 'liblinear', 'newton-cg', 'lbfgs', 'sag'],
    'class_weight': ['balanced', None],
    'thresholds': thresholds
}

Define a function that will be used for all the logistic regression models.

In [13]:
def logistic_regression_experiment(hyperparameters=hyperparameters_lr, 
                                   features_train=features_train, 
                                   target_train=target_train, 
                                   features_valid=features_valid, 
                                   target_valid=target_valid, 
                                   method=None):
    results = []

    if method == 'upsampling':
        # Upsample minority class to match the number of samples in majority class
        features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
        target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
        print(target_upsampled.value_counts())
    elif method == 'downsampling':
        # Downsample the majority class
        features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
        target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
        print(target_downsampled.value_counts())
    else:
        print(target_train.value_counts())

    for solver in hyperparameters['solver']:
        model = LogisticRegression(random_state=42, solver=solver)
        
        if method == 'class_weight':
            for class_weight in hyperparameters['class_weight']:
                model.class_weight = class_weight
                model.fit(features_train, target_train)
                predicted_valid = model.predict(features_valid)
                probabilities_valid = model.predict_proba(features_valid)[:, 1]
                accuracy = accuracy_score(target_valid, predicted_valid)
                f1 = f1_score(target_valid, predicted_valid)
                auc_roc = roc_auc_score(target_valid, probabilities_valid)
                results.append(['LogisticRegression', solver, 'class_weight', class_weight, accuracy, f1, auc_roc])
        elif method == 'upsampling':
            # Shuffle the dataset
            features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
            model.fit(features_upsampled, target_upsampled)
            predicted_valid = model.predict(features_valid)
            probabilities_valid = model.predict_proba(features_valid)[:, 1]
            accuracy = accuracy_score(target_valid, predicted_valid)
            f1 = f1_score(target_valid, predicted_valid)
            auc_roc = roc_auc_score(target_valid, probabilities_valid)
            results.append(['LogisticRegression', solver, 'upsampling', 'N/A', accuracy, f1, auc_roc])
        elif method == 'downsampling':

            # Shuffle the dataset
            features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
            model.fit(features_downsampled, target_downsampled)
            predicted_valid = model.predict(features_valid)
            probabilities_valid = model.predict_proba(features_valid)[:, 1]
            accuracy = accuracy_score(target_valid, predicted_valid)
            f1 = f1_score(target_valid, predicted_valid)
            auc_roc = roc_auc_score(target_valid, probabilities_valid)
            results.append(['LogisticRegression', solver, 'downsampling', 'N/A', accuracy, f1, auc_roc])
        elif method == 'threshold':
            for threshold in hyperparameters['thresholds']:
                model.fit(features_train, target_train)
                probabilities_valid = model.predict_proba(features_valid)[:, 1]
                predicted_valid = probabilities_valid > threshold
                accuracy = accuracy_score(target_valid, predicted_valid)
                f1 = f1_score(target_valid, predicted_valid)
                auc_roc = roc_auc_score(target_valid, probabilities_valid)
                results.append(['LogisticRegression', solver, 'threshold', threshold, accuracy, f1, auc_roc])
        else:
            model.fit(features_train, target_train)
            predicted_valid = model.predict(features_valid)
            probabilities_valid = model.predict_proba(features_valid)[:, 1]
            accuracy = accuracy_score(target_valid, predicted_valid)
            f1 = f1_score(target_valid, predicted_valid)
            auc_roc = roc_auc_score(target_valid, probabilities_valid)
            results.append(['LogisticRegression', solver, 'None', 'None', accuracy, f1, auc_roc])

    df = pd.DataFrame(results, columns=['model_type', 'solver', 'method', 'parameter', 'accuracy', 'f1_score', 'auc_roc'])
    return df.sort_values('f1_score', ascending=False)


### Train the model

In [36]:
df_sorted = logistic_regression_experiment(method=None)

df_display = df_sorted[['solver', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display)
best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
model_type    LogisticRegression
solver           newton-cholesky
method                      None
parameter                   None
accuracy                   0.817
f1_score                0.296154
auc_roc                 0.751975
Name: 0, dtype: object


Unnamed: 0,solver,accuracy,f1_score,auc_roc
0,newton-cholesky,0.817,0.296154,0.751975
3,newton-cg,0.816,0.266932,0.741215
4,lbfgs,0.801,0.111607,0.648376
2,liblinear,0.8055,0.044226,0.635962
1,saga,0.81,0.0,0.488424
5,sag,0.81,0.0,0.500179


Here's what can be seen from the results:
- The precision, which is the ratio of correctly predicted positive observations to the total predicted positives vs false positives, is high with for 0, but relatively low for 1. The precision is .81 for 0 and .45 for 1.
- Recall is the ratio of correctly predicted positive observations to all the observations in the class. For 0, the recall is .98, while for 1, the recall is .07.
- F1 score is the weighted average of Precision and Recall. This score takes both false positive and false negatives into account. It is a better measure than accuracy for uneven class distribution such as what we have in our data. The F1 score for 0 is .89, while the F1 score for 1 is 0.12.
- Support is the number of actual occurences of the class specified in the dataset. For 0, it is 1607 and 1 is 393.

From these metrics, we can conclude that the model is performing well in predicting customers who did not exit (0), but not as well as predicting customers who exited.

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Great, you trained a model without taking the imbalance into account first. Note that you need to use the validation set to evaluate the model here: the test set should only be used once you've selected the model and are not going to make any changes. The same goes for models below

</div>

### Improve the model

#### Explore Class Weight Adjustment

Class Weight Adjustment is useful on data that has imbalanced classes, such as what we currently have.

In [15]:
df_sorted = logistic_regression_experiment(method='class_weight')
df_display = df_sorted[['solver', 'parameter', 'f1_score', 'auc_roc']]
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,solver,parameter,f1_score,auc_roc
0,newton-cholesky,balanced,0.461404,0.752815
6,newton-cg,balanced,0.460733,0.752565
4,liblinear,balanced,0.457483,0.748315
8,lbfgs,balanced,0.394758,0.678908
10,sag,balanced,0.330759,0.548354
2,saga,balanced,0.329276,0.546681
1,newton-cholesky,,0.296154,0.751975
7,newton-cg,,0.266932,0.741215
9,lbfgs,,0.111607,0.648376
5,liblinear,,0.044226,0.635962


solver       newton-cholesky
parameter           balanced
f1_score            0.461404
auc_roc             0.752815
Name: 0, dtype: object


#### Upsampling

In [16]:
df_sorted = logistic_regression_experiment(method='upsampling')
df_display = df_sorted[['solver', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    7963
1    7963
Name: count, dtype: int64


Unnamed: 0,solver,accuracy,f1_score,auc_roc
0,newton-cholesky,0.7,0.46714,0.756511
3,newton-cg,0.6985,0.465899,0.756449
2,liblinear,0.6495,0.409436,0.684428
4,lbfgs,0.6395,0.400665,0.680967
5,sag,0.493,0.335518,0.555606
1,saga,0.4815,0.33226,0.550694


solver      newton-cholesky
accuracy                0.7
f1_score            0.46714
auc_roc            0.756511
Name: 0, dtype: object


#### Downsampling

In [17]:
df_sorted = logistic_regression_experiment(method='downsampling')
df_display = df_sorted[['solver', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    2037
1    2037
Name: count, dtype: int64


Unnamed: 0,solver,accuracy,f1_score,auc_roc
0,newton-cholesky,0.6915,0.458297,0.756046
3,newton-cg,0.69,0.45614,0.755963
2,liblinear,0.6485,0.405748,0.683395
4,lbfgs,0.637,0.396007,0.680088
5,sag,0.4785,0.330122,0.547443
1,saga,0.478,0.329049,0.546152


solver      newton-cholesky
accuracy             0.6915
f1_score           0.458297
auc_roc            0.756046
Name: 0, dtype: object


#### Threshold Adjustment

In [18]:
df_sorted = logistic_regression_experiment(method='threshold')
df_display = df_sorted
display(df_display)
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,model_type,solver,method,parameter,accuracy,f1_score,auc_roc
5,LogisticRegression,newton-cholesky,threshold,0.25,0.7405,0.462176,0.751975
6,LogisticRegression,newton-cholesky,threshold,0.30,0.7785,0.457772,0.751975
4,LogisticRegression,newton-cholesky,threshold,0.20,0.6800,0.457627,0.751975
65,LogisticRegression,newton-cg,threshold,0.25,0.7230,0.456863,0.741215
64,LogisticRegression,newton-cg,threshold,0.20,0.6655,0.452984,0.741215
...,...,...,...,...,...,...,...
38,LogisticRegression,saga,threshold,0.90,0.8100,0.000000,0.488424
79,LogisticRegression,newton-cg,threshold,0.95,0.8100,0.000000,0.741215
19,LogisticRegression,newton-cholesky,threshold,0.95,0.8100,0.000000,0.751975
18,LogisticRegression,newton-cholesky,threshold,0.90,0.8100,0.000000,0.751975


model_type    LogisticRegression
solver           newton-cholesky
method                 threshold
parameter                   0.25
accuracy                  0.7405
f1_score                0.462176
auc_roc                 0.751975
Name: 5, dtype: object


## Decision Tree

In [19]:
hyperparameters_dt = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'class_weight': ['balanced', None],
    'thresholds': thresholds
}

Define a function that will be used for all the decision tree models.

In [20]:
# Create a function that will be used for all Decision Tree models
def decision_tree_experiment(hyperparameters=hyperparameters_dt, 
                             features_train=features_train, 
                             target_train=target_train, 
                             features_valid=features_valid, 
                             target_valid=target_valid, 
                             method=None):
    results = []

    if method == 'upsampling':
        # Upsample minority class to match the number of samples in majority class
        features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
        target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
        print(target_upsampled.value_counts())
    elif method == 'downsampling':
        # Downsample the majority class
        features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
        target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
        print(target_downsampled.value_counts())
    else:
        print(target_train.value_counts())

    for max_depth in hyperparameters['max_depth']:
        for min_samples_split in hyperparameters['min_samples_split']:
            for min_samples_leaf in hyperparameters['min_samples_leaf']:
                model = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=42)
                
                if method == 'class_weight':
                    for class_weight in hyperparameters['class_weight']:
                        model.class_weight = class_weight
                        model.fit(features_train, target_train)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'class_weight', class_weight, accuracy, f1, auc_roc])
                elif method == 'upsampling':
                    # Shuffle the dataset
                    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
                    model.fit(features_upsampled, target_upsampled)
                    predicted_valid = model.predict(features_valid)
                    probabilities_valid = model.predict_proba(features_valid)[:, 1]
                    accuracy = accuracy_score(target_valid, predicted_valid)
                    f1 = f1_score(target_valid, predicted_valid)
                    auc_roc = roc_auc_score(target_valid, probabilities_valid)
                    results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'upsampling', 'N/A', accuracy, f1, auc_roc])
                elif method == 'downsampling':
                    # Shuffle the dataset
                    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
                    model.fit(features_downsampled, target_downsampled)
                    predicted_valid = model.predict(features_valid)
                    probabilities_valid = model.predict_proba(features_valid)[:, 1]
                    accuracy = accuracy_score(target_valid, predicted_valid)
                    f1 = f1_score(target_valid, predicted_valid)
                    auc_roc = roc_auc_score(target_valid, probabilities_valid)
                    results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'downsampling', 'N/A', accuracy, f1, auc_roc])
                elif method == 'threshold':
                    for threshold in hyperparameters['thresholds']:
                        model.fit(features_train, target_train)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        predicted_valid = probabilities_valid > threshold
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'threshold', threshold, accuracy, f1, auc_roc])
                else:
                    model.fit(features_train, target_train)
                    predicted_valid = model.predict(features_valid)
                    probabilities_valid = model.predict_proba(features_valid)[:, 1]
                    accuracy = accuracy_score(target_valid, predicted_valid)
                    f1 = f1_score(target_valid, predicted_valid)
                    auc_roc = roc_auc_score(target_valid, probabilities_valid)
                    results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'None', 'N/A', accuracy, f1, auc_roc])

    df = pd.DataFrame(results, columns=['model_type', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'method', 'parameter', 'accuracy', 'f1_score', 'auc_roc'])
    return df.sort_values('f1_score', ascending=False)

### Train the Model

In [40]:
df_sorted = decision_tree_experiment(method=None)

df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'f1_score', 'auc_roc']]
print(df_sorted.iloc[0])
display(df_display.head())

if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64
Best so far: max_depth                 NaN
min_samples_split    2.000000
min_samples_leaf     1.000000
accuracy             0.826000
f1_score             0.685921
auc_roc              0.892593
Name: 0, dtype: float64
model_type           DecisionTree
max_depth                     5.0
min_samples_split              10
min_samples_leaf                5
method                       None
parameter                     N/A
accuracy                    0.848
f1_score                 0.511254
auc_roc                  0.814623
Name: 17, dtype: object


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,f1_score,auc_roc
17,5.0,10,5,0.511254,0.814623
11,5.0,2,5,0.511254,0.814623
14,5.0,5,5,0.511254,0.814623
13,5.0,5,2,0.508091,0.805005
15,5.0,10,1,0.508091,0.798223


### Improve the model

#### Class Weight Adjustment

In [22]:
df_sorted = decision_tree_experiment(method='class_weight')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
34,5.0,10,5,balanced,0.526126,0.822382
28,5.0,5,5,balanced,0.526126,0.822382
22,5.0,2,5,balanced,0.526126,0.822382
20,5.0,2,2,balanced,0.518919,0.810549
32,5.0,10,2,balanced,0.518919,0.810549


max_depth                 5.0
min_samples_split          10
min_samples_leaf            5
parameter            balanced
f1_score             0.526126
auc_roc              0.822382
Name: 34, dtype: object


#### Upsampling

In [23]:
df_sorted = decision_tree_experiment(method='upsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    7963
1    7963
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
0,,2,1,0.9965,0.990704,0.990789
3,,5,1,0.9895,0.972259,0.990463
1,,2,2,0.971,0.926209,0.989835
4,,5,2,0.971,0.925641,0.99093
6,,10,1,0.959,0.896465,0.987381


max_depth                 NaN
min_samples_split    2.000000
min_samples_leaf     1.000000
accuracy             0.996500
f1_score             0.990704
auc_roc              0.990789
Name: 0, dtype: float64


#### Downsampling

In [35]:
df_sorted = decision_tree_experiment(method='downsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    2037
1    2037
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
0,,2,1,0.826,0.685921,0.892593
4,,5,2,0.8245,0.664756,0.900431
1,,2,2,0.8235,0.662201,0.89495
3,,5,1,0.813,0.661844,0.893648
6,,10,1,0.8135,0.646445,0.900669


0.6859205776173286


#### Threshold Adjustment

In [25]:
df_sorted = decision_tree_experiment(method='threshold')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
226,5.0,2,5,0.3,0.566248,0.814623
286,5.0,5,5,0.3,0.566248,0.814623
346,5.0,10,5,0.3,0.566248,0.814623
287,5.0,5,5,0.35,0.565035,0.814623
347,5.0,10,5,0.35,0.565035,0.814623


max_depth            5.000000
min_samples_split    2.000000
min_samples_leaf     5.000000
parameter            0.300000
f1_score             0.566248
auc_roc              0.814623
Name: 226, dtype: float64


## Random Forest

In [26]:
hyperparameters_rf = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'class_weight': ['balanced', None],
    'thresholds': thresholds
}

Define a function that will be used for all the random forest models.

In [27]:
# Create a function that will be used for all Decision Tree models
def random_forest_experiment(hyperparameters=hyperparameters_rf, 
                             features_train=features_train, 
                             target_train=target_train, 
                             features_valid=features_valid, 
                             target_valid=target_valid, 
                             method=None):
    results = []

    if method == 'upsampling':
        # Upsample minority class to match the number of samples in majority class
        features_upsampled = pd.concat([features_zeros] + [resample(features_ones, replace=True, n_samples=len(features_zeros), random_state=42)])
        target_upsampled = pd.concat([target_zeros] + [resample(target_ones, replace=True, n_samples=len(target_zeros), random_state=42)])
        print(target_upsampled.value_counts())
    elif method == 'downsampling':
        # Downsample the majority class
        features_downsampled = pd.concat([resample(features_zeros, replace=False, n_samples=len(features_ones), random_state=42)] + [features_ones])
        target_downsampled = pd.concat([resample(target_zeros, replace=False, n_samples=len(target_ones), random_state=42)] + [target_ones])
        print(target_downsampled.value_counts())
    else:
        print(target_train.value_counts())

    for n_estimators in hyperparameters['n_estimators']:
        for max_depth in hyperparameters['max_depth']:
            for min_samples_split in hyperparameters['min_samples_split']:
                for min_samples_leaf in hyperparameters['min_samples_leaf']:
                    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=42)
                    if method == 'class_weight':
                        for class_weight in hyperparameters['class_weight']:
                            model.class_weight = class_weight
                            model.fit(features_train, target_train)
                            predicted_valid = model.predict(features_valid)
                            probabilities_valid = model.predict_proba(features_valid)[:, 1]
                            accuracy = accuracy_score(target_valid, predicted_valid)
                            f1 = f1_score(target_valid, predicted_valid)
                            auc_roc = roc_auc_score(target_valid, probabilities_valid)
                            results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'class_weight', class_weight, accuracy, f1, auc_roc])
                    elif method == 'upsampling':
                        # Shuffle the dataset
                        features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)
                        model.fit(features_upsampled, target_upsampled)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'upsampling', 'N/A', accuracy, f1, auc_roc])
                    elif method == 'downsampling':
                        # Shuffle the dataset
                        features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)
                        model.fit(features_downsampled, target_downsampled)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'downsampling', 'N/A', accuracy, f1, auc_roc])
                    elif method == 'threshold':
                        for threshold in hyperparameters['thresholds']:
                            model.fit(features_train, target_train)
                            probabilities_valid = model.predict_proba(features_valid)[:, 1]
                            predicted_valid = probabilities_valid > threshold
                            accuracy = accuracy_score(target_valid, predicted_valid)
                            f1 = f1_score(target_valid, predicted_valid)
                            auc_roc = roc_auc_score(target_valid, probabilities_valid)
                            results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'threshold', threshold, accuracy, f1, auc_roc])
                    else:
                        model.fit(features_train, target_train)
                        predicted_valid = model.predict(features_valid)
                        probabilities_valid = model.predict_proba(features_valid)[:, 1]
                        accuracy = accuracy_score(target_valid, predicted_valid)
                        f1 = f1_score(target_valid, predicted_valid)
                        auc_roc = roc_auc_score(target_valid, probabilities_valid)
                        results.append(['DecisionTree', max_depth, min_samples_split, min_samples_leaf, 'None', 'N/A', accuracy, f1, auc_roc])

    df = pd.DataFrame(results, columns=['model_type', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'method', 'parameter', 'accuracy', 'f1_score', 'auc_roc'])
    return df.sort_values('f1_score', ascending=False)

### Train the Model

In [30]:
df_sorted = random_forest_experiment(method=None)
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,f1_score,auc_roc
30,,5,1,0.560261,0.829896
45,10.0,2,1,0.559322,0.844311
3,,5,1,0.55573,0.811037
34,,10,2,0.555556,0.834056
57,,5,1,0.555372,0.834625


max_depth                 NaN
min_samples_split    5.000000
min_samples_leaf     1.000000
f1_score             0.560261
auc_roc              0.829896
Name: 30, dtype: float64


#### Class Weight Adjustment

In [31]:
df_sorted = random_forest_experiment(method='class_weight')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
144,10.0,2,1,balanced,0.594315,0.84575
122,,10,2,balanced,0.591093,0.839328
66,,10,1,balanced,0.588563,0.83427
104,10.0,10,2,balanced,0.585608,0.845286
150,10.0,5,1,balanced,0.585242,0.842479


max_depth                10.0
min_samples_split           2
min_samples_leaf            1
parameter            balanced
f1_score             0.594315
auc_roc               0.84575
Name: 144, dtype: object


### Upsampling

In [32]:
df_sorted = random_forest_experiment(method='upsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    7963
1    7963
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
54,,2,1,0.9965,0.990704,0.997031
27,,2,1,0.9965,0.990704,0.997058
30,,5,1,0.994,0.984,0.996637
57,,5,1,0.9935,0.98269,0.996339
0,,2,1,0.9905,0.974834,0.996339


max_depth                 NaN
min_samples_split    2.000000
min_samples_leaf     1.000000
accuracy             0.996500
f1_score             0.990704
auc_roc              0.997031
Name: 54, dtype: float64


### Downsampling

In [33]:
df_sorted = random_forest_experiment(method='downsampling')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'accuracy', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    2037
1    2037
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,accuracy,f1_score,auc_roc
54,,2,1,0.8775,0.756219,0.987168
27,,2,1,0.871,0.746562,0.985928
0,,2,1,0.871,0.742515,0.970873
57,,5,1,0.8645,0.734053,0.97268
55,,2,2,0.863,0.729783,0.967571


max_depth                 NaN
min_samples_split    2.000000
min_samples_leaf     1.000000
accuracy             0.877500
f1_score             0.756219
auc_roc              0.987168
Name: 54, dtype: float64


### Threshold Adjustment

In [34]:
df_sorted = random_forest_experiment(method='threshold')
df_display = df_sorted[['max_depth', 'min_samples_split', 'min_samples_leaf', 'parameter', 'f1_score', 'auc_roc']]
display(df_display.head())
if df_sorted.iloc[0]['f1_score'] > best['f1_score']:
    best = df_sorted.iloc[0]
print('Best so far:', best)

Exited
0    4773
1    1227
Name: count, dtype: int64


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,parameter,f1_score,auc_roc
668,,10,1,0.4,0.604046,0.835946
1207,,10,1,0.35,0.600791,0.840374
667,,10,1,0.35,0.59761,0.835946
707,,10,5,0.35,0.593923,0.838596
587,,2,5,0.35,0.593923,0.838596


max_depth                  NaN
min_samples_split    10.000000
min_samples_leaf      1.000000
parameter             0.400000
f1_score              0.604046
auc_roc               0.835946
Name: 668, dtype: float64


<div class="alert alert-danger">
<b>Reviewer's comment</b>

Upampling should be applied only to the train set, otherwise it won't be possible to accurately estimate how the model will generalize to new data for two reasons:
    
1. Validation/test data obtained from an upsampled full dataset will not have the same distribution as actual data (which is not balanced)
2. There are bound to be the same examples in train and test, which is a clear case of data leakage.
    
The goal of upsampling is just to help the model better learn about the underrepresented class, but the validation and test set need to have the original data distribution in order for evaluation to make any sense.

</div>

<div class="alert alert-danger">
<b>Reviewer's comment</b>

The same comment as for upsampling: downsampling should only be applied to the train set. While the argument about having the same examples in train and test no longer works, the first point about validation/test data needing to have the original data distribution for accurate estimation of generalization performance applies here.

</div>

## Testing

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Please check the results after making sure that the test set is only used for final model evaluation and all prior comparisons are done using the validation set 

</div>

## Conclusion

This project involved building a machine learning model to predict customer churn. The dataset was initially imbalanced with a larger number of customers who continued their business with Beta Bank compared to those who did not. The initial model which was trained without addressing the imbalance performed poorly having a low F1 score for the minority class. After addressing the class imbalance using both upsampling and downsampling, the F1 scores improved dramatically from .12 to around .63-.64. The AUC-ROC scores of the improved model was around .65-.69.

This project demonstrated the importance of properly preprocessing the data, handling class imbalance, and choosing the right evaluation metrics when working with imbalanced datasets. It also shows the iterative process of building a model and continually improving the model based on performance.

<div class="alert alert-danger">
<b>Reviewer's comment</b>

Don't forget to change the conclusions if needed

</div>