# Contents <a id='back'></a>

* [Introduction](#intro)
* [1. Data Overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [2. Data Pre-Processing](#data_preprocessing)
    * [2.1 Standardize Column Names](#column_names)
    * [2.2 Handling with Missing Values](#missing_values)
    * [2.3 Drop Unused Columns](#drop_unused_cols)
    * [2.4 Categorizing Columns](#categorizing_cols)
    * [2.5 Encode Features](#encode)
* [3. Splitting the Data](#splitting_data)
* [4. Assessing Model Quality](#model_quality)
    * [4.1 Logistic Regression](#initial_lr)
    * [4.2 Decision Tree Classifier](#initial_dtree)
    * [4.3 Random Forest](#initial_rf)
    * [Conclusions](#model_quality_conclusions)
* [5. Normalizing the features in the dataset using Standard Scaler](#scaling)
    * [5.1 Logistic Regression](#scaling_lr)
    * [5.2 Decision Tree](#scaling_dtree)
    * [5.3 Random Forest](#scaling_rf)
    * [Conclusions](#after_scaling_conclusions)
* [6. Improving the model's quality](#improve)
    * [6.1 Hyperparameter Tuning](#hyperparameter_tuning)
    * [6.2 Upsampling](#upsampling)
    * [6.3 Downsampling](#downsampling)
    * [Conclusions](#after_improving_conclusions)
* [7. Testing Model on Test Dataset](#testing_model)
* [General Conclusion](#end)

# Introduction <a id='intro'></a>

In this project, I will predict whether the customers of Bank Beta is likely to leave the bank soon or not as the bank employees realize that it would be more cost-effective for the company to focus on retaining their loyal existing customers rather than attracting new ones. 
I have data related to the past behavior of clients and their history of contract terminations with the bank.


**Objective:**

To train a model with the highest possible F1 score with a minimum F1 score of 0.59 for the test dataset. Additionally, I will measure the AUC-ROC metric and compare it with the F1 score.


**This project will consist of three steps:**

1. Data Overview
2. Data Preprocessing
3. Splitting the Data
4. Assessing Model Quality
5. Normalizing the Features in the dataset using Standard Scaler
6. Improving the model's quality
7. Testing model on test dataset


[Back to Contents](#back)

## 1. Data Overview <a id='data_review'></a>

The steps to be performed are as follows:
1. Checking the number of rows and columns.
2. Checking for missing values.
3. Checking for duplicate data.
4. Checking statistical information in columns with numerical data types.
5. Checking values in columns with categorical data types.

[Back to Contents](#back)

In [1]:
# load library

# dataset
import pandas as pd, numpy as np

# scientific computing
import numpy as np

# model libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# import Standard Scaler Standardize the features by removing the mean and scaling to unit variance using a standard scaler. 
from sklearn.preprocessing import StandardScaler

# splitting data
from sklearn.model_selection import train_test_split

# testing model
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score

# shuffle aray
from sklearn.utils import shuffle

# ignore warning
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.filterwarnings("ignore")

In [2]:
# load dataset
path = 'data/Churn.csv'
df = pd.read_csv(path)

### 1.1 Data Exploration: churn dataset

In [4]:
df.shape

(10000, 14)

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [6]:
# checking missing values
df.isnull().sum() / 100

RowNumber          0.00
CustomerId         0.00
Surname            0.00
CreditScore        0.00
Geography          0.00
Gender             0.00
Age                0.00
Tenure             9.09
Balance            0.00
NumOfProducts      0.00
HasCrCard          0.00
IsActiveMember     0.00
EstimatedSalary    0.00
Exited             0.00
dtype: float64

In [7]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [9]:
# check data composition of target column in percentage
df['exited'].value_counts()/df.shape[0] * 100

Exited
0    7963
1    2037
Name: count, dtype: int64

### Conclusion <a id='data_review_conclusions'></a>

1. Column names are not standardized (mixed case).
2. There are missing values in the 'tenure' column, totaling 909 data or around 9.09%. Since the percentage is not high, missing values will be filled using the median.
3. Data types in the columns are correct.
4. The composition of the target data is not ideal due to an imbalance. This implies that, since the majority of the target data is 1, there is a tendency for the model to predict the value 1. This can result in poor model performance and low accuracy. To address this imbalance, techniques like upsampling (increasing the frequency of value 0) or downsampling (reducing the frequency of value 1) can be employed. However, both upsampling and downsampling might lead to the introduction of synthetic data points.
5. Columns that are not used will be dropped.

## 2 Data Preprocessing <a id='data_preprocessing'></a>

[Back to Contents](#back)

### 2.1 Standardize Column Names <a id='column_names'></a>

In [10]:
df.columns = df.columns.str.casefold()

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   rownumber        10000 non-null  int64  
 1   customerid       10000 non-null  int64  
 2   surname          10000 non-null  object 
 3   creditscore      10000 non-null  int64  
 4   geography        10000 non-null  object 
 5   gender           10000 non-null  object 
 6   age              10000 non-null  int64  
 7   tenure           9091 non-null   float64
 8   balance          10000 non-null  float64
 9   numofproducts    10000 non-null  int64  
 10  hascrcard        10000 non-null  int64  
 11  isactivemember   10000 non-null  int64  
 12  estimatedsalary  10000 non-null  float64
 13  exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


### 2.2 Handling Missing Values <a id='missing_values'></a>

In [12]:
# fill in missing values in 'tenure' column with median value
df['tenure'] = df['tenure'].fillna(value=df['tenure'].median())

In [13]:
df.isnull().sum()

rownumber          0
customerid         0
surname            0
creditscore        0
geography          0
gender             0
age                0
tenure             0
balance            0
numofproducts      0
hascrcard          0
isactivemember     0
estimatedsalary    0
exited             0
dtype: int64

**Conclusion**

There are no more missing values.

### 2.3 Drop Unused Column <a id='drop_unused_cols'></a>

In [14]:
df = df.drop(columns=['rownumber', 'customerid', 'surname'])

### 2.4 Categorizing Columns <a id='categorizing_cols'></a>

In [15]:
# categorical columns
df_categorical = ['geography', 'gender']

# numerical columns
df_numerical = ['creditscore', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard', 'isactivemember', 'estimatedsalary']

In [16]:
df.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


### 2.5 Encode Features <a id='encode'></a>

I will use get_dummies because the categorical columns do not have any order (they are not ordinal data).

In [17]:
df = pd.get_dummies(df, drop_first=True, columns = df_categorical)

In [18]:
df.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,False,False,False
1,608,41,1.0,83807.86,1,0,1,112542.58,0,False,True,False
2,502,42,8.0,159660.8,3,1,0,113931.57,1,False,False,False
3,699,39,1.0,0.0,2,0,0,93826.63,0,False,False,False
4,850,43,2.0,125510.82,1,1,1,79084.1,0,False,True,False


## 3. Splitting Data <a id='splitting_data'></a>

The data will be splitted into:
1. Training Set: This data is used to train and build the model.
2. Validation Set: This data is used to optimize the model during its construction. It helps assess the model's ability to recognize patterns in a general sense. The validation set is also used to evaluate the accuracy of the created model. If the accuracy is not satisfactory, hyperparameter tuning can be performed.
3. Test Set: This data is used to test the model's performance.

[Back to Contents](#back)

In [19]:
# Split df into df_train_valid, df_test
df_train_valid, df_test = train_test_split(df, test_size=0.15)

# split df_train, df_valid from df_train_valid
df_train, df_valid = train_test_split(df_train_valid, test_size=0.25)

# Define features and target for training dataset
features_train = df_train.drop('exited', axis=1)
target_train = df_train['exited']

# Define features and target for validation dataset
features_valid = df_valid.drop('exited', axis=1)
target_valid = df_valid['exited']

# Define features and target for test dataset
features_test = df_test.drop('exited', axis=1)
target_test = df_test['exited']

In [20]:
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(6375, 11)
(2125, 11)
(1500, 11)


## 4. Assessing Model Quality <a id='model_quality'></a>

without hyperparameter tuning and before scalling.

The assessment of model quality will use the following matrices:

1. F1 score: The "harmonic mean" or "harmonic average" of precision and recall. The best F1-Score value is 1.0, and the worst is 0. In representation, a high F1-Score indicates that our classification model has good precision and recall.
Source: [Understanding Precision, Recall, and F1-Score](https://stevkarta.medium.com/membicarakan-precision-recall-dan-f1-score-e96d81910354)

2. AUC-ROC score:
ROC (Receiver Operating Characteristics) is a performance measurement tool for classification problems used to determine the threshold of a model.
AUC (Area Under the Curve) makes it easy to compare one model to another. AUC is the area under the ROC curve or the integral of the ROC function.
And we should choose the model with the highest AUC since it has higher TP and/or lower FP for every point.
Source: [Understanding ROC and AUC](https://datasans.medium.com/memahami-roc-dan-auc-2e0e4f3638bf)

[Back to Contents](#back)

### 4.1 Logistic Regression <a id='initial_lr'></a>

In [22]:
beforeScaling_lr = LogisticRegression(random_state = 42)

# train model on training set
beforeScaling_lr.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_lr = beforeScaling_lr.predict(features_valid)

# measuring probability using validation set
y_probability_valid_lr = beforeScaling_lr.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.1140529531568228
AUC-ROC score = 0.6700753101004134


### 4.2 Decision Tree Classifier <a id='initial_dtree'></a>

In [23]:
beforeScaling_dTree = DecisionTreeClassifier()

# train model on training set
beforeScaling_dTree.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_dtree = beforeScaling_dTree.predict(features_valid)

# measuring probability using validation set
y_probability_valid_dtree = beforeScaling_dTree.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_dtree))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_dtree))

F1 score = 0.47314285714285714
AUC-ROC score = 0.6698845737349872


### 4.3 Random Forest <a id='initial_rf'></a>

In [21]:
beforeScaling_rf = RandomForestClassifier()

# train model on training set
beforeScaling_rf.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_rf = beforeScaling_rf.predict(features_valid)
# menghitung probability
y_probability_valid_lr = beforeScaling_rf.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_rf))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.6107091172214182
AUC-ROC score = 0.8696814526348492


### Conclusion (before scalling and before hyperparameter tuning) <a id='model_quality_conclusions'></a>

1. Based on the performance results of the three models, the F1 scores from highest to lowest are as follows:
   - Random Forest model with an F1 score of 0.61 and an AUC-ROC score of 0.87
   - Decision Tree model with an F1 score of 0.47 and an AUC-ROC score of 0.67
   - Logistic Regression model with an F1 score of 0.11 and an AUC-ROC score of 0.67


2. The results of Random Forest passed the evaluation with a minimum F1 score of 0.59 for the validation set.

## 5. Normalizing the features in the dataset using Standard Scaler <a id='scaling'></a>

StandardScaler is a class from sklearn used to normalize data in order to eliminate large deviations in the data used.

Source: [Building a Classification Model to Predict Legendary Pokémon](https://medium.com/codelabs-unikom/membangun-model-klasifikasi-untuk-mempredict-pokemon-legend-935d2accceaa#:~:text=StandardScaler%20is%20class%20dari%20sklearn,tidak%20memiliki%20penyimpangan%20yang%20besar.&text=Satu%20hal%20penting%20dalam%20Data,apa%20yang%20akan%20di%20analisis.)

[Back to Contents](#back)

In [22]:
# Scaling features
scaler = StandardScaler()

features_train[df_numerical] = scaler.fit_transform(features_train[df_numerical])

features_valid[df_numerical] = scaler.transform(features_valid[df_numerical])
features_test[df_numerical] = scaler.transform(features_test[df_numerical])

In [23]:
features_train.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
1870,-0.280615,-0.563822,0.357813,-1.215725,0.795789,-1.534687,-1.030747,-0.408434,False,True,True
5182,0.194409,-0.755542,-1.092782,-0.294384,-0.917536,0.651599,0.97017,1.420323,False,False,False
4933,-2.387242,-0.755542,-1.455431,0.519939,0.795789,0.651599,0.97017,-1.565444,True,False,False
6297,1.609154,-0.467962,-1.455431,-1.215725,-0.917536,-1.534687,-1.030747,0.249458,False,False,False
3230,0.287348,-0.563822,0.357813,-1.215725,0.795789,0.651599,0.97017,-0.035811,False,False,False


### 5.1 Logistic Regression <a id='scaling_lr'></a>

In [24]:
afterScaling_lr = LogisticRegression(random_state = 42)

# train model on training set
afterScaling_lr.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_lr = afterScaling_lr.predict(features_valid)

# measuring probability using validation set
y_probability_valid_lr = afterScaling_lr.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.3501683501683502
AUC-ROC score = 0.7953927951448028


### 5.2 Decision Tree Classifier <a id='scaling_dtree'></a>

In [25]:
afterScaling_dTree = DecisionTreeClassifier()
# train model on training set
afterScaling_dTree.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_dtree = afterScaling_dTree.predict(features_valid)

# measuring probability using validation set
y_probability_valid_dtree = afterScaling_dTree.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_dtree))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_dtree))

F1 score = 0.49944258639910816
AUC-ROC score = 0.6873962724862174


### 5.3 Random Forest <a id='scaling_rf'></a>

In [26]:
afterScaling_rf = RandomForestClassifier()
# train model on training set
afterScaling_rf.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_rf = afterScaling_rf.predict(features_valid)
# menghitung probability
y_probability_valid_lr = afterScaling_rf.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_rf))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.6082036775106082
AUC-ROC score = 0.8691909185795224


### Conclusion <a id='after_scaling_conclusions'></a>

1. Based on the performance results of the three models after scaling, the F1 scores from highest to lowest are as follows:
   - Random Forest model with an F1 score of 0.61 and an AUC-ROC score of 0.87
   - Decision Tree model with an F1 score of 0.50 and an AUC-ROC score of 0.69
   - Logistic Regression model with an F1 score of 0.35 and an AUC-ROC score of 0.79

2. The results of Random Forest passed the evaluation with a minimum F1 score of 0.59 for the validation set.

3. When compared to the results before scaling, the Random Forest performance remained unchanged. Meanwhile, Logistic Regression improved by more than 3 times from 0.11 to 0.35, and Decision Treee increased by 0.03 for the F1 score and decreased by 0.02 for the AUC-ROC.

## 6. Improving the model's quality <a id='improve'></a>

There are 3 methods that will be used to improve the model's quality:

1. Hyperparameter tuning for class_weight: Due to imbalance, hyperparameter tuning for class_weight will be performed with class_weight set to "balanced." The model will automatically assign inverse class weights. This configuration helps in assigning a higher weight to the minority class and reducing the weight for the majority class. Source: [Improve Class Imbalance with Class Weights](https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/)

2. Upsampling: This procedure involves generating synthetic data points (matching the minority class) injected into the dataset. After this process, the count of both labels becomes nearly equal. This equalization procedure prevents the model from leaning towards the majority class. Furthermore, the interaction (decision boundary) between target classes remains unchanged. Upsampling also introduces bias into the system due to the added information. Source: [Handling Imbalanced Data - Upsampling](https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/)

3. Downsampling: This mechanism reduces the number of training samples from the majority class. It helps in balancing the number of target categories. However, by discarding accumulated data, we tend to lose a lot of valuable information. Source: [Handling Imbalanced Data - Downsampling](https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/)

[Back to Contents](#back)

### 6.1 Hyperparameter Tuning (class_weight) <a id='hyperparameter_tuning'></a>

#### 6.1.1 Logistic Regression

the hyperparameters that will be tuned are:
- class_weight: balanced
  (Explanation: When set to class_weight = balanced, the model will automatically assign inverse class weights. This configuration helps in assigning a higher weight to the minority class and reducing the weight for the majority class.)
  Source: [Improve Class Imbalance with Class Weights](https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/)
- random_state: Controls the randomness of the estimator.

In [27]:
afterScaling_lr = LogisticRegression(random_state = 42, class_weight ='balanced')

# train model on training set
afterScaling_lr.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_lr = afterScaling_lr.predict(features_valid)

# measuring probability using validation set
y_probability_valid_lr = afterScaling_lr.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.5180327868852459
AUC-ROC score = 0.7975715839072127


#### 6.1.2 Decission Tree Classifier

The hyperparameters that will be tuned are:

- max_depth: Limits the number of branches; if set too high, overfitting might occur.
- class_weight: balanced
  (Explanation: When set to class_weight = balanced, the model will automatically assign inverse class weights. This configuration helps in assigning a higher weight to the minority class and reducing the weight for the majority class.)
  Source: [Improve Class Imbalance with Class Weights](https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/)
- random_state: Controls the randomness of the estimator.

In [28]:
for depth in range(1, 15):
    model_dtree = DecisionTreeClassifier(max_depth=depth, random_state = 42, class_weight ='balanced')
    model_dtree.fit(features_train, target_train)
    predictions_valid_dtree = model_dtree.predict(features_valid)
    probabalities_valid_dtree = model_dtree.predict_proba(features_valid)[:, 1]
    
    print("at max_depth", depth, "F1 score and AUC-SCORE is:", end='')
    print()
    print('F1 score =', f1_score(target_valid, predictions_valid_dtree))
    print('AUC-ROC score =', roc_auc_score(target_valid, probabalities_valid_dtree))
    print()

at max_depth 1 F1 score and AUC-SCORE is:
F1 score = 0.49515418502202646
AUC-ROC score = 0.6995458472204432

at max_depth 2 F1 score and AUC-SCORE is:
F1 score = 0.5248713550600342
AUC-ROC score = 0.7666427304215595

at max_depth 3 F1 score and AUC-SCORE is:
F1 score = 0.5366726296958856
AUC-ROC score = 0.8089758193962616

at max_depth 4 F1 score and AUC-SCORE is:
F1 score = 0.5318985395849346
AUC-ROC score = 0.818614813583433

at max_depth 5 F1 score and AUC-SCORE is:
F1 score = 0.5748502994011977
AUC-ROC score = 0.8387669063924762

at max_depth 6 F1 score and AUC-SCORE is:
F1 score = 0.5616883116883117
AUC-ROC score = 0.8376652486598882

at max_depth 7 F1 score and AUC-SCORE is:
F1 score = 0.5729827742520398
AUC-ROC score = 0.8196694618023855

at max_depth 8 F1 score and AUC-SCORE is:
F1 score = 0.5511265164644714
AUC-ROC score = 0.7971825631494467

at max_depth 9 F1 score and AUC-SCORE is:
F1 score = 0.5461187214611872
AUC-ROC score = 0.7681729241552595

at max_depth 10 F1 score and

#### 6.1.3 Random Forest

The hyperparameters that will be tuned are:
- class_weight: balanced
  (Explanation: When set to class_weight = balanced, the model will automatically assign inverse class weights. This configuration helps in assigning a higher weight to the minority class and reducing the weight for the majority class.)
  Source: [Improve Class Imbalance with Class Weights](https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/)
- random_state: Controls the randomness of the estimator.
- n_estimators: Determines the number of trees in the random forest model. As the number of estimators increases, the variance of predictions decreases. Therefore, the more trees used, the better the results obtained.
- max_depth: Limits the number of branches.

In [29]:
max_depth_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
n_estimator_list = [100, 200, 300, 400, 500]

for depth in max_depth_list:
    for n_estimator in n_estimator_list:
        model_rf = RandomForestClassifier(max_depth = depth, n_estimators = n_estimator, class_weight = 'balanced', random_state = 42)

        model_rf.fit(features_train, target_train)
        predictions_valid_rf = model_rf.predict(features_valid)
        probabalities_valid_rf = model_rf.predict_proba(features_valid)[:, 1]

        print("At depth", depth, "and n_estimator", n_estimator, "F1 score and AUC-SCORE is:", end='')
        print()
        print('F1 score =', f1_score(target_valid, predictions_valid_rf))
        print('AUC-ROC score =', roc_auc_score(target_valid, probabalities_valid_rf))
        print()

At depth 1 and n_estimator 100 F1 score and AUC-SCORE is:
F1 score = 0.5510204081632654
AUC-ROC score = 0.8260852384676807

At depth 1 and n_estimator 200 F1 score and AUC-SCORE is:
F1 score = 0.5570698466780238
AUC-ROC score = 0.8264401943604934

At depth 1 and n_estimator 300 F1 score and AUC-SCORE is:
F1 score = 0.5321100917431193
AUC-ROC score = 0.8188989145571431

At depth 1 and n_estimator 400 F1 score and AUC-SCORE is:
F1 score = 0.5316455696202532
AUC-ROC score = 0.8188430481786197

At depth 1 and n_estimator 500 F1 score and AUC-SCORE is:
F1 score = 0.5314333612740989
AUC-ROC score = 0.8204999632099459

At depth 2 and n_estimator 100 F1 score and AUC-SCORE is:
F1 score = 0.570446735395189
AUC-ROC score = 0.8458728372217241

At depth 2 and n_estimator 200 F1 score and AUC-SCORE is:
F1 score = 0.5706760316066725
AUC-ROC score = 0.8433854207828378

At depth 2 and n_estimator 300 F1 score and AUC-SCORE is:
F1 score = 0.5614035087719299
AUC-ROC score = 0.8404919511537089

At depth 

#### Conclusion

1. Based on the performance results of the models after tuning the hyperparameter, the F1 scores from highest to lowest are as follows:
   - Random Forest model with depth = 10 and n_estimators = =100, F1 score = 0.662 and AUC-ROC score = 0.876
   - Decision Tree model with depth = 7, F1 score = 0.57 and AUC-ROC score = 0.82
   - Logistic Regression model with F1 score = 0.51 and AUC-ROC score = 0.79
   

2. The results of Random Forest passed the evaluation with a minimum F1 score of 0.59 for the validation set.

3. Comparison of the three models before scaling, after scaling, and after weight adjustment is as follows:

    3.1. Logistic Regression
    
        3.1.1 Before scaling: F1 score = 0.11 and AUC-ROC = 0.67

        3.1.2 After scaling: F1 score = 0.35 and AUC-ROC = 0.79
        
        3.1.3 After class weight adjustment: F1 score = 0.51 and AUC-ROC = 0.79
        
    3.2. Decision Tree

        3.2.1 Before scaling: F1 score = 0.47 and AUC-ROC score = 0.67

        3.2.2 After scaling: F1 score = 0.50 and AUC-ROC score = 0.69

        3.2.3 After class weight adjustment: F1 score = 0.57 and AUC-ROC score = 0.82
        
    3.3. Random Forest
    
        3.3.1 Before scaling: F1 score = 0.61 and AUC-ROC score = 0.87

        3.3.2 After scaling: F1 score = 0.61 and AUC-ROC score = 0.87
        
        3.3.3 After class weight adjustment: F1 score = 0.66 and AUC-ROC score = 0.87

4. From the information above, it is evident that the F1 score of all three models increased after class weight adjustment.

### 6.2 Upsampling <a id='upsampling'></a>

In [32]:
def upsample (features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle (features_upsampled, target_upsampled, random_state = 42)
    
    return features_upsampled, target_upsampled

In [33]:
# check data composition
target_train.value_counts()

exited
0    5068
1    1307
Name: count, dtype: int64

In [34]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [35]:
# data composition after upsampling
target_upsampled.value_counts()

exited
1    5228
0    5068
Name: count, dtype: int64

#### 6.2.1 Logistic Regression

In [36]:
upsample_lr = LogisticRegression(random_state = 42)

# train model on training set
upsample_lr.fit(features_upsampled, target_upsampled)

#  predict using validation set
y_predict_valid_lr = upsample_lr.predict(features_valid)

# measuring probability using validation set
y_probability_valid_lr = upsample_lr.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.5188755020080321
AUC-ROC score = 0.7975906602315865


#### 6.2.2 Decision Tree

In [38]:
for depth in range(1, 15):
    model_dtree = DecisionTreeClassifier(max_depth=depth, random_state = 42)
    model_dtree.fit(features_upsampled, target_upsampled)
    predictions_valid_dtree = model_dtree.predict(features_valid)
    probabalities_valid_dtree = model_dtree.predict_proba(features_valid)[:, 1]
    
    print("At max_depth", depth, "F1 score and AUC-SCORE is:", end='')
    print()
    print('F1 score =', f1_score(target_valid, predictions_valid_dtree))
    print('AUC-ROC score =', roc_auc_score(target_valid, probabalities_valid_dtree))
    print()

At max_depth 1 F1 score and AUC-SCORE is:
F1 score = 0.49515418502202646
AUC-ROC score = 0.6995458472204432

At max_depth 2 F1 score and AUC-SCORE is:
F1 score = 0.5248713550600342
AUC-ROC score = 0.7666427304215595

At max_depth 3 F1 score and AUC-SCORE is:
F1 score = 0.5366726296958856
AUC-ROC score = 0.8089758193962616

At max_depth 4 F1 score and AUC-SCORE is:
F1 score = 0.5318985395849346
AUC-ROC score = 0.818614813583433

At max_depth 5 F1 score and AUC-SCORE is:
F1 score = 0.5748502994011977
AUC-ROC score = 0.8387669063924762

At max_depth 6 F1 score and AUC-SCORE is:
F1 score = 0.5600649350649352
AUC-ROC score = 0.8350449792476842

At max_depth 7 F1 score and AUC-SCORE is:
F1 score = 0.5793721973094171
AUC-ROC score = 0.8223292464579354

At max_depth 8 F1 score and AUC-SCORE is:
F1 score = 0.546712802768166
AUC-ROC score = 0.8041658604648627

At max_depth 9 F1 score and AUC-SCORE is:
F1 score = 0.5479204339963835
AUC-ROC score = 0.7758293432021518

At max_depth 10 F1 score and 

#### 6.2.3 Random Forest

In [39]:
max_depth_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
n_estimator_list = [100, 200, 300, 400, 500]

for depth in max_depth_list:
    for n_estimator in n_estimator_list:
        model_rf = RandomForestClassifier(max_depth = depth, n_estimators = n_estimator, random_state = 42)

        model_rf.fit(features_upsampled, target_upsampled)
        predictions_valid_rf = model_rf.predict(features_valid)
        probabalities_valid_rf = model_rf.predict_proba(features_valid)[:, 1]

        print("At depth", depth, "and n_estimator", n_estimator, "F1 score and AUC-SCORE is:", end='')
        print()
        print('F1 score =', f1_score(target_valid, predictions_valid_rf))
        print('AUC-ROC score =', roc_auc_score(target_valid, probabalities_valid_rf))
        print()

At depth 1 and n_estimator 100 F1 score and AUC-SCORE is:
F1 score = 0.5378548895899053
AUC-ROC score = 0.8230739044058133

At depth 1 and n_estimator 200 F1 score and AUC-SCORE is:
F1 score = 0.5267175572519085
AUC-ROC score = 0.8247825980318684

At depth 1 and n_estimator 300 F1 score and AUC-SCORE is:
F1 score = 0.5236947791164659
AUC-ROC score = 0.8185364643940405

At depth 1 and n_estimator 400 F1 score and AUC-SCORE is:
F1 score = 0.5236947791164659
AUC-ROC score = 0.8175867359591439

At depth 1 and n_estimator 500 F1 score and AUC-SCORE is:
F1 score = 0.5269076305220883
AUC-ROC score = 0.8189738572600402

At depth 2 and n_estimator 100 F1 score and AUC-SCORE is:
F1 score = 0.5649622799664711
AUC-ROC score = 0.8407126914786058

At depth 2 and n_estimator 200 F1 score and AUC-SCORE is:
F1 score = 0.5626598465473146
AUC-ROC score = 0.8392622095288965

At depth 2 and n_estimator 300 F1 score and AUC-SCORE is:
F1 score = 0.5367892976588627
AUC-ROC score = 0.8365526901705151

At depth

#### Conclusion

1. Based on the performance results of the models after upsampling, the F1 scores from highest to lowest are as follows:
   - Random Forest model with depth = 11 and n_estimators = 100, F1 score = 0.647 and AUC-ROC score = 0.875
   - Decision Tree model with depth = 5, F1 score = 0.57 and AUC-ROC score = 0.83
   - Logistic Regression model with F1 score = 0.51 and AUC-ROC score = 0.79

2.  The results of Random Forest passed the evaluation with a minimum F1 score of 0.59 for the validation set.

3. Comparison of the three models before scaling, after scaling, after class weight adjustment, and after upsampling is as follows:
    3.1. Logistic Regression
    
        3.1.1 Before scaling: F1 score = 0.11 and AUC-ROC = 0.67

        3.1.2 After scaling: F1 score = 0.35 and AUC-ROC = 0.79
        
        3.1.3 After class weight adjustment: F1 score = 0.51 and AUC-ROC = 0.79

        3.1.4 After upsampling: F1 score = 0.51 and AUC-ROC score = 0.79
        
    3.2. Decision Tree

        3.2.1 Before scaling: F1 score = 0.47 and AUC-ROC score = 0.67

        3.2.2 After scaling: F1 score = 0.50 and AUC-ROC score = 0.69

        3.2.3 After class weight adjustment: F1 score = 0.57 and AUC-ROC score = 0.82

        3.2.4 After upsampling: F1 score = 0.57 and AUC-ROC score = 0.83
        
    3.3. Random Forest
    
        3.3.1 Before scaling: F1 score = 0.61 and AUC-ROC score = 0.87

        3.3.2 After scaling: F1 score = 0.61 and AUC-ROC score = 0.87

        3.3.3 After class weight adjustment: F1 score = 0.66 and AUC-ROC score = 0.87

        3.3.4 After upsampling: F1 score = 0.64 and AUC-ROC score = 0.875
        

4. From the information above, it can be seen that the F1 score of the three models tends to remain stagnant or even decrease (for Random Forest) when using upsampling to handle imbalance compared to class_weight adjustment.

### 6.3 Downsampling <a id='downsampling'></a>

In [40]:
def downsample (features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state = 42)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state = 42)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle (features_downsampled, target_downsampled, random_state = 42)
    
    return features_downsampled, target_downsampled

In [41]:
# check data composition
target_train.value_counts()

exited
0    5068
1    1307
Name: count, dtype: int64

In [42]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.3)

In [43]:
# check data composition after upsampling
target_downsampled.value_counts()

exited
0    1520
1    1307
Name: count, dtype: int64

#### 6.3.1 Logistic Regression

In [44]:
downsample_lr = LogisticRegression(random_state = 42)

# train model on training set
downsample_lr.fit(features_downsampled, target_downsampled)

#  predict using validation set
y_predict_valid_lr = downsample_lr.predict(features_valid)

# measuring probability using validation set
y_probability_valid_lr = downsample_lr.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

F1 score = 0.5316455696202531
AUC-ROC score = 0.7965823402289703


#### 6.3.2 Decision Tree

In [45]:
for depth in range(1, 15):
    model_dtree = DecisionTreeClassifier(max_depth=depth, random_state = 42)
    model_dtree.fit(features_downsampled, target_downsampled)
    predictions_valid_dtree = model_dtree.predict(features_valid)
    probabalities_valid_dtree = model_dtree.predict_proba(features_valid)[:, 1]
    
    print("At max_depth", depth, "F1 score and AUC-SCORE is:", end='')
    print()
    print('F1 score =', f1_score(target_valid, predictions_valid_dtree))
    print('AUC-ROC score =', roc_auc_score(target_valid, probabalities_valid_dtree))
    print()

At max_depth 1 F1 score and AUC-SCORE is:
F1 score = 0.48982667671439334
AUC-ROC score = 0.7064760033465324

At max_depth 2 F1 score and AUC-SCORE is:
F1 score = 0.4989733059548255
AUC-ROC score = 0.754864462715324

At max_depth 3 F1 score and AUC-SCORE is:
F1 score = 0.5655577299412915
AUC-ROC score = 0.8086474340981122

At max_depth 4 F1 score and AUC-SCORE is:
F1 score = 0.546400693842151
AUC-ROC score = 0.8460765451141445

At max_depth 5 F1 score and AUC-SCORE is:
F1 score = 0.5748917748917749
AUC-ROC score = 0.8471931913873121

At max_depth 6 F1 score and AUC-SCORE is:
F1 score = 0.5826235093696763
AUC-ROC score = 0.8361902400074124

At max_depth 7 F1 score and AUC-SCORE is:
F1 score = 0.5760765550239234
AUC-ROC score = 0.8316698324281163

At max_depth 8 F1 score and AUC-SCORE is:
F1 score = 0.5719489981785063
AUC-ROC score = 0.8180363921765268

At max_depth 9 F1 score and AUC-SCORE is:
F1 score = 0.5678119349005425
AUC-ROC score = 0.8058547964692448

At max_depth 10 F1 score and 

#### 6.3.3 Random Forest

In [46]:
max_depth_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
n_estimator_list = [100, 200, 300, 400, 500]

for depth in max_depth_list:
    for n_estimator in n_estimator_list:
        model_rf = RandomForestClassifier(max_depth = depth, n_estimators = n_estimator, random_state = 42)

        model_rf.fit(features_downsampled, target_downsampled)
        predictions_valid_rf = model_rf.predict(features_valid)
        probabalities_valid_rf = model_rf.predict_proba(features_valid)[:, 1]

        print("At depth", depth, "and n_estimator", n_estimator, "F1 score and AUC-SCORE is:", end='')
        print()
        print('F1 score =', f1_score(target_valid, predictions_valid_rf))
        print('AUC-ROC score =', roc_auc_score(target_valid, probabalities_valid_rf))
        print()

At depth 1 and n_estimator 100 F1 score and AUC-SCORE is:
F1 score = 0.5927770859277708
AUC-ROC score = 0.8402746173153073

At depth 1 and n_estimator 200 F1 score and AUC-SCORE is:
F1 score = 0.5696040868454663
AUC-ROC score = 0.8419989807792406

At depth 1 and n_estimator 300 F1 score and AUC-SCORE is:
F1 score = 0.5768321513002364
AUC-ROC score = 0.8341252278939466

At depth 1 and n_estimator 400 F1 score and AUC-SCORE is:
F1 score = 0.5707547169811321
AUC-ROC score = 0.8331196330805266

At depth 1 and n_estimator 500 F1 score and AUC-SCORE is:
F1 score = 0.5813664596273292
AUC-ROC score = 0.8351301414100674

At depth 2 and n_estimator 100 F1 score and AUC-SCORE is:
F1 score = 0.6053169734151329
AUC-ROC score = 0.855453921138475

At depth 2 and n_estimator 200 F1 score and AUC-SCORE is:
F1 score = 0.5981688708036622
AUC-ROC score = 0.8554198562735217

At depth 2 and n_estimator 300 F1 score and AUC-SCORE is:
F1 score = 0.5952380952380952
AUC-ROC score = 0.8526830850231777

At depth 

#### Conclusion

1. Based on the performance results of the models after downsampling, the F1 scores from highest to lowest are as follows:
   - Random Forest model with depth = 10 and n_estimators = 100, F1 score = 0.635 and AUC-ROC score = 0.88
   - Decision Tree model with depth = 6, F1 score = 0.58 and AUC-ROC score = 0.83
   - Logistic Regression model with F1 score = 0.53 and AUC-ROC score = 0.79

2. The results of Random Forest passed the evaluation with a minimum F1 score of 0.59 for the validation set.

3. Comparison of the three models before scaling, after scaling, after class weight adjustment, after upsampling, and after downsampling is as follows:

     3.1. Logistic Regression
    
        3.1.1 Before scaling: F1 score = 0.11 and AUC-ROC = 0.67

        3.1.2 After scaling: F1 score = 0.35 and AUC-ROC = 0.79
        
        3.1.3 After class weight adjustment: F1 score = 0.51 and AUC-ROC = 0.79

        3.1.4 After upsampling: F1 score = 0.51 and AUC-ROC score = 0.79

        3.1.5 After downsampling: F1 score = 0.53 and AUC-ROC score = 0.79
        
    3.2. Decision Tree

        3.2.1 Before scaling: F1 score = 0.47 and AUC-ROC score = 0.67

        3.2.2 After scaling: F1 score = 0.50 and AUC-ROC score = 0.69

        3.2.3 After class weight adjustment: F1 score = 0.57 and AUC-ROC score = 0.82

        3.2.4 After upsampling: F1 score = 0.57 and AUC-ROC score = 0.83
        
        3.2.5 After downsampling: F1 score = 0.58 and AUC-ROC score = 0.83
        
    3.3. Random Forest
    
        3.3.1 Before scaling: F1 score = 0.61 and AUC-ROC score = 0.87

        3.3.2 After scaling: F1 score = 0.61 and AUC-ROC score = 0.87

        3.3.3 After class weight adjustment: F1 score = 0.66 and AUC-ROC score = 0.87

        3.3.4 After upsampling: F1 score = 0.64 and AUC-ROC score = 0.875

        3.3.5 After downsampling: F1 score = 0.635 and AUC-ROC score = 0.88

4. From the information above, it can be seen that the F1 score of the three models tends to remain stagnant or even decrease (for Random Forest) when using aftersampling to handle imbalance compared to class_weight adjustment.

### Conclusion of Improving Model's Quality Step <a id='after_improving_conclusions'></a>

From the three processes (class_weight adjustment, upsampling, downsampling), the best F1 score is achieved by:
- Model: Random Forest
- Method: hyperparameter tuning (class weight adjustment)
- Depth: 10
- n_estimators: 100

## 7. Testing Model on Test Dataset <a id='testing_model'></a>

[Back to Contents](#back)

In [47]:
final_model = RandomForestClassifier(max_depth = 10, n_estimators = 100, class_weight = 'balanced', random_state = 42)

# train model on training set
final_model.fit(features_train, target_train)

#  predict using validation set
y_predict_valid_lr = final_model.predict(features_valid)

# measuring probability using validation set
y_probability_valid_lr = final_model.predict_proba(features_valid)[:, 1]

# test performance algorithm using F1 score and auc_score
print('on validation set, F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('on validation set, AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

on validation set, F1 score = 0.6622516556291391
on validation set, AUC-ROC score = 0.8765584675716112


In [48]:
# test dataset
predicted_test = final_model.predict(features_test)
probabilities_test = final_model.predict_proba(features_test)[:, 1]

print('on dataset test, F1 score =', f1_score(target_test, predicted_test))
print('on dataset test, AUC-ROC score =', roc_auc_score(target_test, probabilities_test))

on dataset test, F1 score = 0.5915966386554622
on dataset test, AUC-ROC score = 0.8517582158570531


# General Conclusion <a id='end'></a>

The F1_score on the test dataset is 0.59, which meets the minimum requirement, and the AUC-ROC score is 0.85.

Therefore, I would recommend Bank Beta to use the Random Forest model with a depth of 10 and n_estimators of 100 to predict whether a customer is likely to leave the bank or not.

[Back to Contents](#back)