# Project 8 - Prediction of Bank Customer Decisions

## Project Description

Bank Beta's customers left the company: little by little, their number decreased every month. Bank officials realized that it was cheaper to retain their old, loyal customers than to attract new ones.
In this case, the task is to predict whether a customer will soon leave the bank or not. There is data regarding clients' past behavior and the history of terminating their contracts with the bank.
Create a model with the maximum possible F1 score. To be declared to have passed the review, a minimum F1 score of 0.59 is required for the test dataset. Check the F1 value for the test set.
Additionally, measure the AUC-ROC metric and compare it to the F1 score.

### Steps of The Project
1. Initialization
2. Data Overview
3. Data Pre-Processing
4. Machine Learning Preparation
5. Checking Quality of the Model
6. General Conclusion

### Data Description

The data you need can be found in the /datasets/Churn.csv file.

Features:
- `RowNumber` — data string index
- `CustomerId` — Customer ID
- `Surname` — last name
- `CreditScore` — credit score
- `Geography` — country of residence
- `Gender` — gender
- `Age` — age
- `Tenure` — maturity period for customer fixed deposits (years)
- `Balance` — account balance
- `NumOfProducts` — number of bank products used by customers
- `HasCrCard` — whether the customer has a credit card
- `IsActiveMember` — customer activity level
- `EstimatedSalary` — estimated salary

Targets:
- `Exited` — whether the customer has stopped

## Initialization

In [1]:
# import general and machine learning library

import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.utils import shuffle

## Data Overview

In [2]:
data = pd.read_csv('churn.csv')
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [4]:
data.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

**Findings :**

- The data information contains 14 columns consisting of ID, name, several categorical columns and several numeric columns
- Has a total of 10,000 rows of data
- 1 column has a missing value, namely the tenure column, which amounts to 909 data
- Only around 9% of values are missing from the total data, therefore these missing values will be filled in during data pre-processing
- There are no incorrect data types in the dataset

## Data Pre-Processing

### Lower Case Column

In [5]:
data = data.rename(columns=lambda x: x.lower())

In [6]:
data.head()

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


### Unused Column

In [7]:
drop_column = ['customerid', 'rownumber', 'surname']
categoric_column = ['geography', 'gender']
numeric_column = ['creditscore', 'age', 'tenure', 'balance',
                  'estimatedsalary', 'numofproducts', 'hascrcard', 
                  'isactivemember']

Several columns have been removed, such as 'customerid', 'rownumber', 'surname' because they will not be used in the dataset analysis.

In [8]:
data = data.drop(columns=drop_column)

In [9]:
data.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


### Handle Missing Value

In [10]:
data['tenure'] = data['tenure'].fillna(value=data['tenure'].median())

In [11]:
data.isnull().sum()

creditscore        0
geography          0
gender             0
age                0
tenure             0
balance            0
numofproducts      0
hascrcard          0
isactivemember     0
estimatedsalary    0
exited             0
dtype: int64

### Encoding Column

In [12]:
data = pd.get_dummies(data, columns=categoric_column, drop_first=True)

Columns that are included in categorical columns, such as the 'geography' column, and the 'gender' column will be converted into numeric columns using the ***one hot encoding*** function, to make it easier to carry out analysis.

In [13]:
data.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


## Machine Learning Preparation

### Check for Data Balance

In [14]:
data.head(10)

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0
5,645,44,8.0,113755.78,2,1,0,149756.71,1,0,1,1
6,822,50,7.0,0.0,2,1,1,10062.8,0,0,0,1
7,376,29,4.0,115046.74,4,1,0,119346.88,1,1,0,0
8,501,44,4.0,142051.07,2,0,1,74940.5,0,0,0,1
9,684,27,2.0,134603.88,1,1,1,71725.73,0,0,0,1


In [15]:
data['exited'].value_counts()

0    7963
1    2037
Name: exited, dtype: int64

**Findings :**
- Unbalanced data that will be used as a reference in the analysis process
- The results of this analysis must produce a minimum number of 0.59 in the F1 Score
- The accuracy used is the result of predictions from test data

### Train Data, Valid Data, Test Data

In [16]:
train_valid, test = train_test_split(data, test_size=0.15)
train, valid = train_test_split(train_valid, test_size=0.2)

In [17]:
#train
features_train = train.drop(['exited'], axis=1)
target_train = train['exited']

#valid
features_valid = valid.drop(['exited'], axis=1)
target_valid = valid['exited']

#test
features_test = test.drop(['exited'], axis=1)
target_test = test['exited']

In [18]:
features_train.shape, features_valid.shape, features_test.shape

((6800, 11), (1700, 11), (1500, 11))

### Scaling Features

In [19]:
sscaler = StandardScaler()
features_train[numeric_column] = sscaler.fit_transform(features_train[numeric_column])
features_valid[numeric_column] = sscaler.transform(features_valid[numeric_column])
features_test[numeric_column] = sscaler.transform(features_test[numeric_column])

In [20]:
features_train.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
2322,-0.857129,0.207368,-1.448901,-1.229108,0.791773,0.650072,-1.031982,-1.685019,0,1,0
990,0.051713,-0.175448,0.000426,0.266621,0.791773,0.650072,-1.031982,-0.589894,0,0,1
1252,0.444168,2.791371,-0.361905,0.339828,-0.918537,0.650072,0.969009,-0.670896,0,0,1
7401,-0.175497,0.01596,1.449753,0.845376,0.791773,-1.53829,-1.031982,1.23739,0,0,0
8378,-0.371725,-0.175448,0.362758,0.317907,0.791773,0.650072,0.969009,0.685958,1,0,0


In [21]:
features_valid.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
6495,-0.340741,0.01596,0.000426,0.109286,2.502082,0.650072,-1.031982,-1.605867,1,0,0
3546,-1.115322,0.398775,0.362758,0.594535,-0.918537,0.650072,-1.031982,-0.18307,0,1,1
9791,1.807431,-0.366855,0.362758,0.671115,-0.918537,0.650072,-1.031982,-1.285197,1,0,0
422,-1.187617,0.111664,0.72509,0.565448,-0.918537,0.650072,-1.031982,-1.602821,1,0,1
8387,0.506134,0.494479,1.087422,1.319215,-0.918537,0.650072,-1.031982,0.124509,0,0,1


In [22]:
features_test.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
317,0.578428,0.303072,1.449753,-1.229108,0.791773,0.650072,0.969009,-1.243832,0,1,1
6377,0.43384,-0.558263,1.449753,-1.229108,-0.918537,0.650072,-1.031982,0.234802,0,0,1
3830,2.065624,-0.941078,1.812085,-1.229108,0.791773,0.650072,0.969009,1.739214,0,0,0
3041,0.216957,-0.462559,0.72509,0.482706,0.791773,0.650072,0.969009,-1.284405,0,0,1
409,-1.166961,0.207368,-0.724237,0.991778,-0.918537,0.650072,-1.031982,0.117189,1,0,0


## Checking Quality of the Model

### With Default Hyperparameter

In [23]:
logreg = LogisticRegression(random_state=12345)

logreg.fit(features_train, target_train)
predict_logreg_valid = logreg.predict(features_valid)
predict_logreg_test = logreg.predict(features_test)

print('Valid Accuracy')
print('  F1 Score :', f1_score(target_valid, predict_logreg_valid))
print('  AUC ROC :', roc_auc_score(target_valid, predict_logreg_valid))
print('Test Accuracy')
print('  F1 Score :', f1_score(target_test, predict_logreg_test))
print('  AUC ROC :', roc_auc_score(target_test, predict_logreg_test))

Valid Accuracy
  F1 Score : 0.3207126948775056
  AUC ROC : 0.5919170219416972
Test Accuracy
  F1 Score : 0.3059360730593607
  AUC ROC : 0.5828549029228476


In [24]:
dectree = DecisionTreeClassifier(random_state=12345)

dectree.fit(features_train, target_train)
predict_dectree_valid = dectree.predict(features_valid)
predict_dectree_test = dectree.predict(features_test)

print('Valid Accuracy')
print('  F1 Score :', f1_score(target_valid, predict_dectree_valid))
print('  AUC ROC :', roc_auc_score(target_valid, predict_dectree_valid))
print('Test Accuracy')
print('  F1 Score :', f1_score(target_test, predict_dectree_test))
print('  AUC ROC :', roc_auc_score(target_test, predict_dectree_test))

Valid Accuracy
  F1 Score : 0.5057471264367814
  AUC ROC : 0.697820019110582
Test Accuracy
  F1 Score : 0.5170278637770898
  AUC ROC : 0.6949369485161768


In [25]:
ranfor = RandomForestClassifier(random_state=12345)

ranfor.fit(features_train, target_train)
predict_ranfor_valid = ranfor.predict(features_valid)
predict_ranfor_test = ranfor.predict(features_test)

print('Valid Accuracy')
print('  F1 Score :', f1_score(target_valid, predict_ranfor_valid))
print('  AUC ROC :', roc_auc_score(target_valid, predict_ranfor_valid))
print('Test Accuracy')
print('  F1 Score :', f1_score(target_test, predict_ranfor_test))
print('  AUC ROC :', roc_auc_score(target_test, predict_ranfor_test))

Valid Accuracy
  F1 Score : 0.5714285714285714
  AUC ROC : 0.7124034771504393
Test Accuracy
  F1 Score : 0.5957446808510638
  AUC ROC : 0.7234601118367194


### With Tuning Hyperparameter

In [26]:
best_score = 0
best_depth = 0

for depth in range (1, 7):
    dtc = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dtc.fit(features_train, target_train)
    predicted_dtc_valid = dtc.predict(features_valid)
    score = f1_score(target_valid, predicted_dtc_valid)
    if score > best_score:
        best_score = score
        best_depth = depth

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_dtc_valid))

F1 Score terbaik berdasarkan validation set: 0.5704225352112676 best_depth: 5
AUC-ROC validation set: 0.710823861180023


In [27]:
best_est = 0
best_depth = 0
best_score = 0

for est in range(10, 101, 10):
    for depth in range (1, 7):
        rfc = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=est)
        rfc.fit(features_train, target_train)
        predicted_rfc_valid = rfc.predict(features_valid)
        score = f1_score(target_valid, predicted_rfc_valid)
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_est:", best_est, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_rfc_valid))

F1 Score terbaik berdasarkan validation set: 0.5347368421052632 best_est: 40 best_depth: 6
AUC-ROC validation set: 0.6830403561396625


**Findings :**
- None of the models received a figure exceeding 0.59 as previously determined
- The closest value produced is 56.22 with the Decision Tree model at depth 7

Therefore, data balance will be carried out using several approaches

### Data Balancing

#### Class Weight adjustment

In [28]:
logreg_cw = LogisticRegression(random_state=1, class_weight='balanced', solver='liblinear')

logreg_cw.fit(features_train, target_train)
predict_logreg_cw_valid = logreg_cw.predict(features_valid)
predict_logreg_cw_test = logreg_cw.predict(features_test)

print('Valid Accuracy')
print('  F1 Score :', f1_score(target_valid, predict_logreg_cw_valid))
print('  AUC ROC :', roc_auc_score(target_valid, predict_logreg_cw_valid))

Valid Accuracy
  F1 Score : 0.4704684317718942
  AUC ROC : 0.6971615686639665


In [29]:
best_score = 0
best_depth = 0

for depth in range (1, 7):
    dtc_cw = DecisionTreeClassifier(random_state=12345, max_depth=depth, class_weight='balanced')
    dtc_cw.fit(features_train, target_train)
    predicted_dtc_cw_valid = dtc_cw.predict(features_valid)
    score = f1_score(target_valid, predicted_dtc_cw_valid)
    if score > best_score:
        best_score = score
        best_depth = depth

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_dtc_cw_valid))

F1 Score terbaik berdasarkan validation set: 0.5661252900232018 best_depth: 5
AUC-ROC validation set: 0.767309819779674


In [30]:
best_est = 0
best_depth = 0
best_score = 0

for est in range(10, 101, 10):
    for depth in range (1, 7):
        rfc_cw = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=est, class_weight='balanced')
        rfc_cw.fit(features_train, target_train)
        predicted_rfc_cw_valid = rfc_cw.predict(features_valid)
        score = f1_score(target_valid, predicted_rfc_cw_valid)
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_est:", best_est, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_rfc_cw_valid))

F1 Score terbaik berdasarkan validation set: 0.5968331303288673 best_est: 80 best_depth: 6
AUC-ROC validation set: 0.7800720083182023


#### Upsampling

In [31]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled


features_upsampled, target_upsampled = upsample(features_train, target_train, 3)


In [32]:
target_train.value_counts(), target_upsampled.value_counts()

(0    5409
 1    1391
 Name: exited, dtype: int64,
 0    5409
 1    4173
 Name: exited, dtype: int64)

In [33]:
logreg_up = LogisticRegression(random_state=12345, solver='liblinear')
logreg_up.fit(features_upsampled, target_upsampled)
predicted_logreg_up_valid = logreg_up.predict(features_valid)
predicted_logreg_up_test = logreg_up.predict(features_test)

print('Valid Accuracy')
print('  F1 Score :', f1_score(target_valid, predicted_logreg_up_valid))
print('  AUC ROC :', roc_auc_score(target_valid, predicted_logreg_up_valid))

Valid Accuracy
  F1 Score : 0.48406139315230223
  AUC ROC : 0.697399896687573


In [34]:
best_score = 0
best_depth = 0

for depth in range (1, 7):
    dtc_up = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dtc_up.fit(features_upsampled, target_upsampled)
    predicted_dtc_up_valid = dtc_up.predict(features_valid)
    score = f1_score(target_valid, predicted_dtc_up_valid)
    if score > best_score:
        best_score = score
        best_depth = depth

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_dtc_up_valid))

F1 Score terbaik berdasarkan validation set: 0.5714285714285714 best_depth: 5
AUC-ROC validation set: 0.7681589326451751


In [35]:
best_est = 0
best_depth = 0
best_score = 0

for est in range(10, 101, 10):
    for depth in range (1, 7):
        rfc_up = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=est)
        rfc_up.fit(features_upsampled, target_upsampled)
        predicted_rfc_up_valid = rfc_up.predict(features_valid)
        score = f1_score(target_valid, predicted_rfc_up_valid)
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_est:", best_est, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_rfc_up_valid))

F1 Score terbaik berdasarkan validation set: 0.6091794158553547 best_est: 50 best_depth: 6
AUC-ROC validation set: 0.7656005090243183


#### Downsampling

In [36]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(features_train, target_train, 0.3)


In [37]:
target_train.value_counts(), target_downsampled.value_counts()

(0    5409
 1    1391
 Name: exited, dtype: int64,
 0    1623
 1    1391
 Name: exited, dtype: int64)

In [38]:
logreg_down = LogisticRegression(random_state=12345, solver='liblinear')
logreg_down.fit(features_downsampled, target_downsampled)
predicted_logreg_down_valid = logreg_down.predict(features_valid)
predicted_logreg_down_test = logreg_down.predict(features_test)

print('Valid Accuracy')
print('  F1 Score :', f1_score(target_valid, predicted_logreg_down_valid))
print('  AUC ROC :', roc_auc_score(target_valid, predicted_logreg_down_valid))

Valid Accuracy
  F1 Score : 0.47240618101545245
  AUC ROC : 0.6928428431757265


In [39]:
best_score = 0
best_depth = 0

for depth in range (1, 7):
    dtc_down = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dtc_down.fit(features_downsampled, target_downsampled)
    predicted_dtc_down_valid = dtc_down.predict(features_valid)
    score = f1_score(target_valid, predicted_dtc_down_valid)
    if score > best_score:
        best_score = score
        best_depth = depth

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_dtc_down_valid))

F1 Score terbaik berdasarkan validation set: 0.5649432534678436 best_depth: 6
AUC-ROC validation set: 0.7528981796173005


In [40]:
best_est = 0
best_depth = 0
best_score = 0

for est in range(10, 101, 10):
    for depth in range (1, 7):
        rfc_down = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=est)
        rfc_down.fit(features_downsampled, target_downsampled)
        predicted_rfc_down_valid = rfc_down.predict(features_valid)
        score = f1_score(target_valid, predicted_rfc_down_valid)
        if score > best_score:
            best_score = score
            best_depth = depth
            best_est = est

print("F1 Score terbaik berdasarkan validation set:", best_score, "best_est:", best_est, "best_depth:", best_depth)
print('AUC-ROC validation set:', roc_auc_score(target_valid, predicted_rfc_down_valid))

F1 Score terbaik berdasarkan validation set: 0.5968169761273209 best_est: 60 best_depth: 5
AUC-ROC validation set: 0.7742956021274379


### Best Model

In [43]:
final = RandomForestClassifier(random_state=12345, n_estimators=50, max_depth=6)
final.fit(features_upsampled, target_upsampled)
predict_final = final.predict(features_test)

print('Test Accuracy')
print(' F1 Score :', f1_score(target_test, predict_final))
print(' AUC ROC :', roc_auc_score(target_test, predict_final))

Test Accuracy
 F1 Score : 0.6130500758725342
 AUC ROC : 0.7594403897485674


## General Conclusion

- Performing data balance greatly affects the value of the F1 score and also the AUC-ROC value
- In this analysis, data balance using an upsample approach, which multiplies minor data, produces the best F1 score and AUC-ROC values
- The model with the best results is the Random Forest model using hyperparameter tuning
- The F1 Score results obtained were 0.609 on validation data and 0.613 on test data
- The resulting AUC-ROC results are 0.766 on validation data and 0.759 on test data
- Selection of the best model based on the F1 Score and AUC-ROC results, not only a high number but also the difference between the validation and test values, so that it does not cause excessive overfitting or underfitting, as can be seen from the test value being slightly better than the validation value