### Modeling part 1

Use only the application train data to train 3 different models. The goal is to classify (and return the probability) of a customer pays the loan on day or not. 

Models we are going to evaluate:
- Logistic Regression (LR): mandatory, but it's good to start simple
- Decision Trees (DT): it's a good white box model, where we can understand better the problem and how variables influences the output
- Artificial Neural Network (ANN): kind of black box, but we can add layers to improve performance

We are going to use the Scikit-learn for modelling

Challenges:

- Unbalanced dataset
- High number of features with high ratio od NaNs
- High dimensional dataset
- Outliers

Methodology:

- Data processing
- Feature extraction
- Split dataset into training and validation
- Hyperparameter tunning
- Test evaluation
- Explanation evaluation

We will first build our models and processing flows into a Notebook, and then we will adapt for Python files for the future API.

Modelling approach:

- As part of my own study, I will test an approach to generate more data with SMOTE, instead of using sampling techniques.
- In my Master's thesis, I'm using Optuna framework to tune hyperparametes. I'm using this here as well, to better understand Optuna.
- We are going to remove large ratio of NaNs variables.
- We are going to use Train - Validation - Test approach. We are not using k-fold technique for time and computation restrictions.
- Finally, I will test the effect of PCA for dimension reduction.

In [156]:
import pandas as pd
import numpy as np
import os
import json
import time

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

import optuna
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_param_importances
import plotly


In [157]:
RANDOM_STATE = 42

In [158]:
DATA_FOLDER_FULL_PATH = os.getcwd() + '\\data'

CSVS_PATH_DICTS = {}

for root, dirs, files in os.walk(DATA_FOLDER_FULL_PATH):
    for file in files:
        if file.endswith(".csv"):
            print(f'file name: {file}')
            CSVS_PATH_DICTS[file.split('.csv')[0]] = DATA_FOLDER_FULL_PATH + '\\' + file


print(CSVS_PATH_DICTS)      

file name: application_test.csv
file name: application_train.csv
file name: bureau.csv
file name: bureau_balance.csv
file name: credit_card_balance.csv
file name: HomeCredit_columns_description.csv
file name: installments_payments.csv
file name: POS_CASH_balance.csv
file name: previous_application.csv
file name: sample_submission.csv
{'application_test': 'c:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\data\\application_test.csv', 'application_train': 'c:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\data\\application_train.csv', 'bureau': 'c:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\data\\bureau.csv', 'bureau_balance': 'c:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\data\\bureau_balance.csv', 'credit_card_balance': 'c:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\data\\credit_card_balance.csv', 'HomeCredit_columns_description': 'c:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\data\\HomeCredit_columns_description.csv', 'installments_payments': 'c

In [159]:
df_application_train = pd.read_csv(CSVS_PATH_DICTS['application_train'],
                                   sep = ',',
                                   encoding= 'utf-8')

df_application_train.head(5)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [160]:
# Calculate the percentage of missing values in each column
missing_percentage = (df_application_train.isnull().sum() / len(df_application_train)) * 100

for col, value in missing_percentage.items():
    print(f"% of missing data or NaN for {col}: {value}")

% of missing data or NaN for SK_ID_CURR: 0.0
% of missing data or NaN for TARGET: 0.0
% of missing data or NaN for NAME_CONTRACT_TYPE: 0.0
% of missing data or NaN for CODE_GENDER: 0.0
% of missing data or NaN for FLAG_OWN_CAR: 0.0
% of missing data or NaN for FLAG_OWN_REALTY: 0.0
% of missing data or NaN for CNT_CHILDREN: 0.0
% of missing data or NaN for AMT_INCOME_TOTAL: 0.0
% of missing data or NaN for AMT_CREDIT: 0.0
% of missing data or NaN for AMT_ANNUITY: 0.003902299429939092
% of missing data or NaN for AMT_GOODS_PRICE: 0.09040327012692229
% of missing data or NaN for NAME_TYPE_SUITE: 0.42014757195677555
% of missing data or NaN for NAME_INCOME_TYPE: 0.0
% of missing data or NaN for NAME_EDUCATION_TYPE: 0.0
% of missing data or NaN for NAME_FAMILY_STATUS: 0.0
% of missing data or NaN for NAME_HOUSING_TYPE: 0.0
% of missing data or NaN for REGION_POPULATION_RELATIVE: 0.0
% of missing data or NaN for DAYS_BIRTH: 0.0
% of missing data or NaN for DAYS_EMPLOYED: 0.0
% of missing dat

We are going to remove all columns with high ratio, higher than 14%, to keep the BUREAU features. At least in a first trial. For exception we are going to keep the OCCUPATION_TYPE, and fill the NaNs. Again, for a first trial

In [161]:
df_application_train["OCCUPATION_TYPE"].unique()

array(['Laborers', 'Core staff', 'Accountants', 'Managers', nan,
       'Drivers', 'Sales staff', 'Cleaning staff', 'Cooking staff',
       'Private service staff', 'Medicine staff', 'Security staff',
       'High skill tech staff', 'Waiters/barmen staff',
       'Low-skill Laborers', 'Realty agents', 'Secretaries', 'IT staff',
       'HR staff'], dtype=object)

In [162]:
df_application_train["OCCUPATION_TYPE"] = df_application_train["OCCUPATION_TYPE"].fillna("not_informed")
df_application_train["OCCUPATION_TYPE"].unique()

array(['Laborers', 'Core staff', 'Accountants', 'Managers',
       'not_informed', 'Drivers', 'Sales staff', 'Cleaning staff',
       'Cooking staff', 'Private service staff', 'Medicine staff',
       'Security staff', 'High skill tech staff', 'Waiters/barmen staff',
       'Low-skill Laborers', 'Realty agents', 'Secretaries', 'IT staff',
       'HR staff'], dtype=object)

In [163]:
high_nan_columns = []

missing_percentage = (df_application_train.isnull().sum() / len(df_application_train)) * 100

for col, value in missing_percentage.items():
    if value > 14.0:
        high_nan_columns.append(col)
columns_to_keep = [col for col in df_application_train.columns if col not in high_nan_columns]
print(high_nan_columns)
print(columns_to_keep)

['OWN_CAR_AGE', 'EXT_SOURCE_1', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EME

To make easy, let's keep all the original information in variable 'df_application_train' and the processed data into a new variable 'df'

In [164]:
df = df_application_train[columns_to_keep].copy()
print(df.shape)

(307511, 72)


Now, we are going to adopt a new name pattern. For a new transformed column, we will keep in lower case.

Let's separete columns of text that needs to be transform

In [165]:
text_columns = []
for col in df.columns:
    print(f"col {col} - {df[col].dtype}")
    if df[col].dtype == 'object':
        text_columns.append(col)

col SK_ID_CURR - int64
col TARGET - int64
col NAME_CONTRACT_TYPE - object
col CODE_GENDER - object
col FLAG_OWN_CAR - object
col FLAG_OWN_REALTY - object
col CNT_CHILDREN - int64
col AMT_INCOME_TOTAL - float64
col AMT_CREDIT - float64
col AMT_ANNUITY - float64
col AMT_GOODS_PRICE - float64
col NAME_TYPE_SUITE - object
col NAME_INCOME_TYPE - object
col NAME_EDUCATION_TYPE - object
col NAME_FAMILY_STATUS - object
col NAME_HOUSING_TYPE - object
col REGION_POPULATION_RELATIVE - float64
col DAYS_BIRTH - int64
col DAYS_EMPLOYED - int64
col DAYS_REGISTRATION - float64
col DAYS_ID_PUBLISH - int64
col FLAG_MOBIL - int64
col FLAG_EMP_PHONE - int64
col FLAG_WORK_PHONE - int64
col FLAG_CONT_MOBILE - int64
col FLAG_PHONE - int64
col FLAG_EMAIL - int64
col OCCUPATION_TYPE - object
col CNT_FAM_MEMBERS - float64
col REGION_RATING_CLIENT - int64
col REGION_RATING_CLIENT_W_CITY - int64
col WEEKDAY_APPR_PROCESS_START - object
col HOUR_APPR_PROCESS_START - int64
col REG_REGION_NOT_LIVE_REGION - int64
col 

In [166]:
text_columns

['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'OCCUPATION_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE']

We also going to apply the same categories from our exploratory analysis

In [167]:
def categories_income(income):
    if income < 20000.0:
        return 'Less 20k'
    if income >= 20000.0 and income < 40000.0:
        return '20k - 40k'
    if income >= 40000.0 and income < 60000.0:
        return '40k - 60k'
    if income >= 60000.0 and income < 100000.0:
        return '60k - 100k'
    if income >= 100000.0 and income < 150000.0:
        return '100k - 150k'
    if income >= 150000.0 and income < 200000.0:
        return '150k - 200k'
    if income >= 200000.0 and income < 250000.0:
        return '200k - 250k'
    if income >= 250000.0 and income < 3000000.0:
        return '250 - 300k'
    if income >= 3000000.0 and income < 3500000.0:
        return '300 - 350k'
    if income >= 3500000.0:
        return 'more than 350k'
    else:
        return 'error'


In [168]:
df['incoming_category'] = 0
df['incoming_category'] = df['AMT_INCOME_TOTAL'].apply(categories_income)
text_columns.append('incoming_category')
df['incoming_category'].head(10)

0    200k - 250k
1     250 - 300k
2     60k - 100k
3    100k - 150k
4    100k - 150k
5     60k - 100k
6    150k - 200k
7     250 - 300k
8    100k - 150k
9    100k - 150k
Name: incoming_category, dtype: object

In [169]:
df['age'] = np.floor(df['DAYS_BIRTH'] / 365).astype(int) * (-1)
df['age'].head(5)

0    26
1    46
2    53
3    53
4    55
Name: age, dtype: int32

In [170]:
df['had_children'] = 0
df['had_children'].loc[df['CNT_CHILDREN'] > 0] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['had_children'].loc[df['CNT_CHILDREN'] > 0] = 1


Let's understand the type of data in each text column we separated

In [171]:
for col in text_columns:
    print(f"{col} - unique values: {df[col].unique()}")

NAME_CONTRACT_TYPE - unique values: ['Cash loans' 'Revolving loans']
CODE_GENDER - unique values: ['M' 'F' 'XNA']
FLAG_OWN_CAR - unique values: ['N' 'Y']
FLAG_OWN_REALTY - unique values: ['Y' 'N']
NAME_TYPE_SUITE - unique values: ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan
 'Other_B' 'Group of people']
NAME_INCOME_TYPE - unique values: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed'
 'Student' 'Businessman' 'Maternity leave']
NAME_EDUCATION_TYPE - unique values: ['Secondary / secondary special' 'Higher education' 'Incomplete higher'
 'Lower secondary' 'Academic degree']
NAME_FAMILY_STATUS - unique values: ['Single / not married' 'Married' 'Civil marriage' 'Widow' 'Separated'
 'Unknown']
NAME_HOUSING_TYPE - unique values: ['House / apartment' 'Rented apartment' 'With parents'
 'Municipal apartment' 'Office apartment' 'Co-op apartment']
OCCUPATION_TYPE - unique values: ['Laborers' 'Core staff' 'Accountants' 'Managers' 'not_informed' 'Dr

We are going to use one-hot-encoder to all. For gender we are going to remove values 'XNA', because we saw on data exploratory step that only 4 persons fill as 'XNA'. For all NaNs we are going to fill as 'not_informed'

In [172]:
df['CODE_GENDER'].value_counts()

CODE_GENDER
F      202448
M      105059
XNA         4
Name: count, dtype: int64

In [173]:
print(df.shape)
df = df.drop(df[df['CODE_GENDER'] == 'XNA'].index)
print(df.shape)

(307511, 75)
(307507, 75)


For one hot encoding we will use pandas dumbies

In [174]:
# for one hot encoding 

df = pd.get_dummies(df, columns = text_columns)

df.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,ORGANIZATION_TYPE_XNA,incoming_category_100k - 150k,incoming_category_150k - 200k,incoming_category_200k - 250k,incoming_category_20k - 40k,incoming_category_250 - 300k,incoming_category_300 - 350k,incoming_category_40k - 60k,incoming_category_60k - 100k,incoming_category_more than 350k
0,100002,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,...,False,False,False,True,False,False,False,False,False,False
1,100003,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,...,False,False,False,False,False,True,False,False,False,False
2,100004,0,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,...,False,False,False,False,False,False,False,False,True,False
3,100006,0,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,...,False,True,False,False,False,False,False,False,False,False
4,100007,0,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,...,False,True,False,False,False,False,False,False,False,False
5,100008,0,0,99000.0,490495.5,27517.5,454500.0,0.035792,-16941,-1588,...,False,False,False,False,False,False,False,False,True,False
6,100009,0,1,171000.0,1560726.0,41301.0,1395000.0,0.035792,-13778,-3130,...,False,False,True,False,False,False,False,False,False,False
7,100010,0,0,360000.0,1530000.0,42075.0,1530000.0,0.003122,-18850,-449,...,False,False,False,False,False,True,False,False,False,False
8,100011,0,0,112500.0,1019610.0,33826.5,913500.0,0.018634,-20099,365243,...,True,True,False,False,False,False,False,False,False,False
9,100012,0,0,135000.0,405000.0,20250.0,405000.0,0.019689,-14469,-2019,...,False,True,False,False,False,False,False,False,False,False


In [175]:
columns_to_keep = [col for col in df.columns if col not in text_columns]
columns_to_keep.remove('DAYS_BIRTH') # we are going to use age
columns_to_keep.remove('AMT_INCOME_TOTAL') # we are going to use incoming categories
# guesses on variables might not help
columns_to_keep.remove('REGION_POPULATION_RELATIVE')
columns_to_keep.remove('REGION_RATING_CLIENT')
columns_to_keep.remove('REGION_RATING_CLIENT_W_CITY')
columns_to_keep.remove('HOUR_APPR_PROCESS_START')
df = df[columns_to_keep]

In [176]:
print(f"rows that contains na: {df.isnull().values.ravel().sum()}")
print(f"total rows: {df.shape[0]}")

rows that contains na: 254151
total rows: 307507


In [177]:
float_columns = []
int_columns = []
for col in df.columns:
    print(f"{col} - {df[col].dtypes}")
    if df[col].dtypes == 'float':
        float_columns.append(col)
    else:
        int_columns.append(col)

SK_ID_CURR - int64
TARGET - int64
CNT_CHILDREN - int64
AMT_CREDIT - float64
AMT_ANNUITY - float64
AMT_GOODS_PRICE - float64
DAYS_EMPLOYED - int64
DAYS_REGISTRATION - float64
DAYS_ID_PUBLISH - int64
FLAG_MOBIL - int64
FLAG_EMP_PHONE - int64
FLAG_WORK_PHONE - int64
FLAG_CONT_MOBILE - int64
FLAG_PHONE - int64
FLAG_EMAIL - int64
CNT_FAM_MEMBERS - float64
REG_REGION_NOT_LIVE_REGION - int64
REG_REGION_NOT_WORK_REGION - int64
LIVE_REGION_NOT_WORK_REGION - int64
REG_CITY_NOT_LIVE_CITY - int64
REG_CITY_NOT_WORK_CITY - int64
LIVE_CITY_NOT_WORK_CITY - int64
EXT_SOURCE_2 - float64
OBS_30_CNT_SOCIAL_CIRCLE - float64
DEF_30_CNT_SOCIAL_CIRCLE - float64
OBS_60_CNT_SOCIAL_CIRCLE - float64
DEF_60_CNT_SOCIAL_CIRCLE - float64
DAYS_LAST_PHONE_CHANGE - float64
FLAG_DOCUMENT_2 - int64
FLAG_DOCUMENT_3 - int64
FLAG_DOCUMENT_4 - int64
FLAG_DOCUMENT_5 - int64
FLAG_DOCUMENT_6 - int64
FLAG_DOCUMENT_7 - int64
FLAG_DOCUMENT_8 - int64
FLAG_DOCUMENT_9 - int64
FLAG_DOCUMENT_10 - int64
FLAG_DOCUMENT_11 - int64
FLAG_DOCU

In [178]:
for col in float_columns:
    print(f"{col}: {df[col].unique()}")
    print(f"rows that contains na: {df[col].isnull().values.ravel().sum()}")

AMT_CREDIT: [ 406597.5 1293502.5  135000.  ...  181989.   743863.5 1391130. ]
rows that contains na: 0
AMT_ANNUITY: [24700.5 35698.5  6750.  ... 71986.5 58770.  77809.5]
rows that contains na: 12
AMT_GOODS_PRICE: [ 351000.  1129500.   135000.  ...  453465.   143977.5  743863.5]
rows that contains na: 278
DAYS_REGISTRATION: [ -3648.  -1186.  -4260. ... -16396. -14558. -14798.]
rows that contains na: 0
CNT_FAM_MEMBERS: [ 1.  2.  3.  4.  5.  6.  9.  7.  8. 10. 13. nan 14. 12. 20. 15. 16. 11.]
rows that contains na: 2
EXT_SOURCE_2: [0.26294859 0.62224578 0.55591208 ... 0.13118876 0.26448565 0.2678342 ]
rows that contains na: 660
OBS_30_CNT_SOCIAL_CIRCLE: [  2.   1.   0.   4.   8.  10.  nan   7.   3.   6.   5.  12.   9.  13.
  11.  14.  22.  16.  15.  17.  20.  25.  19.  18.  21.  24.  23.  28.
  26.  29.  27.  47. 348.  30.]
rows that contains na: 1021
DEF_30_CNT_SOCIAL_CIRCLE: [ 2.  0.  1. nan  3.  4.  5.  6.  7. 34.  8.]
rows that contains na: 1021
OBS_60_CNT_SOCIAL_CIRCLE: [  2.   1.   

In [179]:
df['AMT_GOODS_PRICE'].min()


40500.0

In [180]:
df["DAYS_LAST_PHONE_CHANGE"].max()

0.0

How we are going to deal with NaNs values:
- CNT_FAM_MEMBERS fill with 1 (at least the person who applied and only 2 cases)
- AMT_ANNUITY drop rows, only 12
- AMT_GOODS_PRICE use the min value
- BUREAU values, for test, we tried to fill using mode, but didn't work, so go with mean
- DAYS_LAST_PHONE_CHANGE fill with min (1 value only)
- CNT_SOCIAL_CIRCLE variables tried with mode, going with mean
- EXT_SOURCE_2 we will use the avg

In a real problem, this should be align with business experts

Other transformations:

- DAYS_LAST_PHONE_CHANGE multiply by -1
- DAYS_REGISTRATION multiply by -1


In [181]:
df["AMT_REQ_CREDIT_BUREAU_YEAR"] = df['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(df['AMT_REQ_CREDIT_BUREAU_YEAR'].mean())
df['AMT_REQ_CREDIT_BUREAU_YEAR'].isnull().values.ravel().sum()

0

In [182]:
df["CNT_FAM_MEMBERS"] = df['CNT_FAM_MEMBERS'].fillna(1)
df["AMT_GOODS_PRICE"] = df['AMT_GOODS_PRICE'].fillna(df['AMT_GOODS_PRICE'].min())
df["DAYS_LAST_PHONE_CHANGE"] = df['DAYS_LAST_PHONE_CHANGE'].fillna(df['DAYS_LAST_PHONE_CHANGE'].min())
df["EXT_SOURCE_2"] = df['EXT_SOURCE_2'].fillna(df['EXT_SOURCE_2'].mean())
df["AMT_REQ_CREDIT_BUREAU_HOUR"] = df['AMT_REQ_CREDIT_BUREAU_HOUR'].fillna(df['AMT_REQ_CREDIT_BUREAU_HOUR'].mean())
df["AMT_REQ_CREDIT_BUREAU_DAY"] = df['AMT_REQ_CREDIT_BUREAU_DAY'].fillna(df['AMT_REQ_CREDIT_BUREAU_DAY'].mean())
df["AMT_REQ_CREDIT_BUREAU_WEEK"] = df['AMT_REQ_CREDIT_BUREAU_WEEK'].fillna(df['AMT_REQ_CREDIT_BUREAU_WEEK'].mean())
df["AMT_REQ_CREDIT_BUREAU_MON"] = df['AMT_REQ_CREDIT_BUREAU_MON'].fillna(df['AMT_REQ_CREDIT_BUREAU_MON'].mean())
df["AMT_REQ_CREDIT_BUREAU_QRT"] = df['AMT_REQ_CREDIT_BUREAU_QRT'].fillna(df['AMT_REQ_CREDIT_BUREAU_QRT'].mean())
df["AMT_REQ_CREDIT_BUREAU_YEAR"] = df['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(df['AMT_REQ_CREDIT_BUREAU_YEAR'].mean())
df["OBS_30_CNT_SOCIAL_CIRCLE"] = df['OBS_30_CNT_SOCIAL_CIRCLE'].fillna(df['OBS_30_CNT_SOCIAL_CIRCLE'].mean())
df["DEF_30_CNT_SOCIAL_CIRCLE"] = df['DEF_30_CNT_SOCIAL_CIRCLE'].fillna(df['DEF_30_CNT_SOCIAL_CIRCLE'].mean())
df["OBS_60_CNT_SOCIAL_CIRCLE"] = df['OBS_60_CNT_SOCIAL_CIRCLE'].fillna(df['OBS_60_CNT_SOCIAL_CIRCLE'].mean())
df["DEF_60_CNT_SOCIAL_CIRCLE"] = df['DEF_60_CNT_SOCIAL_CIRCLE'].fillna(df['DEF_60_CNT_SOCIAL_CIRCLE'].mean())
df['DAYS_LAST_PHONE_CHANGE'] = df['DAYS_LAST_PHONE_CHANGE'].abs()
df['DAYS_REGISTRATION'] = df['DAYS_REGISTRATION'].abs()
df = df[df['AMT_ANNUITY'].notna()]

In [183]:
df['DAYS_LAST_PHONE_CHANGE']

0         1134.0
1          828.0
2          815.0
3          617.0
4         1106.0
           ...  
307506     273.0
307507       0.0
307508    1909.0
307509     322.0
307510     787.0
Name: DAYS_LAST_PHONE_CHANGE, Length: 307495, dtype: float64

In [184]:
for col in int_columns:
    print(f"{col}: {df[col].unique()}")
    print(f"rows that contains na: {df[col].isnull().values.ravel().sum()}")

SK_ID_CURR: [100002 100003 100004 ... 456253 456254 456255]
rows that contains na: 0
TARGET: [1 0]
rows that contains na: 0
CNT_CHILDREN: [ 0  1  2  3  4  7  5  6  8  9 11 12 10 19 14]
rows that contains na: 0
DAYS_EMPLOYED: [  -637  -1188   -225 ... -12971 -11084  -8694]
rows that contains na: 0
DAYS_ID_PUBLISH: [-2120  -291 -2531 ... -6194 -5854 -6211]
rows that contains na: 0
FLAG_MOBIL: [1 0]
rows that contains na: 0
FLAG_EMP_PHONE: [1 0]
rows that contains na: 0
FLAG_WORK_PHONE: [0 1]
rows that contains na: 0
FLAG_CONT_MOBILE: [1 0]
rows that contains na: 0
FLAG_PHONE: [1 0]
rows that contains na: 0
FLAG_EMAIL: [0 1]
rows that contains na: 0
REG_REGION_NOT_LIVE_REGION: [0 1]
rows that contains na: 0
REG_REGION_NOT_WORK_REGION: [0 1]
rows that contains na: 0
LIVE_REGION_NOT_WORK_REGION: [0 1]
rows that contains na: 0
REG_CITY_NOT_LIVE_CITY: [0 1]
rows that contains na: 0
REG_CITY_NOT_WORK_CITY: [0 1]
rows that contains na: 0
LIVE_CITY_NOT_WORK_CITY: [0 1]
rows that contains na: 0
F

Other transformation:

- DAYS_EMPLOYED multiply by -1
- DAYS_ID_PUBLISH multiply by -1
- Add to normalization: age, DAYS_ID_PUBLISH, DAYS_EMPLOYED

In [185]:
df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].abs()
df['DAYS_ID_PUBLISH'] = df['DAYS_ID_PUBLISH'].abs()

Normalization

Initially I wanted to do the normalization by doing the math manually in pandas, but as we will need the process in an API, it would be better to use the scikit-learn normalizer, despite the bias that is introduced (and which according to the scikit-learn documentation does not affect modeling)

In [186]:
float_columns.append('age')
float_columns.append('DAYS_EMPLOYED')
float_columns.append('DAYS_ID_PUBLISH')

In [187]:
scaler = StandardScaler()

df[float_columns] = scaler.fit_transform(df[float_columns].to_numpy())

df.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,...,ORGANIZATION_TYPE_XNA,incoming_category_100k - 150k,incoming_category_150k - 200k,incoming_category_200k - 250k,incoming_category_20k - 40k,incoming_category_250 - 300k,incoming_category_300 - 350k,incoming_category_40k - 60k,incoming_category_60k - 100k,incoming_category_more than 350k
0,100002,1,0,-0.4781,-0.166152,-0.505835,-0.481125,-0.379841,-0.579158,1,...,False,False,False,True,False,False,False,False,False,False
1,100003,0,0,1.725424,0.592657,1.600584,-0.477174,-1.078707,-1.790854,1,...,False,False,False,False,False,True,False,False,False,False
2,100004,0,0,-1.152887,-1.404649,-1.090275,-0.484079,-0.206117,-0.306874,1,...,False,False,False,False,False,False,False,False,True,False
3,100006,0,0,-0.711433,0.177858,-0.651945,-0.4639,1.375841,-0.369148,1,...,False,True,False,False,False,False,False,False,False,False
4,100007,0,0,-0.213742,-0.361753,-0.067505,-0.463907,-0.191641,0.307256,1,...,False,True,False,False,False,False,False,False,False,False
5,100008,0,0,-0.269655,0.028208,-0.225791,-0.474305,-0.004576,-1.66763,1,...,False,False,False,False,False,False,False,False,True,False
6,100009,0,1,2.389344,0.979202,2.318958,-0.463247,-1.071043,-1.573557,1,...,False,False,True,False,False,False,False,False,False,False
7,100010,0,0,2.313005,1.032604,2.684233,-0.482473,-0.110456,-0.407572,1,...,False,False,False,False,False,True,False,False,False,False
8,100011,0,0,1.044936,0.463498,1.016144,2.133543,0.692871,0.344355,1,...,True,True,False,False,False,False,False,False,False,False
9,100012,0,0,-0.482069,-0.473215,-0.359725,-0.471214,2.682738,0.661026,1,...,False,True,False,False,False,False,False,False,False,False


In [188]:
data_columns = [col for col in df.columns if col != 'TARGET' or col != 'SK_ID_CURR']

y = df['TARGET'].values

X = df[data_columns].values

In [189]:
y

array([1, 0, 0, ..., 0, 1, 0], dtype=int64)

In [190]:
X

array([[100002, 1, 0, ..., False, False, False],
       [100003, 0, 0, ..., False, False, False],
       [100004, 0, 0, ..., False, True, False],
       ...,
       [456253, 0, 0, ..., False, False, False],
       [456254, 1, 0, ..., False, False, False],
       [456255, 0, 0, ..., False, False, False]], dtype=object)

Balancing classes

In [191]:
sm = SMOTE(random_state= RANDOM_STATE)

print(f"Total cases delayed befor: {y.sum()}")

X_res, y_res = sm.fit_resample(X, y)

print(f"Total cases delayed befor: {y_res.sum()}")

Total cases delayed befor: 24825
Total cases delayed befor: 282670


In [206]:
print(( y_res == 0).sum(0))
print(( y_res == 1).sum(0))
print(y_res.shape)

282670
282670
(565340,)


In [193]:
le = LabelEncoder()
le.fit(y_res)
y_res = le.transform(y_res)

Split train and validation

In [194]:
X_train, X_validation, y_train, y_validation = train_test_split(X_res, y_res, test_size=0.20, random_state=RANDOM_STATE)

Objetive functions

In [195]:
def calculate_metrics(y_pred, y_validation):
    """ Function to calculate metrics of calculation
    """
    # our validation is our y_real
    score = roc_auc_score(y_validation, y_pred)

    return score

In [196]:
def objective_logisticregression(trial, X_train, X_validation, y_train, y_validation, random_state = 77):
    """ Objective function for Logistic Regression
    """
    lr_penalty = trial.suggest_categorical('penalty', ['l1','l2', 'elasticnet'])
    lr_c = trial.suggest_float('C',0.1,10.0)
    lr_l1_ratio = trial.suggest_float('l1_ratio', 0.01, 0.99) # only used with elasticnet

    clf = LogisticRegression(penalty = lr_penalty,
                             C = lr_c,
                             l1_ratio= lr_l1_ratio,
                             solver = 'saga',
                             random_state= random_state)
    
    clf.fit(X_train, y_train)

      
    pred = clf.predict(X_validation)
        
    target_metric =  calculate_metrics(pred, y_validation)
    
    return target_metric

In [197]:
def objective_decisiontree(trial, X_train, X_validation, y_train, y_validation, random_state = 77):
    """ Objective function for decision trees
    """
    dt_criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    dt_splitter = trial.suggest_categorical('splitter',['best','random'])
    dt_max_depth = trial.suggest_int('max_depth', 2, 128, log=True)
    dt_min_samples_split = trial.suggest_int('min_samples_split',2, 10)

    clf = DecisionTreeClassifier(criterion= dt_criterion,
                                 splitter= dt_splitter,
                                 max_depth= dt_max_depth,
                                 min_samples_split= dt_min_samples_split,
                                 random_state= random_state)

    
    clf.fit(X_train, y_train)
    # predict
    pred = clf.predict(X_validation)
        
    target_metric =  calculate_metrics(pred, y_validation)
   
    return target_metric

In [198]:
def objective_mlp(trial, X_train, X_validation, y_train, y_validation, random_state = 77):
    """ Objective function for ANN
    """
    # params
    mlp_n_layers = trial.suggest_int('n_layers', 1, 8)
    mlp_layers = []
    for i in range(mlp_n_layers):
        mlp_layers.append(trial.suggest_int(f'n_units_{i}', 1, 10))
    mlp_activation = trial.suggest_categorical('activation', ['identity','logistic','tanh','relu'])
    mlp_alpha = trial.suggest_float('alpha', 0.0001, 0.001, step = 0.0005)
    mlp_learning_rate_init = trial.suggest_float('learning_rate_init',0.0001,0.1, step = 0.005)
    
    # build and test
    clf = MLPClassifier(hidden_layer_sizes = tuple(mlp_layers),
                        activation = mlp_activation,
                        alpha = mlp_alpha,
                        learning_rate_init = mlp_learning_rate_init,
                        random_state = random_state)
    
   
    clf.fit(X_train, y_train)
    # predict
    pred = clf.predict(X_validation)
        
    target_metric =  calculate_metrics(pred, y_validation)
   
    return target_metric

In [199]:
def save_results_from_studies(dict_dfs, dict_resume, save_studies_path):
    """ Function to save results from studies
    """
    for model in dict_dfs:
        print('Saving results of model ', model)
        filename = model + '_study_optuna' + '.csv'
        dict_dfs[model].to_csv(save_studies_path + '\\' + filename, sep = ';')
    
    with open(save_studies_path + '\\' + 'resume_optuna_ml.txt', 'w+') as file:
        file.write(json.dumps(dict_resume)) # use `json.loads` to do the reverse

    return None

In [200]:
def plot_history_and_hyperparameter_importance(study, plot_configs, show_fig = False, save_fig = True,
                                               files_name = ['study_historic.svg', 'hyperparameter_importance.svg', 'hyperparameter_importance_by_time.svg'],
                                               path_to_save = r'C:\Users\Pichau\Documents\Projetos\desafio_itau\src\results\figs'):
    """ Function to create 3 main plots from Optuna study: historical, hyperparameter importance and hyperparameter importance by time consumption

        The parameter importances are returned as a dictionary where the keys consist of parameter names and their values importances. 
        The importances are represented by non-negative floating point numbers, where higher values mean that the parameters are more important. 
        The returned dictionary is of type collections.OrderedDict and is ordered by its values in a descending order. 
        By default, the sum of the importance values are normalized to 1.0.
    """
    # plot optimization history
    fig_history = plot_optimization_history(study)
    if 'history' in plot_configs:
        try:
            fig_history.update_layout(title={'text': plot_configs['history']['title'],
                                              'y':0.9,
                                              'x':0.5,
                                              'xanchor': 'center',
                                              'yanchor': 'top'},
                                     xaxis_title = plot_configs['history']['xaxis_title'],
                                     yaxis_title = plot_configs['history']['yaxis_title'],
                                     )
            fig_history.update_traces(name=plot_configs['history']['objective_value'], selector=dict(name='Objective Value'))
            fig_history.update_traces(name=plot_configs['history']['best_value'], selector=dict(name='Best Value'))
        except:
            pass
    # plot hyperparameter importance
    try:
        fig_importance = plot_param_importances(study)
        if 'importance' in plot_configs:
            try:
                fig_importance.update_layout(title={'text': plot_configs['importance']['title'],
                                                    'y':0.9,
                                                    'x':0.5,
                                                    'xanchor': 'center',
                                                    'yanchor': 'top'},
                                            xaxis_title = plot_configs['importance']['xaxis_title'],
                                            yaxis_title = plot_configs['importance']['yaxis_title'],
                                            )
            except:
                pass
        # plot hyperparameter importance for duration
        fig_duration = optuna.visualization.plot_param_importances(study, target=lambda t: t.duration.total_seconds(), target_name="duration")
        if 'duration' in plot_configs:
            try:
                fig_duration.update_layout(title={'text': plot_configs['duration']['title'],
                                                    'y':0.9,
                                                    'x':0.5,
                                                    'xanchor': 'center',
                                                    'yanchor': 'top'},
                                            xaxis_title = plot_configs['duration']['xaxis_title'],
                                            yaxis_title = plot_configs['duration']['yaxis_title'],
                                            )
            except:
                pass
    except:
        print('Could not plot variance')
        fig_duration = None
        fig_importance = None
    # show plot
    if show_fig:
        fig_history.show()
        if fig_importance != None:
            fig_importance.show()
        if fig_duration != None:
            fig_duration.show()
    # save plots
    print('plotting plot_history_and_hyperparameter_importance')
    if save_fig:
        # history
        history_file = path_to_save + '\\' + files_name[0]
        fig_history.write_image(history_file)
        if fig_importance != None:
            # importance
            importance_file = path_to_save + '\\' + files_name[1]
            fig_importance.write_image(importance_file)
        if fig_duration != None:
            # history
            duration_file = path_to_save + '\\' + files_name[2]
            fig_duration.write_image(duration_file)

    
    return fig_history, fig_importance, fig_duration

In [201]:
def return_standard_plot_configs(model_name = 'Random Forest', target_name = 'roc auc'):
    """ Function to build plot configs dict with standard names
    """
    plot_configs = {}
    # optimization history
    plot_configs['history'] = {}
    plot_configs['history']['title'] = 'Histórico da otimização - {model}'.format(model = model_name)
    plot_configs['history']['xaxis_title'] = 'Estudo (interação)'
    plot_configs['history']['yaxis_title'] = 'Valores de ' + target_name
    plot_configs['history']['objective_value'] = 'Valor objetivo'
    plot_configs['history']['best_value'] = 'Melhor valor'
    # hyperparameter importance
    plot_configs['importance'] = {}
    plot_configs['importance']['title'] = 'Importância dos hiperparâmetros - {model}'.format(model = model_name)
    plot_configs['importance']['xaxis_title'] = 'Importância do hiperparâmetro'
    plot_configs['importance']['yaxis_title'] = 'Hiperparâmetro'
    # hyperparameter importance by time duration
    plot_configs['duration'] = {}
    plot_configs['duration']['title'] = 'Importância dos hiperparâmetros - {model}'.format(model = model_name)
    plot_configs['duration']['xaxis_title'] = 'Importância pelo impacto no tempo de processamento'
    plot_configs['duration']['yaxis_title'] = 'Hiperparâmetro'

    return plot_configs

In [202]:
def create_plots_names_optuna(model, format = '.svg'):
    """ Function to create list of files name given a model prefix name
    """
    prefix_files = ['history_optimization', 'hyperparameter_importance','hyperparameter_importance_duration']
    files_name = []
    for prefix in prefix_files:
        files_name.append(model + '_' + prefix + format)

    return files_name

Configurations of our optimization

In [203]:
n_trials = 100
plot_figures = False
direction = 'maximize'
PATH_TO_SAVE_FIGURES = r'C:\Users\Pichau\Documents\Projetos\desafio_itau\src\results\figs'
PATH_TO_SAVE_RESULTS = r'C:\Users\Pichau\Documents\Projetos\desafio_itau\src\results'
save_experiment = True
show_figures = True
save_figures = True
save_results_studies = True

models_to_study = ['ann' ,'logistic_regression' ,'decision_tree' ]

In [204]:
PATH_TO_SAVE_RESULTS

'C:\\Users\\Pichau\\Documents\\Projetos\\desafio_itau\\src\\results'

In [205]:
# dict to translate name of model to objective function to study
dict_ml_models_to_function = {
    'ann': objective_mlp,
    'logistic_regression': objective_logisticregression,
    'decision_tree': objective_decisiontree
}

# variables to store results
dict_dfs = {}
dict_resume = {}



for model in models_to_study:
# define study
    obj = dict_ml_models_to_function[model]
    if obj == None:
        print('>>>>>>> Skipping model ', model)
        continue
    sampler = optuna.samplers.TPESampler(seed=RANDOM_STATE)
    # measure time
    starting_time = time.time()
    # create study
    print('>>>>>>> Starting study for ', model)
    study = optuna.create_study(direction=direction, sampler= sampler)
    # realize study
    study.optimize(lambda trial: obj(trial, X_train, X_validation, y_train, y_validation, RANDOM_STATE), n_trials=n_trials)
    # store results
    print('>>>>>>> Finished studies with {model} - best result was {result} - best params: {params}'.format(model = model, result = study.best_value, params = study.best_params))
    # measure time
    ending_time = time.time()
    try:
        print(optuna.importance.get_param_importances(study))
    except:
        pass
    # Save results to dataframe
    df_study = study.trials_dataframe()
    dataframe_study_name = f'optuna_results_for_{model}.csv'
    # Plot figures
    if plot_figures:
        print('>>>>>>> Plotting figures')
        # define files name
        files_name_list = create_plots_names_optuna(model= model)
        # get configurations of plot: axis label, title, ...
        plot_configs_dict = return_standard_plot_configs(model_name= model,
                                                         target_name= 'roc_auc_score')
        print(f'files_name_list: {files_name_list}')
        # plot figures
        plot_history_and_hyperparameter_importance(study = study,
                                                    plot_configs = plot_configs_dict,
                                                    show_fig = show_figures, 
                                                    save_fig = save_figures,
                                                    files_name = files_name_list,
                                                    path_to_save = PATH_TO_SAVE_FIGURES)
        df_study.to_csv(PATH_TO_SAVE_FIGURES + '\\' + dataframe_study_name, sep = ';', encoding = 'utf-8')

print('---------------------------------------------------------------------------------')
dict_dfs[model] = study.trials_dataframe(attrs=('number', 'value', 'params', 'state'))
dict_resume[model] = {
    'result': study.best_value,
    'params': study.best_params,
    'time': ending_time - starting_time
}

# Save functions

if save_results_studies:
    print('>>>>>>> Saving results')
    save_results_from_studies(dict_dfs = dict_dfs, 
                                dict_resume = dict_resume, 
                                save_studies_path = PATH_TO_SAVE_RESULTS)

[I 2023-12-23 01:04:06,294] A new study created in memory with name: no-name-6e7fcca1-005c-41dd-a66a-7bed256f8bd8


>>>>>>> Starting study for  ann




[I 2023-12-23 01:04:33,496] Trial 0 finished with value: 0.5 and parameters: {'n_layers': 3, 'n_units_0': 10, 'n_units_1': 8, 'n_units_2': 6, 'activation': 'relu', 'alpha': 0.0006, 'learning_rate_init': 0.07010000000000001}. Best is trial 0 with value: 0.5.
[I 2023-12-23 01:05:37,295] Trial 1 finished with value: 1.0 and parameters: {'n_layers': 1, 'n_units_0': 10, 'activation': 'identity', 'alpha': 0.0001, 'learning_rate_init': 0.050100000000000006}. Best is trial 1 with value: 1.0.
[I 2023-12-23 01:05:55,131] Trial 2 finished with value: 0.5 and parameters: {'n_layers': 4, 'n_units_0': 3, 'n_units_1': 7, 'n_units_2': 2, 'n_units_3': 3, 'activation': 'tanh', 'alpha': 0.0006, 'learning_rate_init': 0.0551}. Best is trial 1 with value: 1.0.
[I 2023-12-23 01:06:20,879] Trial 3 finished with value: 0.5 and parameters: {'n_layers': 1, 'n_units_0': 7, 'activation': 'relu', 'alpha': 0.0006, 'learning_rate_init': 0.0301}. Best is trial 1 with value: 1.0.
[I 2023-12-23 01:06:37,843] Trial 4 fin

>>>>>>> Finished studies with ann - best result was 1.0 - best params: {'n_layers': 1, 'n_units_0': 10, 'activation': 'identity', 'alpha': 0.0001, 'learning_rate_init': 0.050100000000000006}


[I 2023-12-23 02:12:27,840] A new study created in memory with name: no-name-1928bdb3-b719-4ef3-bde5-cdaa7456f8d5


{'n_layers': 0.36627606682006125, 'activation': 0.3593009563973937, 'learning_rate_init': 0.22271283642897766, 'n_units_0': 0.04713676424527457, 'alpha': 0.004573376108292828}
>>>>>>> Starting study for  logistic_regression


[I 2023-12-23 02:13:58,434] Trial 0 finished with value: 0.5 and parameters: {'penalty': 'l2', 'C': 6.026718993550662, 'l1_ratio': 0.1628982676335878}. Best is trial 0 with value: 0.5.
[I 2023-12-23 02:16:36,605] Trial 1 finished with value: 0.5 and parameters: {'penalty': 'elasticnet', 'C': 6.051038616257767, 'l1_ratio': 0.7039111262401245}. Best is trial 0 with value: 0.5.
[I 2023-12-23 02:18:07,035] Trial 2 finished with value: 0.5 and parameters: {'penalty': 'l2', 'C': 2.202157195714934, 'l1_ratio': 0.18818846786295862}. Best is trial 0 with value: 0.5.
[I 2023-12-23 02:20:45,197] Trial 3 finished with value: 0.5 and parameters: {'penalty': 'elasticnet', 'C': 4.376255684556946, 'l1_ratio': 0.29540455739408106}. Best is trial 0 with value: 0.5.
[I 2023-12-23 02:23:23,372] Trial 4 finished with value: 0.5 and parameters: {'penalty': 'l1', 'C': 3.726982248607548, 'l1_ratio': 0.4569485845326952}. Best is trial 0 with value: 0.5.
[I 2023-12-23 02:26:01,128] Trial 5 finished with value: 

>>>>>>> Finished studies with logistic_regression - best result was 0.5 - best params: {'penalty': 'l2', 'C': 6.026718993550662, 'l1_ratio': 0.1628982676335878}
>>>>>>> Starting study for  decision_tree


[I 2023-12-23 05:57:56,905] Trial 0 finished with value: 1.0 and parameters: {'criterion': 'entropy', 'splitter': 'best', 'max_depth': 3, 'min_samples_split': 3}. Best is trial 0 with value: 1.0.
[I 2023-12-23 05:57:58,784] Trial 1 finished with value: 1.0 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_depth': 2, 'min_samples_split': 10}. Best is trial 0 with value: 1.0.
[I 2023-12-23 05:58:00,632] Trial 2 finished with value: 1.0 and parameters: {'criterion': 'gini', 'splitter': 'random', 'max_depth': 6, 'min_samples_split': 6}. Best is trial 0 with value: 1.0.
[I 2023-12-23 05:58:04,468] Trial 3 finished with value: 1.0 and parameters: {'criterion': 'gini', 'splitter': 'best', 'max_depth': 6, 'min_samples_split': 5}. Best is trial 0 with value: 1.0.
[I 2023-12-23 05:58:06,345] Trial 4 finished with value: 1.0 and parameters: {'criterion': 'entropy', 'splitter': 'random', 'max_depth': 21, 'min_samples_split': 2}. Best is trial 0 with value: 1.0.
[I 2023-12-23 05:5

>>>>>>> Finished studies with decision_tree - best result was 1.0 - best params: {'criterion': 'entropy', 'splitter': 'best', 'max_depth': 3, 'min_samples_split': 3}
---------------------------------------------------------------------------------
>>>>>>> Saving results
Saving results of model  decision_tree
