This project is an introduction to artificial neural nets: fully-connected
neural nets, hidden layers, activation functions, back-propagation, dropout.

### Implementation Steps:
1. **Setup:** Create a Jupyter notebook environment and import necessary libraries.
2. **Data Exploration and Preprocessing:** Load the dataset, perform exploratory data analysis, handle missing values, and perform feature engineering.
3. **Model Development:** Implement various models as specified in the mandatory part.
4. **Evaluation:** Compare models based on accuracy and AUC, and create a summary table.
5. **Optimization:** Work on the bonus part to enhance model performance.
6. **Finalization:** Prepare the submission file and documentation for peer review.

Tabular modeling takes data in the form of a table (like a spreadsheet or CSV). The objective is to predict the value in one column based on the values in the other columns.

## 1. Setup

In [1]:
%%capture
# ! pip install -Uqq fastbook dtreeviz

import fastbook
fastbook.setup_book()

import numpy as np
import pandas as pd

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import dtreeviz
import pickle

from fastbook import *
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
random_state = 123
import warnings
warnings.filterwarnings('ignore')

In [3]:
def update_summary_table(approach, lib, accuracy, auc):

    baseline_index = summary_table[summary_table['Approach'] == approach].index

    summary_table.loc[baseline_index, 'Library Used'] = lib
    summary_table.loc[baseline_index, 'Accuracy'] = accuracy
    summary_table.loc[baseline_index, 'AUC Score'] = auc
    return summary_table

summary_table = pd.DataFrame({
    "Approach": ["Baseline (Naive Classifier)", "Random Forest", "Scikit-learn (MLPClassifier)", 
                 "Keras (TensorFlow)", "TensorFlow", "NumPy"],
    "Library Used": ["None", "Scikit-learn", "Scikit-learn", "TensorFlow", "TensorFlow", "NumPy"],
    "Algorithm": ["Naive Classifier", "Random Forest", "MLPClassifier", "Keras Neural Network", 
                  "TensorFlow Neural Network", "Custom Neural Network"],
    "Hyperparameters": ["N/A", "Optimized via Grid Search", "Default / Custom Settings", 
                        "Custom Settings", "Custom Settings", "Custom Implementation"],
    "Accuracy": ["To be filled", "To be filled", "To be filled", "To be filled", "To be filled", "To be filled"],
    "AUC Score": ["To be filled", "To be filled", "To be filled", "To be filled", "To be filled", "To be filled"]
})



summary_table

Unnamed: 0,Approach,Library Used,Algorithm,Hyperparameters,Accuracy,AUC Score
0,Baseline (Naive Classifier),,Naive Classifier,,To be filled,To be filled
1,Random Forest,Scikit-learn,Random Forest,Optimized via Grid Search,To be filled,To be filled
2,Scikit-learn (MLPClassifier),Scikit-learn,MLPClassifier,Default / Custom Settings,To be filled,To be filled
3,Keras (TensorFlow),TensorFlow,Keras Neural Network,Custom Settings,To be filled,To be filled
4,TensorFlow,TensorFlow,TensorFlow Neural Network,Custom Settings,To be filled,To be filled
5,NumPy,NumPy,Custom Neural Network,Custom Implementation,To be filled,To be filled


## 2. Data Exploration and Preprocessing

- Download and prepare the dataset.
- Perform data preprocessing, such as handling missing values and feature selection.
- Split the dataset into training and testing sets with stratification.

In tabular data some columns may contain numerical data, like "age," while others contain string values, like "sex." The numerical data can be directly fed to the model (with some optional preprocessing), but the other columns need to be converted to numbers. Since the values in those correspond to different categories, we often call this type of variables categorical variables. The first type are called continuous variables.

In [4]:
# file_path =  '/kaggle/input/bank-data-train-csv'
# df = pd.read_csv(f'{file_path}/bank_data_train.csv', low_memory=False)
file_path = 'data/bank_data_train.csv'
df = pd.read_csv(file_path, low_memory=False)

df.head()

Unnamed: 0,ID,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,PRC_ACCEPTS_A_EMAIL_LINK,APP_REGISTR_RGN_CODE,PRC_ACCEPTS_A_POS,PRC_ACCEPTS_A_TK,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,PRC_ACCEPTS_A_AMOBILE,SUM_TRAN_AUT_TENDENCY3M,CLNT_TRUST_RELATION,PRC_ACCEPTS_TK,PRC_ACCEPTS_A_MTP,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,CNT_ACCEPTS_TK,APP_MARITAL_STATUS,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,AMOUNT_RUB_NAS_PRC,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,TRANS_COUNT_SUP_PRC,APP_EDUCATION,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,PRC_ACCEPTS_A_ATM,PRC_ACCEPTS_MTP,TRANS_COUNT_NAS_PRC,APP_TRAVEL_PASS,CNT_ACCEPTS_MTP,CR_PROD_CNT_TOVR,APP_CAR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,APP_POSITION_TYPE,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,APP_EMP_TYPE,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,APP_COMP_TYPE,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,PACK,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
0,146841,0,0.0,,,,,0.0,,,0.0,,,,,,0.0,,,,0.541683,0,156067.339767,,,,0.0,начальник отдела,,0.0,,,,,,0.0,,,0,,0,,,0.0,1.0,1.0,0.0,660,,0,,0.0,0.0,,,,,0,,,,,0.134651,0.0,,,,0.0,0.474134,,0.13191,0.0,,0.0,0.40678,0.101695,0.0,0.483032,,,,,,0.134634,0.0,,,0.442285,K01,,,1.593023,,0.0,,0.0,,0.483032,,0.40678,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
1,146842,0,0.041033,,,,,0.0,0.166667,0.186107,0.244678,,0.670968,,,,0.0,0.666667,,,0.0,0,4278.845817,,,,0.0,,,0.454545,,,,,,0.0,,,0,,0,,,0.0,0.109091,0.410691,0.0,552,,0,,0.0,0.0,,,,,0,,,,,0.239365,0.0,,,,0.0,0.384264,0.6,0.101934,0.0,0.510712,0.0,0.333333,0.166667,0.0,0.2,,,,0.309799,,0.133333,0.0,0.24,,0.515876,102,,,1.587647,,0.0,,0.0,,0.39434,,0.545455,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
2,146843,0,0.006915,0.0,,0.0,0.0,0.0,,,0.0,0.0,,,0.0,0.0,0.0,,0.0,,0.0,0,112837.062817,,,,0.0,ГЕНЕРАЛЬНЫЙ ДИРЕКТОР,,0.0,,,,0.0,0.0,0.0,,0.0,0,,0,,,0.0,0.810811,0.92514,0.0,420,,0,,0.0,0.0,,,,,0,,,,,0.084341,0.0,,,,0.0,0.336136,,0.121041,0.0,,0.0,0.366667,0.133333,0.0,0.431656,,,,,,0.063129,0.0,,,0.522833,102,,,1.587647,,0.0,,0.0,,0.399342,,0.297297,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
3,146844,0,0.0,,,,,0.0,,,0.0,,,,,,0.0,,,,0.005874,0,42902.902883,,,,0.0,МЕНЕДЖЕР ИАО,,0.0,,,,,,0.0,,,0,,0,,,0.0,1.0,1.0,0.0,372,,0,,0.0,0.0,,,,,0,,,,,0.005659,0.0,,,,0.0,0.019648,,5e-06,0.0,,0.0,,,0.0,,,,,,,,0.0,,,0.000189,K01,,,1.583333,,0.0,,0.0,,,,,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
4,146845,0,0.0,,,,,0.0,,,0.0,,,,,,0.0,,,,0.0,0,71906.476533,,,,0.0,,,0.0,,,,,,0.0,,,0,,0,,,0.0,1.0,1.0,0.0,288,,0,,0.0,0.0,,,,,0,,,,,0.166946,0.0,,,,0.0,0.556935,,0.177869,0.0,,0.0,0.62069,0.172414,0.0,0.61161,,,,,,0.200415,0.0,,,0.593648,102,,,1.583333,,0.0,,0.0,,0.61161,,0.62069,,0.0,,0.0,,,0.0,,,,,0.0,,,,0


In [5]:
len(df.columns)

116

In [6]:
constant_or_nan_cols = [col for col in df.columns if df[col].nunique(dropna=True) <= 1]
df.drop(columns=constant_or_nan_cols, inplace=True)
constant_or_nan_cols

['PRC_ACCEPTS_A_EMAIL_LINK',
 'PRC_ACCEPTS_A_POS',
 'PRC_ACCEPTS_A_TK',
 'PRC_ACCEPTS_A_AMOBILE',
 'PRC_ACCEPTS_TK',
 'PRC_ACCEPTS_A_MTP',
 'CNT_ACCEPTS_TK',
 'PRC_ACCEPTS_A_ATM',
 'PRC_ACCEPTS_MTP',
 'CNT_ACCEPTS_MTP']

In [7]:
len(df.columns)

106

In [8]:
df[df.select_dtypes(include=["object"]).columns] = df.select_dtypes(include=["object"]).apply(lambda c: c.str.lower())

In [9]:
df

Unnamed: 0,ID,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,APP_REGISTR_RGN_CODE,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,SUM_TRAN_AUT_TENDENCY3M,CLNT_TRUST_RELATION,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,APP_MARITAL_STATUS,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,AMOUNT_RUB_NAS_PRC,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,TRANS_COUNT_SUP_PRC,APP_EDUCATION,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,TRANS_COUNT_NAS_PRC,APP_TRAVEL_PASS,CR_PROD_CNT_TOVR,APP_CAR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,APP_POSITION_TYPE,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,APP_EMP_TYPE,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,APP_COMP_TYPE,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,PACK,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
0,146841,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.541683,0,156067.339767,,,,0.000000,начальник отдела,,0.000000,,,,0.000000,,0,,0,,,0.0,1.000000,1.000000,0.0,660,,0,,0.0,0.0,,,,,0,,,,,0.134651,0.0,,,,0.0,0.474134,,0.131910,0.0,,0.0,0.406780,0.101695,0.0,0.483032,,,,,,0.134634,0.0,,,0.442285,k01,,,1.593023,,0.0,,0.0,,0.483032,,0.406780,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
1,146842,0,0.041033,,0.0,0.166667,0.186107,0.244678,0.670968,,0.0,0.666667,,0.000000,0,4278.845817,,,,0.000000,,,0.454545,,,,0.000000,,0,,0,,,0.0,0.109091,0.410691,0.0,552,,0,,0.0,0.0,,,,,0,,,,,0.239365,0.0,,,,0.0,0.384264,0.600000,0.101934,0.0,0.510712,0.0,0.333333,0.166667,0.0,0.200000,,,,0.309799,,0.133333,0.0,0.240000,,0.515876,102,,,1.587647,,0.0,,0.0,,0.394340,,0.545455,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
2,146843,0,0.006915,,0.0,,,0.000000,,,0.0,,,0.000000,0,112837.062817,,,,0.000000,генеральный директор,,0.000000,,,,0.000000,,0,,0,,,0.0,0.810811,0.925140,0.0,420,,0,,0.0,0.0,,,,,0,,,,,0.084341,0.0,,,,0.0,0.336136,,0.121041,0.0,,0.0,0.366667,0.133333,0.0,0.431656,,,,,,0.063129,0.0,,,0.522833,102,,,1.587647,,0.0,,0.0,,0.399342,,0.297297,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
3,146844,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.005874,0,42902.902883,,,,0.000000,менеджер иао,,0.000000,,,,0.000000,,0,,0,,,0.0,1.000000,1.000000,0.0,372,,0,,0.0,0.0,,,,,0,,,,,0.005659,0.0,,,,0.0,0.019648,,0.000005,0.0,,0.0,,,0.0,,,,,,,,0.0,,,0.000189,k01,,,1.583333,,0.0,,0.0,,,,,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
4,146845,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,71906.476533,,,,0.000000,,,0.000000,,,,0.000000,,0,,0,,,0.0,1.000000,1.000000,0.0,288,,0,,0.0,0.0,,,,,0,,,,,0.166946,0.0,,,,0.0,0.556935,,0.177869,0.0,,0.0,0.620690,0.172414,0.0,0.611610,,,,,,0.200415,0.0,,,0.593648,102,,,1.583333,,0.0,,0.0,,0.611610,,0.620690,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355185,590822,0,0.000000,,0.0,0.142857,0.123579,0.000000,1.000000,,0.0,1.000000,,0.000000,0,9697.620867,,,,0.000000,,,0.000000,,,,0.000000,,2,,0,,,0.0,0.428571,0.786104,0.0,516,,0,,0.0,0.0,,,,,0,,,,,0.203823,0.0,,,,0.0,0.651715,,0.119700,0.0,,0.0,0.500000,,0.0,0.566265,0.0,,45.0,,45.0,,0.0,,0.0,0.572322,104,,,8.963872,0.0,0.0,,0.0,,0.659039,,0.785714,,0.0,1.0,0.0,,0.002731,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
355186,590823,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,428380.024733,,,,0.262714,,,0.000000,,,,0.184211,,0,,0,,,0.0,0.000000,0.000000,0.0,672,,0,,0.0,0.0,,,,,0,,,,,0.157256,0.0,,,,0.0,0.461491,,0.165218,0.0,,0.0,,,0.0,,,,,,,,0.0,,,0.392381,104,,,8.963872,,0.0,,0.0,,0.652612,,0.500000,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
355187,590825,0,0.041298,,0.0,0.089286,0.065293,0.095187,0.281349,,0.0,0.392857,,0.000000,0,224884.436700,0.125,,,0.031179,,,0.211488,,,0.06917,0.039164,,0,,0,,,0.0,0.000000,0.000000,0.0,372,0.291667,0,0.172479,0.0,0.0,,0.97375,,,0,,,,0.941176,0.172569,0.0,,,,0.0,0.601178,0.530864,0.053596,0.0,0.297856,0.0,,,0.0,,,,,0.145648,,,0.0,0.185185,,0.447377,k01,,,8.966560,,0.0,,0.0,,0.448386,,0.459530,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
355188,590826,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,12080.001833,,,,0.282573,руководитель,,0.000000,,,,0.200000,,1,,0,,,0.0,0.800000,0.717427,0.0,540,,0,,0.0,0.0,,,,,0,,,,,0.270810,0.0,,,,0.0,1.000000,,0.379720,0.0,,0.0,1.000000,0.250000,0.0,1.000000,,,,,,0.130435,0.0,,,1.000000,o01,,,8.966560,,0.0,,0.0,,1.000000,,1.000000,,0.0,,0.0,,,0.0,,,,,0.0,,,,0


In [10]:
translation_dict = {
    'близкий ро': 'close relative', 'друг': 'friend', 'отец': 'father',
    'сестра': 'sister', 'сын': 'son', 'мать': 'mother', 'муж': 'husband',
    'брат': 'brother', 'дальний ро': 'distant relative', 'дочь': 'daughter',
    'жена': 'wife', 'mother': 'mother', 'brother': 'brother', 'friend': 'friend',
    'sister': 'sister', 'other': 'other', 'relative': 'relative', 'daughter': 'daughter',
    'son': 'son', 'father': 'father'
}

df['CLNT_TRUST_RELATION'] = df['CLNT_TRUST_RELATION'].map(translation_dict)
df['CLNT_TRUST_RELATION'].unique()

array([nan, 'mother', 'brother', 'friend', 'sister', 'other', 'relative', 'daughter', 'son', 'father', 'close relative', 'husband', 'distant relative', 'wife'], dtype=object)

In [11]:
majority_class = df['TARGET'].value_counts().idxmax()
majority_class

0

In [12]:
# Filter the DataFrame to keep only rows with the minority class
# df_filtered = df[df['TARGET'] != majority_class]
df_filtered_0 = df[df['TARGET'] == majority_class]
df_filtered_0['TARGET'].value_counts()

TARGET
0    326265
Name: count, dtype: int64

In [13]:
df_filtered_0.shape

(326265, 106)

In [14]:
threshold = 52  # Adjust this threshold based on your requirements
df_filtered_0 = df_filtered_0.dropna(thresh=df_filtered_0.shape[1] - threshold + 1)
df_filtered_0

Unnamed: 0,ID,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,APP_REGISTR_RGN_CODE,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,SUM_TRAN_AUT_TENDENCY3M,CLNT_TRUST_RELATION,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,APP_MARITAL_STATUS,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,AMOUNT_RUB_NAS_PRC,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,TRANS_COUNT_SUP_PRC,APP_EDUCATION,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,TRANS_COUNT_NAS_PRC,APP_TRAVEL_PASS,CR_PROD_CNT_TOVR,APP_CAR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,APP_POSITION_TYPE,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,APP_EMP_TYPE,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,APP_COMP_TYPE,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,PACK,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
1,146842,0,0.041033,,0.0,0.166667,0.186107,0.244678,0.670968,,0.0,0.666667,,0.0,0,4278.845817,,,,0.000000,,,0.454545,,,,0.000000,,0,,0,,,0.0,0.109091,0.410691,0.000000,552,,0,,0.0,0.0,,,,,0,,,,,0.239365,0.000000,,,,0.0,0.384264,0.600000,0.101934,0.000000,0.510712,0.0,0.333333,0.166667,0.0,0.200000,,,,0.309799,,0.133333,0.000000,0.240000,,0.515876,102,,,1.587647,,0.000000,,0.000000,,0.394340,,0.545455,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
5,146846,0,0.077010,,0.0,0.500000,0.098136,0.050708,1.000000,,0.0,1.000000,,0.0,0,0.000000,,,,0.008501,,,0.207547,,0.250000,,0.113208,,0,,0,0.345196,,0.0,0.226415,0.767778,248361.558333,384,,0,,0.0,0.0,,0.649373,,,0,,,,0.500000,0.000000,105180.487383,,,,0.0,0.000000,0.727273,0.000000,0.657494,0.627798,0.0,0.500000,0.166667,0.0,0.854130,,,,0.110611,,0.018453,0.023562,0.181818,,0.000000,105,,,1.593023,,0.700485,,0.033022,,0.780661,,0.698113,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
8,146850,0,0.017322,,0.0,0.333333,0.357143,0.002935,0.357143,,0.0,0.333333,,0.0,0,116791.222983,,,,0.000000,генеральный директор,,0.020408,,,,0.000000,,2,,0,,,0.0,0.448980,0.751444,0.000000,492,,0,,0.0,0.0,,,,,0,,,,,0.707973,0.000000,,,,0.0,0.771877,,0.911780,0.000000,,0.0,0.272727,0.045455,0.0,0.353191,0.0,,20.00,,45.00,0.015648,0.000000,,0.0,0.937303,o01,,,1.590335,0.0,0.000000,,0.000000,,0.297416,,0.326531,,0.0,1.0,0.0,,0.001976,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
9,146851,0,0.067036,,0.0,,,0.426053,,,0.0,,,0.0,0,51312.385183,0.300,,,0.000000,архитектор,,0.318681,,0.200000,0.301623,0.000000,,0,,0,0.147805,,0.0,0.010989,0.093474,0.000000,348,0.500000,0,0.480429,0.0,0.0,,0.512229,,,0,,,,0.400000,0.324353,0.000000,,,,0.0,0.797624,0.344828,0.260106,0.000000,0.703480,0.0,1.000000,,0.0,1.000000,,,,0.578966,,,0.000000,0.241379,,0.657603,o01,,,1.590335,,0.000000,,0.000000,,0.613750,,0.373626,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
10,146853,0,0.248236,,0.0,,,0.231500,,,0.0,,,0.0,0,172001.723267,0.120,,,0.032666,экономист,,0.496703,,0.082192,0.178356,0.085714,,0,,0,0.138603,,0.0,0.065934,0.384842,0.000000,528,0.680000,0,0.684099,0.0,0.0,,0.403204,,,0,,,,0.342466,0.131789,0.000000,,,,0.0,0.405121,0.526549,0.159406,0.000000,0.473186,0.0,0.566667,0.166667,0.0,0.445000,,,,0.124565,,0.113750,0.000000,0.172566,,0.437892,k01,,,9.750000,,0.000000,,0.000000,,0.459285,,0.509890,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355177,590812,0,0.023161,,0.0,,,0.050035,,,0.0,,,0.0,0,329981.887067,,,,0.000000,врач,,0.275862,,1.000000,,0.000000,,0,,0,1.000000,,0.0,0.172414,0.274337,0.000000,576,,0,,0.0,0.0,,1.000000,,,0,,,,1.000000,0.378239,0.000000,,,,0.0,0.604052,1.000000,0.128148,0.000000,1.000000,0.0,1.000000,1.000000,0.0,1.000000,,,,1.000000,,1.000000,0.000000,1.000000,,0.462652,102,,,4.638603,,0.000000,,0.000000,,0.876346,,0.965517,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
355178,590813,0,0.229349,34.0,0.0,,,0.221886,0.100000,sister,0.0,0.500000,m,0.0,0,0.000000,,jo,manager,0.000000,директор,y,0.375000,h,,,0.000000,n,0,y,0,,top_manager,0.0,0.000000,0.000000,30076.208333,444,,1,,0.0,0.0,private,0.228153,,,0,,,,0.375000,0.000000,15370.567550,,,private,0.0,0.000000,0.533333,0.000000,0.299388,0.428659,0.0,,,0.0,,,,28.99,,28.99,,0.075941,,,0.000000,105,,,4.614410,,0.317472,,0.000327,,0.360251,,0.475000,,0.0,0.0,0.0,,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
355183,590819,0,0.053964,76.0,0.0,,,0.052262,,brother,0.0,,d,0.0,0,74671.609133,,other,top_manager,0.005183,инженер-геодезист,n,0.105263,h,0.600000,,0.157895,n,0,y,0,0.745280,top_manager,0.0,0.105263,0.717432,0.000000,420,,0,,0.0,0.0,private,0.745280,,,0,,,,0.600000,0.597644,0.000000,,,private,0.0,0.906678,0.500000,0.157822,0.000000,0.482577,0.0,0.500000,0.500000,0.0,0.118343,,,,0.482577,,0.118343,0.000000,0.500000,,0.805097,k01,,,8.912797,,0.000000,,0.000000,,0.321382,,0.631579,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
355185,590822,0,0.000000,,0.0,0.142857,0.123579,0.000000,1.000000,,0.0,1.000000,,0.0,0,9697.620867,,,,0.000000,,,0.000000,,,,0.000000,,2,,0,,,0.0,0.428571,0.786104,0.000000,516,,0,,0.0,0.0,,,,,0,,,,,0.203823,0.000000,,,,0.0,0.651715,,0.119700,0.000000,,0.0,0.500000,,0.0,0.566265,0.0,,45.00,,45.00,,0.000000,,0.0,0.572322,104,,,8.963872,0.0,0.000000,,0.000000,,0.659039,,0.785714,,0.0,1.0,0.0,,0.002731,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [15]:
df_filtered_1 = df[df['TARGET'] != majority_class]
df_filtered_1

Unnamed: 0,ID,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,APP_REGISTR_RGN_CODE,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,SUM_TRAN_AUT_TENDENCY3M,CLNT_TRUST_RELATION,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,APP_MARITAL_STATUS,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,AMOUNT_RUB_NAS_PRC,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,TRANS_COUNT_SUP_PRC,APP_EDUCATION,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,TRANS_COUNT_NAS_PRC,APP_TRAVEL_PASS,CR_PROD_CNT_TOVR,APP_CAR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,APP_POSITION_TYPE,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,APP_EMP_TYPE,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,APP_COMP_TYPE,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,PACK,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
26,146871,0,0.020267,,0.0,1.0,1.000000,0.111424,1.000000,mother,0.0,1.000000,,0.994683,0,1049.544317,,,,0.000000,,,0.379310,,,,0.000000,,1,,0,,,0.0,0.000000,0.000000,0.000000,624,,0,,0.0,0.0,,,,,0,,,,,0.548043,0.000000,,,,0.0,0.597867,0.363636,0.375444,0.000000,0.218070,0.0,,,0.0,,0.0,,20.00,0.218070,20.0,,0.000000,0.363636,0.0,0.787479,o01,,,2.101087,0.0,0.000000,,0.000000,,0.655993,,0.551724,,0.0,1.0,0.0,,0.479087,0.0,1.0,0.615412,0.615412,0.5,0.0,0.233333,0.233333,0.233333,1
27,146872,0,0.000000,,0.0,,,0.016017,,friend,0.0,,,0.000000,0,372.021867,,,,0.000000,водитель-экспедитор,,0.333333,,,,0.000000,,2,,0,,,0.0,0.666667,0.983983,0.000000,276,,0,,0.0,0.0,,,,,0,,,,,0.508451,0.000000,,,,0.0,0.929992,1.000000,0.366414,0.000000,1.000000,0.0,0.900000,0.500000,0.0,0.975369,0.0,,45.00,1.000000,45.0,0.369458,0.000000,1.000000,0.0,0.951984,102,,,2.079582,0.0,0.000000,,0.000000,,0.975764,,0.933333,,0.0,1.0,0.0,,0.073967,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,1
30,146875,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,50349.755383,,,,0.000000,,,0.000000,,,,0.000000,,1,,0,,,0.0,1.000000,1.000000,503.430000,444,,0,,0.0,0.0,,,,,0,,,,,0.259187,0.000000,,,,0.0,0.978182,,0.389747,0.000000,,0.0,1.000000,0.277778,0.0,1.000000,0.0,,20.00,,20.0,0.394616,0.000000,,0.0,0.998822,k01,,,1.528507,0.0,0.630223,,0.609817,,1.000000,,1.000000,,0.0,1.0,0.0,,0.223881,0.0,1.0,0.430108,0.430108,0.0,0.0,0.000000,0.000000,0.000000,1
59,146914,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,0.000000,,,,0.000000,,,0.000000,,,,0.000000,,0,,0,,,0.0,1.000000,1.000000,285864.466233,576,,0,,0.0,0.0,,,,,0,,,,,0.000000,146290.689400,,,,0.0,0.000000,,0.000000,0.546122,,0.0,0.333333,0.083333,0.0,0.483093,,,,,,0.215639,0.158224,,,0.000000,105,,,2.262378,,0.400246,,0.120361,,0.483093,,0.333333,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
63,146918,0,0.136533,,0.0,,,0.142614,0.500549,,0.0,0.333333,,0.000000,0,0.000000,0.190476,,,0.024863,матрос,,0.233945,,0.090909,0.039879,0.013761,,0,,0,0.03504,,0.0,0.279817,0.381626,90712.186667,264,0.619048,0,0.769065,0.0,0.0,,0.086418,,,0,,,,0.181818,0.000000,33337.143867,,,,0.0,0.000000,0.686275,0.000000,0.474820,0.698464,0.0,0.442623,0.098361,0.0,0.349362,,,,0.085932,,0.045710,0.041084,0.235294,,0.000000,105,,,2.262378,,0.397866,,0.025198,,0.381276,,0.513761,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355153,590782,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,306.366667,,,,0.000000,,,0.000000,,,,0.000000,,0,,0,,,0.0,0.000000,0.000000,0.000000,408,,0,,0.0,0.0,,,,,0,,,,,0.817757,0.000000,,,,0.0,0.876246,,0.379053,0.000000,,0.0,,,0.0,,,,,,,,0.000000,,,0.989278,103,,,4.611721,,0.000000,,0.000000,,1.000000,,1.000000,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
355157,590787,1,0.010296,,0.0,0.5,0.639828,0.026800,1.000000,,0.0,1.000000,m,0.000000,0,0.000000,,so,,0.000000,инженер-технолог,,0.187500,s,,,0.000000,,0,,0,,specialist,0.0,0.625000,0.951664,88660.155000,468,1.000000,0,1.000000,0.0,0.0,private,,,,0,,,,,0.000000,8336.016083,,,private,0.0,0.000000,0.777778,0.000000,0.445256,0.519482,0.0,0.566667,0.233333,0.0,0.547468,,,,0.161498,,0.191456,0.141512,0.555556,,0.000000,105,,,8.904732,,0.499913,,0.095497,,0.544260,,0.604167,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
355163,590795,0,0.000000,,0.0,,,0.047988,,brother,0.0,,,0.000000,0,23918.950167,,,,0.012092,менеджер по фандрейзингу,,0.294118,,,,0.176471,,1,,0,,,0.0,0.235294,0.709536,0.000000,444,,1,,0.0,0.0,,,,,0,,,,,0.022978,0.000000,,,,0.0,0.072704,0.200000,0.012033,0.000000,0.211372,0.0,,,0.0,,0.0,,32.99,,45.0,,0.000000,,0.0,0.032374,107,,,3.969248,0.0,0.000000,,0.000000,,0.014125,,0.117647,,0.0,1.0,0.0,,0.009542,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,1
355166,590799,1,0.151001,23.0,0.0,,,0.444095,,,0.0,,t,0.000000,0,1091.564267,1.000000,jo,,0.000000,,n,0.153846,uh,1.000000,1.000000,0.000000,n,0,n,1,1.00000,specialist,0.0,0.000000,0.000000,223.453333,264,1.000000,1,1.000000,0.0,0.0,private,1.000000,,,0,,,,1.000000,0.996395,35.893767,,,private,0.0,0.996395,1.000000,0.551204,0.000000,1.000000,0.0,,,0.0,,,,,1.000000,,,0.000000,1.000000,,0.551204,k01,,,3.963872,,0.000000,,0.000000,,1.000000,,1.000000,,0.0,,0.0,,,0.0,,,,,0.0,,,,1


In [16]:
X =  pd.concat([df_filtered_1, df_filtered_0], axis=0).reset_index(drop=True)
X

Unnamed: 0,ID,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,APP_REGISTR_RGN_CODE,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,SUM_TRAN_AUT_TENDENCY3M,CLNT_TRUST_RELATION,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,APP_MARITAL_STATUS,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,AMOUNT_RUB_NAS_PRC,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,TRANS_COUNT_SUP_PRC,APP_EDUCATION,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,TRANS_COUNT_NAS_PRC,APP_TRAVEL_PASS,CR_PROD_CNT_TOVR,APP_CAR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,APP_POSITION_TYPE,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,APP_EMP_TYPE,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,APP_COMP_TYPE,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,PACK,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
0,146871,0,0.020267,,0.0,1.000000,1.000000,0.111424,1.000000,mother,0.0,1.000000,,0.994683,0,1049.544317,,,,0.000000,,,0.379310,,,,0.000000,,1,,0,,,0.0,0.000000,0.000000,0.000000,624,,0,,0.0,0.0,,,,,0,,,,,0.548043,0.000000,,,,0.0,0.597867,0.363636,0.375444,0.000000,0.218070,0.0,,,0.0,,0.0,,20.00,0.218070,20.00,,0.000000,0.363636,0.0,0.787479,o01,,,2.101087,0.0,0.000000,,0.000000,,0.655993,,0.551724,,0.0,1.0,0.0,,0.479087,0.0,1.0,0.615412,0.615412,0.5,0.0,0.233333,0.233333,0.233333,1
1,146872,0,0.000000,,0.0,,,0.016017,,friend,0.0,,,0.000000,0,372.021867,,,,0.000000,водитель-экспедитор,,0.333333,,,,0.000000,,2,,0,,,0.0,0.666667,0.983983,0.000000,276,,0,,0.0,0.0,,,,,0,,,,,0.508451,0.000000,,,,0.0,0.929992,1.000000,0.366414,0.000000,1.000000,0.0,0.900000,0.500000,0.0,0.975369,0.0,,45.00,1.000000,45.00,0.369458,0.000000,1.000000,0.0,0.951984,102,,,2.079582,0.0,0.000000,,0.000000,,0.975764,,0.933333,,0.0,1.0,0.0,,0.073967,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,1
2,146875,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,50349.755383,,,,0.000000,,,0.000000,,,,0.000000,,1,,0,,,0.0,1.000000,1.000000,503.430000,444,,0,,0.0,0.0,,,,,0,,,,,0.259187,0.000000,,,,0.0,0.978182,,0.389747,0.000000,,0.0,1.000000,0.277778,0.0,1.000000,0.0,,20.00,,20.00,0.394616,0.000000,,0.0,0.998822,k01,,,1.528507,0.0,0.630223,,0.609817,,1.000000,,1.000000,,0.0,1.0,0.0,,0.223881,0.0,1.0,0.430108,0.430108,0.0,0.0,0.000000,0.000000,0.000000,1
3,146914,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,0.000000,,,,0.000000,,,0.000000,,,,0.000000,,0,,0,,,0.0,1.000000,1.000000,285864.466233,576,,0,,0.0,0.0,,,,,0,,,,,0.000000,146290.689400,,,,0.0,0.000000,,0.000000,0.546122,,0.0,0.333333,0.083333,0.0,0.483093,,,,,,0.215639,0.158224,,,0.000000,105,,,2.262378,,0.400246,,0.120361,,0.483093,,0.333333,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
4,146918,0,0.136533,,0.0,,,0.142614,0.500549,,0.0,0.333333,,0.000000,0,0.000000,0.190476,,,0.024863,матрос,,0.233945,,0.090909,0.039879,0.013761,,0,,0,0.03504,,0.0,0.279817,0.381626,90712.186667,264,0.619048,0,0.769065,0.0,0.0,,0.086418,,,0,,,,0.181818,0.000000,33337.143867,,,,0.0,0.000000,0.686275,0.000000,0.474820,0.698464,0.0,0.442623,0.098361,0.0,0.349362,,,,0.085932,,0.045710,0.041084,0.235294,,0.000000,105,,,2.262378,,0.397866,,0.025198,,0.381276,,0.513761,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210737,590812,0,0.023161,,0.0,,,0.050035,,,0.0,,,0.000000,0,329981.887067,,,,0.000000,врач,,0.275862,,1.000000,,0.000000,,0,,0,1.00000,,0.0,0.172414,0.274337,0.000000,576,,0,,0.0,0.0,,1.000000,,,0,,,,1.000000,0.378239,0.000000,,,,0.0,0.604052,1.000000,0.128148,0.000000,1.000000,0.0,1.000000,1.000000,0.0,1.000000,,,,1.000000,,1.000000,0.000000,1.000000,,0.462652,102,,,4.638603,,0.000000,,0.000000,,0.876346,,0.965517,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
210738,590813,0,0.229349,34.0,0.0,,,0.221886,0.100000,sister,0.0,0.500000,m,0.000000,0,0.000000,,jo,manager,0.000000,директор,y,0.375000,h,,,0.000000,n,0,y,0,,top_manager,0.0,0.000000,0.000000,30076.208333,444,,1,,0.0,0.0,private,0.228153,,,0,,,,0.375000,0.000000,15370.567550,,,private,0.0,0.000000,0.533333,0.000000,0.299388,0.428659,0.0,,,0.0,,,,28.99,,28.99,,0.075941,,,0.000000,105,,,4.614410,,0.317472,,0.000327,,0.360251,,0.475000,,0.0,0.0,0.0,,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0
210739,590819,0,0.053964,76.0,0.0,,,0.052262,,brother,0.0,,d,0.000000,0,74671.609133,,other,top_manager,0.005183,инженер-геодезист,n,0.105263,h,0.600000,,0.157895,n,0,y,0,0.74528,top_manager,0.0,0.105263,0.717432,0.000000,420,,0,,0.0,0.0,private,0.745280,,,0,,,,0.600000,0.597644,0.000000,,,private,0.0,0.906678,0.500000,0.157822,0.000000,0.482577,0.0,0.500000,0.500000,0.0,0.118343,,,,0.482577,,0.118343,0.000000,0.500000,,0.805097,k01,,,8.912797,,0.000000,,0.000000,,0.321382,,0.631579,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
210740,590822,0,0.000000,,0.0,0.142857,0.123579,0.000000,1.000000,,0.0,1.000000,,0.000000,0,9697.620867,,,,0.000000,,,0.000000,,,,0.000000,,2,,0,,,0.0,0.428571,0.786104,0.000000,516,,0,,0.0,0.0,,,,,0,,,,,0.203823,0.000000,,,,0.0,0.651715,,0.119700,0.000000,,0.0,0.500000,,0.0,0.566265,0.0,,45.00,,45.00,,0.000000,,0.0,0.572322,104,,,8.963872,0.0,0.000000,,0.000000,,0.659039,,0.785714,,0.0,1.0,0.0,,0.002731,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0


In [17]:
X.pop('ID')
X

Unnamed: 0,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,APP_REGISTR_RGN_CODE,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,SUM_TRAN_AUT_TENDENCY3M,CLNT_TRUST_RELATION,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,APP_MARITAL_STATUS,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,AMOUNT_RUB_NAS_PRC,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,TRANS_COUNT_SUP_PRC,APP_EDUCATION,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,TRANS_COUNT_NAS_PRC,APP_TRAVEL_PASS,CR_PROD_CNT_TOVR,APP_CAR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,APP_POSITION_TYPE,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,APP_EMP_TYPE,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,APP_COMP_TYPE,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,PACK,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
0,0,0.020267,,0.0,1.000000,1.000000,0.111424,1.000000,mother,0.0,1.000000,,0.994683,0,1049.544317,,,,0.000000,,,0.379310,,,,0.000000,,1,,0,,,0.0,0.000000,0.000000,0.000000,624,,0,,0.0,0.0,,,,,0,,,,,0.548043,0.000000,,,,0.0,0.597867,0.363636,0.375444,0.000000,0.218070,0.0,,,0.0,,0.0,,20.00,0.218070,20.00,,0.000000,0.363636,0.0,0.787479,o01,,,2.101087,0.0,0.000000,,0.000000,,0.655993,,0.551724,,0.0,1.0,0.0,,0.479087,0.0,1.0,0.615412,0.615412,0.5,0.0,0.233333,0.233333,0.233333,1
1,0,0.000000,,0.0,,,0.016017,,friend,0.0,,,0.000000,0,372.021867,,,,0.000000,водитель-экспедитор,,0.333333,,,,0.000000,,2,,0,,,0.0,0.666667,0.983983,0.000000,276,,0,,0.0,0.0,,,,,0,,,,,0.508451,0.000000,,,,0.0,0.929992,1.000000,0.366414,0.000000,1.000000,0.0,0.900000,0.500000,0.0,0.975369,0.0,,45.00,1.000000,45.00,0.369458,0.000000,1.000000,0.0,0.951984,102,,,2.079582,0.0,0.000000,,0.000000,,0.975764,,0.933333,,0.0,1.0,0.0,,0.073967,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,1
2,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,50349.755383,,,,0.000000,,,0.000000,,,,0.000000,,1,,0,,,0.0,1.000000,1.000000,503.430000,444,,0,,0.0,0.0,,,,,0,,,,,0.259187,0.000000,,,,0.0,0.978182,,0.389747,0.000000,,0.0,1.000000,0.277778,0.0,1.000000,0.0,,20.00,,20.00,0.394616,0.000000,,0.0,0.998822,k01,,,1.528507,0.0,0.630223,,0.609817,,1.000000,,1.000000,,0.0,1.0,0.0,,0.223881,0.0,1.0,0.430108,0.430108,0.0,0.0,0.000000,0.000000,0.000000,1
3,0,0.000000,,0.0,,,0.000000,,,0.0,,,0.000000,0,0.000000,,,,0.000000,,,0.000000,,,,0.000000,,0,,0,,,0.0,1.000000,1.000000,285864.466233,576,,0,,0.0,0.0,,,,,0,,,,,0.000000,146290.689400,,,,0.0,0.000000,,0.000000,0.546122,,0.0,0.333333,0.083333,0.0,0.483093,,,,,,0.215639,0.158224,,,0.000000,105,,,2.262378,,0.400246,,0.120361,,0.483093,,0.333333,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
4,0,0.136533,,0.0,,,0.142614,0.500549,,0.0,0.333333,,0.000000,0,0.000000,0.190476,,,0.024863,матрос,,0.233945,,0.090909,0.039879,0.013761,,0,,0,0.03504,,0.0,0.279817,0.381626,90712.186667,264,0.619048,0,0.769065,0.0,0.0,,0.086418,,,0,,,,0.181818,0.000000,33337.143867,,,,0.0,0.000000,0.686275,0.000000,0.474820,0.698464,0.0,0.442623,0.098361,0.0,0.349362,,,,0.085932,,0.045710,0.041084,0.235294,,0.000000,105,,,2.262378,,0.397866,,0.025198,,0.381276,,0.513761,,0.0,,0.0,,,0.0,,,,,0.0,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210737,0,0.023161,,0.0,,,0.050035,,,0.0,,,0.000000,0,329981.887067,,,,0.000000,врач,,0.275862,,1.000000,,0.000000,,0,,0,1.00000,,0.0,0.172414,0.274337,0.000000,576,,0,,0.0,0.0,,1.000000,,,0,,,,1.000000,0.378239,0.000000,,,,0.0,0.604052,1.000000,0.128148,0.000000,1.000000,0.0,1.000000,1.000000,0.0,1.000000,,,,1.000000,,1.000000,0.000000,1.000000,,0.462652,102,,,4.638603,,0.000000,,0.000000,,0.876346,,0.965517,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
210738,0,0.229349,34.0,0.0,,,0.221886,0.100000,sister,0.0,0.500000,m,0.000000,0,0.000000,,jo,manager,0.000000,директор,y,0.375000,h,,,0.000000,n,0,y,0,,top_manager,0.0,0.000000,0.000000,30076.208333,444,,1,,0.0,0.0,private,0.228153,,,0,,,,0.375000,0.000000,15370.567550,,,private,0.0,0.000000,0.533333,0.000000,0.299388,0.428659,0.0,,,0.0,,,,28.99,,28.99,,0.075941,,,0.000000,105,,,4.614410,,0.317472,,0.000327,,0.360251,,0.475000,,0.0,0.0,0.0,,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0
210739,0,0.053964,76.0,0.0,,,0.052262,,brother,0.0,,d,0.000000,0,74671.609133,,other,top_manager,0.005183,инженер-геодезист,n,0.105263,h,0.600000,,0.157895,n,0,y,0,0.74528,top_manager,0.0,0.105263,0.717432,0.000000,420,,0,,0.0,0.0,private,0.745280,,,0,,,,0.600000,0.597644,0.000000,,,private,0.0,0.906678,0.500000,0.157822,0.000000,0.482577,0.0,0.500000,0.500000,0.0,0.118343,,,,0.482577,,0.118343,0.000000,0.500000,,0.805097,k01,,,8.912797,,0.000000,,0.000000,,0.321382,,0.631579,,0.0,,0.0,,,0.0,,,,,0.0,,,,0
210740,0,0.000000,,0.0,0.142857,0.123579,0.000000,1.000000,,0.0,1.000000,,0.000000,0,9697.620867,,,,0.000000,,,0.000000,,,,0.000000,,2,,0,,,0.0,0.428571,0.786104,0.000000,516,,0,,0.0,0.0,,,,,0,,,,,0.203823,0.000000,,,,0.0,0.651715,,0.119700,0.000000,,0.0,0.500000,,0.0,0.566265,0.0,,45.00,,45.00,,0.000000,,0.0,0.572322,104,,,8.963872,0.0,0.000000,,0.000000,,0.659039,,0.785714,,0.0,1.0,0.0,,0.002731,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0


In [18]:
procs = [Categorify, FillMissing, Normalize]
y_names = 'TARGET'
cont_names, cat_names = cont_cat_split(X, 1, dep_var=y_names)
y_block = CategoryBlock()
splits = RandomSplitter(valid_pct=0.2)(range_of(X))

In [19]:
cont_len, cat_len = len(cont_names), len(cat_names)
cont_len, cat_len

(91, 13)

In [20]:
to = TabularPandas(X, procs=procs, cat_names=cat_names, cont_names=cont_names,
                   y_names=y_names, y_block=y_block, splits=splits)
to.show()

Unnamed: 0,CLNT_TRUST_RELATION,APP_MARITAL_STATUS,APP_KIND_OF_PROP_HABITATION,CLNT_JOB_POSITION_TYPE,CLNT_JOB_POSITION,APP_DRIVING_LICENSE,APP_EDUCATION,APP_TRAVEL_PASS,APP_CAR,APP_POSITION_TYPE,APP_EMP_TYPE,APP_COMP_TYPE,PACK,AMOUNT_RUB_CLO_PRC_na,APP_REGISTR_RGN_CODE_na,CNT_TRAN_AUT_TENDENCY1M_na,SUM_TRAN_AUT_TENDENCY1M_na,AMOUNT_RUB_SUP_PRC_na,SUM_TRAN_AUT_TENDENCY3M_na,CNT_TRAN_AUT_TENDENCY3M_na,CNT_TRAN_MED_TENDENCY1M_na,AMOUNT_RUB_NAS_PRC_na,TRANS_COUNT_SUP_PRC_na,CNT_TRAN_CLO_TENDENCY1M_na,SUM_TRAN_MED_TENDENCY1M_na,TRANS_COUNT_NAS_PRC_na,SUM_TRAN_CLO_TENDENCY1M_na,TRANS_COUNT_ATM_PRC_na,AMOUNT_RUB_ATM_PRC_na,CNT_TRAN_MED_TENDENCY3M_na,SUM_TRAN_MED_TENDENCY3M_na,SUM_TRAN_CLO_TENDENCY3M_na,LDEAL_TENOR_MAX_na,LDEAL_YQZ_CHRG_na,DEAL_YQZ_IR_MAX_na,LDEAL_YQZ_COM_na,DEAL_YQZ_IR_MIN_na,CNT_TRAN_CLO_TENDENCY3M_na,LDEAL_TENOR_MIN_na,LDEAL_AMT_MONTH_na,CNT_TRAN_SUP_TENDENCY3M_na,SUM_TRAN_SUP_TENDENCY3M_na,CNT_TRAN_ATM_TENDENCY3M_na,CNT_TRAN_ATM_TENDENCY1M_na,SUM_TRAN_ATM_TENDENCY3M_na,DEAL_GRACE_DAYS_ACC_S1X1_na,AVG_PCT_MONTH_TO_PCLOSE_na,DEAL_YWZ_IR_MIN_na,SUM_TRAN_SUP_TENDENCY1M_na,DEAL_YWZ_IR_MAX_na,SUM_TRAN_ATM_TENDENCY1M_na,CNT_TRAN_SUP_TENDENCY1M_na,DEAL_GRACE_DAYS_ACC_AVG_na,MAX_PCLOSE_DATE_na,LDEAL_YQZ_PC_na,DEAL_GRACE_DAYS_ACC_MAX_na,LDEAL_DELINQ_PER_MAXYQZ_na,CLNT_SALARY_VALUE_na,TRANS_AMOUNT_TENDENCY3M_na,MED_DEBT_PRC_YQZ_na,TRANS_CNT_TENDENCY3M_na,LDEAL_USED_AMT_AVG_YQZ_na,LDEAL_USED_AMT_AVG_YWZ_na,AVG_PCT_DEBT_TO_DEAL_AMT_na,LDEAL_ACT_DAYS_ACC_PCT_AVG_na,MED_DEBT_PRC_YWZ_na,LDEAL_ACT_DAYS_PCT_TR3_na,LDEAL_ACT_DAYS_PCT_AAVG_na,LDEAL_DELINQ_PER_MAXYWZ_na,LDEAL_ACT_DAYS_PCT_TR_na,LDEAL_ACT_DAYS_PCT_TR4_na,LDEAL_ACT_DAYS_PCT_CURR_na,CR_PROD_CNT_IL,AMOUNT_RUB_CLO_PRC,APP_REGISTR_RGN_CODE,TURNOVER_DYNAMIC_IL_1M,CNT_TRAN_AUT_TENDENCY1M,SUM_TRAN_AUT_TENDENCY1M,AMOUNT_RUB_SUP_PRC,SUM_TRAN_AUT_TENDENCY3M,REST_DYNAMIC_FDEP_1M,CNT_TRAN_AUT_TENDENCY3M,REST_DYNAMIC_SAVE_3M,CR_PROD_CNT_VCU,REST_AVG_CUR,CNT_TRAN_MED_TENDENCY1M,AMOUNT_RUB_NAS_PRC,TRANS_COUNT_SUP_PRC,CNT_TRAN_CLO_TENDENCY1M,SUM_TRAN_MED_TENDENCY1M,TRANS_COUNT_NAS_PRC,CR_PROD_CNT_TOVR,CR_PROD_CNT_PIL,SUM_TRAN_CLO_TENDENCY1M,TURNOVER_CC,TRANS_COUNT_ATM_PRC,AMOUNT_RUB_ATM_PRC,TURNOVER_PAYM,AGE,CNT_TRAN_MED_TENDENCY3M,CR_PROD_CNT_CC,SUM_TRAN_MED_TENDENCY3M,REST_DYNAMIC_FDEP_3M,REST_DYNAMIC_IL_1M,SUM_TRAN_CLO_TENDENCY3M,LDEAL_TENOR_MAX,LDEAL_YQZ_CHRG,CR_PROD_CNT_CCFP,DEAL_YQZ_IR_MAX,LDEAL_YQZ_COM,DEAL_YQZ_IR_MIN,CNT_TRAN_CLO_TENDENCY3M,REST_DYNAMIC_CUR_1M,REST_AVG_PAYM,LDEAL_TENOR_MIN,LDEAL_AMT_MONTH,LDEAL_GRACE_DAYS_PCT_MED,REST_DYNAMIC_CUR_3M,CNT_TRAN_SUP_TENDENCY3M,TURNOVER_DYNAMIC_CUR_1M,REST_DYNAMIC_PAYM_3M,SUM_TRAN_SUP_TENDENCY3M,REST_DYNAMIC_IL_3M,CNT_TRAN_ATM_TENDENCY3M,CNT_TRAN_ATM_TENDENCY1M,TURNOVER_DYNAMIC_IL_3M,SUM_TRAN_ATM_TENDENCY3M,DEAL_GRACE_DAYS_ACC_S1X1,AVG_PCT_MONTH_TO_PCLOSE,DEAL_YWZ_IR_MIN,SUM_TRAN_SUP_TENDENCY1M,DEAL_YWZ_IR_MAX,SUM_TRAN_ATM_TENDENCY1M,REST_DYNAMIC_PAYM_1M,CNT_TRAN_SUP_TENDENCY1M,DEAL_GRACE_DAYS_ACC_AVG,TURNOVER_DYNAMIC_CUR_3M,MAX_PCLOSE_DATE,LDEAL_YQZ_PC,CLNT_SETUP_TENOR,DEAL_GRACE_DAYS_ACC_MAX,TURNOVER_DYNAMIC_PAYM_3M,LDEAL_DELINQ_PER_MAXYQZ,TURNOVER_DYNAMIC_PAYM_1M,CLNT_SALARY_VALUE,TRANS_AMOUNT_TENDENCY3M,MED_DEBT_PRC_YQZ,TRANS_CNT_TENDENCY3M,LDEAL_USED_AMT_AVG_YQZ,REST_DYNAMIC_CC_1M,LDEAL_USED_AMT_AVG_YWZ,TURNOVER_DYNAMIC_CC_1M,AVG_PCT_DEBT_TO_DEAL_AMT,LDEAL_ACT_DAYS_ACC_PCT_AVG,REST_DYNAMIC_CC_3M,MED_DEBT_PRC_YWZ,LDEAL_ACT_DAYS_PCT_TR3,LDEAL_ACT_DAYS_PCT_AAVG,LDEAL_DELINQ_PER_MAXYWZ,TURNOVER_DYNAMIC_CC_3M,LDEAL_ACT_DAYS_PCT_TR,LDEAL_ACT_DAYS_PCT_TR4,LDEAL_ACT_DAYS_PCT_CURR,TARGET
82551,friend,m,rent,self_empl,директор,y,h,n,y,top_manager,private,private,o01,False,False,True,True,False,True,True,True,False,False,True,True,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,False,False,False,False,False,False,True,False,False,False,False,False,False,True,True,False,True,True,False,True,False,True,False,True,False,False,False,False,False,False,False,False,0.0,0.009109,34.0,0.0,0.285714,0.277421,0.05307,0.705025,0.0,0.666667,0.0,0.0,63435.9375,0.333333,0.008808,0.1941748,0.333333,0.302326,0.07767,1.0,1.0,0.350164,0.0,0.116505,0.428206,0.0,492.0,0.666667,1.0,0.774205,-4.336809e-19,0.0,0.7698,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.666667,0.139563,0.0,12.0,4620.0,0.0,0.261827,0.05,0.183704,0.0,0.155116,0.0,0.25,0.166667,0.0,0.080128,0.0,-0.971537,45.0,0.155116,45.0,0.064103,-3.469447e-18,0.05,0.0,0.275605,-10.451612,0.017041,4.232808,0.0,0.0,0.0,0.0,18866.109375,0.248277,1.0,0.15534,0.359047,0.0,1.0,0.0,0.033644,0.003367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
105485,#na#,#na#,#na#,#na#,индивидуальный предприниматель,#na#,#na#,#na#,#na#,#na#,#na#,#na#,o01,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,True,True,True,True,False,True,True,False,False,False,False,False,True,True,True,False,True,False,False,True,True,True,True,True,True,False,True,False,True,True,True,True,True,True,True,True,True,True,True,0.0,0.070411,54.0,0.0,0.130435,0.115649,0.071236,0.282494,0.0,0.304348,0.0,0.0,49325.410156,0.333333,0.000775,0.3333333,0.266667,0.302326,0.003436,0.0,0.0,0.101889,0.0,0.178694,0.745954,0.0,336.0,0.25,-1.387779e-17,0.205548,-4.336809e-19,0.0,0.824661,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.633333,0.2102,0.0,12.0,4620.0,0.0,0.509496,0.402062,0.05037,0.0,0.314553,0.0,0.596154,0.211538,0.0,0.481511,0.0,-0.971537,45.0,0.175732,45.0,0.090107,-3.469447e-18,0.206186,0.0,0.475829,-10.451612,0.017041,1.125273,0.0,0.0,0.0,0.0,18866.109375,0.472476,1.0,0.446735,0.359047,0.0,1.0,0.0,0.033644,0.009105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
129551,daughter,m,other,specialist,#na#,n,h,n,y,specialist,private,private,105,False,False,True,True,False,True,True,True,False,False,True,True,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,True,False,True,False,True,True,False,True,True,False,True,True,False,True,False,True,False,True,False,False,False,False,False,False,False,False,1.0,0.0,77.0,0.0,0.285714,0.277421,0.0,0.705025,0.0,0.666667,0.0,0.0,0.0,0.333333,0.750769,-2.775558e-17,0.333333,0.302326,0.928571,2.0,0.0,0.350164,0.0,0.0,0.0,66126.296875,588.0,0.666667,-1.387779e-17,0.774205,-4.336809e-19,0.0,0.7698,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.666667,0.0,22411.818359,12.0,4620.0,0.0,0.0,0.6,0.0,0.520659,0.591799,0.0,0.578947,0.222222,0.0,0.586652,0.0,-0.971537,45.0,0.205196,45.0,0.193103,0.1232733,0.230769,0.0,0.0,-10.451612,0.017041,6.773123,0.0,0.573985,0.0,0.055654,18866.109375,0.558398,1.0,0.714286,0.359047,0.0,1.0,0.0,0.033644,0.000242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
92628,#na#,#na#,#na#,#na#,специалист по регистрации лс,#na#,#na#,#na#,#na#,#na#,#na#,#na#,o01,False,True,True,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,False,True,True,False,False,False,False,False,True,True,True,False,True,False,False,True,True,True,True,True,True,False,True,False,True,True,True,True,True,True,True,True,True,True,True,0.0,0.056931,54.0,0.0,0.285714,0.277421,0.046881,0.705025,0.0,0.666667,0.0,0.0,0.0,0.25,0.0,0.2460317,0.153846,0.131015,0.0,0.0,0.0,0.133901,0.0,0.238095,0.767877,120300.203125,444.0,0.625,-1.387779e-17,0.425898,-4.336809e-19,0.0,0.528438,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.461538,0.0,4951.445801,12.0,4620.0,0.0,0.0,0.354839,0.0,0.458568,0.347282,0.0,0.366667,0.1,0.0,0.601523,0.0,-0.971537,45.0,0.171304,45.0,0.335751,0.1869904,0.096774,0.0,0.0,-10.451612,0.017041,5.899467,0.0,0.561704,0.0,0.281371,18866.109375,0.562409,1.0,0.5,0.359047,0.0,1.0,0.0,0.033644,0.009105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
47943,#na#,#na#,#na#,#na#,инженер,#na#,#na#,#na#,#na#,#na#,#na#,#na#,301,False,True,True,True,False,True,True,True,False,False,True,True,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,False,True,False,False,True,False,True,False,True,True,False,True,True,False,True,True,False,True,False,True,False,True,False,False,False,False,False,False,False,False,0.0,0.0,54.0,0.0,0.285714,0.277421,0.005611,0.705025,0.0,0.666667,0.0,0.0,8735.512695,0.333333,0.0,0.1,0.333333,0.302326,0.0,1.0,0.0,0.350164,0.0,0.6,0.905883,0.0,408.0,0.666667,-1.387779e-17,0.774205,-4.336809e-19,0.0,0.7698,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.666667,0.157411,0.0,12.0,4620.0,0.0,0.510326,0.6,0.077531,0.0,0.591799,0.0,0.166667,0.222222,0.0,0.060793,0.0,-0.971537,45.0,0.205196,45.0,0.193103,-3.469447e-18,0.230769,0.0,0.355282,-10.451612,0.017041,7.689797,0.0,0.0,0.0,0.0,18866.109375,0.055072,1.0,0.1,0.359047,0.0,1.0,0.0,0.033644,0.000405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
53839,#na#,#na#,#na#,#na#,заместитель директора,#na#,#na#,#na#,#na#,#na#,#na#,#na#,k01,False,True,True,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,False,True,True,False,False,False,True,False,True,True,True,False,True,True,False,True,True,True,True,True,True,False,True,False,True,True,True,True,True,True,True,True,True,True,True,0.0,0.095906,54.0,0.0,0.285714,0.277421,0.253978,0.705025,0.0,0.666667,0.0,0.0,621261.0,0.181818,0.000806,0.452514,0.4,0.117337,0.005587,0.0,0.0,0.196612,0.0,0.005587,0.008058,0.0,708.0,0.454545,-1.387779e-17,0.377287,-4.336809e-19,0.0,0.196612,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.4,0.152396,0.0,12.0,4620.0,0.0,0.484726,0.567901,0.132157,0.0,0.666586,0.0,1.0,0.222222,0.0,1.0,0.0,-0.971537,45.0,0.263152,45.0,0.193103,-3.469447e-18,0.246914,0.0,0.377823,-10.451612,0.017041,5.904851,0.0,0.0,0.0,0.0,18866.109375,0.371823,1.0,0.541899,0.359047,0.0,1.0,0.0,0.033644,0.009105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
84654,#na#,#na#,#na#,#na#,директор,#na#,#na#,#na#,#na#,#na#,#na#,#na#,102,False,True,True,True,False,True,True,False,False,False,True,False,False,True,False,False,False,False,False,True,True,True,True,True,False,True,True,False,False,True,True,True,False,True,False,False,False,True,False,False,True,True,False,True,True,False,True,False,True,False,True,False,False,False,False,False,False,False,False,0.0,0.559075,54.0,0.0,0.285714,0.277421,0.224985,0.705025,0.0,0.666667,0.0,0.0,23968.753906,0.111111,0.0,0.4942529,0.333333,0.102468,0.0,1.0,0.0,0.350164,0.0,0.0,0.0,0.0,492.0,0.444444,-1.387779e-17,0.350446,-4.336809e-19,0.0,0.755961,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.4,0.087055,0.0,12.0,4620.0,0.0,0.243562,0.27907,0.099874,0.0,0.274773,0.0,0.578947,0.222222,0.0,0.586652,0.0,-0.971537,45.0,0.177917,45.0,0.193103,-3.469447e-18,0.162791,0.0,0.440385,-10.451612,0.017041,8.375281,0.0,0.0,0.0,0.0,18866.109375,0.646715,1.0,0.367816,0.359047,0.0,1.0,0.0,0.033644,0.049346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
197826,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,102,False,True,True,True,False,True,True,True,False,False,True,True,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,False,False,False,False,True,False,True,False,False,True,False,True,True,False,True,True,False,True,False,True,False,True,False,False,False,False,False,False,False,False,0.0,0.0,54.0,0.0,0.285714,0.277421,0.054621,0.705025,0.0,0.666667,0.0,0.0,4732.006348,0.333333,0.29371,0.05,0.333333,0.302326,0.25,1.0,0.0,0.350164,0.0,0.15,0.566221,0.0,336.0,0.666667,-1.387779e-17,0.774205,-4.336809e-19,0.0,0.7698,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.666667,0.16224,0.0,12.0,4620.0,0.0,0.390946,0.6,0.157868,0.0,0.591799,0.0,0.666667,0.333333,0.0,0.933333,0.0,-0.971537,20.0,0.205196,20.0,0.4,-3.469447e-18,0.230769,0.0,0.664835,-10.451612,0.017041,4.0,0.0,0.0,0.0,0.0,18866.109375,0.543293,1.0,0.35,0.359047,0.0,1.0,0.0,0.033644,0.021053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
175244,#na#,#na#,#na#,#na#,руковод-ль отдела продаж,#na#,#na#,#na#,#na#,#na#,#na#,#na#,103,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,True,True,False,True,True,True,True,True,False,True,True,False,False,True,True,True,True,True,True,False,True,True,False,True,True,True,True,True,True,False,True,False,True,True,True,True,True,True,True,True,True,True,True,0.0,0.785917,54.0,0.0,0.25,0.262594,0.009701,0.574509,0.0,0.5,0.0,0.0,186492.765625,0.333333,0.001915,0.3970588,0.1875,0.302326,0.014706,0.0,0.0,0.224783,0.0,0.036765,0.114233,0.0,300.0,0.666667,-1.387779e-17,0.774205,-4.336809e-19,0.0,0.336682,19.0,0.001517,0.0,23.0,0.01726,22.969999,0.34375,0.157536,0.0,12.0,4620.0,0.0,0.49596,0.888889,0.212569,0.0,0.719892,0.0,0.578947,0.222222,0.0,0.586652,0.0,-0.971537,45.0,0.30901,45.0,0.193103,-3.469447e-18,0.333333,0.0,0.364532,-10.451612,0.017041,3.660109,0.0,0.0,0.0,0.0,18866.109375,0.3577,1.0,0.647059,0.359047,0.0,1.0,0.0,0.033644,0.009105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
94048,other,m,jo,specialist,генеральный директор,y,h,n,y,manager,private,private,k01,False,False,True,True,False,True,True,True,False,False,True,True,False,True,False,False,True,True,False,True,True,True,True,True,False,True,True,False,False,True,True,True,False,True,False,True,False,True,True,False,True,True,False,True,True,False,True,False,True,False,True,False,False,False,False,False,False,False,False,1.0,0.012737,77.0,0.0,0.285714,0.277421,0.007683,0.705025,0.0,0.666667,0.0,1.0,12392.05957,0.333333,0.0,0.05882353,0.333333,0.302326,0.0,1.0,0.0,0.350164,0.0,0.294118,0.91672,0.0,396.0,0.666667,-1.387779e-17,0.774205,-4.336809e-19,0.0,1.0,19.0,0.001517,0.0,23.0,0.01726,22.969999,1.0,0.056946,0.0,12.0,4620.0,0.0,0.203718,1.0,0.061321,0.0,1.0,0.0,0.578947,0.222222,0.0,0.586652,0.0,-0.971537,19.9,0.205196,45.0,0.193103,-3.469447e-18,0.230769,0.0,0.117856,-10.451612,0.017041,7.649467,0.0,0.0,0.0,0.0,18866.109375,0.081648,1.0,0.647059,0.359047,0.0,1.0,0.0,0.033644,0.038363,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [21]:
len(to.train),len(to.valid)

(168594, 42148)

In [22]:
y_train_distribution = to.train.ys.value_counts()
y_test_distribution = to.valid.ys.value_counts()
y_train_distribution, y_test_distribution

(TARGET
 0         145408
 1          23186
 Name: count, dtype: int64,
 TARGET
 0         36409
 1          5739
 Name: count, dtype: int64)

In [23]:
from imblearn.over_sampling import SMOTE

# Handling class imbalance with SMOTE
smote = SMOTE(random_state=42)
xs, ys = smote.fit_resample(to.train.xs, to.train.ys)

In [24]:
ys.value_counts()

TARGET
0         145408
1         145408
Name: count, dtype: int64

## 3. Model Development

### 3.1 - Baseline - naive classifier

- Use a naive classifier that predicts the most common class.

In [25]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, roc_auc_score


xs, valid_xs, ys, valid_ys = train_test_split(df.iloc[:,:-1], df['TARGET'], test_size=0.2)

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(xs, ys)
dummy_predictions = dummy_clf.predict(valid_xs)

# Evaluate the baseline model
baseline_accuracy = accuracy_score(valid_ys, dummy_predictions)
baseline_auc = roc_auc_score(valid_ys, dummy_predictions)

baseline_accuracy, baseline_auc

(0.91940933021763, 0.5)

In [26]:
update_summary_table(
    approach='Baseline (Naive Classifier)',
    lib='sklearn',
    accuracy=baseline_accuracy,
    auc=baseline_auc
)

Unnamed: 0,Approach,Library Used,Algorithm,Hyperparameters,Accuracy,AUC Score
0,Baseline (Naive Classifier),sklearn,Naive Classifier,,0.919409,0.5
1,Random Forest,Scikit-learn,Random Forest,Optimized via Grid Search,To be filled,To be filled
2,Scikit-learn (MLPClassifier),Scikit-learn,MLPClassifier,Default / Custom Settings,To be filled,To be filled
3,Keras (TensorFlow),TensorFlow,Keras Neural Network,Custom Settings,To be filled,To be filled
4,TensorFlow,TensorFlow,TensorFlow Neural Network,Custom Settings,To be filled,To be filled
5,NumPy,NumPy,Custom Neural Network,Custom Implementation,To be filled,To be filled


1. **ROC Curve:** ROC stands for Receiver Operating Characteristic. It's a graph that shows the performance of a classification model at various classification thresholds. The x-axis represents the false positive rate (FPR), and the y-axis represents the true positive rate (TPR).

2. **AUC:** AUC is the area under the ROC curve. In simple terms, it's a measure of how well a model can distinguish between classes. The AUC value ranges from 0 to 1, where a higher value indicates better performance.

   - AUC = 1: Perfect classifier (100% true positive rate and 0% false positive rate).
   - AUC = 0.5: Classifier performs no better than random chance (the diagonal line).
   - AUC < 0.5: Worse than random chance.

So, when someone says the AUC is 0.8, for example, it means the model has a good ability to distinguish between positive and negative instances.

### 3.2 Random Forest:
- Implement a Random Forest classifier and use grid search for hyperparameter tuning.

Decision tree ensembles, as the name suggests, rely on decision trees. So let's start there! A decision tree asks a series of binary (that is, yes or no) questions about the data. After each question the data at that part of the tree is split between a "yes" and a "no" branch, as shown in figure below. After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

from sklearn.metrics import accuracy_score, roc_auc_score

def m_auc(m, xs, y): return roc_auc_score(m.predict(xs), y)
def m_accuracy(m, xs, y): return accuracy_score(m.predict(xs), y)

m = DecisionTreeClassifier(max_leaf_nodes=4)
m.fit(to.train.xs, to.train.ys);

In [None]:
draw_tree(m, to.train.xs, size=10, leaves_parallel=True, precision=2)

The top node represents the initial model before any splits have been done, when all the data is in one group. This is the simplest possible model. It is the result of asking zero questions and will always predict the value to be the average value of the whole dataset.

In this case, we can see it predicts a values of [261123, 23029] for the logarithm of the sales price. It gives a mean squared error of 0.48. The square root of this is 0.69. (Remember that unless you see m_rmse, or a root mean squared error, then the value you are looking at is before taking the square root, so it is just the average of the square of the differences.) We can also see that there are 404,710 auction records in this group—that is the total size of our training set. The final piece of information shown here is the decision criterion for the best split that was found, which is to split based on the coupler_system column.

In [None]:
m_accuracy(m, to.train.xs, to.train.ys), m_accuracy(m, to.valid.xs, to.valid.ys)

In [None]:
# m_auc(m, to.train.xs, to.train.ys), m_auc(m, to.valid.xs, to.valid.ys)

In [None]:
from sklearn.utils.class_weight import compute_class_weight

kwargs = {'max_depth': 16, 'max_features': 'sqrt', 'n_estimators': 200}

def rf(xs, y, n_estimators=150, min_samples_leaf=10, **kwargs):
    return RandomForestClassifier(n_jobs=-1, min_samples_leaf=min_samples_leaf, oob_score=True, **kwargs).fit(xs, y)

In [None]:
m = rf(to.train.xs, to.train.ys, **kwargs);

In [None]:
m_accuracy(m, to.train.xs, to.train.ys), m_accuracy(m, to.valid.xs, to.valid.ys)

In [None]:
m_auc(m, to.train.xs, to.train.ys), m_auc(m, to.valid.xs, to.valid.ys)

In [None]:
def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}).sort_values('imp', ascending=False)

In [None]:
fi = rf_feat_importance(m, to.train.xs)
fi[:10]

In [None]:
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi[:30]);

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rf_model = RandomForestClassifier(random_state=42, n_jobs=-1)

# Define parameter grids using list comprehensions and numpy functions
n_estimators = [100, 200, 250]
max_features = ['log', 'sqrt']
max_depth = [8, 16, 24]
min_samples_split = [5, 10]

parameter_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
}

grid_search = GridSearchCV(rf_model, parameter_grid, cv=3, scoring='roc_auc', n_jobs=-1)
grid_search.fit(to.train.xs, to.train.ys.iloc[:, 0])

best_rf_model = grid_search.best_estimator_

# Evaluate
rf_predictions = best_rf_model.predict(to.valid.xs)
rf_accuracy = accuracy_score(to.valid.ys.iloc[:, 0], rf_predictions)
rf_auc = roc_auc_score(to.valid.ys.iloc[:, 0], rf_predictions)

In [None]:
best_rf_model = grid_search.best_estimator_

# Evaluate
rf_predictions = best_rf_model.predict(to.valid.xs)
rf_accuracy = accuracy_score(to.valid.ys.iloc[:, 0], rf_predictions)
rf_auc = roc_auc_score(to.valid.ys.iloc[:, 0], rf_predictions)
rf_accuracy, rf_auc

In [None]:
grid_search.best_estimator_

In [None]:
update_summary_table(
    approach='Random Forest',
    lib='sklearn',
    accuracy=rf_accuracy,
    auc=rf_auc
)

### 3.3 MLP Classifier Implementation

In [None]:
from sklearn.neural_network import MLPClassifier

class MyMLPClassifier:
    def __init__(self, max_iter):
        # 
        self.model = MLPClassifier(**{'activation': 'relu', 'alpha': 0.01, 'hidden_layer_sizes': (50,50), 'learning_rate': 'adaptive', 'solver': 'adam'},  max_iter=max_iter, random_state=42, verbose=True)

    def train(self, X, y):
        self.model.fit(X, y)

    def evaluate(self, X, y):
        predictions = self.model.predict(X)
        accuracy = accuracy_score(y, predictions)
        auc = roc_auc_score(y, predictions)
        return accuracy, auc

mlp_classifier = MyMLPClassifier(max_iter=50)
mlp_classifier.train(to.train.xs, to.train.ys)
mlp_accuracy, mlp_auc = mlp_classifier.evaluate(to.valid.xs, to.valid.ys)
mlp_accuracy, mlp_auc

In [None]:
update_summary_table(
    approach='Scikit-learn (MLPClassifier)',
    lib='sklearn',
    accuracy=mlp_accuracy,
    auc=mlp_auc
)

###  3.4 Keras

In [None]:
import keras
import keras_tuner
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, Activation
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam

keras.utils.set_random_seed(42)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
strategy = tf.distribute.get_strategy() # default strategy that works on CPU and single GPU
print("Number of accelerators: ", strategy.num_replicas_in_sync)

def build_model(hp, opt=None):
    input_features = to.train.xs.shape[1]
    with strategy.scope():
        model = Sequential()


        model.add(Dense(
            units=hp.Int('units_0' , min_value=256, max_value=2048, step=64),
            use_bias=True,
            activation='relu',
            input_shape=(input_features,)))
        model.add(Dropout(hp.Float(f"dropout_rate_0", min_value=0, max_value=0.6, step=0.2)))


        # Dense layers with Batch Normalization and ReLU
        for i in range(1, hp.Int('num_layers', min_value=6, max_value=12)):
            model.add(BatchNormalization())
            model.add(Dense(units=hp.Int('units_' + str(i), min_value=256, max_value=2048, step=512),
                            use_bias=True,
                            kernel_regularizer=l2(0.01)))
            model.add(Activation('relu'))
            model.add(Dropout(hp.Float(f"dropout_rate_{i}", min_value=0, max_value=0.6, step=0.2)))


        # Output layer
        model.add(Dense(1, activation='sigmoid'))

        # Compile the model
        optimizer = opt if opt is not None else Adam(hp.Float('lr', min_value=1e-4, max_value=1e-2, sampling='LOG'))
        model.compile(
            optimizer=optimizer,
            loss='binary_crossentropy',
            metrics=['AUC']
        )
    return model

In [None]:
# Hyperparameter tuning setup
tuner = keras_tuner.RandomSearch(
    hypermodel=build_model,
    objective=keras_tuner.Objective("val_auc", direction="max"),
    max_trials=20,
    executions_per_trial=1,
    overwrite=False,
    directory="/kaggle/working/",
    project_name="hyperparameter_tuning",
    seed=42
)

# Start hyperparameter tuning
tuner.search_space_summary()

early_stopping = keras.callbacks.EarlyStopping(monitor='val_auc', patience=10, min_delta=1e-3, restore_best_weights=True)

# Assuming X_train, y_train, X_test, y_test, and BATCH_SIZE are defined and valid
tuner.search(to.train.xs, to.train.ys, validation_data=(to.valid.xs, to.valid.ys), epochs=70, batch_size=512, callbacks=[early_stopping])

# Display results
tuner.results_summary()

In [None]:
best_hps = tuner.get_best_hyperparameters(1)[0]
best_hps.values

In [None]:
best_hps_values = best_hps.values
file_path = 'best_hps_values.pkl'

with open(file_path, 'wb') as f:
    pickle.dump(best_hps_values, f)

print(f'Best hyperparameter values saved in {file_path}') 

In [None]:
best_model = build_model(best_hps)

# Train the model for longer.

early_stopping = keras.callbacks.EarlyStopping(monitor='val_auc', patience=25, min_delta=5e-4, restore_best_weights=True)
history = best_model.fit(to.train.xs, to.train.ys, validation_data=(to.valid.xs, to.valid.ys), epochs=50, batch_size=512, callbacks=[early_stopping])

In [None]:
(fig, ax) = plt.subplots(1, 1, figsize=(10, 8))

ax.plot(history.history["loss"], label="train_loss")
ax.plot(history.history["val_loss"], label="val_loss")
ax.set_title("Training Graphs")
ax.set_xlabel("Epochs")
ax.set_ylabel("Loss")
ax.set_ylim([0, 2.5])
ax.legend()

fig.tight_layout(pad=3.0)
fig.show()

In [None]:
dev_predictions_proba = best_model.predict(to.valid.xs)
dev_predictions = (dev_predictions_proba > 0.5).astype('int')
keras_auc = roc_auc_score(to.valid.ys, dev_predictions_proba)
print('keras AUC:', keras_auc)
keras_acc = accuracy_score(to.valid.ys, dev_predictions)
print('keras ACC:', keras_acc)

keras_acc, keras_auc

In [None]:
summary_table

In [None]:
update_summary_table(
    approach='Keras (TensorFlow)',
    lib='TensorFlow',
    accuracy=keras_acc,
    auc=keras_auc
)

### 3.5 TensorFlow Approach

In [None]:
xs, ys = tf.convert_to_tensor(to.train.xs, dtype=tf.float32), tf.convert_to_tensor(to.train.ys, dtype=tf.float32)
xs_valid, ys_valid = tf.convert_to_tensor(to.valid.xs, dtype=tf.float32), tf.convert_to_tensor(to.valid.ys, dtype=tf.float32)

xs.shape, ys.shape, xs_valid.shape, ys_valid.shape

In [None]:
BATCH_SIZE = 512
train_ds = tf.data.Dataset.from_tensor_slices((xs, ys))
train_ds = train_ds.shuffle(buffer_size=xs.shape[0]).batch(BATCH_SIZE)

valid_ds = tf.data.Dataset.from_tensor_slices((xs_valid, ys_valid))
valid_ds = valid_ds.shuffle(buffer_size=xs_valid.shape[0]).batch(BATCH_SIZE)

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, optimizers, losses, metrics

In [None]:
model = tf.keras.Sequential()
for i in range(best_hps_values['num_layers']):
    print(best_hps_values['units_%d' % (i)])
    model.add(BatchNormalization())
    model.add(layers.Dense(best_hps_values['units_%d' % (i)], activation=tf.nn.relu, kernel_regularizer=l2(0.01)))
    model.add(layers.Dropout(best_hps_values['dropout_rate_%d' % (i)]))
model.add(layers.Dense(1, activation=tf.nn.sigmoid))

# Compile the model
model.compile(optimizer=optimizers.Adam(learning_rate=best_hps_values['lr']),
                loss=losses.BinaryCrossentropy(),
                metrics=['accuracy', 'AUC'])

In [None]:
model.build(input_shape=xs.shape)
model.summary()

In [None]:
model.fit(xs, ys, batch_size=512, epochs=50, validation_data=(xs_valid, ys_valid))

In [None]:
def loss_fn(y_true, y_pred):  # Use clearer variable names
    loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
    return tf.reduce_mean(loss)

def one_batch(model, x_batch, y_batch, loss_func, optimizer=None, is_train=False):  # Rename to train_step
    if is_train == True:
        with tf.GradientTape() as tape:
            y_pred = model(x_batch, training=True)
            loss = loss_func(y_batch, y_pred)

        gradients = tape.gradient(loss, model.variables)
        optimizer.apply_gradients(zip(gradients, model.variables))
    else:
        y_pred = model(x_batch, training=False)
        loss = loss_func(y_batch, y_pred)

    return loss

def train_model(model, train_dataset, val_dataset, optimizer, loss_func, epochs=10):  # Use 'val' for validation
    train_losses, val_losses = [], []

    for epoch in range(epochs):
        train_epoch_loss = tf.keras.metrics.Mean()  # Track epoch-wise loss
        val_epoch_loss = tf.keras.metrics.Mean()

        for x_batch, y_batch in train_dataset:
            train_batch_loss = one_batch(model, x_batch, y_batch, loss_func, optimizer=optimizer, is_train=True)
            train_epoch_loss.update_state(train_batch_loss)

        for x_batch, y_batch in val_dataset:
            val_batch_loss = one_batch(model, x_batch, y_batch, loss_func)
            val_epoch_loss.update_state(val_batch_loss)

        print(f"Epoch {epoch + 1}/{epochs} - Training loss: {train_epoch_loss.result():.4f}, Validation loss: {val_epoch_loss.result():.4f}")

        train_losses.append(train_epoch_loss.result())
        val_losses.append(val_epoch_loss.result())
    
    return train_losses, val_losses


In [None]:
train_losses, val_losses = train_model(
    model,
    train_ds,
    valid_ds,
    optimizer=tf.optimizers.Adam(learning_rate=0.001,weight_decay=0.1),
    loss_func=loss_fn
)

In [None]:
dev_predictions_proba = model(xs_valid, training=False)
dev_predictions = (dev_predictions_proba > 0.5).numpy().astype('int')
tensorflow_auc = roc_auc_score(ys_valid, dev_predictions_proba)
print('Tensorflow AUC:', tensorflow_auc)
tensorflow_acc = accuracy_score(ys_valid, dev_predictions)
print('Tensorflow ACC:', tensorflow_acc)

tensorflow_acc, tensorflow_auc

In [None]:
update_summary_table(
    approach='TensorFlow',
    lib='TensorFlow',
    accuracy=tensorflow_acc,
    auc=tensorflow_auc
)

### Most sofisticated Neural network

In [None]:
layers = [64, 128, 64]
metrics = [accuracy,RocAucBinary()]
dls = to.dataloaders(shuffle_train=True, device=torch.device('cuda'), bs=512)

# learn = tabular_learner(dls, layers=layers, metrics=metrics, config=tabular_config(ps=0.4))
learn = tabular_learner(dls, layers=layers, metrics=metrics, config=tabular_config(ps=0.4), wd=0.1)
learn.model.to('cuda')

learn.lr_find()

In [None]:
learn.fit(n_epoch=4, lr=0.001)

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(n_epoch=6, lr_max=1e-5)

In [None]:
learn.model

In [26]:
import nn as nnx
from tensor import Tensor
from tqdm import tqdm
import matplotlib.pyplot as plt

In [27]:
class NeuralNetwork(nnx.Module):
    def __init__(self, input_shape, output_shape):
        super(NeuralNetwork, self).__init__()
        self.dense1 = nnx.Linear(in_features=input_shape, out_features=100)
        self.dense2 = nnx.Linear(100, 50)
        self.dense3 = nnx.Linear(50, 25)
        self.dense4 = nnx.Linear(25, output_shape)
        self.relu = nnx.ReLU()

    def forward(self, x):
        x = self.relu(self.dense1(x))
        x = self.relu(self.dense2(x))
        x = self.relu(self.dense3(x))
        x = self.dense4(x)
        return x

# Custom Dataset class
class MyDataset(nnx.Dataset):
    def __init__(self, X, y):
        self.X = Tensor(X)
        self.y = Tensor(y)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, index):
        return self.X[index], self.y[index]

In [28]:
def fit(epochs, lr, model, loss_func, opt_fn, train_dl, valid_dl, patience=10, use_early_stoping=False):
    
    recorder = {'tr_loss': [], 'val_loss': [], 'tr_acc': [], 'val_acc': []}
    losses = [[], []]
    best_val_loss = float('inf')
    counter_early_stop = 0
    early_stop = False
    # fig, axs = plt.subplots(1, 1, figsize=(14, 7))
    # p = display(fig,display_id=True)

    opt = opt_fn(model.parameters(), lr=lr)
    
    for epoch in tqdm(range(epochs)):
        model.train()
        train_tot_loss, train_tot_acc, t_count = 0.,0.,0
        for xb,yb in train_dl:
            preds = model(xb)
            # import pdb; pdb.set_trace()
            loss = loss_func(preds, yb)
            loss.backward()
            opt.step()
            opt.zero_grad()

            # Calculate accuracy & loss
            predicted_labels = preds.argmax(axis=1)
            n = len(xb)
            t_count += n
            train_tot_loss += loss.item()*n
            train_tot_acc  += Tensor.accuracy(predicted_labels, yb).item()*n
            recorder['tr_loss'].append(loss.item())
            recorder['tr_acc'].append(Tensor.accuracy(predicted_labels, yb).item())
            losses[0].append(loss.item())

        model.eval()
        val_tot_loss, val_tot_acc,v_count = 0.,0.,0
        for xb,yb in valid_dl:
            preds = model(xb)

            pred_labels = preds.argmax(axis=1)
            n = len(xb)
            v_count += n
            val_tot_acc  += Tensor.accuracy(pred_labels, yb).item()*n
            val_tot_loss += loss_func(preds,yb).item()*n
            recorder['val_loss'].append(loss_func(preds,yb).item())
            recorder['val_acc'].append(Tensor.accuracy(pred_labels, yb).item())
            losses[1].append(loss_func(preds,yb).item())

            
        print(f"epoch {epoch + 1:02d}/{epochs:02d} - loss: {train_tot_loss/t_count:.4f} - acc: {train_tot_acc/t_count:.4f} - val_loss: {val_tot_loss/v_count:.4f} - val_acc: {val_tot_acc/v_count:.4f}")

        if use_early_stoping:
            if recorder['val_loss'][-1] < best_val_loss:
                best_val_loss = recorder['val_loss'][-1]
                counter_early_stop = 0
                # Save the best model
                model.save_weights(path='best_model.pth')
            else:
                counter_early_stop += 1
                if counter_early_stop >= patience:
                    print("Early stopping triggered")
                    early_stop = True
                    
            if early_stop:
                print("Stopped")
                break
            
    return recorder

In [29]:
X_tr, y_tr, X_val, y_val = map(Tensor, (to.train.xs, to.train.ys, to.valid.xs, to.valid.xs))

# Create the neural network
input_shape = X_tr.shape[1]  # Replace with the actual input shape
output_shape = 1  # Replace with the actual output shape

model = NeuralNetwork(input_shape, output_shape)

tr_ds = MyDataset(X_tr, y_tr)
val_ds = MyDataset(X_val, y_val)

# Creating the data loader
bs = 64
tr_dl = nnx.DataLoader(tr_ds, batch_size=bs)
val_dl = nnx.DataLoader(val_ds, batch_size=bs)

In [32]:
ds_path = 'ds.pkl'

with open(ds_path, 'wb') as f:
    pickle.dump((X_tr, y_tr, X_val, y_val), f)

In [30]:
xb, yb = next(iter(tr_dl))
xb.shape, yb.shape

((64, 163), (64, 1))

In [31]:
lr = 0.005
n_epochs = 25
recorder = fit(n_epochs, lr, model, nnx.BinaryCrossEntropyLoss(), nnx.SGD, tr_dl, val_dl)

  0%|     | 0/25 [00:00<?, ?it/s]


AssertionError: 

In [None]:
# plt.figure(figsize=(10, 7))
plt.plot(recorder['tr_loss'], label='Training Loss')
plt.plot(recorder['val_loss'], label='Validation Loss', linestyle='--')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.legend()
plt.title('Loss Curve')
plt.show()

plt.figure(figsize=(10, 7))
plt.plot(recorder['tr_acc'], label='Training Accuracy')
plt.plot(recorder['val_acc'], label='Validation Accuracy')
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Accuracy Curve')
plt.show()

# Save the model weights
model.save_weights(path='model_params.pkl')
print(f'>>> Saved model weights')






