## Validation procedures
## Tutorials

This projects conducted to the development of classes that have the goal of contributing with validation procedures during the implementation of data modeling in supervised learning tasks. This tutorial has the goal of showing its easy use and flexibility.
<br>
<br>
Use cases for the classes presented here are as follows:
* *KfoldsCV*, for perfoming grid/random search of a Light GBM model.
* *KfoldsCV_fit*, for performing grid/random search and fitting a SVM classifier using the entire training data and the best choices of hyper-parameters.
* *bootstrap_estimation*, for running a large collection of estimations in order to assess average and standard deviation of performance metrics, using a regularized logistic regression model.

All estimations have no intention of being as efficient as possibile, but focus on illustrating how those classes can be used in real-world applications.

--------

This notebook imports the developed classes in addition to a data pre-processing pipeline that seeks to assess their functionalities by applying several distinct statistical learning methods.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing datasets](#imports)<a href='#imports'></a>.
5. [Data pre-processing](#data_pre_proc)<a href='#data_pre_proc'></a>.
6. [Assessing K-folds CV](#kfolds_assess)<a href='#kfolds_assess'></a>.
    * [Light GBM](#kfolds_light_gbm)<a href='#kfolds_light_gbm'></a>.
<br>
<br>
7. [Assessing K-folds fit](#kfolds_fit_assess)<a href='#kfolds_fit_assess'></a>.
    * [SVM classifier](#kfolds_fit_svm_class)<a href='#kfolds_fit_svm_class'></a>.
<br>
<br>
8. [Assessing bootstrap estimation](#boot_assess)<a href='#boot_assess'></a>.
    * [Logistic regression](#boot_lr)<a href='#boot_lr'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import os

from datetime import datetime
import time
import progressbar

from scipy.stats import uniform, norm, randint

from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.metrics import roc_auc_score, average_precision_score, auc, precision_recall_curve, brier_score_loss
from sklearn.metrics import mean_squared_error

# pip install lightgbm
import lightgbm as lgb

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import utils
from utils import loading_data, classify_features, assessing_missings
from utils import data_consistency, running_time, missings_detection, frequency_list

In [3]:
import transformations
from transformations import log_transformation, standard_scale, impute_missing, one_hot_encoding
from transformations import applying_log_transf, applying_standard_scale, treating_missings, applying_one_hot

In [4]:
import kfolds
from kfolds import KfoldsCV, Kfolds_fit

import bootstrap
from bootstrap import bootstrap_estimation

<a id='settings'></a>

## Settings

In [5]:
# Declare whether to export results:
export = True

# Declare whether to log-transform numerical features:
log_transform = True

# Declare whether to standardize numerical features:
standardize = True

# Define the dataset_id:
dataset_id = 2706

<a id='imports'></a>

## Importing datasets

<a id='feats_label'></a>

### Features and label

In [6]:
print('----------------------------------------')
print(f'\033[1mStore {dataset_id}:\033[0m')

df_train = loading_data(path=f'../data_pipeline/Datasets/dataset_{dataset_id}.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

# Accessory variables:
drop_vars = ['y', 'order_amount', 'store_id', 'order_id', 'status', 'epoch', 'date', 'weight']

df_train.head(3)

----------------------------------------
[1mStore 2706:[0m
Shape of df: (14434, 2628).
Number of distinct instances: 14434.
Time period: from 2020-12-31 to 2021-03-31.
----------------------------------------




Unnamed: 0,AVGITEMCREATIONTIME(),BILLINGADDRESSCHARRANDOMNESS(),BILLINGADDRESSCHARWORDMODELPROB(),BILLINGADDRESSRANDOMNESS(),BILLINGCITY(),BILLINGCOUNTRY(),BILLINGLARGEAREAREPUTATION(),BILLINGNAMECHARRANDOMNESS(),BILLINGNAMECHARWORDMODELPROB(),BILLINGNAMERANDOMNESS(),...,ZIPFIRST3REPUTATION(),ZIPFIRST5REPUTATION(),y,order_amount,order_id,status,epoch,store_id,weight,date
0,,0.255415,0.194991,9.357623e-14,Venustiano carranza,MX,,0.98304,0.034951,0.7736941,...,0.06119,0.045009,0.0,1795.16,130874044224,APPROVED,1609459270000.0,2706,1.0,2020-12-31 21:01:10
1,,0.299488,0.117108,9.357623e-14,Coacalco de berriozabal,MX,,0.351049,0.097665,2.301321e-09,...,0.091155,0.078335,0.0,4092.23,130874049768,APPROVED,1609459766000.0,2706,1.0,2020-12-31 21:09:26
2,,0.201846,0.204315,9.357623e-14,Hermosillo,MX,,0.189533,0.180326,9.357623e-14,...,0.069444,0.118511,0.0,1871.4,130874053609,APPROVED,1609460116000.0,2706,1.0,2020-12-31 21:15:16


#### Train-test split

In [7]:
df_train['train_test'] = 'test'
df_train['train_test'].iloc[:int(df_train.shape[0]/2)] = 'train'

# Train-test split:
df_test = df_train[df_train.train_test == 'test'].copy()
df_train = df_train[df_train.train_test == 'train'].copy()

# Resetting indices:
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

drop_vars.append('train_test')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


<a id='classif_feat'></a>

### Classifying features

In [8]:
print(f'\033[1mDataset {dataset_id}:\033[0m')
classified_features = classify_features(df_train, drop_vars=drop_vars,
                                        drop_excessive_miss=True, drop_no_var=True,
                                        test_data=df_test)

feats_assess = classified_features['feats_assess']
cat_vars = classified_features['cat_vars']
excessive_miss_train = classified_features['excessive_miss_train']
no_variance = classified_features['no_variance']
cont_vars = classified_features['cont_vars']
binary_vars = classified_features['binary_vars']
    
feats_assess

[1mDataset 2706:[0m
Initial number of features: 2620.
1377 features were dropped for excessive number of missings!
300 features were dropped for having no variance!
943 remaining features.




Unnamed: 0,class,frequency
2,cont_vars,915
0,cat_vars,14
1,binary_vars,14
3,drop_vars,9


<a id='data_pre_proc'></a>

## Data pre-processing

<a id='assessing_missing'></a>

### Assessing missing values

In [9]:
print(f'\033[1mDataset {dataset_id}:\033[0m')
print('\033[1mTraining data:\033[0m')
missings_train = assessing_missings(dataframe=df_train)
print('\n\033[1mTest data:\033[0m')
missings_test = assessing_missings(dataframe=df_test)
print('\n')

missings_train.index.name = 'training_data'
missings_test.index.name = 'test_data'

missings_train.head(10)

[1mDataset 2706:[0m
[1mTraining data:[0m
[1mNumber of features with missings:[0m 231 out of 952 features (24.26%).
[1mAverage number of missings:[0m 478 out of 7217 observations (6.62%).

[1mTest data:[0m
[1mNumber of features with missings:[0m 134 out of 952 features (14.08%).
[1mAverage number of missings:[0m 470 out of 7217 observations (6.51%).




Unnamed: 0_level_0,feature,missings,share
training_data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"CUSTNAVCOUNT(t,30min)",6772,0.93834
1,"CUSTNAVCOUNT(t,1h)",6751,0.93543
2,"CUSTNAVCOUNT(9,30min)",6743,0.934322
3,CUSTNAVENTROPY(COOKIE),6737,0.93349
4,"CUSTNAVCOUNT(9,1h)",6713,0.930165
5,"CUSTNAVCOUNT(t,1d)",6691,0.927117
6,"CUSTNAVCOUNT(ta,30min)",6683,0.926008
7,"CUSTNAVCOUNT(ta,1h)",6655,0.922128
8,"CREDITCARD(TOTAL_AMOUNT,60)",6604,0.915062
9,"EMAIL(TOTAL_AMOUNT,60)",6524,0.903977


<a id='num_transf'></a>

### Transforming numerical features

#### Logarithmic transformation

In [10]:
print('---------------------------------------------------------------------------------------------------------')
print('\033[1mAPPLYING LOGARITHMIC TRANSFORMATION OVER NUMERICAL DATA\033[0m')
print('\n')

print(f'\033[1mDataset {dataset_id}:\033[0m')

# Variables that should not be log-transformed:
not_log = [c for c in df_train.columns if c not in cont_vars]

if log_transform:
    print('\033[1mTraining data:\033[0m')
    df_train = applying_log_transf(dataframe=df_train, not_log=not_log)

    print('\033[1mTest data:\033[0m')
    df_test = applying_log_transf(dataframe=df_test, not_log=not_log)
    print('\n')


else:
    print('\033[1mNo transformation performed!\033[0m')
    print('\n')

print('---------------------------------------------------------------------------------------------------------')
print('\n')

---------------------------------------------------------------------------------------------------------
[1mAPPLYING LOGARITHMIC TRANSFORMATION OVER NUMERICAL DATA[0m


[1mDataset 2706:[0m
[1mTraining data:[0m
[1mNumber of numerical variables log-transformed:[0m 915.
[1mTest data:[0m
[1mNumber of numerical variables log-transformed:[0m 915.


---------------------------------------------------------------------------------------------------------




#### Standardizing numerical features

In [11]:
print('---------------------------------------------------------------------------------------------------------')
print('\033[1mAPPLYING STANDARD SCALE TRANSFORMATION OVER NUMERICAL DATA\033[0m')
print('\n')

print(f'\033[1mDataset {dataset_id}:\033[0m')

# Inputs that should not be standardized:
not_stand = [c for c in df_train.columns if c.replace('L#', '') not in cont_vars]

if standardize:
    scaled_data = applying_standard_scale(training_data=df_train, not_stand=not_stand,
                                          test_data=df_test)
    df_train_scaled = scaled_data['training_data']
    df_test_scaled = scaled_data['test_data']

else:
    df_train_scaled = df_train.copy()
    df_test_scaled = df_test.copy()

    print('\033[1mNo transformation performed!\033[0m')

print('\n')
print('---------------------------------------------------------------------------------------------------------')
print('\n')

---------------------------------------------------------------------------------------------------------
[1mAPPLYING STANDARD SCALE TRANSFORMATION OVER NUMERICAL DATA[0m


[1mDataset 2706:[0m
[1mStandard scaling training data...[0m
[1mStandard scaling test data...[0m


---------------------------------------------------------------------------------------------------------




In [12]:
del scaled_data

#### Treating missing values

In [13]:
print('---------------------------------------------------------------------------------------------------------')
print('\033[1mTREATING MISSING VALUES\033[0m')
print('\n')

print(f'\033[1mDataset {dataset_id}:\033[0m')

print('\033[1mTreating missing values of training data...\033[0m')
df_train_scaled = treating_missings(dataframe=df_train_scaled, cat_vars=cat_vars,
                                    drop_vars=drop_vars)

print('\033[1mTreating missing values of test data...\033[0m')
df_test_scaled = treating_missings(dataframe=df_test_scaled, cat_vars=cat_vars,
                                   drop_vars=drop_vars)

print('\n')
print('---------------------------------------------------------------------------------------------------------')
print('\n')

---------------------------------------------------------------------------------------------------------
[1mTREATING MISSING VALUES[0m


[1mDataset 2706:[0m
[1mTreating missing values of training data...[0m
[1mTreating missing values of test data...[0m


---------------------------------------------------------------------------------------------------------




<a id='categorical_transf'></a>

### Transforming categorical features

#### Creating dummies through one-hot encoding

In [14]:
print(f'\033[1mDataset {dataset_id}:\033[0m')

transf_data = applying_one_hot(df_train_scaled, cat_vars, test_data=df_test_scaled)
df_train_scaled = transf_data['training_data']
df_test_scaled = transf_data['test_data']

print(f'\033[1mShape of df_train_scaled:\033[0m {df_train_scaled.shape}.')
print(f'\033[1mShape of df_test_scaled:\033[0m {df_test_scaled.shape}.')
print('\n')

[1mDataset 2706:[0m
[1mNumber of categorical features:[0m 14
[1mNumber of overall selected dummies:[0m 126.
[1mShape of df_train_scaled:[0m (7217, 1291).
[1mShape of df_test_scaled:[0m (7217, 1194).




In [15]:
del transf_data

In [16]:
# Assessing missing values (training data):
missings_detection(df_train_scaled, name=f'df_train_scaled (dataset {dataset_id})')

# Assessing missing values (test data):
missings_detection(df_test_scaled, name=f'df_test_scaled (dataset {dataset_id})')

<a id='datasets_structure'></a>

### Datasets structure

In [17]:
print(f'\033[1mDataset {dataset_id}\033[0m:')
df_test_scaled = data_consistency(dataframe=df_train_scaled,
                                  test_data=df_test_scaled)['test_data']

[1mDataset 2706[0m:
Training and test data are consistent with each other.


<a id='kfolds_assess'></a>

## Assessing K-folds CV

<a id='kfolds_light_gbm'></a>

### Light GBM

Click [here](https://lightgbm.readthedocs.io/en/latest/index.html) for documentation of light GBM library.

In [18]:
# Grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [100, 250, 500]}

# Creating K-folds CV object:
kfolds = KfoldsCV(task = 'binary', method = 'light_gbm', num_folds = 3, metric = 'roc_auc',
                  random_search = True, n_samples = 10,
                  grid_param = grid_param,
                  default_param = {'bagging_fraction': 0.75,
                                   'learning_rate': 0.01,
                                   'max_depth': 10,
                                   'num_iterations': 500},
                  cost_function='cross_entropy')

# Running K-folds CV:
kfolds.run(inputs = df_train_scaled.drop(drop_vars, axis=1), output = df_train_scaled['y'])

# Defining best tuning hyper-parameter:
best_param = kfolds.best_param



[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212










[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112










[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171








[1mGrid estimation progress:[0m [----                                          ]  10%

No further splits with positive gain, best gain: -inf




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212










[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112










[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171






[1mGrid estimation progress:[0m [---------                                     ]  20%





[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212




[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112




[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212




[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112




[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212






[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112






[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[1mGrid estimation progress:[0m [-----------------------                       ]  50%





[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212




[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112




[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212




[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112




[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212




[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112




[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212










[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112










[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171








[1mGrid estimation progress:[0m [-----------------------------------------     ]  90%

No further splits with positive gain, best gain: -inf




[LightGBM] [Info] Number of positive: 204, number of negative: 4607
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51382
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1042
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.042403 -> initscore=-3.117212
[LightGBM] [Info] Start training from score -3.117212








[LightGBM] [Info] Number of positive: 199, number of negative: 4612
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4811.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51327
[LightGBM] [Info] Number of data points in the train set: 4811, number of used features: 1041
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.041364 -> initscore=-3.143112
[LightGBM] [Info] Start training from score -3.143112








[LightGBM] [Info] Number of positive: 211, number of negative: 4601
[LightGBM] [Info] [cross_entropy:Init]: (metric) labels passed interval [0, 1] check
[LightGBM] [Info] [cross_entropy:Init]: sum-of-weights = 4812.000000
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 51961
[LightGBM] [Info] Number of data points in the train set: 4812, number of used features: 1045
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043849 -> initscore=-3.082171
[LightGBM] [Info] Start training from score -3.082171




[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: light gbm.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'bagging_fraction': 0.6910856228040706, 'learning_rate': 0.04354226569193867, 'max_depth': 9, 'num_iterations': 500}.
CV performance metric associated with best hyper-parameters: 0.9814.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 1.28 minutes.
Start time: 2021-05-15, 20:39:08
End time: 2021-05-15, 20:40:24
------------------------------------


In [19]:
# Best tuning hyper-parameters:
kfolds.best_param

{'bagging_fraction': 0.6910856228040706,
 'learning_rate': 0.04354226569193867,
 'max_depth': 9,
 'num_iterations': 500}

In [22]:
# CV metrics:
kfolds.CV_metric.style.set_properties(subset=['tun_param'], **{'width': '300px'})

Unnamed: 0,tun_param,cv_roc_auc
0,"{'bagging_fraction': 0.9806006231095018, 'learning_rate': 0.05436407008625484, 'max_depth': 4, 'num_iterations': 500}",0.97874
1,"{'bagging_fraction': 0.5803051220628379, 'learning_rate': 0.09202315173619957, 'max_depth': 6, 'num_iterations': 500}",0.980055
2,"{'bagging_fraction': 0.7458661769821264, 'learning_rate': 0.016686980999181332, 'max_depth': 7, 'num_iterations': 100}",0.97428
3,"{'bagging_fraction': 0.5888799588504267, 'learning_rate': 0.024453775140339463, 'max_depth': 8, 'num_iterations': 100}",0.976057
4,"{'bagging_fraction': 0.8731974354017087, 'learning_rate': 0.03865857789740874, 'max_depth': 6, 'num_iterations': 250}",0.98102
5,"{'bagging_fraction': 0.7125904818716433, 'learning_rate': 0.03671318458793546, 'max_depth': 8, 'num_iterations': 250}",0.979626
6,"{'bagging_fraction': 0.9125101295483491, 'learning_rate': 0.06332278424553706, 'max_depth': 1, 'num_iterations': 500}",0.978052
7,"{'bagging_fraction': 0.6034188464064448, 'learning_rate': 0.036869428067368995, 'max_depth': 3, 'num_iterations': 100}",0.975818
8,"{'bagging_fraction': 0.9952388495538327, 'learning_rate': 0.09710333459138716, 'max_depth': 6, 'num_iterations': 500}",0.980724
9,"{'bagging_fraction': 0.6910856228040706, 'learning_rate': 0.04354226569193867, 'max_depth': 9, 'num_iterations': 500}",0.981443


<a id='kfolds_fit_assess'></a>

## Assessing K-folds fit

<a id='kfolds_fit_svm_class'></a>

### SVM classifier

In [23]:
# Declare grid of hyper-parameters:
params = {'C': [1],
          'kernel': ['poly'],
          'degree': [1, 2, 3, 4],
          'gamma': ['scale']}
params_default = {'C': 1.0, 'kernel': 'poly', 'degree': 1, 'gamma': 'scale'}

# Declare K-folds CV estimation object:
kfolds = Kfolds_fit(task='classification', method='SVM',
                    metric='roc_auc', num_folds=3, pre_selecting=False, random_search=False,
                    grid_param=params, default_param=params_default)

# Running train-test estimation:
kfolds.fit(train_inputs=df_train_scaled.drop(drop_vars, axis=1),
           train_output=df_train_scaled['y'],
           test_inputs=df_test_scaled.drop(drop_vars, axis=1),
           test_output=df_test_scaled['y'])

[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: SVM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'C': '1', 'kernel': 'poly', 'degree': '1', 'gamma': 'scale'}.
   CV performance metric associated with best hyper-parameters: 0.9639.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9833
   test_prec_avg = 0.9237
   test_brier = 0.009
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 4.14 minutes.
Start time: 2021-05-15, 20:43:04
End time: 2021-05-15, 20:47:12
------------------------------------


<a id='boot_assess'></a>

## Assessing bootstrap estimation

<a id='boot_lr'></a>

### Logistic regression

In [25]:
# Declare grid of hyper-parameters:
params_default = {'C': 0.1}

# Declare bootstrap estimation object:
boot_estimations = bootstrap_estimation(task='classification', method='logistic_regression',
                                        metric='roc_auc', num_folds=3, pre_selecting=False, random_search=False,
                                        grid_param=params, default_param=params_default,
                                        cv=False, replacement=True, n_iterations=1000, bootstrap_scores=True)

# Running bootstrap estimation:
boot_estimations.run(train_inputs=df_train_scaled.drop(drop_vars, axis=1),
                     train_output=df_train_scaled['y'],
                     test_inputs=df_test_scaled.drop(drop_vars, axis=1),
                     test_output=df_test_scaled['y'])



---------------------------------------------------------------------------------------------
[1mBootstrap statistics:[0m
   Number of estimations: 1000.
   avg(roc_auc) = 0.9841
   std(roc_auc) = 0.0016
   avg(prec_avg) = 0.9237
   std(prec_avg) = 0.0051
   avg(brier) = 0.0095
   std(brier) = 0.0005


[1m   Performance metrics based on bootstrap scores:[0m
   roc_auc = 0.9866
   prec_avg = 0.9335
   brier = 0.0088
   Hyper-parameters used in estimations: {'C': 0.1}.
---------------------------------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 16.7 minutes.
Start time: 2021-05-15, 20:50:13
End time: 2021-05-15, 21:06:56
------------------------------------
