## Validation procedures
## Tutorials

This projects conducted to the development of classes that have the goal of contributing with validation procedures during the implementation of data modeling in supervised learning tasks. This tutorial has the goal of showing its easy use and flexibility.
<br>
<br>
Use cases for the classes presented here are as follows:
* *KfoldsCV*, for perfoming grid/random search of a LightGBM model and a XGBoost model. Besides, pre-selection of features during each of the K-folds estimation for LightGBM is also presented.
* *KfoldsCV_fit*, for performing grid/random search and fitting a SVM classifier using the entire training data and the best choices of hyper-parameters. Besides, the same for GBM classifier (sklearn) is applied together with parallelization for reducing overall running time. A demonstration of how to use this class for implementing XGBoost with early stopping is also available. Finally, logistic regression with pre-selection of features is demonstrated.
* *BootstrapEstimation*, for running a large collection of estimations in order to assess average and standard deviation of performance metrics, using a regularized logistic regression model.

Important to notice that all estimations have no intention of being as efficient as possibile, but focus on illustrating how those classes can be used in real-world applications.
<br>
<br>
The complete collection of learning algorithms covered by KfoldsCV, Kfolds_fit, and BootstrapEstimation classes are presented below. Each method is followed by the library of reference and the hyper-parameters subject to grid or random search. Note that all hyper-parameters are named exactly how they are in their original libraries.
1. Logistic regression from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (method='logistic_regression').
    * Main hyper-parameters for tuning: regularization parameter ('C').
2. Linear regression (Lasso) from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) (method='lasso').
    * Main hyper-parameters for tuning: regularization parameter ('C').
3. GBM from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) (method='gbm').
    * Hyper-parameters for tuning: subsample ('subsample'), maximum depth ('max_depth'), learning rate ('learning_rate'), number of estimators ('n_estimators').
4. GBM from [LightGBM](https://lightgbm.readthedocs.io/en/latest/Parameters.html) (method='light_gbm').
    * Main hyper-parameters for tuning: subsample ('bagging_fraction'), maximum depth ('max_depth'), learning rate ('learning_rate), number of estimators ('num_iterations').
    * By declaring 'metric' and 'early_stopping_rounds' into the parameters dictionary, it is possible to implement both "KfoldsCV" and "Kfolds_fit" with early stopping. For "KfoldsCV", at each k-folds estimation early stopping will take place, while for "Kfolds_fit" estimation will stop after a stopping rule is triggered both during each of k-folds estimation and during the final fitting using the entire training data.
5. GBM from [XGBoost](https://xgboost.readthedocs.io/en/latest/parameter.html#xgboost-parameters) (method='xgboost').
    * Main hyper-parameters for tuning: subsample ('subsample'), maximum depth ('max_depth'), learning rate ('eta'), number of estimators ('num_boost_round').
    * By declaring 'eval_metric' and 'early_stopping_rounds' into the parameters dictionary, also for XGBoost early stopping is available for both "KfoldsCV" and "Kfolds_fit".
6. Random forest from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) (method='random_forest').
    * Main hyper-parameters for tuning: number of estimators ('n_estimators'), maximum number of features ('max_features') and minimum number of samplesfor split ('min_samples_split').
7. SVM from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) (method='svm').
    * Main hyper-parameters for tuning: regularization parameter ('C') kernel ('kernel'), polynomial degree ('degree'), gamma ('gamma').

--------

This notebook imports the developed classes and uses a dataset for binary classification seeking to assess the functionalities of those classes by applying several distinct statistical learning methods.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing datasets](#imports)<a href='#imports'></a>.
5. [Data pre-processing](#data_pre_proc)<a href='#data_pre_proc'></a>.
6. [Assessing K-folds CV](#kfolds_assess)<a href='#kfolds_assess'></a>.
    * [LightGBM](#kfolds_lightgbm)<a href='#kfolds_lightgbm'></a>.
    * [XGBoost](#kfolds_xgboost)<a href='#kfolds_xgboost'></a>.
<br>
<br>
7. [Assessing K-folds fit](#kfolds_fit_assess)<a href='#kfolds_fit_assess'></a>.
    * [SVM classifier](#kfolds_fit_svm_class)<a href='#kfolds_fit_svm_class'></a>.
    * [Parallel estimation (GBM)](#kfolds_fit_gbm_parallel)<a href='#kfolds_fit_gbm_parallel'></a>.
    * [XGBoost with early stopping](#kfolds_fit_xgboost_es)<a href='#kfolds_fit_xgboost_es'></a>.
    * [Logistic regression with pre-selection of features](#kfolds_fit_lr_sel_feats)<a href='#kfolds_fit_lr_sel_feats'></a>.
<br>
<br>
8. [Assessing bootstrap estimation](#boot_assess)<a href='#boot_assess'></a>.
    * [Logistic regression](#boot_lr)<a href='#boot_lr'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import os

from datetime import datetime
import time
import progressbar

from scipy.stats import uniform, norm, randint

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import utils
from utils import loading_data, running_time

In [3]:
import kfolds
from kfolds import KfoldsCV, Kfolds_fit

import bootstrap
from bootstrap import BootstrapEstimation

<a id='settings'></a>

## Settings

In [4]:
# Define the dataset_id:
dataset_id = 2706

<a id='imports'></a>

## Importing datasets

<a id='feats_label'></a>

### Features and label

#### Training data

In [5]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_train = loading_data(path=f'Datasets/dataset_{dataset_id}_train.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

# Accessory variables:
drop_vars = ['y', 'order_id', 'epoch', 'date']

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2020-12-31 to 2021-02-17.
----------------------------------------




#### Test data

In [6]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_test = loading_data(path=f'Datasets/dataset_{dataset_id}_test.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2021-02-17 to 2021-03-31.
----------------------------------------




<a id='kfolds_assess'></a>

## Assessing K-folds CV

<a id='kfolds_lightgbm'></a>

### LightGBM

Click [here](https://lightgbm.readthedocs.io/en/latest/index.html) for documentation of LightGBM library.

In [7]:
# Grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [100, 250, 500]}
default_param = {'bagging_fraction': 0.75, 'learning_rate': 0.01, 'max_depth': 10, 'num_iterations': 500}

# Creating K-folds CV object:
kfolds = KfoldsCV(task='binary', method='light_gbm', num_folds=3, metric='roc_auc', shuffle=False,
                  random_search=True, n_samples=10,
                  grid_param=grid_param, default_param=default_param,
                  pre_selecting=False,
                  parallelize=False)

# Running K-folds CV:
kfolds.run(inputs=df_train.drop(drop_vars, axis=1), output=df_train['y'])

# Defining best tuning hyper-parameter:
best_param = kfolds.best_param

[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: light gbm.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'bagging_fraction': 0.6832649920297882, 'learning_rate': 0.08491106396508266, 'max_depth': 7, 'num_iterations': 100}.
CV performance metric associated with best hyper-parameters: 0.9825.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 0.68 minutes.
Start time: 2021-07-18, 13:29:08
End time: 2021-07-18, 13:29:49
------------------------------------


In [8]:
# Best tuning hyper-parameters:
kfolds.best_param

{'bagging_fraction': 0.6832649920297882,
 'learning_rate': 0.08491106396508266,
 'max_depth': 7,
 'num_iterations': 100}

In [11]:
# CV metrics:
kfolds.CV_metric.sort_values('cv_roc_auc',
                             ascending=False).style.set_properties(subset=['tun_param'], **{'width': '300px'})

Unnamed: 0,tun_param,cv_roc_auc
1,"{'bagging_fraction': 0.6832649920297882, 'learning_rate': 0.08491106396508266, 'max_depth': 7, 'num_iterations': 100}",0.982525
0,"{'bagging_fraction': 0.8608926606533922, 'learning_rate': 0.04263524339823467, 'max_depth': 9, 'num_iterations': 250}",0.980851
7,"{'bagging_fraction': 0.720694670763939, 'learning_rate': 0.03540258497810973, 'max_depth': 8, 'num_iterations': 250}",0.980705
6,"{'bagging_fraction': 0.8655896154122709, 'learning_rate': 0.07974806260814968, 'max_depth': 5, 'num_iterations': 100}",0.980403
8,"{'bagging_fraction': 0.8324012409439627, 'learning_rate': 0.08192613707707953, 'max_depth': 2, 'num_iterations': 100}",0.980156
2,"{'bagging_fraction': 0.513212829128326, 'learning_rate': 0.02947968410710311, 'max_depth': 5, 'num_iterations': 100}",0.978885
3,"{'bagging_fraction': 0.6725562945020598, 'learning_rate': 0.02627670694304666, 'max_depth': 7, 'num_iterations': 100}",0.977479
9,"{'bagging_fraction': 0.7564047931344857, 'learning_rate': 0.005768237590992054, 'max_depth': 5, 'num_iterations': 100}",0.96622
4,"{'bagging_fraction': 0.5503319176774703, 'learning_rate': 0.011888356094332886, 'max_depth': 2, 'num_iterations': 100}",0.964302
5,"{'bagging_fraction': 0.80911447600028, 'learning_rate': 0.003551497272993853, 'max_depth': 5, 'num_iterations': 100}",0.962228


#### Pre-selecting features

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
# Grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [100, 250, 500]}
default_param = {'bagging_fraction': 0.75, 'learning_rate': 0.01, 'max_depth': 10, 'num_iterations': 500}

# Parameters for features selection:
selection_params = {
    'method': 'supervised', 'threshold': 0,
    'estimator': LogisticRegression(C=1.0, penalty='l1', solver='liblinear')
}

# Creating K-folds CV object:
kfolds = KfoldsCV(task='binary', method='light_gbm', num_folds=3, metric='roc_auc', shuffle=False,
                  random_search=True, n_samples=10,
                  grid_param=grid_param, default_param=default_param,
                  pre_selecting=True, pre_selecting_params=selection_params,
                  parallelize=False)

# Running K-folds CV:
kfolds.run(inputs=df_train.drop(drop_vars, axis=1), output=df_train['y'])

# Defining best tuning hyper-parameter:
best_param = kfolds.best_param

[1mGrid estimation progress:[0m [                                      ]   0%

From 1282 features, 211 were selected!




From 1282 features, 201 were selected!


[1mGrid estimation progress:[0m [---                                   ]  10%

From 1282 features, 216 were selected!
From 1282 features, 211 were selected!




From 1282 features, 200 were selected!


[1mGrid estimation progress:[0m [-------                               ]  20%

From 1282 features, 217 were selected!
From 1282 features, 213 were selected!




From 1282 features, 203 were selected!




From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [-----------                           ]  30%

From 1282 features, 210 were selected!




From 1282 features, 200 were selected!




From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [---------------                       ]  40%

From 1282 features, 211 were selected!




From 1282 features, 198 were selected!




From 1282 features, 219 were selected!


[1mGrid estimation progress:[0m [-------------------                   ]  50%

From 1282 features, 211 were selected!




From 1282 features, 201 were selected!




From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [----------------------                ]  60%

From 1282 features, 209 were selected!




From 1282 features, 202 were selected!


[1mGrid estimation progress:[0m [--------------------------            ]  70%

From 1282 features, 217 were selected!
From 1282 features, 209 were selected!




From 1282 features, 201 were selected!




From 1282 features, 219 were selected!


[1mGrid estimation progress:[0m [------------------------------        ]  80%

From 1282 features, 208 were selected!




From 1282 features, 200 were selected!




From 1282 features, 215 were selected!


[1mGrid estimation progress:[0m [----------------------------------    ]  90%

From 1282 features, 210 were selected!




From 1282 features, 198 were selected!


[1mGrid estimation progress:[0m [--------------------------------------] 100%

From 1282 features, 217 were selected!
---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: light gbm.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'bagging_fraction': 0.6393204625867241, 'learning_rate': 0.06452780151570851, 'max_depth': 6, 'num_iterations': 100}.
CV performance metric associated with best hyper-parameters: 0.9832.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 0.86 minutes.
Start time: 2021-07-18, 13:30:52
End time: 2021-07-18, 13:31:44
------------------------------------


#### Early stopping during each of the K-folds estimation

In [14]:
# Grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [250],
              'early_stopping_rounds': [20],
              'metric': ['auc']
             }
default_param = {'bagging_fraction': 0.75, 'learning_rate': 0.01, 'max_depth': 10, 'num_iterations': 500}

# Parameters for features selection:
selection_params = {
    'method': 'supervised', 'threshold': 0,
    'estimator': LogisticRegression(C=1.0, penalty='l1', solver='liblinear')
}

# Creating K-folds CV object:
kfolds = KfoldsCV(task='binary', method='light_gbm', num_folds=3, metric='roc_auc', shuffle=False,
                  random_search=True, n_samples=10,
                  grid_param=grid_param, default_param=default_param,
                  pre_selecting=True, pre_selecting_params=selection_params,
                  parallelize=False)

# Running K-folds CV:
kfolds.run(inputs=df_train.drop(drop_vars, axis=1), output=df_train['y'])

# Defining best tuning hyper-parameter:
best_param = kfolds.best_param

[1mGrid estimation progress:[0m [                                      ]   0%

From 1282 features, 210 were selected!




From 1282 features, 200 were selected!




From 1282 features, 217 were selected!


[1mGrid estimation progress:[0m [---                                   ]  10%

From 1282 features, 207 were selected!




From 1282 features, 198 were selected!


[1mGrid estimation progress:[0m [-------                               ]  20%

From 1282 features, 217 were selected!
From 1282 features, 210 were selected!




From 1282 features, 202 were selected!




From 1282 features, 217 were selected!


[1mGrid estimation progress:[0m [-----------                           ]  30%

From 1282 features, 212 were selected!




From 1282 features, 201 were selected!


[1mGrid estimation progress:[0m [---------------                       ]  40%

From 1282 features, 217 were selected!
From 1282 features, 212 were selected!




From 1282 features, 197 were selected!




From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [-------------------                   ]  50%

From 1282 features, 212 were selected!




From 1282 features, 202 were selected!




From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [----------------------                ]  60%

From 1282 features, 210 were selected!




From 1282 features, 201 were selected!


[1mGrid estimation progress:[0m [--------------------------            ]  70%

From 1282 features, 217 were selected!
From 1282 features, 214 were selected!




From 1282 features, 199 were selected!


[1mGrid estimation progress:[0m [------------------------------        ]  80%

From 1282 features, 218 were selected!
From 1282 features, 210 were selected!




From 1282 features, 201 were selected!


[1mGrid estimation progress:[0m [----------------------------------    ]  90%

From 1282 features, 219 were selected!
From 1282 features, 210 were selected!




From 1282 features, 202 were selected!


[1mGrid estimation progress:[0m [--------------------------------------] 100%

From 1282 features, 218 were selected!
---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: light gbm.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'bagging_fraction': 0.5780735721079513, 'learning_rate': 0.06945847301187266, 'max_depth': 4, 'num_iterations': 250, 'early_stopping_rounds': 20, 'metric': 'auc'}.
CV performance metric associated with best hyper-parameters: 0.9822.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 0.73 minutes.
Start time: 2021-07-18, 13:31:44
End time: 2021-07-18, 13:32:27
------------------------------------


<a id='kfolds_xgboost'></a>

### XGBoost

Click [here](https://xgboost.readthedocs.io/en/latest/index.html) for documentation of XGBoost library.

In [15]:
# Grid of hyper-parameters:
grid_param = {'subsample': uniform(0.5, 0.5),
              'eta': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_boost_round': [100, 250, 500]}
default_param = {'subsample': 0.75, 'eta': 0.01, 'max_depth': 10, 'num_boost_round': 100}

# Creating K-folds CV object:
kfolds = KfoldsCV(task='binary:logistic', method='xgboost', num_folds=3, metric='roc_auc', shuffle=False,
                  random_search=True, n_samples=10,
                  grid_param=grid_param, default_param=default_param,
                  pre_selecting=False,
                  parallelize=False)

# Running K-folds CV:
kfolds.run(inputs=df_train.drop(drop_vars, axis=1), output=df_train['y'])

[1mGrid estimation progress:[0m [                                      ]   0%



[1mGrid estimation progress:[0m [---                                   ]  10%



[1mGrid estimation progress:[0m [-------                               ]  20%



[1mGrid estimation progress:[0m [-----------                           ]  30%



[1mGrid estimation progress:[0m [---------------                       ]  40%



[1mGrid estimation progress:[0m [-------------------                   ]  50%



[1mGrid estimation progress:[0m [----------------------                ]  60%



[1mGrid estimation progress:[0m [--------------------------            ]  70%



[1mGrid estimation progress:[0m [------------------------------        ]  80%



[1mGrid estimation progress:[0m [----------------------------------    ]  90%



[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: xgboost.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'subsample': 0.9996099353984378, 'eta': 0.05260007383864195, 'max_depth': 3, 'num_boost_round': 250}.
CV performance metric associated with best hyper-parameters: 0.9809.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 8.6 minutes.
Start time: 2021-07-18, 13:32:27
End time: 2021-07-18, 13:41:03
------------------------------------


In [16]:
# CV metrics:
kfolds.CV_metric.sort_values('cv_roc_auc',
                             ascending=False).style.set_properties(subset=['tun_param'], **{'width': '300px'})

Unnamed: 0,tun_param,cv_roc_auc
7,"{'subsample': 0.9996099353984378, 'eta': 0.05260007383864195, 'max_depth': 3, 'num_boost_round': 250}",0.980909
4,"{'subsample': 0.7625820775305374, 'eta': 0.04381120313104465, 'max_depth': 8, 'num_boost_round': 500}",0.980256
8,"{'subsample': 0.9983049580837712, 'eta': 0.04993105449021878, 'max_depth': 9, 'num_boost_round': 250}",0.979666
2,"{'subsample': 0.5530401948568808, 'eta': 0.0667633300311981, 'max_depth': 5, 'num_boost_round': 500}",0.978711
0,"{'subsample': 0.798463742093148, 'eta': 0.06221069321952554, 'max_depth': 3, 'num_boost_round': 250}",0.978672
5,"{'subsample': 0.7826607740727265, 'eta': 0.09324628928635169, 'max_depth': 9, 'num_boost_round': 250}",0.978445
3,"{'subsample': 0.9475319617512563, 'eta': 0.06452459453363402, 'max_depth': 5, 'num_boost_round': 500}",0.978408
1,"{'subsample': 0.941260472119148, 'eta': 0.07852820147476218, 'max_depth': 4, 'num_boost_round': 500}",0.978156
9,"{'subsample': 0.6372518708338644, 'eta': 0.07321161675319736, 'max_depth': 6, 'num_boost_round': 100}",0.976528
6,"{'subsample': 0.6794803876204119, 'eta': 0.06437992563757859, 'max_depth': 1, 'num_boost_round': 250}",0.976426


#### Early stopping during each of the K-folds estimation

In [17]:
# Grid of hyper-parameters:
grid_param = {'subsample': uniform(0.5, 0.5),
              'eta': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_boost_round': [100, 250, 500],
              'early_stopping_rounds': [20],
              'eval_metric': ['auc']
             }
default_param = {'subsample': 0.75, 'eta': 0.01, 'max_depth': 10, 'num_boost_round': 100}

# Creating K-folds CV object:
kfolds = KfoldsCV(task='binary:logistic', method='xgboost', num_folds=3, metric='roc_auc', shuffle=False,
                  random_search=True, n_samples=10,
                  grid_param=grid_param, default_param=default_param,
                  pre_selecting=False,
                  parallelize=False)

# Running K-folds CV:
kfolds.run(inputs=df_train.drop(drop_vars, axis=1), output=df_train['y'])

[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: xgboost.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'subsample': 0.9453305466279336, 'eta': 0.06991058585004947, 'max_depth': 8, 'num_boost_round': 500, 'early_stopping_rounds': 20, 'eval_metric': 'auc'}.
CV performance metric associated with best hyper-parameters: 0.9805.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 2.49 minutes.
Start time: 2021-07-18, 13:41:03
End time: 2021-07-18, 13:43:33
------------------------------------


<a id='kfolds_fit_assess'></a>

## Assessing K-folds fit

<a id='kfolds_fit_svm_class'></a>

### SVM classifier

In [19]:
# Declare grid of hyper-parameters:
params = {'C': [1],
          'kernel': ['poly'],
          'degree': [1, 2, 3, 4],
          'gamma': ['scale']}
params_default = {'C': 1.0, 'kernel': 'poly', 'degree': 1, 'gamma': 'scale'}
fixed_params = {'probability': True}

# Declare K-folds CV estimation object:
kfolds = Kfolds_fit(task='classification', method='SVM',
                    metric='roc_auc', num_folds=3, random_search=False, shuffle=False,
                    grid_param=params, default_param=params_default, fixed_params=fixed_params,
                    pre_selecting=False,
                    parallelize=False)

# Running train-test estimation:
kfolds.fit(train_inputs=df_train.drop(drop_vars, axis=1),
           train_output=df_train['y'],
           test_inputs=df_test.drop(drop_vars, axis=1),
           test_output=df_test['y'])

[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: SVM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'C': 1, 'kernel': 'poly', 'degree': 1, 'gamma': 'scale'}.
   CV performance metric associated with best hyper-parameters: 0.9639.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9833
   test_prec_avg = 0.9236
   test_brier = 0.0091
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 1.74 minutes.
Start time: 2021-07-18, 13:44:35
End time: 2021-07-18, 13:46:19
------------------------------------


<a id='kfolds_fit_gbm_parallel'></a>

### Parallel estimation (GBM)

#### Sequential train-validation estimation

In [22]:
# Declare grid of hyper-parameters:
params = {'subsample': [0.75],
          'learning_rate': [0.0001, 0.001, 0.01],
          'max_depth': [1, 3, 5],
          'n_estimators': [500]}
params_default = {'subsample': 0.75,
                  'learning_rate': 0.01,
                  'max_depth': 10,
                  'n_estimators': 500}
fixed_params = {'warm_start':True}

# Declare K-folds CV estimation object:
train_test_est = Kfolds_fit(task='classification', method='GBM', metric='roc_auc', num_folds=3, shuffle=False,
                            random_search=False,
                            grid_param=params, default_param=params_default, fixed_params=fixed_params,
                            pre_selecting=False,
                            parallelize=False)

# Running train-test estimation:
train_test_est.fit(train_inputs=df_train.drop(drop_vars, axis=1),
                   train_output=df_train['y'],
                   test_inputs=df_test.drop(drop_vars, axis=1),
                   test_output=df_test['y'])

[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: GBM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500}.
   CV performance metric associated with best hyper-parameters: 0.9609.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9904
   test_prec_avg = 0.9484
   test_brier = 0.0043
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 41.26 minutes.
Start time: 2021-07-18, 13:48:03
End time: 2021-07-18, 14:29:19
------------------------------------


#### Parallel train-validation estimation

In [24]:
# Declare grid of hyper-parameters:
params = {'subsample': [0.75],
          'learning_rate': [0.0001, 0.001, 0.01],
          'max_depth': [1, 3, 5],
          'n_estimators': [500]}
params_default = {'subsample': 0.75,
                  'learning_rate': 0.01,
                  'max_depth': 10,
                  'n_estimators': 500}
fixed_params = {'warm_start':True}

# Declare K-folds CV estimation object:
train_test_est = Kfolds_fit(task='classification', method='GBM', metric='roc_auc', num_folds=3, shuffle=False,
                            random_search=False,
                            pre_selecting=False,
                            grid_param=params, default_param=params_default, fixed_params=fixed_params,
                            parallelize=True)

# Running train-test estimation:
train_test_est.fit(train_inputs=df_train.drop(drop_vars, axis=1),
                   train_output=df_train['y'],
                   test_inputs=df_test.drop(drop_vars, axis=1),
                   test_output=df_test['y'])

[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: GBM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}.
   CV performance metric associated with best hyper-parameters: 0.9601.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9859
   test_prec_avg = 0.9481
   test_brier = 0.0043
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 17.41 minutes.
Start time: 2021-07-18, 14:33:32
End time: 2021-07-18, 14:50:57
------------------------------------


<a id='kfolds_fit_xgboost_es'></a>

### XGBoost with early stopping

#### No early stopping

In [25]:
# Grid of hyper-parameters:
grid_param = {'subsample': [0.75],
              'eta': [0.001, 0.01, 0.1],
              'max_depth': [1, 3, 5],
              'num_boost_round': [200]}
default_param = {'subsample': 0.75, 'eta': 0.01, 'max_depth': 10, 'num_boost_round': 100}

# Creating K-folds CV object:
kfolds = Kfolds_fit(task='binary:logistic', method='xgboost', num_folds=3, metric='roc_auc', shuffle=False,
                    random_search=False,
                    pre_selecting=False,
                    grid_param=grid_param, default_param=default_param,
                    parallelize=False)

# Running K-folds CV:
kfolds.fit(train_inputs=df_train.drop(drop_vars, axis=1), train_output=df_train['y'],
           test_inputs=df_test.drop(drop_vars, axis=1), test_output=df_test['y'])

[1mGrid estimation progress:[0m [                                      ]   0%



[1mGrid estimation progress:[0m [----                                  ]  11%



[1mGrid estimation progress:[0m [--------                              ]  22%



[1mGrid estimation progress:[0m [------------                          ]  33%



[1mGrid estimation progress:[0m [----------------                      ]  44%



[1mGrid estimation progress:[0m [---------------------                 ]  55%



[1mGrid estimation progress:[0m [-------------------------             ]  66%



[1mGrid estimation progress:[0m [-----------------------------         ]  77%



[1mGrid estimation progress:[0m [---------------------------------     ]  88%



[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: xgboost.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.75, 'eta': 0.1, 'max_depth': 5, 'num_boost_round': 200}.
   CV performance metric associated with best hyper-parameters: 0.9793.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9912
   test_prec_avg = 0.9611
   test_brier = 0.0047
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 4.15 minutes.
Start time: 2021-07-18, 14:50:57
End time: 2021-07-18, 14:55:05
------------------------------------


#### Early stopping

In [26]:
# Grid of hyper-parameters:
grid_param = {'subsample': [0.75],
              'eta': [0.001, 0.01, 0.1],
              'max_depth': [1, 3, 5],
              'num_boost_round': [200],
              'eval_metric': ['auc'],
              'early_stopping_rounds': [20]}
default_param = {'subsample': 0.75, 'eta': 0.01, 'max_depth': 10, 'num_boost_round': 100}

# Creating K-folds CV object:
kfolds = Kfolds_fit(task='binary:logistic', method='xgboost', num_folds=3, metric='roc_auc', shuffle=False,
                    random_search=False,
                    pre_selecting=False,
                    grid_param=grid_param, default_param=default_param,
                    parallelize=False)

# Running K-folds CV:
kfolds.fit(train_inputs=df_train.drop(drop_vars, axis=1), train_output=df_train['y'],
           val_inputs=df_test.drop(drop_vars, axis=1), val_output=df_test['y'],
           test_inputs=df_test.drop(drop_vars, axis=1), test_output=df_test['y'])

[1mGrid estimation progress:[0m [--------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: xgboost.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.75, 'eta': 0.1, 'max_depth': 5, 'num_boost_round': 200, 'eval_metric': 'auc', 'early_stopping_rounds': 20}.
   CV performance metric associated with best hyper-parameters: 0.981.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9936
   test_prec_avg = 0.9608
   test_brier = 0.0044
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 1.33 minutes.
Start time: 2021-07-18, 14:55:05
End time: 2021-07-18, 14:56:25
------------------------------------




<a id='kfolds_fit_lr_sel_feats'></a>

### Logistic regression with pre-selection of features

In [27]:
# Grid of hyper-parameters:
grid_param = {'C': [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.25, 0.3, 0.5, 0.75, 1, 3, 10]}
default_param = {'C': 1.0}
fixed_params = {'penalty':'l1', 'solver':'liblinear', 'warm_start':True}

# Parameters for features selection:
selection_params = {
    'method': 'supervised', 'threshold': 0,
    'estimator': LogisticRegression(C=1.0, penalty='l1', solver='liblinear')
}

# Creating K-folds CV object:
kfolds = Kfolds_fit(task='classification', method='logistic_regression', num_folds=3, metric='roc_auc',
                    shuffle=False,
                    random_search=False,
                    grid_param=grid_param, default_param=default_param, fixed_params=fixed_params,
                    pre_selecting=True, pre_selecting_params=selection_params, only_final_selection=False,
                    parallelize=False)

# Running K-folds CV:
kfolds.fit(train_inputs=df_train.drop(drop_vars, axis=1), train_output=df_train['y'],
           test_inputs=df_test.drop(drop_vars, axis=1), test_output=df_test['y'])

[1mGrid estimation progress:[0m [                                      ]   0%

From 1282 features, 208 were selected!
From 1282 features, 201 were selected!


[1mGrid estimation progress:[0m [--                                    ]   7%

From 1282 features, 213 were selected!
From 1282 features, 211 were selected!
From 1282 features, 201 were selected!


[1mGrid estimation progress:[0m [-----                                 ]  14%

From 1282 features, 217 were selected!
From 1282 features, 211 were selected!
From 1282 features, 200 were selected!


[1mGrid estimation progress:[0m [--------                              ]  21%

From 1282 features, 218 were selected!
From 1282 features, 212 were selected!
From 1282 features, 199 were selected!


[1mGrid estimation progress:[0m [----------                            ]  28%

From 1282 features, 215 were selected!
From 1282 features, 210 were selected!
From 1282 features, 199 were selected!
From 1282 features, 219 were selected!


[1mGrid estimation progress:[0m [-------------                         ]  35%

From 1282 features, 211 were selected!
From 1282 features, 198 were selected!
From 1282 features, 219 were selected!


[1mGrid estimation progress:[0m [----------------                      ]  42%

From 1282 features, 211 were selected!
From 1282 features, 201 were selected!
From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [-------------------                   ]  50%

From 1282 features, 209 were selected!
From 1282 features, 202 were selected!
From 1282 features, 216 were selected!


[1mGrid estimation progress:[0m [---------------------                 ]  57%

From 1282 features, 213 were selected!
From 1282 features, 200 were selected!
From 1282 features, 217 were selected!


[1mGrid estimation progress:[0m [------------------------              ]  64%

From 1282 features, 208 were selected!
From 1282 features, 199 were selected!
From 1282 features, 220 were selected!


[1mGrid estimation progress:[0m [---------------------------           ]  71%

From 1282 features, 206 were selected!
From 1282 features, 202 were selected!
From 1282 features, 215 were selected!


[1mGrid estimation progress:[0m [-----------------------------         ]  78%

From 1282 features, 211 were selected!
From 1282 features, 201 were selected!
From 1282 features, 217 were selected!


[1mGrid estimation progress:[0m [--------------------------------      ]  85%

From 1282 features, 211 were selected!
From 1282 features, 202 were selected!
From 1282 features, 213 were selected!


[1mGrid estimation progress:[0m [-----------------------------------   ]  92%

From 1282 features, 214 were selected!
From 1282 features, 201 were selected!
From 1282 features, 217 were selected!


[1mGrid estimation progress:[0m [--------------------------------------] 100%

From 1282 features, 277 were selected!
---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: logistic regression.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'C': 0.1}.
   CV performance metric associated with best hyper-parameters: 0.9719.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9864
   test_prec_avg = 0.9327
   test_brier = 0.0087
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 1.7 minutes.
Start time: 2021-07-18, 14:56:25
End time: 2021-07-18, 14:58:07
------------------------------------


In [28]:
# Grid of hyper-parameters:
grid_param = {'C': [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.25, 0.3, 0.5, 0.75, 1, 3, 10]}
default_param = {'C': 1.0}
fixed_params = {'penalty':'l1', 'solver':'liblinear', 'warm_start':True}

# Parameters for features selection:
selection_params = {
    'method': 'supervised', 'threshold': 0,
    'estimator': LogisticRegression(C=1.0, penalty='l1', solver='liblinear')
}

# Creating K-folds CV object:
kfolds = Kfolds_fit(task='classification', method='logistic_regression', num_folds=3, metric='roc_auc',
                    shuffle=False,
                    random_search=False,
                    grid_param=grid_param, default_param=default_param, fixed_params=fixed_params,
                    pre_selecting=True, pre_selecting_params=selection_params, only_final_selection=True,
                    parallelize=False)

# Running K-folds CV:
kfolds.fit(train_inputs=df_train.drop(drop_vars, axis=1), train_output=df_train['y'],
           test_inputs=df_test.drop(drop_vars, axis=1), test_output=df_test['y'])

[1mGrid estimation progress:[0m [--------------------------------------] 100%

From 1282 features, 278 were selected!
---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: logistic regression.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'C': 0.1}.
   CV performance metric associated with best hyper-parameters: 0.9723.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9864
   test_prec_avg = 0.9327
   test_brier = 0.0087
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 0.71 minutes.
Start time: 2021-07-18, 14:58:07
End time: 2021-07-18, 14:58:50
------------------------------------


<a id='boot_assess'></a>

## Assessing bootstrap estimation

<a id='boot_lr'></a>

### Logistic regression

In [29]:
# Declare grid of hyper-parameters:
params = {'C': [0.1]}
params_default = {'C': 0.1}
fixed_params = {'penalty':'l1', 'solver':'liblinear', 'warm_start':True}

# Declare bootstrap estimation object:
boot_estimations = BootstrapEstimation(task='classification', method='logistic_regression',
                                       metric='roc_auc', num_folds=3, shuffle=False,
                                       pre_selecting=False,
                                       random_search=False,
                                       grid_param=params, default_param=params_default, fixed_params=fixed_params,
                                       parallelize=False,
                                       cv=False, replacement=True, n_iterations=1000, bootstrap_scores=True)

# Running bootstrap estimation:
boot_estimations.run(train_inputs=df_train.drop(drop_vars, axis=1),
                     train_output=df_train['y'],
                     test_inputs=df_test.drop(drop_vars, axis=1),
                     test_output=df_test['y'])



---------------------------------------------------------------------------------------------
[1mBootstrap statistics:[0m
   Number of estimations: 1000.
   avg(roc_auc) = 0.9841
   std(roc_auc) = 0.0017
   avg(prec_avg) = 0.9238
   std(prec_avg) = 0.005
   avg(brier) = 0.0095
   std(brier) = 0.0005


[1m   Performance metrics based on bootstrap scores:[0m
   roc_auc = 0.9865
   prec_avg = 0.9334
   brier = 0.0088
   Hyper-parameters used in estimations: {'C': 0.1}.
---------------------------------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 19.45 minutes.
Start time: 2021-07-18, 14:58:50
End time: 2021-07-18, 15:18:17
------------------------------------
