## Validation procedures
## Tutorials

This projects conducted to the development of classes that have the goal of contributing with validation procedures during the implementation of data modeling in supervised learning tasks. This tutorial has the goal of showing its easy use and flexibility.
<br>
<br>
Use cases for the classes presented here are as follows:
* *KfoldsCV*, for perfoming grid/random search of a LightGBM model and a XGBoost model. Besides, pre-selection of features during each of the K-folds estimation for LightGBM is also presented.
* *KfoldsCV_fit*, for performing grid/random search and fitting a SVM classifier using the entire training data and the best choices of hyper-parameters. Besides, the same for GBM classifier is applied together with parallelization for reducing overall running time.
* *BootstrapEstimation*, for running a large collection of estimations in order to assess average and standard deviation of performance metrics, using a regularized logistic regression model.

Important to notice that all estimations have no intention of being as efficient as possibile, but focus on illustrating how those classes can be used in real-world applications.
<br>
<br>
The complete collection of learning algorithms covered by KfoldsCV, Kfolds_fit, and BootstrapEstimation classes are presented below. Each method is followed by the library of reference and the hyper-parameters subject to grid or random search. Note that all hyper-parameters are named exactly how they are in their original libraries.
1. Logistic regression (from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)).
    * Hyper-parameters for tuning: regularization parameter ('C').
<br>
<br>
2. Linear regression (Lasso) (from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)).
    * Hyper-parameters for tuning: regularization parameter ('C').
<br>
<br>
3. GBM (from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)).
    * Hyper-parameters for tuning: subsample ('subsample'), maximum depth ('max_depth'), learning rate ('learning_rate'), number of estimators ('n_estimators').
<br>
<br>
4. GBM (from [LightGBM](https://lightgbm.readthedocs.io/en/latest/Parameters.html)).
    * Hyper-parameters for tuning: subsample ('bagging_fraction'), maximum depth ('max_depth'), learning rate ('learning_rate), number of estimators ('num_iterations').
<br>
<br>
5. GBM (from [XGBoost](https://xgboost.readthedocs.io/en/latest/parameter.html#xgboost-parameters)).
    * Hyper-parameters for tuning: subsample ('subsample'), maximum depth ('max_depth'), learning rate ('eta'), number of estimators ('num_boost_round').
<br>
<br>
6. Random forest (from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)).
    * Hyper-parameters for tuning: number of estimators ('n_estimators'), maximum number of features ('max_features') and minimum
    number of samples for split ('min_samples_split').
<br>
<br>
7. SVM (from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)).
    * Hyper-parameters for tuning: regularization parameter ('C') kernel ('kernel'), polynomial degree ('degree'), gamma ('gamma').

--------

This notebook imports the developed classes and uses a dataset for binary classification seeking to assess the functionalities of those classes by applying several distinct statistical learning methods.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing datasets](#imports)<a href='#imports'></a>.
5. [Data pre-processing](#data_pre_proc)<a href='#data_pre_proc'></a>.
6. [Assessing K-folds CV](#kfolds_assess)<a href='#kfolds_assess'></a>.
    * [LightGBM](#kfolds_lightgbm)<a href='#kfolds_lightgbm'></a>.
    * [XGBoost](#kfolds_xgboost)<a href='#kfolds_xgboost'></a>.
<br>
<br>
7. [Assessing K-folds fit](#kfolds_fit_assess)<a href='#kfolds_fit_assess'></a>.
    * [SVM classifier](#kfolds_fit_svm_class)<a href='#kfolds_fit_svm_class'></a>.
    * [Parallel estimation (GBM)](#kfolds_fit_gbm_parallel)<a href='#kfolds_fit_gbm_parallel'></a>.
<br>
<br>
8. [Assessing bootstrap estimation](#boot_assess)<a href='#boot_assess'></a>.
    * [Logistic regression](#boot_lr)<a href='#boot_lr'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import os

from datetime import datetime
import time
import progressbar

from scipy.stats import uniform, norm, randint

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import utils
from utils import loading_data, running_time

In [3]:
import kfolds
from kfolds import KfoldsCV, Kfolds_fit

import bootstrap
from bootstrap import BootstrapEstimation

<a id='settings'></a>

## Settings

In [4]:
# Define the dataset_id:
dataset_id = 2706

<a id='imports'></a>

## Importing datasets

<a id='feats_label'></a>

### Features and label

#### Training data

In [5]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_train = loading_data(path=f'Datasets/dataset_{dataset_id}_train.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

# Accessory variables:
drop_vars = ['y', 'order_id', 'epoch', 'date']

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2020-12-31 to 2021-02-17.
----------------------------------------




#### Test data

In [6]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_test = loading_data(path=f'Datasets/dataset_{dataset_id}_test.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2021-02-17 to 2021-03-31.
----------------------------------------




<a id='kfolds_assess'></a>

## Assessing K-folds CV

<a id='kfolds_lightgbm'></a>

### LightGBM

Click [here](https://lightgbm.readthedocs.io/en/latest/index.html) for documentation of LightGBM library.

In [18]:
# Grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [100, 250, 500]}

# Creating K-folds CV object:
kfolds = KfoldsCV(task = 'binary', method = 'light_gbm', num_folds = 3, metric = 'roc_auc',
                  random_search = True, n_samples = 10,
                  grid_param = grid_param,
                  default_param = {'bagging_fraction': 0.75,
                                   'learning_rate': 0.01,
                                   'max_depth': 10,
                                   'num_iterations': 500})

# Running K-folds CV:
kfolds.run(inputs = df_train.drop(drop_vars, axis=1), output = df_train['y'])

# Defining best tuning hyper-parameter:
best_param = kfolds.best_param

[1mGrid estimation progress:[0m [                                              ] N/A%



















































































[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: light gbm.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'bagging_fraction': 0.8564267514042088, 'learning_rate': 0.03307956006799815, 'max_depth': 5, 'num_iterations': 250}.
CV performance metric associated with best hyper-parameters: 0.9818.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 1.07 minutes.
Start time: 2021-05-18, 17:28:52
End time: 2021-05-18, 17:29:56
------------------------------------


In [19]:
# Best tuning hyper-parameters:
kfolds.best_param

{'bagging_fraction': 0.8564267514042088,
 'learning_rate': 0.03307956006799815,
 'max_depth': 5,
 'num_iterations': 250}

In [20]:
# CV metrics:
kfolds.CV_metric.sort_values('cv_roc_auc',
                             ascending=False).style.set_properties(subset=['tun_param'], **{'width': '300px'})

Unnamed: 0,tun_param,cv_roc_auc
0,"{'bagging_fraction': 0.8564267514042088, 'learning_rate': 0.03307956006799815, 'max_depth': 5, 'num_iterations': 250}",0.981832
2,"{'bagging_fraction': 0.6598188679693455, 'learning_rate': 0.06896039726050547, 'max_depth': 8, 'num_iterations': 500}",0.980826
8,"{'bagging_fraction': 0.8907666526719143, 'learning_rate': 0.06191495909029268, 'max_depth': 3, 'num_iterations': 100}",0.980798
6,"{'bagging_fraction': 0.9430910631497579, 'learning_rate': 0.04818394450453028, 'max_depth': 8, 'num_iterations': 500}",0.980569
9,"{'bagging_fraction': 0.9163785959047468, 'learning_rate': 0.09387259961233586, 'max_depth': 6, 'num_iterations': 500}",0.980061
5,"{'bagging_fraction': 0.7198296782892253, 'learning_rate': 0.03495206821405517, 'max_depth': 5, 'num_iterations': 100}",0.978563
1,"{'bagging_fraction': 0.6188968287608833, 'learning_rate': 0.05093991681484368, 'max_depth': 1, 'num_iterations': 500}",0.97797
4,"{'bagging_fraction': 0.9045568719088195, 'learning_rate': 0.02091435858274271, 'max_depth': 7, 'num_iterations': 100}",0.974292
3,"{'bagging_fraction': 0.536585497553552, 'learning_rate': 0.04616855762740995, 'max_depth': 1, 'num_iterations': 100}",0.970582
7,"{'bagging_fraction': 0.5267165018530952, 'learning_rate': 0.005129053997969813, 'max_depth': 8, 'num_iterations': 100}",0.958813


#### Pre-selecting features

In [7]:
# Grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [100, 250, 500]}

# Creating K-folds CV object:
kfolds = KfoldsCV(task = 'binary', method = 'light_gbm', num_folds = 3, metric = 'roc_auc',
                  random_search = True, n_samples = 10,
                  grid_param = grid_param,
                  default_param = {'bagging_fraction': 0.75,
                                   'learning_rate': 0.01,
                                   'max_depth': 10,
                                   'num_iterations': 500},
                  pre_selecting=True, pre_selecting_param=1)

# Running K-folds CV:
kfolds.run(inputs = df_train.drop(drop_vars, axis=1), output = df_train['y'])

# Defining best tuning hyper-parameter:
best_param = kfolds.best_param

















































[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: light gbm.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'bagging_fraction': 0.9471752089265734, 'learning_rate': 0.034040275610410814, 'max_depth': 7, 'num_iterations': 250}.
CV performance metric associated with best hyper-parameters: 0.9816.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 0.87 minutes.
Start time: 2021-06-13, 18:03:52
End time: 2021-06-13, 18:04:44
------------------------------------


Selected features during each of the K-folds estimations

In [13]:
# Features selected while training with folds 2 and 3:
selected_features = len(kfolds.CV_selected_feat["1"])
all_features = len(df_train.drop(drop_vars, axis=1).columns)

print(f'\033[1m{selected_features} out of {all_features} features were selected during the first estimation:\033[0m')
print('\n')
print(kfolds.CV_selected_feat['1'])

[1m211 out of 1282 features were selected during the first estimation:[0m


['feat_1', 'feat_4', 'feat_5', 'feat_9', 'feat_17', 'feat_55', 'feat_56', 'feat_60', 'feat_62', 'feat_77', 'feat_94', 'feat_111', 'feat_121', 'feat_124', 'feat_128', 'feat_130', 'feat_132', 'feat_133', 'feat_137', 'feat_139', 'feat_142', 'feat_146', 'feat_147', 'feat_149', 'feat_151', 'feat_160', 'feat_163', 'feat_165', 'feat_172', 'feat_174', 'feat_175', 'feat_176', 'feat_180', 'feat_181', 'feat_185', 'feat_191', 'feat_192', 'feat_196', 'feat_198', 'feat_201', 'feat_213', 'feat_215', 'feat_216', 'feat_217', 'feat_221', 'feat_222', 'feat_223', 'feat_227', 'feat_229', 'feat_230', 'feat_232', 'feat_239', 'feat_241', 'feat_242', 'feat_245', 'feat_253', 'feat_254', 'feat_257', 'feat_271', 'feat_278', 'feat_283', 'feat_284', 'feat_286', 'feat_289', 'feat_290', 'feat_294', 'feat_295', 'feat_311', 'feat_313', 'feat_314', 'feat_317', 'feat_319', 'feat_324', 'feat_328', 'feat_330', 'feat_331', 'feat_335', 'feat_337', 

<a id='kfolds_xgboost'></a>

### XGBoost

Click [here](https://xgboost.readthedocs.io/en/latest/index.html) for documentation of XGBoost library.

In [18]:
# Grid of hyper-parameters:
grid_param = {'subsample': uniform(0.5, 0.5),
              'eta': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_boost_round': [100, 250, 500]}

# Creating K-folds CV object:
kfolds = KfoldsCV(task = 'binary:logistic', method = 'xgboost', num_folds = 3, metric = 'roc_auc',
                  random_search = True, n_samples = 10,
                  grid_param = grid_param,
                  default_param = {'subsample': 0.75,
                                   'eta': 0.01,
                                   'max_depth': 10,
                                   'num_boost_round': 100})

# Running K-folds CV:
kfolds.run(inputs = df_train.drop(drop_vars, axis=1), output = df_train['y'])

[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mK-folds CV outcomes:[0m
Number of data folds: 3.
Number of samples for random search: 10.
Estimation method: xgboost.
Metric for choosing best hyper-parameter: roc_auc.
Best hyper-parameters: {'subsample': 0.5134567183342928, 'eta': 0.050805711983733466, 'max_depth': 5, 'num_boost_round': 100}.
CV performance metric associated with best hyper-parameters: 0.9811.
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 7.38 minutes.
Start time: 2021-06-13, 10:16:04
End time: 2021-06-13, 10:23:27
------------------------------------


In [19]:
# CV metrics:
kfolds.CV_metric.sort_values('cv_roc_auc',
                             ascending=False).style.set_properties(subset=['tun_param'], **{'width': '300px'})

Unnamed: 0,tun_param,cv_roc_auc
9,"{'subsample': 0.5134567183342928, 'eta': 0.050805711983733466, 'max_depth': 5, 'num_boost_round': 100}",0.98109
7,"{'subsample': 0.6572591434661241, 'eta': 0.04621502203919024, 'max_depth': 7, 'num_boost_round': 500}",0.980355
1,"{'subsample': 0.8617633312269194, 'eta': 0.03839326995841612, 'max_depth': 9, 'num_boost_round': 500}",0.980002
2,"{'subsample': 0.639408116127177, 'eta': 0.0861505151585871, 'max_depth': 8, 'num_boost_round': 500}",0.979651
0,"{'subsample': 0.9676272460997488, 'eta': 0.0371531329992009, 'max_depth': 3, 'num_boost_round': 500}",0.979312
4,"{'subsample': 0.6145690448510704, 'eta': 0.04453770667478092, 'max_depth': 7, 'num_boost_round': 250}",0.979152
3,"{'subsample': 0.6461978544484204, 'eta': 0.03607199484448666, 'max_depth': 1, 'num_boost_round': 500}",0.977906
8,"{'subsample': 0.6389982286899557, 'eta': 0.09207214314802341, 'max_depth': 3, 'num_boost_round': 100}",0.977694
6,"{'subsample': 0.8149282408714418, 'eta': 0.078324260950394, 'max_depth': 1, 'num_boost_round': 100}",0.973234
5,"{'subsample': 0.9300521825173638, 'eta': 0.014009842509754455, 'max_depth': 7, 'num_boost_round': 100}",0.954159


<a id='kfolds_fit_assess'></a>

## Assessing K-folds fit

<a id='kfolds_fit_svm_class'></a>

### SVM classifier

In [21]:
# Declare grid of hyper-parameters:
params = {'C': [1],
          'kernel': ['poly'],
          'degree': [1, 2, 3, 4],
          'gamma': ['scale']}
params_default = {'C': 1.0, 'kernel': 'poly', 'degree': 1, 'gamma': 'scale'}

# Declare K-folds CV estimation object:
kfolds = Kfolds_fit(task='classification', method='SVM',
                    metric='roc_auc', num_folds=3, pre_selecting=False, random_search=False,
                    grid_param=params, default_param=params_default)

# Running train-test estimation:
kfolds.fit(train_inputs=df_train.drop(drop_vars, axis=1),
           train_output=df_train['y'],
           test_inputs=df_test.drop(drop_vars, axis=1),
           test_output=df_test['y'])

[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: SVM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'C': '1', 'kernel': 'poly', 'degree': '1', 'gamma': 'scale'}.
   CV performance metric associated with best hyper-parameters: 0.9639.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9833
   test_prec_avg = 0.9237
   test_brier = 0.009
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 4.49 minutes.
Start time: 2021-05-18, 17:29:57
End time: 2021-05-18, 17:34:26
------------------------------------


<a id='kfolds_fit_gbm_parallel'></a>

### Parallel estimation (GBM)

#### Sequential train-validation estimation

In [20]:
# Declare grid of hyper-parameters:
params = {'subsample': 0.75,
          'learning_rate': [0.0001, 0.001, 0.01],
          'max_depth': [1, 3, 5],
          'n_estimators': 500}
params_default = {'subsample': 0.75,
                  'learning_rate': 0.01,
                  'max_depth': 10,
                  'n_estimators': 500}

# Declare K-folds CV estimation object:
train_test_est = Kfolds_fit(task='classification', method='GBM',
                            metric='roc_auc', num_folds=3, pre_selecting=False,
                            random_search=False, grid_param=params, default_param=params_default,
                            parallelize=False)

# Running train-test estimation:
train_test_est.fit(train_inputs=df_train.drop(drop_vars, axis=1),
                   train_output=df_train['y'],
                   test_inputs=df_test.drop(drop_vars, axis=1),
                   test_output=df_test['y'])

[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: GBM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 5.0, 'n_estimators': 500.0}.
   CV performance metric associated with best hyper-parameters: 0.9592.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9918
   test_prec_avg = 0.9505
   test_brier = 0.0044
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 31.08 minutes.
Start time: 2021-06-04, 21:12:51
End time: 2021-06-04, 21:43:56
------------------------------------


#### Parallel train-validation estimation

In [21]:
# Declare grid of hyper-parameters:
params = {'subsample': 0.75,
          'learning_rate': [0.0001, 0.001, 0.01],
          'max_depth': [1, 3, 5],
          'n_estimators': 500}
params_default = {'subsample': 0.75,
                  'learning_rate': 0.01,
                  'max_depth': 10,
                  'n_estimators': 500}

# Declare K-folds CV estimation object:
train_test_est = Kfolds_fit(task='classification', method='GBM',
                            metric='roc_auc', num_folds=3, pre_selecting=False,
                            random_search=False, grid_param=params, default_param=params_default,
                            parallelize=True)

# Running train-test estimation:
train_test_est.fit(train_inputs=df_train.drop(drop_vars, axis=1),
                   train_output=df_train['y'],
                   test_inputs=df_test.drop(drop_vars, axis=1),
                   test_output=df_test['y'])

[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Estimation method: GBM.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 3.0, 'n_estimators': 500.0}.
   CV performance metric associated with best hyper-parameters: 0.9592.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9875
   test_prec_avg = 0.9504
   test_brier = 0.0043
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 12.93 minutes.
Start time: 2021-06-04, 21:43:56
End time: 2021-06-04, 21:56:52
------------------------------------


<a id='boot_assess'></a>

## Assessing bootstrap estimation

<a id='boot_lr'></a>

### Logistic regression

In [22]:
# Declare grid of hyper-parameters:
params_default = {'C': 0.1}

# Declare bootstrap estimation object:
boot_estimations = BootstrapEstimation(task='classification', method='logistic_regression',
                                        metric='roc_auc', num_folds=3, pre_selecting=False, random_search=False,
                                        grid_param=params, default_param=params_default,
                                        cv=False, replacement=True, n_iterations=1000, bootstrap_scores=True)

# Running bootstrap estimation:
boot_estimations.run(train_inputs=df_train.drop(drop_vars, axis=1),
                     train_output=df_train['y'],
                     test_inputs=df_test.drop(drop_vars, axis=1),
                     test_output=df_test['y'])



---------------------------------------------------------------------------------------------
[1mBootstrap statistics:[0m
   Number of estimations: 1000.
   avg(roc_auc) = 0.984
   std(roc_auc) = 0.0017
   avg(prec_avg) = 0.9237
   std(prec_avg) = 0.0048
   avg(brier) = 0.0095
   std(brier) = 0.0004


[1m   Performance metrics based on bootstrap scores:[0m
   roc_auc = 0.9863
   prec_avg = 0.9333
   brier = 0.0088
   Hyper-parameters used in estimations: {'C': 0.1}.
---------------------------------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 18.08 minutes.
Start time: 2021-05-18, 17:34:27
End time: 2021-05-18, 17:52:31
------------------------------------
