## Grid vs. random searches
## Analysis of results

**Experiments**

Grid search will be compared against random search based on their implementation using logistic regression and GBM as estimation methods. The tuning hyper-parameter of the first is *regularization parameter* $\lambda$, while the second method has the following hyper-parameters to be set: *subsample* $\eta$, *maximum depth* $J$, *learning rate* $v$ and *number of estimators* $M$.
<br>
<br>
Grid search for $\lambda$ will take place on the following set: $\Theta_{\lambda} = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.25, 0.3, 0.5, 0.75, 1, 3, 10]$. For random search, its minimum and maximum values conceive the interval $(0.0001, 10)$ over which a three-modal random distribution will be defined: $Uniform(0.0001, 0.1)$, $Uniform(0.1, 1)$ and $Uniform(1, 10)$. A total of 10 random samples will be drawn, preserving the same density of values in each sub-interval $(0.0001, 0.1)$, $(0.1, 1)$ and $(1, 10)$ as found in $\Theta_{\lambda}$ - therefore, four random values from $(0.0001, 0.1)$, four from $(0.1, 1)$ and two from $(1, 10)$.
<br>
<br>
When it comes to the definition of $(\eta, J, v, M)$ for GBM estimation, grid search will look over $\Theta = \{0.75\}x\{1, 3, 5\}x\{0.0001, 0.01, 0.1\}x\{100, 250, 500\}$, so $|\Theta| = 27$. In order to keep things comparable, 20 random samples $(\eta, J, v, M)$ will be extracted based on the following distributions for each hyper-parameter: $\eta = 0.75$ will be kept constant, while $J \in \{1, 2, 3, 4, 5\}$ and $M \in \{100, 101, ..., 500\}$ will be defined from an ordinary random sampling. Finally, $v$ will come from $Uniform(0.0001, 0.1)$.

**Assessing results**

Once logistic regression and GBM estimations are done for all 30 distinct datasets, the analysis of results will follow guidelines for conclusion, which focus on performance metrics (ROC-AUC, average precision score and Brier score - all evaluated on test data) and running time. Therefore, **descriptive statistics** by random search status (True, when random search is performed, and False, when grid search is implemented) will point to which alternative leads to the highest expected performance metrics and to the lowest expected running time, where such averages are taken over all datasets. In order to provide more robustness to the comparison of performances between random and grid searches, **statistical tests** for pair-wised differences in performance metrics are implemented (both sign test and Wilcoxon test). Finally, distributions of performance metrics and running times for random and grid searches are assessed through **data visualization**, namely, using boxplots of outcomes by random search status.

<a id='main_findings'></a>**Main findings**

For both estimation methods, random search reaches the same level of performances as grid search, but requires less time to do so. This is specially true for more complex settings, as it holds for GBM estimation, which has 4 main hyper-parameters, some of them integers while others are continuous variables. It is important to stress that, as indicated during initial discussions, for such fine results to be obtained, previous inquiriments are necessary to define suitable ranges of values for all hyper-parameters involved.
<br>
1. **Logistic regression:**
    * [Performance metrics](#performance_metrics_lr)<a href='#performance_metrics_lr'></a>: grid search results are slightly better than those for random search, with most datasets having a higher test ROC-AUC with grid search.
    * [Running times](#running_times_lr)<a href='#running_times_lr'></a>: random search clearly uses less time to run than grid search.
    * [Statistical tests](#statistical_tests_lr)<a href='#statistical_tests_lr'></a>: however, there is no evidence of a major advantage of grid search over random search, since two-sided sign test has a p-value of 0.2, while Wilcoxon test with null hypothesis of $H_0: \mbox{ROC-AUC}_{random} \geq \mbox{ROC-AUC}_{grid}$ has a p-value of 0.08.
    * [Distribution of outcomes](#data_vis_lr)<a href='#data_vis_lr'></a>: rather similar distributions of performance metrics between two alternatives.
<br>
<br>
2. **GBM:**
    * [Performance metrics](#performance_metrics_gbm)<a href='#performance_metrics_gbm'></a>: now, random search results are slightly better than those for grid search, with most datasets having a higher test ROC-AUC with random search.
    * [Running times](#running_times_gbm)<a href='#running_times_gbm'></a>: random search clearly uses less time to run than grid search.
    * [Statistical tests](#statistical_tests_gbm)<a href='#statistical_tests_gbm'></a>: there is no evidence of statistically significant differences of performance metrics between grid and random searches.
    * [Distribution of outcomes](#data_vis_gbm)<a href='#data_vis_gbm'></a>: random search distributions are slightly better.

-----------

This project is based on three notebooks, besides Python scripts for running tests. The first notebook, "1 Grid vs. Random Searches - Experiments Design", discussed both alternatives for hyper-parameters definition and presented the experiments that will give answers on which of them that is more appropriate. Notebook "2 Grid vs. Random Searches - Methodology Development" contains methodological construction for large scale tests. In this final notebook, results from these tests will be assessed and discussed.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Importing data](#imports)<a href='#imports'></a>.
4. [Descriptive statistics](#desc_stat)<a href='#desc_stat'></a>.
    * [Logistic regression](#desc_stat_lr)<a href='#desc_stat_lr'></a>.
        * [Information on estimations](#info_estimations_lr)<a href='#info_estimations_lr'></a>.
        * [Performance metrics](#performance_metrics_lr)<a href='#performance_metrics_lr'></a>.
        * [Comparing performance metrics for each datasetc](#comp_metrics_lr)<a href='#ccomp_metrics_lr'></a>.
        * [Running times](#running_times_lr)<a href='#running_times_lr'></a>.
<br>
<br>
    * [GBM](#desc_stat_gbm)<a href='#desc_stat_gbm'></a>.
        * [Information on estimations](#info_estimations_gbm)<a href='#info_estimations_gbm'></a>.
        * [Performance metrics](#performance_metrics_gbm)<a href='#performance_metrics_gbm'></a>.
        * [Comparing performance metrics for each datasetc](#comp_metrics_gbm)<a href='#ccomp_metrics_gbm'></a>.
        * [Running times](#running_times_gbm)<a href='#running_times_gbm'></a>.
<br>
<br>
5. [Statistical tests](#statistical_tests)<a href='#statistical_tests'></a>.
    * [Logistic regression](#statistical_tests_lr)<a href='#statistical_tests_lr'></a>.
    * [GBM](#statistical_tests_gbm)<a href='#statistical_tests_gbm'></a>.
<br>
<br>
6. [Data visualization](#data_vis)<a href='#data_vis'></a>.
    * [Logistic regression](#data_vis_lr)<a href='#data_vis_lr'></a>.
    * [GBM](#data_vis_gbm)<a href='#data_vis_gbm'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import os

from datetime import datetime
import time

import progressbar
from time import sleep

from scipy.stats import wilcoxon
from statsmodels.stats.descriptivestats import sign_test

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# print(__version__) # requires version >= 1.9.0

import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

<a id='functions_classes'></a>

## Functions and classes

In [2]:
from utils import epoch_to_date

<a id='imports'></a>

## Importing data

In [3]:
os.chdir('../')
with open('Datasets/model_assessment.json') as json_file:
    model_assessment = json.load(json_file)

### Outcomes from logistic regression estimations

In [4]:
metrics_lr = pd.DataFrame([model_assessment[e]for e in model_assessment.keys() if
                           model_assessment[e]['method'] == 'logistic_regression'])
metrics_lr['running_time'] = metrics_lr.running_time.apply(lambda x: float(x.split(' minutes')[0]))

print('\033[1mShape of metrics_lr:\033[0m {0}.'.format(metrics_lr.shape))
metrics_lr.head()

[1mShape of metrics_lr:[0m (60, 20).


Unnamed: 0,store_id,n_orders_train,n_orders_test,n_vars,first_date_train,last_date_train,first_date_test,last_date_test,avg_order_amount_train,avg_order_amount_test,log_transform,standardize,method,random_search,n_samples,best_param,test_roc_auc,test_prec_avg,test_brier,running_time
0,2212,3540,3610,2243,2019-12-31,2020-03-30,2020-03-31,2020-05-31,155.19648,150.540377,True,True,logistic_regression,False,10,{'C': 0.25},0.939789,0.457001,0.0124,0.63
1,6966,5899,2080,1827,2019-12-31,2020-03-30,2020-03-31,2020-05-31,1051.275284,1198.870013,True,True,logistic_regression,False,10,{'C': 0.1},0.876872,0.569408,0.055207,6.38
2,6256,14436,8787,2033,2019-12-31,2020-03-30,2020-03-31,2020-05-31,320.155748,324.115751,True,True,logistic_regression,False,10,{'C': 0.1},0.967475,0.534896,0.003936,2.45
3,5847,6599,8457,3632,2019-12-31,2020-03-30,2020-03-31,2020-05-31,681.107581,579.925495,True,True,logistic_regression,False,10,{'C': 0.25},0.936715,0.381396,0.02543,2.05
4,1603,9218,8942,2210,2019-12-31,2020-03-30,2020-03-31,2020-05-31,179.681607,175.679007,True,True,logistic_regression,False,10,{'C': 0.1},0.931906,0.626619,0.033837,1.85


### Outcomes from GBM estimations

In [5]:
metrics_gbm = pd.DataFrame([model_assessment[e]for e in model_assessment.keys() if
                           model_assessment[e]['method'] == 'GBM'])
metrics_gbm['running_time'] = metrics_gbm.running_time.apply(lambda x: float(x.split(' minutes')[0]))

print('\033[1mShape of metrics_gbm:\033[0m {0}.'.format(metrics_gbm.shape))
metrics_gbm.head()

[1mShape of metrics_gbm:[0m (60, 20).


Unnamed: 0,store_id,n_orders_train,n_orders_test,n_vars,first_date_train,last_date_train,first_date_test,last_date_test,avg_order_amount_train,avg_order_amount_test,log_transform,standardize,method,random_search,n_samples,best_param,test_roc_auc,test_prec_avg,test_brier,running_time
0,2212,3540,3610,2243,2019-12-31,2020-03-30,2020-03-31,2020-05-31,155.19648,150.540377,True,True,GBM,False,20,"{'subsample': 0.75, 'learning_rate': 0.01, 'ma...",0.831416,0.459043,0.011842,34.45
1,6966,5899,2080,1827,2019-12-31,2020-03-30,2020-03-31,2020-05-31,1051.275284,1198.870013,True,True,GBM,False,20,"{'subsample': 0.75, 'learning_rate': 0.01, 'ma...",0.880453,0.610075,0.050491,82.22
2,6256,14436,8787,2033,2019-12-31,2020-03-30,2020-03-31,2020-05-31,320.155748,324.115751,True,True,GBM,False,20,"{'subsample': 0.75, 'learning_rate': 0.01, 'ma...",0.953424,0.463554,0.004351,137.05
3,5847,6599,8457,3632,2019-12-31,2020-03-30,2020-03-31,2020-05-31,681.107581,579.925495,True,True,GBM,False,20,"{'subsample': 0.75, 'learning_rate': 0.1, 'max...",0.941362,0.625792,0.017499,143.58
4,1603,9218,8942,2210,2019-12-31,2020-03-30,2020-03-31,2020-05-31,179.681607,175.679007,True,True,GBM,False,20,"{'subsample': 0.75, 'learning_rate': 0.1, 'max...",0.9233,0.585341,0.036275,126.42


<a id='desc_stat'></a>

## Descriptive statistics

<a id='desc_stat_lr'></a>

### Logistic regression

<a id='info_estimations_lr'></a>

#### Information on estimations

In [6]:
if metrics_lr.isnull().sum().sum() > 0:
    print('\033[1mProblem - Missing values detected!\033[0m')

In [7]:
print('\033[1mDistribution of estimations by random search status:\033[0m')
metrics_lr.random_search.value_counts()

[1mDistribution of estimations by random search status:[0m


True     30
False    30
Name: random_search, dtype: int64

In [8]:
print('\033[1mDistribution of estimations by best hyper-parameter - grid searches:\033[0m')
metrics_lr[metrics_lr.random_search == False].best_param.value_counts()

[1mDistribution of estimations by best hyper-parameter - grid searches:[0m


{'C': 0.1}     16
{'C': 0.25}     9
{'C': 0.3}      1
{'C': 10.0}     1
{'C': 0.5}      1
{'C': 0.75}     1
{'C': 1}        1
Name: best_param, dtype: int64

In [9]:
print('\033[1mDistribution of best hyper-parameters - random searches:\033[0m')
metrics_lr[metrics_lr.random_search == True].best_param.apply(lambda x: float(x.split(': ')[1].split('}')[0])).describe()

[1mDistribution of best hyper-parameters - random searches:[0m


count    30.000000
mean      0.302869
std       0.310134
min       0.045667
25%       0.091953
50%       0.156493
75%       0.378948
max       1.311701
Name: best_param, dtype: float64

In [10]:
print('\033[1mDistribution of running times - grid searches:\033[0m')
metrics_lr[metrics_lr.random_search == False].running_time.describe()

[1mDistribution of running times - grid searches:[0m


count    30.000000
mean      7.004667
std       9.606706
min       0.400000
25%       1.270000
50%       2.035000
75%       7.122500
max      31.630000
Name: running_time, dtype: float64

In [11]:
print('\033[1mDistribution of running times - random searches:\033[0m')
metrics_lr[metrics_lr.random_search == True].running_time.describe()

[1mDistribution of running times - random searches:[0m


count    30.000000
mean      5.431333
std       7.811835
min       0.400000
25%       1.285000
50%       1.835000
75%       4.992500
max      27.930000
Name: running_time, dtype: float64

<a id='performance_metrics_lr'></a>

#### Performance metrics by random search status

In [12]:
print('\033[1mPerformance metric by random search status:\033[0m')
metrics_lr.groupby('random_search').describe()['test_roc_auc']

[1mPerformance metric by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,0.914307,0.097957,0.467005,0.894108,0.943731,0.962412,0.997625
True,30.0,0.913,0.099077,0.45639,0.8982,0.942363,0.962425,0.997465


In [13]:
print('\033[1mPerformance metric by random search status:\033[0m')
metrics_lr.groupby('random_search').describe()['test_prec_avg']

[1mPerformance metric by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,0.564047,0.216738,0.007624,0.450836,0.552152,0.741836,0.99711
True,30.0,0.555859,0.221718,0.006873,0.431145,0.5436,0.740951,0.996833


In [14]:
print('\033[1mPerformance metric by random search status:\033[0m')
metrics_lr.groupby('random_search').describe()['test_brier']

[1mPerformance metric by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,0.018937,0.023677,0.001791,0.004936,0.008854,0.022496,0.100069
True,30.0,0.019287,0.023663,0.001573,0.005075,0.008737,0.025309,0.099272


[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='comp_metrics_lr'></a>

#### Comparing performance metrics for each dataset

In [15]:
# Performance metrics for grid search:
comp_lr = metrics_lr[metrics_lr.random_search == False][['store_id', 'test_roc_auc',
                                                         'test_prec_avg', 'test_brier']]
comp_lr.columns = ['store_id', 'test_roc_auc_grid', 'test_prec_avg_grid', 'test_brier_grid']

# Performance metrics for random search:
comp_lr = comp_lr.merge(metrics_lr[metrics_lr.random_search == True][['store_id', 'test_roc_auc',
                                                                      'test_prec_avg', 'test_brier']],
                        on='store_id', how='inner')
comp_lr.columns = ['store_id', 'test_roc_auc_grid', 'test_prec_avg_grid', 'test_brier_grid',
                   'test_roc_auc_random', 'test_prec_avg_random', 'test_brier_random']
comp_lr.head()

Unnamed: 0,store_id,test_roc_auc_grid,test_prec_avg_grid,test_brier_grid,test_roc_auc_random,test_prec_avg_random,test_brier_random
0,2212,0.939789,0.457001,0.0124,0.936307,0.445822,0.012654
1,6966,0.876872,0.569408,0.055207,0.873687,0.563365,0.05548
2,6256,0.967475,0.534896,0.003936,0.968309,0.368424,0.004922
3,5847,0.936715,0.381396,0.02543,0.934107,0.369893,0.026916
4,1603,0.931906,0.626619,0.033837,0.931209,0.626901,0.033794


In [16]:
# Comparing performance metrics for grid and random searches for each dataset:
comp_lr['roc_auc_random_better'] = comp_lr[['test_roc_auc_grid',
                                            'test_roc_auc_random']].apply(lambda x: 1 if x['test_roc_auc_random'] > x['test_roc_auc_grid'] else 0,
                                                                          axis=1)

comp_lr['prec_avg_random_better'] = comp_lr[['test_prec_avg_grid',
                                             'test_prec_avg_random']].apply(lambda x: 1 if x['test_prec_avg_random'] > x['test_prec_avg_grid'] else 0,
                                                                            axis=1)

comp_lr['brier_random_better'] = comp_lr[['test_brier_grid',
                                          'test_brier_random']].apply(lambda x: 1 if x['test_brier_random'] < x['test_brier_grid'] else 0,
                                                                      axis=1)

In [17]:
print('\033[1mFrequencies of random search better than grid search:\033[0m')
print('\033[1mTest ROC-AUC:\033[0m')
print(comp_lr['roc_auc_random_better'].value_counts())
print('\n')

print('\033[1mTest average precision score:\033[0m')
print(comp_lr['prec_avg_random_better'].value_counts())

print('\n')

print('\033[1mTest Brier score:\033[0m')
print(comp_lr['brier_random_better'].value_counts())

[1mFrequencies of random search better than grid search:[0m
[1mTest ROC-AUC:[0m
0    19
1    11
Name: roc_auc_random_better, dtype: int64


[1mTest average precision score:[0m
0    20
1    10
Name: prec_avg_random_better, dtype: int64


[1mTest Brier score:[0m
0    19
1    11
Name: brier_random_better, dtype: int64


<a id='running_times_lr'></a>

#### Running times

In [18]:
print('\033[1mRunning time by random search status:\033[0m')
metrics_lr.groupby('random_search').describe()['running_time']

[1mRunning time by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,7.004667,9.606706,0.4,1.27,2.035,7.1225,31.63
True,30.0,5.431333,7.811835,0.4,1.285,1.835,4.9925,27.93


[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='desc_stat_gbm'></a>

### GBM

<a id='info_estimations_gbm'></a>

#### Information on estimations

In [19]:
if metrics_gbm.isnull().sum().sum() > 0:
    print('\033[1mProblem - Missing values detected!\033[0m')

In [20]:
print('\033[1mDistribution of estimations by random search status:\033[0m')
metrics_gbm.random_search.value_counts()

[1mDistribution of estimations by random search status:[0m


True     30
False    30
Name: random_search, dtype: int64

In [21]:
print('\033[1mDistribution of estimations by best hyper-parameter - grid searches:\033[0m')
metrics_gbm[metrics_gbm.random_search == False].best_param.value_counts()

[1mDistribution of estimations by best hyper-parameter - grid searches:[0m


{'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 5.0, 'n_estimators': 500.0}    9
{'subsample': 0.75, 'learning_rate': 0.1, 'max_depth': 3.0, 'n_estimators': 500.0}     6
{'subsample': 0.75, 'learning_rate': 0.1, 'max_depth': 1.0, 'n_estimators': 500.0}     6
{'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 3.0, 'n_estimators': 500.0}    3
{'subsample': 0.75, 'learning_rate': 0.1, 'max_depth': 3.0, 'n_estimators': 250.0}     2
{'subsample': 0.75, 'learning_rate': 0.1, 'max_depth': 5.0, 'n_estimators': 500.0}     1
{'subsample': 0.75, 'learning_rate': 0.1, 'max_depth': 1.0, 'n_estimators': 250.0}     1
{'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 3.0, 'n_estimators': 250.0}    1
{'subsample': 0.75, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}        1
Name: best_param, dtype: int64

In [22]:
print('\033[1mDistribution of best hyper-parameters - random searches:\033[0m')
metrics_lr[metrics_lr.random_search == True].best_param.apply(lambda x: float(x.split(': ')[1].split('}')[0])).describe()

[1mDistribution of best hyper-parameters - random searches:[0m


count    30.000000
mean      0.302869
std       0.310134
min       0.045667
25%       0.091953
50%       0.156493
75%       0.378948
max       1.311701
Name: best_param, dtype: float64

In [23]:
print('\033[1mDistribution of running times - grid searches:\033[0m')
metrics_gbm[metrics_gbm.random_search == False].running_time.describe()

[1mDistribution of running times - grid searches:[0m


count     30.000000
mean     188.195333
std      184.413068
min        7.950000
25%       66.807500
50%      121.735000
75%      317.552500
max      688.130000
Name: running_time, dtype: float64

In [24]:
print('\033[1mDistribution of running times - random searches:\033[0m')
metrics_gbm[metrics_gbm.random_search == True].running_time.describe()

[1mDistribution of running times - random searches:[0m


count     30.000000
mean     140.496333
std      142.873932
min        2.600000
25%       42.817500
50%       95.960000
75%      202.920000
max      623.400000
Name: running_time, dtype: float64

<a id='performance_metrics_gbm'></a>

#### Performance metrics by random search status

In [25]:
print('\033[1mPerformance metric by random search status:\033[0m')
metrics_gbm.groupby('random_search').describe()['test_roc_auc']

[1mPerformance metric by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,0.896874,0.129526,0.269928,0.882236,0.928344,0.951856,0.997508
True,30.0,0.90514,0.097506,0.518819,0.886797,0.930427,0.95783,0.997279


In [26]:
print('\033[1mPerformance metric by random search status:\033[0m')
metrics_gbm.groupby('random_search').describe()['test_prec_avg']

[1mPerformance metric by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,0.503705,0.264976,0.004968,0.289032,0.479844,0.687865,0.996855
True,30.0,0.51157,0.249368,0.005877,0.371149,0.487225,0.699803,0.995934


In [27]:
print('\033[1mPerformance metric by random search status:\033[0m')
metrics_gbm.groupby('random_search').describe()['test_brier']

[1mPerformance metric by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,0.017852,0.020895,0.002547,0.006029,0.010977,0.016698,0.098081
True,30.0,0.017676,0.020911,0.003637,0.005742,0.009812,0.017447,0.098184


[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='comp_metrics_gbm'></a>

#### Comparing performance metrics for each dataset

In [28]:
# Performance metrics for grid search:
comp_gbm = metrics_gbm[metrics_gbm.random_search == False][['store_id', 'test_roc_auc',
                                                         'test_prec_avg', 'test_brier']]
comp_gbm.columns = ['store_id', 'test_roc_auc_grid', 'test_prec_avg_grid', 'test_brier_grid']

# Performance metrics for random search:
comp_gbm = comp_gbm.merge(metrics_gbm[metrics_gbm.random_search == True][['store_id', 'test_roc_auc',
                                                                      'test_prec_avg', 'test_brier']],
                        on='store_id', how='inner')
comp_gbm.columns = ['store_id', 'test_roc_auc_grid', 'test_prec_avg_grid', 'test_brier_grid',
                   'test_roc_auc_random', 'test_prec_avg_random', 'test_brier_random']
comp_gbm.head()

Unnamed: 0,store_id,test_roc_auc_grid,test_prec_avg_grid,test_brier_grid,test_roc_auc_random,test_prec_avg_random,test_brier_random
0,2212,0.831416,0.459043,0.011842,0.863508,0.454041,0.012091
1,6966,0.880453,0.610075,0.050491,0.881956,0.602313,0.05228
2,6256,0.953424,0.463554,0.004351,0.93859,0.398818,0.004946
3,5847,0.941362,0.625792,0.017499,0.937213,0.613035,0.017671
4,1603,0.9233,0.585341,0.036275,0.921981,0.589481,0.036168


In [29]:
# Comparing performance metrics for grid and random searches for each dataset:
comp_gbm['roc_auc_random_better'] = comp_gbm[['test_roc_auc_grid',
                                            'test_roc_auc_random']].apply(lambda x: 1 if x['test_roc_auc_random'] > x['test_roc_auc_grid'] else 0,
                                                                          axis=1)

comp_gbm['prec_avg_random_better'] = comp_gbm[['test_prec_avg_grid',
                                             'test_prec_avg_random']].apply(lambda x: 1 if x['test_prec_avg_random'] > x['test_prec_avg_grid'] else 0,
                                                                            axis=1)

comp_gbm['brier_random_better'] = comp_gbm[['test_brier_grid',
                                          'test_brier_random']].apply(lambda x: 1 if x['test_brier_random'] < x['test_brier_grid'] else 0,
                                                                      axis=1)

In [30]:
print('\033[1mFrequencies of random search better than grid search:\033[0m')
print('\033[1mTest ROC-AUC:\033[0m')
print(comp_gbm['roc_auc_random_better'].value_counts())
print('\n')

print('\033[1mTest average precision score:\033[0m')
print(comp_gbm['prec_avg_random_better'].value_counts())

print('\n')

print('\033[1mTest Brier score:\033[0m')
print(comp_gbm['brier_random_better'].value_counts())

[1mFrequencies of random search better than grid search:[0m
[1mTest ROC-AUC:[0m
1    17
0    13
Name: roc_auc_random_better, dtype: int64


[1mTest average precision score:[0m
0    17
1    13
Name: prec_avg_random_better, dtype: int64


[1mTest Brier score:[0m
0    18
1    12
Name: brier_random_better, dtype: int64


<a id='running_times_gbm'></a>

#### Running times

In [31]:
print('\033[1mRunning time by random search status:\033[0m')
metrics_gbm.groupby('random_search').describe()['running_time']

[1mRunning time by random search status:[0m


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
random_search,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,30.0,188.195333,184.413068,7.95,66.8075,121.735,317.5525,688.13
True,30.0,140.496333,142.873932,2.6,42.8175,95.96,202.92,623.4


[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='statistical_tests'></a>

## Statistical tests

<a id='statistical_tests_lr'></a>

### Logistic regression

In [32]:
# P-value of sign test:
p_value_sign = sign_test([j - i for i, j in zip(comp_lr['test_roc_auc_grid'], 
                                                comp_lr['test_roc_auc_random'])])[1]

# P-value of Wilcoxon test:
p_value_wilcoxon = wilcoxon([j - i for i, j in zip(comp_lr['test_roc_auc_grid'],
                                                   comp_lr['test_roc_auc_random'])],
                            alternative = 'less')[1]

print('\033[1mSign test:\033[0m')
print('H0: ROC-AUC (random) != ROC-AUC (grid)')
print('H1: ROC-AUC (random) = ROC-AUC (grid)')
print('P-value = {0:.4f}'.format(p_value_sign))
print('\n')

print('\033[1mWilcoxon test:\033[0m')
print('H0: ROC-AUC (random) >= ROC-AUC (grid)')
print('H1: ROC-AUC (random) < ROC-AUC (grid)')
print('P-value = {0:.4f}'.format(p_value_wilcoxon))

[1mSign test:[0m
H0: ROC-AUC (random) != ROC-AUC (grid)
H1: ROC-AUC (random) = ROC-AUC (grid)
P-value = 0.2005


[1mWilcoxon test:[0m
H0: ROC-AUC (random) >= ROC-AUC (grid)
H1: ROC-AUC (random) < ROC-AUC (grid)
P-value = 0.0794


[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='statistical_tests_gbm'></a>

### GBM

In [33]:
# P-value of sign test:
p_value_sign = sign_test([j - i for i, j in zip(comp_gbm['test_roc_auc_grid'], 
                                                comp_gbm['test_roc_auc_random'])])[1]

# P-value of Wilcoxon test:
p_value_wilcoxon = wilcoxon([j - i for i, j in zip(comp_gbm['test_roc_auc_grid'],
                                                   comp_gbm['test_roc_auc_random'])],
                            alternative = 'less')[1]

print('\033[1mSign test:\033[0m')
print('H0: ROC-AUC (random) != ROC-AUC (grid)')
print('H1: ROC-AUC (random) = ROC-AUC (grid)')
print('P-value = {0:.4f}'.format(p_value_sign))
print('\n')

print('\033[1mWilcoxon test:\033[0m')
print('H0: ROC-AUC (random) >= ROC-AUC (grid)')
print('H1: ROC-AUC (random) < ROC-AUC (grid)')
print('P-value = {0:.4f}'.format(p_value_wilcoxon))

[1mSign test:[0m
H0: ROC-AUC (random) != ROC-AUC (grid)
H1: ROC-AUC (random) = ROC-AUC (grid)
P-value = 0.5847


[1mWilcoxon test:[0m
H0: ROC-AUC (random) >= ROC-AUC (grid)
H1: ROC-AUC (random) < ROC-AUC (grid)
P-value = 0.8732


[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='data_vis'></a>

## Data visualization

<a id='data_vis_lr'></a>

### Logistic regression

#### Boxplot of performance metrics by random search status

In [34]:
# Boxplots for the distribution of outcomes by equality of best hyper-parameter:
fig = make_subplots(rows=1, cols=3,
                    subplot_titles=("Test ROC-AUC", "Test average precision score", "Running time (minutes)"))

# First plot:
fig.add_trace(
    go.Box(x=metrics_lr['random_search'], y=metrics_lr['test_roc_auc'], name='roc_auc'),
    row=1, col=1, secondary_y=False)

# Second plot:
fig.add_trace(
    go.Box(x=metrics_lr['random_search'], y=metrics_lr['test_prec_avg'], name='prec_avg'),
    row=1, col=2, secondary_y=False)

# Third plot:
fig.add_trace(
    go.Box(x=metrics_lr['random_search'], y=metrics_lr['running_time'], name='running_time'),
    row=1, col=3, secondary_y=False)

# Changing layout:
fig.update_layout(
    title_text='Outcomes by random search status',
    width=1100,
    height=450,
    showlegend=False
)

# Changing axes:
fig.update_xaxes(title_text="random_search", row=1, col=1)
fig.update_xaxes(title_text="random_search", row=1, col=2)
fig.update_xaxes(title_text="random_search", row=1, col=3)

fig.show()

[(Main findings)](#main_findings)<a href='#main_findings'></a>

<a id='data_vis_gbm'></a>

### GBM

#### Boxplot of performance metrics by random search status

In [35]:
# Boxplots for the distribution of outcomes by equality of best hyper-parameter:
fig = make_subplots(rows=1, cols=3,
                    subplot_titles=("Test ROC-AUC", "Test average precision score", "Running time (minutes)"))

# First plot:
fig.add_trace(
    go.Box(x=metrics_gbm['random_search'], y=metrics_gbm['test_roc_auc'], name='roc_auc'),
    row=1, col=1, secondary_y=False)

# Second plot:
fig.add_trace(
    go.Box(x=metrics_gbm['random_search'], y=metrics_gbm['test_prec_avg'], name='prec_avg'),
    row=1, col=2, secondary_y=False)

# Third plot:
fig.add_trace(
    go.Box(x=metrics_gbm['random_search'], y=metrics_gbm['running_time'], name='running_time'),
    row=1, col=3, secondary_y=False)

# Changing layout:
fig.update_layout(
    title_text='Outcomes by random search status',
    width=1100,
    height=450,
    showlegend=False
)

# Changing axes:
fig.update_xaxes(title_text="random_search", row=1, col=1)
fig.update_xaxes(title_text="random_search", row=1, col=2)
fig.update_xaxes(title_text="random_search", row=1, col=3)

fig.show()

[(Main findings)](#main_findings)<a href='#main_findings'></a>