## Tests over GBM hyper-parameters

This notebook presents all results and discussions that follow tests conducted in order to study the specification of GBM hyper-parameters.
<br>
In a first place, theoretical aspects of GBM method and data modeling are discussed in sections "Gradient Boosting Model (GBM) and its main hyper-parameters" and "Theoretical framework". Then, the objectives and structure of tests are presented, after which follow the main conclusions from the tests. The remaining of the notebook has full results.
<br>
From these tests over GBM hyper-parameters, two additional studies have emerged: the first compares values for the performance metrics of precision-recall AUC and average precision score (folder older "Peformance metrics for classification tasks"), while the second opposes grid searches using ROC-AUC and average precision score (folder "Grid Search Metrics"). Full results of both such studies are presented and discussed in their own notebooks, even though an appendix here shows preliminary results for the second study, making use of the same outcomes from tests over GBM hyper-parameters.

-----------

#### Gradient Boosting Model (GBM) and its main hyper-parameters

In their standard setting, **Gradient Boosting Models (GBMs)** are tree-based learning algorithms that produce an ensemble of different estimators in order to provide more robust predictions. Analogous to bagging and random forests, predictions from GBM are constructed upon a collection of individual estimators, but differently from them, learners that compose the GBM ensemble are not independent from each other, since they are defined in a sequential and evolutional manner.
<br>
<br>
Consequently, it follows that GBM estimation requires: i) creating $M$ different (but not independent) estimators; ii) combining them to conceive a final model. A first question that emerges from such procedure is *what data to use at each estimation* $m$? All $N$ training data points available, or just some fraction $\eta$ of randomly picked observations? This hyper-parameter $\eta$ is named **subsample** and it is defined in the interval $(0, 1]$.
<br>
<br>
Once defined $\eta = 1$ or $\eta < 1$, and in the second case which specific value $\eta \in (0,1)$ to use, given that each base learner that constitute the GBM ensemble is a decision-tree, a second choice to do concerns the size of each tree, which can be understood in different ways: i) number of terminal nodes; ii) number of splits. The first definition is how Friedman, Hastie, and Tibshirani (2008) deal with tree sizes, and it actually is more intuitive to grasp a tree size. Number of splits, in its turn, can be translated into the depth of a tree. Either ways of defining a tree size are highly related, but the number of splits more directly reveals the possible degree of interaction between input variables. Thus, another relevant hyper-parameter for GBM is the **maximum depth** of trees, $max\_depth \in \mathbb{N}_+$, varying from $max\_depth = 1$, where trees are actually stumps and there is no interaction among inputs of a given tree, and $max\_depth > 1$, where interaction effects may be captured by single trees.
<br>
<br>
Since GBM is a kind of ensemble model, the contribution of each base learner to the final composite model can also be calibrated. The hyper-parameter that controls this is the **learning rate**, or shrinkage parameter, $v \in (0, 1]$. Defining $v < 1$ leads to a regularized model, and $v \rightarrow 0$ implies in a slow learning that attenuates the contribution of each tree in the ensemble, thus preventing overfitting caused by specificities of data expressed in a few, but eventually influent base learners.
<br>
<br>
Finally, being defined how data is used in each estimation, which kind of trees can be estimated, and the weight each of them receives when composing the final model, it is necessarily to declare how many of such individual estimators should be constructed. This last hyper-parameter is the **number of estimators**, $n\_estimators \in \mathbb{N}_+$, where its value is typically large, $n\_estimators \geq 100$.
<br>
<br>
**Note:** besides these hyper-parameters specific for ensemble models, there are several others concerning trees construction.

Therefore, the main hyper-parameters to be explored when estimating GBM are:
* Subsample, $\eta$: whether $\eta = 1$ or $\eta < 1$, and which value when $\eta \in (0, 1)$.
* Maximum depth, $max\_depth$: choosing among $\{1, 2, ..., 10\}$.
* Learning rate, $v$: exploring different values smaller than 1.
* Number of estimators, $n\_estimators$.

---------------

#### Theoretical framework

Prior to presenting objectives and structure of tests, it is necessary to define some crucial objects concerning model estimation. First of all, a **statistical model** consists on a function $F(X)$ that defines how a response variable $Y$ is defined from inputs $X$ through **parameters** $\gamma$, which ultimately defines a model. To this deterministic relationship, an irreducible random error $\epsilon$ is conceived. How these parameters $\gamma$ relate with inputs $X$ depends on the **statistical learning method** used to estimate such parameters from empirical data. Estimation methods of any complexity levels are reference by **hyper-parameters** $\theta$, which are not estimated from data, but rather defined previously to estimation. So, before estimating a model, one should collect data, choose which method to use and define which values its hyper-parameters will assume.
<br>
<br>
Given a statistical learning method $L$ based on hyper-parameters $\theta^L \in \Theta^L$, where $\Theta^L \subset \mathbb{R}^k$, and parameters $\gamma^L \in \mathbb{R}^p$, the **model space** $\mathcal{M}^L$ is understood to be a p-dimensional Euclidean space accomplishing all possible values for parameters $\gamma^L$. When some algorithm is about to execute the estimation of a model under the specific statistical learning framework $L$, the model probability distribution $P(\gamma^L|\theta^L)$ is a direct function of hyper-parameters $\theta^L$ that must be defined previously to the estimation of $\gamma^L$.
<br>
This means that how likely it is to some model $\hat{\gamma}^L \in \mathcal{M}^L$ to be estimated depends on the hyper-parameters choice $\theta^L$ - and, of course, on the data being modeled. Note that, depending on the statistical learning method under consideration, such model probability distribution may be constant, i.e., no randomness would exist in model estimation. Even so, its hyper-parameters will define which given model $\hat{\gamma}$ will be estimated (given the data), but in a deterministic way.
<br>
Note also that the objective, in supervised learning tasks, is to approximate the target function $F(X)$ that defines precisely how the response variable $Y$ relates with inputs $X$. Therefore, the main concern is to define the best learning method and, then, the best hyper-parameter vector $\theta^{L*} \in \Theta^L$ that would lead to an expected estimated model $E(\hat{\gamma}^L|\theta^{L*})$ that gets the closest to $F(X)$.
<br>
Consequently, finding such $\theta^{L*}$ requires trying out all possible combinations of $\theta_1$, $\theta_2$, ..., $\theta_k$ available in $\Theta^L$. In this sense, to say that some value of $\theta_j$ is the "best", or "appropriate" is stricly correct only if all others $\theta_{-j}$ are also defined in their best values. Such *general perspective*, however, it is not only unfeasible, but, which is more important here, is also uninformative.
<br>
Therefore, tests as those whose results are presented and discussed here assume a *partial perspective*, since they assume all other things being equal. If their results do not necessarily reveal hyper-parameters values that guarantee the best possible expected performance, they are still able to reference good strategies for hyper-parameters specification, besides of providing evidence of how predictive accuracy should relate with main hyper-parameters of a given statistical learning method.

------------

#### Objetives and structure of tests

Turning the attention back to tests over GBM hyper-parameters, the objectives of the study whose results are presented here are as follows:
1. Oppose theoretical considerations to empirical evidence.
    * For instance, Friedman, Hastie, and Tibshirani (2008) indicate that hardly large trees would lead to better results than shorter ones (page 363 of 2nd edition). They also point to the relevance of defining a very small learning rate (page 365) and to the possibility of stochastic GBM to outperform standard GBM (page 365) - thus, to the possibility of $subsample < 1$ be preferable to $subsample = 1$.
<br>
<br>
2. Explore different ranges of values for those main hyper-parameters, so that appropriate values for each of them can be assessed - again, under a partial perspective. Having such optimal values at hand, one can either perform:
    1. Grid or random search over a pre-selected set of values for a given hyper-parameter $\theta_j$.
    2. Grid or random search over a set of values for hyper-parameters $\theta_{-j}$, while $\theta_j$ is fixed in some appropriate value.

Concerning struture of tests, algorithms from *sklearn* library were used, while data pre-processing, transformations and validation procedures followed codes autonomously derived. The response variable was binary $Y \in \{0,1\}$ for each one of the 100 different datasets, all of which are more-or-less unbalanced and presented different sets of input variables.
<br>
The setting was the following for each hyper-parameter explored:
1. **Subsample:**
    * $\eta \in \{0.75, 0.8, 0.9, 1\}$.
    * $max\_depth = 3$.
    * $learning\_rate = 0.05$.
    * $n\_estimators = 500$.
<br>
<br>
2. **Max depth:**
    * $\eta = 1$.
    * $max\_depth \in \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10\}$.
    * $learning\_rate = 0.1$.
    * $n\_estimators \in \{500, 1000\}$.
<br>
<br>
3. **Learning rate:**
    * First setting: small values:
        * $\eta = 1$.
        * $max\_depth = 3$.
        * $learning\_rate \in \{0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1\}$.
        * $n\_estimators = 500$.
    * Second setting: moderate values:
        * $\eta = 1$.
        * $max\_depth = 3$.
        * $learning\_rate \in \{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}$.
        * $n\_estimators \in \{100, 250, 500\}$.
<br>
<br>
4. **Number of estimators:**
    * $\eta = 1$.
    * $max\_depth = 3$.
    * $learning\_rate \in \{0.05, 0.1\}$.
    * $n\_estimators \in \{100, 250, 500, 750, 1000, 1250, 1500, 1750, 2000\}$.

This setting was based on some default values from sklearn library, such as $max\_depth = 3$ or $learning\_rate = 0.1$, except when these hyper-parameters were subject to grid search. Given the trade-off existing between learning rate and number of estimators, when exploring learning rate, different number of estimators were used for moderate values ($0.1 \leq learning\_rate \leq 0.9$). For the same reason, when exploring number of estimators, two different learning rates were used.
<br>
<br>
When it comes to methodology for outcomes assessment, both descriptive statistics and data visualization were applied, having as performance metric of reference the ROC-AUC statistic, although precision average score, precision-recall AUC, Brier score, and binomial deviance were also analyzed.
<br>
<br>
A final remark must discuss generalization of the results presented here. Any conclusion should be taken with proper caution, since datasets used were limited with respect to their nature. Tests with a similar setting applied to more diverse data - different binary response variable, multiclass classification, regression problem - and even to simulated datasets would largely contribute to the robustness of results.
<br>
<br>
Irrespective of datasets nature, some additional procedures have the ability to improve the tests. When studying hyper-parameter $\theta_j^L$, instead of defining $\theta_{-j}^L$ to ad-hoc values, as done here, it is possible to previously define $\theta_{-j}^L$ through grid or random search using cross-validation on training data and some reference value for $\theta_j^L$. Later, train-test split validation would lead to performance metrics for a grid of $\theta_j^L$ values, using the best values for $\theta_{-j}^L$ obtained during cross-validation estimation. Besides, in order to reduce variance of results, it would be beneficial to only consider datasets with sufficiently large number of observations, or then to previously select features for small length datasets, in order to attenuate dimensionality problems.

-----------------
<a id='main_conclusions'></a>

#### Main conclusions

In this section, main conclusions derived from analysis of results are presented and discussed.
1. **Subsample:** the smaller the trainig data, the smaller it is its faithfulness with respect to the population it represents. Thus, impose more randomness to the set of observations used when producing each estimator in the ensemble may reduce the variability component of predictive performance, even more than compensating the increasing in bias. Furthermore, using subsample with big datasets reduces running time, while preserving a faithful sample of data. As a result, $\eta < 1$ is an appropriate choice when defining GBM.
    * Best average performance metrics for $\eta < 1$. In particular, 0.8, 0.9 and 0.75 presented the highest averages of test ROC-AUC, respectively, with quite similar values.
        * [Reference 1](#reference1)<a href='#reference1'></a>: averages of performance metrics by hyper-parameter value.
        * [Reference 2](#reference2)<a href='#reference2'></a>: boxplots of performance metric by subsample value.
    * Since it seems best to define a subsample value smaller than 1, these possibilities distribute among different values, which explains why $\eta = 1$ has the largest share of best performances. Even so, $\eta = 0.9$ and $\eta = 0.75$ have a very close share.
        * [Reference 3](#reference3)<a href='#reference3'></a>: count plot of best hyper-parameter value.
    * $\eta = 1$ is prone to be the best hyper-parameter value on datasets for which there is a natural tendency for a good classification performance.
        * [Reference 4](#reference4)<a href='#reference4'></a>: stripplot of performance metric by best hyper-parameter value.
    * Consequently, $\eta < 1$ seems specially promising for datasets whose classification task is more challenging, as those with few observations, more unbalanced datasets, etc.
        * [Reference 5](#reference5)<a href='#reference5'></a>: average test ROC-AUC by best subsample value (note: this reference does not point to any causality between subsample and performance metric).
        * [Reference 6](#reference6)<a href='#reference6'></a>: averages number of observations and response variable by best hyper-parameter value.
        * [Reference 7](#reference7)<a href='#reference7'></a>: count plot of best subsample by quartiles of number of observations.
        * [Reference 8](#reference8)<a href='#reference8'></a>: heatmap of correlation matrices for different subsamples.
    * Even though not individually concerning subsample hyper-parameter, it was found a positive and concave relationship between dataset length and predictive performance.
        * [Reference 9](#performance_data_info)<a href='#performance_data_info'></a>: scatterplot of performance metric against number of observations.
<br>
<br>
2. **Maximum depth:** high values for maximum depth increase model complexity, and depending on the dataset, this enlarged complexity may lead to overfitting due to the capture of interaction effects only present on training data. Consequently, tree size is a hyper-parameter whose range of values is expected to produce very distinct performance metrics. A range of $\{1, 2, 3, 4, 5\}$ seems reasonable to be explored through grid-search in most applications.
    * Clearly, performance was distinctly better for small values of $max\_depth$.
        * References [10](#reference10)<a href='#reference10'></a> and [11](#reference11)<a href='#reference11'></a>: averages of performance metrics.
    * $max\_depth \in \{1, 2\}$ concentrate more than a half of best hyper-parameter values.
        * [Reference 12](#reference12)<a href='#reference12'></a>: count plot of best hyper-parameter value.
    * Datasets whose best $max\_depth$ is high are likely to have better average performance.
        * [Reference 13](#reference13)<a href='#reference13'></a>: stripplot of performance metric by best hyper-parameter value.
    * Small $max\_depth$ values have particularly good performance with small datasets.
        * [Reference 14](#reference12)<a href='#reference12'></a>: count plot of best hyper-parameter value by quartiles of number of observations.
    * Smaller correlation between performance metric and number of observations for small values of $max\_depth$.
        * [Reference 15](#reference15)<a href='#reference15'></a>: heatmap of correlation matrices for different maximum depth values.
<br>
<br>
3. **Learning rate:** smaller learning rates are expected to produce better results, even if they require a relatively high number of estimators in order to properly explore the highest possible quantity of patterns that exist on training data. Some good options to be explored are $\{0.01, 0.05, 0.1\}$, depending on data length.
    * Indeed, $v \leq 0.1$ has substantially higher performance metrics than $v > 0.1$, irrespective of the number of estimators used.16
        * [Reference 16](#reference16)<a href='#reference16'></a>: averages of performance metric by learning rate.
    * Learning rates $v \leq 0.1$ have similar frequencies of best hyper-parameter value. Even so, $v = 0.01$ - the smallest value tested - presents a distinguished performance, given its distribution of performance metrics and its share of best hyper-parameter value.
        * References [17](#reference17)<a href='#reference17'></a> and [18](#reference18)<a href='#reference18'></a>: count plot of best hyper-parameter value and boxplot of performance metric by learning rate.
    * Decreasing correlation between performance metric and number of observations, and increasing with number of features across learning rate values.
        * [Reference 19](#reference19)<a href='#reference19'></a>: heatmap of correlation matrices for different maximum depth values.
<br>
<br>
4. **Number of estimators:** this hyper-parameter should accommodate sufficiently small learning rates. In general, and similarly to the number of training epochs on neural network modeling, it seems reasonable to adjust the number of estimators so that predictive performance can be optimized, given good choices for the remaining hyper-parameters. In general applications, no more than 500 estimators appears to be sufficient to produce good estimations.
    * Very similar results for a broad range of values.
        * [Reference 20](#reference20)<a href='#reference20'></a>: boxplots of performance metric.
    * Moderate values for the number of estimators prevail as optimal values across analyzed datasets (e.g., $M \in \{100, 250\}$. Besides, large values presented a small share of best hyper-parameters.
        * [Reference 21](#reference21)<a href='#reference21'></a>: count plot of best hyper-parameters.
    * Relevance of early stopping as a validation procedure when estimating GBM: datasets whose best hyper-parameter value is large show more potential to have a good predictive performance.
        * [Reference 22](#reference22)<a href='#reference22'></a>: stripplot of performance metric against number of estimators.
    * Relevance of early stopping as a validation preocedure when estimating GBM: small datasets are more likely to select small values of number of estimators, as compared to larger datasets - however, even for these datasets, still prevail moderate values.
        * [Reference 23](#reference21)<a href='#reference21'></a>: count plot of best hyper-parameters.

Having these conclusions in mind, two immediate possibilities emerge: i) to use the appropriate values found in applications where the predictive accuracy need not to be the best possible, but at least reasonably good accuracies are needed in order to compare performances of reference against those acquired by modifying data modeling in any way; ii) to compare the quality of results obtained using the appropriate values found taken as benchmark the performances derived from random search (study to be implemented in the future).