# [Parameter Tuning in Gradient Boosting (GBM)](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)

# [Gradient Boosting Decision tree](https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af)

__Boosting algorithms play a crucial role in dealing with bias variance trade-off.  Unlike bagging algorithms, which only controls for high variance in a model, boosting controls both the aspects (bias & variance), and is considered to be more effective.__


### Difference between boosting and random forest

![](images\gbm_5.PNG)

## 1. How Boosting Works ?

Boosting is a sequential technique which works on the principle of __ensemble__. It combines a set of __weak learners__ and delivers improved prediction accuracy. At any instant t, the model outcomes are weighed based on the outcomes of previous instant t-1. The outcomes predicted correctly are given a lower weight and the ones miss-classified are weighted higher. This technique is followed for a classification problem while a similar technique is used for regression.

![](images\gbm1.PNG)

### Gradient Boosting

Gradient boosting algorithm sequentially combines weak learners in way that each new learner fits to the residuals from the previous step so that the model improves. The final model aggregates the results from each step and a strong learner is achieved. 

__Gradient boosted decision trees__ algorithm uses decision trees as week learners. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model. 

![](images\gbm_6.PNG)

## 2. GBM Parameters

The overall parameters of this ensemble model can be divided into 3 categories:

1. __Tree-Specific Parameters:__ These affect each individual tree in the model.
2. __Boosting Parameters:__ These affect the boosting operation in the model.
3. __Miscellaneous Parameters:__ Other parameters for overall functioning.


![](images\gbm_2.PNG)

The parameters used for defining a tree are:

1. __min_samples_split__
    - Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    - Too high values can lead to under-fitting hence, it should be tuned using CV.



2. __min_samples_leaf__
    - Defines the minimum samples (or observations) required in a terminal node or leaf.
    - Used to control over-fitting similar to min_samples_split.
    - Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.
    
    
3. __min_weight_fraction_leaf__
    - Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
    - Only one of #2 and #3 should be defined.
    
4. __max_depth__
    - The maximum depth of a tree.
    - Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
    - Should be tuned using CV.
    - The higher value of maximum depth causes overfitting, and a lower value causes underfitting 
    
    ![](images\gbm_3.png)
    
5. __max_leaf_nodes__
    - The maximum number of terminal nodes or leaves in a tree.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    - If this is defined, GBM will ignore max_depth.
    
6. __max_features__
    - The number of features to consider while searching for a best split. These will be randomly selected.
    - As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
    - Higher values can lead to over-fitting but depends on case to case.    
    
Before moving on to other parameters, lets see the overall pseudo-code of the GBM algorithm for 2 classes:

![](images\gbm_4.PNG)


### Boosting Parameters

> A problem with gradient boosted decision trees is that they are quick to learn and overfit training data. One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage (or eta in XGBoost 

#### Learning rate and n_estimators

Hyperparemetes are key parts of learning algorithms which effect the performance and accuracy of a model. __Learning rate__ and __n_estimators__ are two critical hyperparameters for gradient boosting decision trees. `Learning rate` $\alpha$ simply means how fast the model learns. Each tree added modifies the overall model. The magnitude of the modification is controlled by learning rate. 

The steps of gradient boosted decision tree algorithms with learning rate introduced:

![](images\gbm_7.PNG)


__The lower the learning rate, the slower the model learns.__ The __advantage__ of slower learning rate is that the model becomes more robust and generalized. In statistical learning, models that learn slowly perform better. However, learning slowly comes at a cost. It takes more time to train the model which brings us to the other significant hyperparameter.

__n_estimator__ is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful at selecting the number of trees. It creates a high risk of overfitting to use too many trees.



#### Note on overfitting

__One key difference between random forests and gradient boosting decision trees is the number of trees used in the model. Increasing the number of trees in random forests does not cause overfitting.__ After some point, the accuracy of the model does not increase by adding more trees but it is also not negatively effected by adding excessive trees. You still do not want to add unnecessary amount of trees due to computational reasons but there is no risk of overfitting associated with the number of trees in random forests. 


However, the number of trees in gradient boosting decision trees is very critical in terms of overfitting. Adding too many trees will cause overfitting so it is important to stop adding trees at some point.

![](images\gbm_9.PNG)

Apart from these, there are certain miscellaneous parameters which affect overall functionality:

![](images\gbm_10.PNG)




In [1]:
#Import libraries:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier  #GBM algorithm
from sklearn.model_selection import cross_val_score  #Additional scklearn functions
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

# pandas defaults
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

In [4]:
def modelfit(alg, dtrain, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'])
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_val_score(alg, dtrain[predictors], dtrain['Disbursed'], cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % accuracy_score(dtrain['Disbursed'].values, dtrain_predictions))
    print("AUC Score (Train): %f" % roc_auc_score(dtrain['Disbursed'], dtrain_predprob))
    
    if performCV:
        print("CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score)))
        
    #Print Feature Importance:
    if printFeatureImportance:
        feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp.plot(kind='bar', title='Feature Importances')
        plt.ylabel('Feature Importance Score')


![](images\gbm_11.PNG)

![](images\gbm_12.PNG)

![](images\gbm_13.PNG)

![](images\gbm_14.PNG)

![](images\gbm_15.PNG)

![](images\gbm_16.PNG)

![](images\gbm_17.PNG)

![](images\gbm_178.PNG)

![](images\gbm_18.PNG)

Here, we find that optimum value is 7, which is also the square root. So our initial value was the best. You might be anxious to check for lower values and you should if you like. I’ll stay with 7 for now. With this we have the final tree-parameters as:

- min_samples_split: 1200
- min_samples_leaf: 60
- max_depth: 9
- max_features: 7

### Tuning subsample and making models with lower learning rate

The next step would be try different subsample values. Lets take values 0.6,0.7,0.75,0.8,0.85,0.9.

![](images\gbm_19.PNG)

Here, we found 0.85 as the optimum value. Finally, we have all the parameters needed. Now, we need to lower the learning rate and increase the number of estimators proportionally. Note that these trees might not be the most optimum values but a good benchmark.

As trees increase, it will become increasingly computationally expensive to perform CV and find the optimum values. 

Lets decrease the learning rate to half, i.e. 0.05 with twice (120) the number of trees.

![](images\gbm_20.PNG)

1. Now lets reduce learning rate to one-tenth of the original value(), i.e. 0.01 for 600 trees ---> cv_mean = 0.8409
2. Lets decrease to one-twentieth of the original value, i.e. 0.005 for 1200 trees.---> cv_mean = 0.8392

#### Gradient Boosting Pros and cons

![](images\gbm_8.PNG)