# [Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)


## The XGBoost Advantage

1. __Regularization__

    - Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
    - In fact, XGBoost is also known as a ‘regularized boosting‘ technique.
    
2. __Parallel Processing__

    - XGBoost implements parallel processing and is blazingly faster as compared to GBM.
    - But hang on, we know that boosting is a sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. 
    - XGBoost also supports implementation on Hadoop.
    
3. __High Flexibility__

    - XGBoost allows users to define custom optimization objectives and evaluation criteria.
    - This adds a whole new dimension to the model and there is no limit to what we can do.
    
4. __Handling Missing Values__
    - XGBoost has an in-built routine to handle missing values.
    - The user is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
    
5. __Tree Pruning__
   - A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
   -  XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
    - Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
    
    
6. __Built-in Cross-Validation__
    - XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
    - This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

7. __Continue on Existing Model__
    - User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
    - GBM implementation of sklearn also has this feature so they are even on this point.
    
    
## XGBoost Parameters
The overall parameters have been divided into 3 categories by XGBoost authors:

1. __General Parameters:__ Guide the overall functioning
2. __Booster Parameters:__ Guide the individual booster (tree/regression) at each step
3. __Learning Task Parameters:__ Guide the optimization performed

### General Parameters

These define the overall functionality of XGBoost.

1. `booster [default=gbtree]`
    - Select the type of model to run at each iteration. It has 2 options:
        - gbtree: tree-based models
        - gblinear: linear models


2. `silent [default=0]`
    - Silent mode is activated is set to 1, i.e. no running messages will be printed.
    - It’s generally good to keep it 0 as the messages might help in understanding the model.

3. `nthread [default to maximum number of threads available if not set]`
    - This is used for parallel processing and number of cores in the system should be entered
    - If you wish to run on all cores, value should not be entered and algorithm will detect automatically

### Boosting parameters

1. `eta [default=0.3]`
    - Analogous to learning rate in GBM
    - Makes the model more robust by shrinking the weights on each step
    - Typical final values to be used: 0.01-0.2
    
2. `min_child_weight [default=1]`
    - Defines the minimum sum of weights of all observations required in a child.
    - This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    - Too high values can lead to under-fitting hence, it should be tuned using CV.
    
    
3. `max_depth [default=6]`
    - The maximum depth of a tree, same as GBM.
    - Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
    - Should be tuned using CV.
    - Typical values: 3-10
    
    
4. `max_leaf_nodes`
    - The maximum number of terminal nodes or leaves in a tree.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    - If this is defined, GBM will ignore max_depth.
    
    
5. `gamma [default=0]`
    - A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
    
    
6. `max_delta_step [default=0]`
    - In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
    - Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    - This is generally not used but you can explore further if you wish.
    
    
7. `subsample [default=1]`
    - Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
    - Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
    - Typical values: 0.5-1
    
8. `colsample_bytree [default=1]`
    - Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
    - Typical values: 0.5-1
    
9. `colsample_bylevel [default=1]`
    - Denotes the subsample ratio of columns for each split, in each level.
    - I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
    
10. `lambda [default=1]`
    - L2 regularization term on weights (analogous to Ridge regression)
    - This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
    
11. `alpha [default=0]`
    - L1 regularization term on weight (analogous to Lasso regression)
    - Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
    
12. `scale_pos_weight [default=1]`
    - A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
    
    
13. `n_estimators`

    - The number of trees (or rounds) in an XGBoost model is specified to the XGBClassifier or XGBRegressor class in the n_estimators argument. The default in the XGBoost library is 100.

### Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.

1. `objective [default=reg:squarederror]`

This defines the loss function to be minimized. Mostly used values are:
    - reg:squarederror - for linear regression
    - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
    - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
you also need to set an additional __num_class__ (number of classes) parameter defining the number of unique classes
    - multi:softprob – same as softmax, but returns predicted probability of each data point belonging to each class.

2. `eval_metric [ default according to objective ]`

    - The metric to be used for validation data.
    - The default values are rmse for regression and error for classification.
    - Typical values are:
        rmse – root mean square error  
        mae – mean absolute error  
        logloss – negative log-likelihood  
        error – Binary classification error rate (0.5 threshold)  
        merror – Multiclass classification error rate  
        mlogloss – Multiclass logloss  
        auc: Area under the curve  
        
3. `seed [default=0]`
    - The random number seed.
    - Can be used for generating reproducible results and also for parameter tuning.



> xgboost module in python has an sklearn wrapper called `XGBClassifier`. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM

1. eta –> learning_rate  
2. lambda –> reg_lambda  
3. alpha –> reg_alpha  


#### Reference
- [XGBoost Parameters reference](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters)
- [XGBoost Demo Codes (xgboost GitHub repository)](https://github.com/dmlc/xgboost/tree/master/demo/guide-python)

## General Approach for Parameter Tuning

The various steps to be performed are:

1. Choose a relatively __high learning rate__. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.

2. __Tune tree-specific parameters__ ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
3. Tune __regularization parameters__ (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
4. __Lower the learning rate__ and decide the optimal parameters .

***

# [Random Forest Hyperparameter Tuning](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/)


Random Forest Hyperparameters we’ll be Looking at:
- max_depth
- min_sample_split
- max_leaf_nodes
- min_samples_leaf
- n_estimators
- max_sample (bootstrap sample)
- max_features

> IMP many of these hyperparameters are decrapitated.

### max_depth

> The max_depth of a tree in Random Forest is defined as the longest path between the root node and the leaf node:

![](images\rf1.PNG)

Using the max_depth parameter, I can limit up to what depth I want every tree in my random forest to grow.

![](images\rf2.PNG)

In this graph, we can clearly see that as the max depth of the decision tree increases, the performance of the model over the training set increases continuously. On the other hand as the max_depth value increases, the performance over the test set increases initially but after a certain point, it starts to decrease rapidly.

> Can you think of a reason for this? The tree starts to overfit the training set and therefore is not able to generalize over the unseen points in the test set.

### min_sample_split

> min_sample_split – a parameter that tells the decision tree in a random forest the minimum required number of observations in any given node in order to split it.

The default value of the minimum_sample_split is assigned to 2. This means that if any terminal node has more than two observations and is not a pure node, we can split it further into subnodes.

Having a default value as 2 poses the issue that a tree often keeps on splitting until the nodes are completely pure. As a result, the tree grows in size and therefore overfits the data.

![](images\rf3.PNG)

__By increasing the value of the min_sample_split, we can reduce the number of splits that happen in the decision tree and therefore prevent the model from overfitting.__ In the above example, if we increase the min_sample_split value from 2 to 6, the tree on the left would then look like the tree on the right.

Now, let’s look at the effect of min_samples_split on the performance of the model. The graph below is plotted considering that all the other parameters remain the same and only the value of min_samples_split is changed:

![](images\rf4.PNG)

On increasing the value of the min_sample_split hyperparameter, we can clearly see that for the small value of parameters, there is a significant difference between the training score and the test scores. But as the value of the parameter increases, the difference between the train score and the test score decreases.

But there’s one thing you should keep in mind. When the parameter value increases too much, there is an overall dip in both the training score and test scores. This is due to the fact that the minimum requirement of splitting a node is so high that there are no significant splits observed. As a result, the random forest starts to underfit.

## n_estimators

We know that a Random Forest algorithm is nothing but a grouping of trees. But how many trees should we consider? That’s a common question fresher data scientists ask. And it’s a valid one!

We might say that more trees should be able to produce a more generalized result, right? But by choosing more number of trees, the time complexity of the Random Forest model also increases.

In this graph, we can clearly see that the performance of the model sharply increases and then stagnates at a certain level:

![](images\rf9.PNG)

This means that choosing a large number of estimators in a random forest model is not the best idea. Although it will not degrade the model, it can save you the computational complexity and prevent the use of a fire extinguisher on your CPU!

## max_samples

> The max_samples hyperparameter determines what fraction of the original dataset is given to any individual tree.

![](images\rf10.PNG)

We can see that the performance of the model rises sharply and then saturates fairly quickly. Can you figure out what the key takeaway from this visualization is?

It is not necessary to give each decision tree of the Random Forest the full data. If you would notice, the model performance reaches its max when the data provided is less than 0.2 fraction of the original dataset. That’s quite astonishing!

Although this fraction will differ from dataset to dataset, we can allocate a lesser fraction of bootstrapped data to each decision tree. As a result, the training time of the Random Forest model is reduced drastically.

##  max_features

> This resembles the number of maximum features provided to each tree in a random forest.

We know that random forest chooses some random samples from the features to find the best split. Let’s see how varying this parameter can affect our random forest model’s performance.

![](images\rf11.PNG)

We can see that the performance of the model initially increases as the number of max_feature increases. But, after a certain point, the train_score keeps on increasing. But the test_score saturates and even starts decreasing towards the end, which clearly means that the model starts to overfit.

Ideally, the overall performance of the model is the highest close to 6 value of the max features. It is a good convention to consider the default value of this parameter, which is set to square root of the number of features present in the dataset. The ideal number of max_features generally tend to lie close to this value.