Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add details on improving training speed #3628

Merged
merged 10 commits into from
Dec 11, 2020
140 changes: 135 additions & 5 deletions docs/Parameters-Tuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,145 @@ To get good results using a leaf-wise tree, these are some important parameters:
For Faster Speed
----------------

- Use bagging by setting ``bagging_fraction`` and ``bagging_freq``
Add More Computational Resources
''''''''''''''''''''''''''''''''

- Use feature sub-sampling by setting ``feature_fraction``
On systems where it is available, LightGBM uses OpenMP to parallelize many operations. The maximum number of threads used by LightGBM is controlled by the parameter ``num_threads``. By default, this will defer to the default behavior of OpenMP (one thread per real CPU core or the value in environment variable ``OMP_NUM_THREADS``, if it is set). For best performance, set this to the number of real CPU cores available.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

You might be able to achieve faster training by moving to a machine with more available CPU cores.

Using distributed (multi-machine) training might also reduce training time. See the `Distributed Learning Guide <./Parallel-Learning-Guide.rst>`_ for details.

Use a GPU-enabled version of LightGBM
'''''''''''''''''''''''''''''''''''''

You might find that training is faster using a GPU-enabled build of LightGBM. See the `GPU Tutorial <./GPU-Tutorial.rst>`__ for details.

Grow Shallower Trees
''''''''''''''''''''

The total training time for LightGBM increases with the total number of tree nodes added. LightGBM comes with several parameters that can be used to control the number of nodes per tree.

The suggestions below will speed up training, but might hurt training accuracy.

Decrease ``max_depth``
**********************

This parameter is an integer that controls the maximum distance between the root node of each tree and a leaf node. Decrease ``max_depth`` to reduce training time.

Decrease ``num_leaves``
***********************

LightGBM adds nodes to trees based on the gain from adding that node, regardless of depth. This figure from `the feature documentation <./Features.rst#leaf-wise-best-first-tree-growth>`__ illustrates the process.

.. image:: ./_static/images/leaf-wise.png
:align: center

Because of this growth strategy, it isn't straightforward to use ``max_depth`` alone to limit the complexity of trees. The ``num_leaves`` parameter sets the maximum number of nodes per tree. Decrease ``num_leaves`` to reduce training time.

Increase ``min_gain_to_split``
******************************

When adding a new tree node, LightGBM chooses the split point that has the largest gain. Gain is basically the reduction in training loss that results from adding a split point. By default, LightGBM sets ``min_gain_to_split`` to 0.0, which means "there is no improvement that is too small". However, in practice you might find that very small improvements in the training loss don't have a meaningful impact on the generalization error of the model. Increase ``min_gain_to_split`` to reduce training time.

Increase ``min_data_in_leaf`` and ``min_sum_hessian_in_leaf``
*************************************************************

Depending on the size of the training data and the distribution of features, it's possible for LightGBM to add tree nodes that only describe a small number of observations. In the most extreme case, consider the addition of a tree node that only a single observation from the training data falls into. This is very unlikely to generalize well, and probably is a sign of overfitting.

This can be prevented indirectly with parameters like ``max_depth`` and ``num_leaves``, but LightGBM also offers parameters to help you directly avoid adding these overly-specific tree nodes.

- ``min_data_in_leaf``: Minimum number of observations that must fall into a tree node for it to be added.
- ``min_sum_hessian_in_leaf``: Minimum sum of the Hessian (second derivative of the objective function evaluated for each observation) for observations in a leaf. For some regression objectives, this is just the minimum number of records that have to fall into each node. For classification objectives, it represents a sum over a distribution of probabilities. See `this Stack Overflow answer <https://stats.stackexchange.com/questions/317073/explanation-of-min-child-weight-in-xgboost-algorithm>`_ for a good description of how to reason about values of this parameter.

Grow Less Trees
'''''''''''''''

Decrease ``num_iterations``
***************************

The ``num_iterations`` parameter controls the number of boosting rounds that will be performed. Since LightGBM uses decision trees as the learners, this can also be thought of as "number of trees".

If you try changing ``num_iterations``, change the ``learning_rate`` as well. ``learning_rate`` will not have any impact on training time, but it will impact the training accuracy. As a general rule, if you reduce ``num_iterations``, you should increase ``learning_rate``.

Choosing the right value of ``num_iterations`` and ``learning_rate`` is highly dependent on the data and objective, so these parameters are often chosen from a set of possible values through hyperparameter tuning.

Decrease ``num_iterations`` to reduce training time.
btrotta marked this conversation as resolved.
Show resolved Hide resolved

Use Early Stopping
******************

If early stopping is enabled, after each boosting round the model's training accuracy is evaluated against a validation set that contains data not available to the training process. That accuracy is then compared to the accuracy as of the previous boosting round. If the model's accuracy fails to improve for some number of consecutive rounds, LightGBM stops the training process.

That "number of consecutive rounds" is controlled by the parameter ``early_stopping_rounds``. For example, ``early_stopping_rounds=1`` says *the first time accuracy on the validation set does not improve, stop training*.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

Set ``early_stopping_rounds`` and provide a validation set to possibly reduce training time.

Consider Fewer Splits
'''''''''''''''''''''

The parameters described in previous sections control how many trees are constructed and how many nodes are constructed per tree. Training time can be further reduced by reducing the amount of time needed to add a tree node to the model.

The suggestions below will speed up training, but might hurt training accuracy.

Enable Feature Pre-Filtering When Creating Dataset
**************************************************

By default, when a LightGBM ``Dataset`` object is constructed, some features will be filtered out based on the value of ``min_data_in_leaf``.

For a simple example, consider a 1000-observation dataset with a feature called ``feature_1``. ``feature_1`` takes on only two values: 25.0 (995 observations) and 50.0 (5 observations). If ``min_data_in_leaf = 10``, there is no split for this feature which will result in a valid split at least one of the leaf nodes will only have 5 observations.

Instead of reconsidering this feature and then ignoring it every iteration, LightGBM filters this feature out at before training, when the ``Dataset`` is constructed.

If this default behavior has been overridden by setting ``feature_pre_filter=False``, set ``feature_pre_filter=True`` to reduce training time.

Decrease ``max_bin`` or ``max_bin_by_feature`` When Creating Dataset
********************************************************************

LightGBM training `buckets continuous features into discrete bins <./Features.rst#optimization-in-speed-and-memory-usage>`_ to improve training speed and reduce memory requirements for training. This binning is done one time during ``Dataset`` construction. The number of splits considered when adding a node is ``O(#feature * #bin)``, so reducing the number of bins per feature can reduce the number of splits that need to be evaluated.

``max_bin`` is controls the maximum number of bins that features will bucketed into. It is also possible to set this maximum feature-by-feature, by passing ``max_bin_by_feature``.

Reduce ``max_bin`` or ``max_bin_by_feature`` to reduce training time.

Increase ``min_data_in_bin`` When Creating Dataset
**************************************************

Some bins might contain a small number of observations, which might mean that the effort of evaluating that bin's boundaries as possible split points isn't likely to change the final model very much. You can control the granularity of the bins by setting ``min_data_in_bin``.

Increase ``min_data_in_bin`` to reduce training time.

Decrease ``feature_fraction``
*****************************

By default, LightGBM considers all features in a ``Dataset`` during the training process. This behavior can be changed by setting ``feature_fraction`` to a value ``> 0`` and ``<= 1.0``. Setting ``feature_fraction`` to ``0.5``, for example, tells LightGBM to randomly select ``50%`` of features at the beginning of constructing each tree. This reduces the total number of splits that have to be evaluated to add each tree node.

Decrease ``feature_fraction`` to reduce training time.

Decrease ``max_cat_threshold``
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based this paragraph on the explanations in #2261, #2411, and #2919

******************************

LightGBM uses a `custom approach for finding optimal splits for categorical features <./Advanced-Topics.html#categorical-feature-support>`_. In this process, LightGBM explores explores splits that break a categorical feature into two groups. These are sometimes called "k-vs.-rest" splits. Higher ``max_cat_threshold`` values correspond to more split points and larger possible group sizes to search.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

Decrease ``max_cat_threshold`` to reduce training time.

Use Less Data
'''''''''''''

Use Bagging
***********

By Default, LightGBM uses all observations in the training data for each iteration. It is possible to instead tell LightGBM to randomly sample the training data. This process of training over multiple random samples without replacement is called "bagging".
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

Set ``bagging_freq`` to an integer greater than 0 to control how often a new sample is drawn. Set ``bagging_fraction`` to a value ``> 0.0`` and ``< 1.0`` to control the size of the sample. For example, ``{"bagging_freq": 5, "bagging_fraction": 0.75}`` tells LightGBM "re-sample without replacement every 5 iterations, and draw samples of 75% of the training data".

Decrease ``bagging_fraction`` to reduce training time.

- Use small ``max_bin``

- Use ``save_binary`` to speed up data loading in future learning
Save Constructed Datasets with ``save_binary``
''''''''''''''''''''''''''''''''''''''''''''''

- Use parallel learning, refer to `Parallel Learning Guide <./Parallel-Learning-Guide.rst>`__
This only applies to the LightGBM CLI. If you pass parameter ``save_binary``, the training dataset and all validations sets will be saved in a binary format understood by LightGBM. This can speed up training next time, because binning and other work done when constructing a ``Dataset`` does not have to be re-done.


For Better Accuracy
Expand Down
22 changes: 15 additions & 7 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ Learning Control Parameters

- frequency for bagging

- ``0`` means disable bagging; ``k`` means perform bagging at every ``k`` iteration
- ``0`` means disable bagging; ``k`` means perform bagging at every ``k`` iteration. Every ``k``-th iteration, LightGBM will randomly select ``bagging_fraction%`` of the data to use for the next ``k`` iterations.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

- **Note**: to enable bagging, ``bagging_fraction`` should be set to value smaller than ``1.0`` as well

Expand All @@ -322,15 +322,15 @@ Learning Control Parameters

- ``feature_fraction`` :raw-html:`<a id="feature_fraction" title="Permalink to this parameter" href="#feature_fraction">&#x1F517;&#xFE0E;</a>`, default = ``1.0``, type = double, aliases: ``sub_feature``, ``colsample_bytree``, constraints: ``0.0 < feature_fraction <= 1.0``

- LightGBM will randomly select part of features on each iteration (tree) if ``feature_fraction`` smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features before training each tree
- LightGBM will randomly select a subset of of features on each iteration (tree) if ``feature_fraction`` is smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features before training each tree

- can be used to speed up training

- can be used to deal with over-fitting

- ``feature_fraction_bynode`` :raw-html:`<a id="feature_fraction_bynode" title="Permalink to this parameter" href="#feature_fraction_bynode">&#x1F517;&#xFE0E;</a>`, default = ``1.0``, type = double, aliases: ``sub_feature_bynode``, ``colsample_bynode``, constraints: ``0.0 < feature_fraction_bynode <= 1.0``

- LightGBM will randomly select part of features on each tree node if ``feature_fraction_bynode`` smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features at each tree node
- LightGBM will randomly select a subset of features on each tree node if ``feature_fraction_bynode`` is smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features at each tree node

- can be used to deal with over-fitting

Expand All @@ -344,6 +344,8 @@ Learning Control Parameters

- ``extra_trees`` :raw-html:`<a id="extra_trees" title="Permalink to this parameter" href="#extra_trees">&#x1F517;&#xFE0E;</a>`, default = ``false``, type = bool

- can be used to speed up training

- use extremely randomized trees

- if set to ``true``, when evaluating node splits LightGBM will check only one randomly-chosen threshold for each feature
Expand All @@ -358,11 +360,13 @@ Learning Control Parameters

- will stop training if one metric of one validation data doesn't improve in last ``early_stopping_round`` rounds

- ``<= 0`` means disable
- can be used to speed up training

- ``<= 0`` means disable (the default)

- ``first_metric_only`` :raw-html:`<a id="first_metric_only" title="Permalink to this parameter" href="#first_metric_only">&#x1F517;&#xFE0E;</a>`, default = ``false``, type = bool

- set this to ``true``, if you want to use only the first metric for early stopping
- LightGBM allows you to provide multiple evaluation metrics. Set this to ``true``, if you want to use only the first metric for early stopping.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

- ``max_delta_step`` :raw-html:`<a id="max_delta_step" title="Permalink to this parameter" href="#max_delta_step">&#x1F517;&#xFE0E;</a>`, default = ``0.0``, type = double, aliases: ``max_tree_output``, ``max_leaf_output``

Expand All @@ -382,6 +386,8 @@ Learning Control Parameters

- ``min_gain_to_split`` :raw-html:`<a id="min_gain_to_split" title="Permalink to this parameter" href="#min_gain_to_split">&#x1F517;&#xFE0E;</a>`, default = ``0.0``, type = double, aliases: ``min_split_gain``, constraints: ``min_gain_to_split >= 0.0``

- can be used to speed up training

- the minimal gain to perform split
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

- ``drop_rate`` :raw-html:`<a id="drop_rate" title="Permalink to this parameter" href="#drop_rate">&#x1F517;&#xFE0E;</a>`, default = ``0.1``, type = double, aliases: ``rate_drop``, constraints: ``0.0 <= drop_rate <= 1.0``
Expand Down Expand Up @@ -442,7 +448,9 @@ Learning Control Parameters

- used for the categorical features

- limit the max threshold points in categorical features
- limit number of split points considered for categorical features. See `the documentation on how LightGBM finds optimal splits for categorical features <./Features.rst#optimal-split-for-categorical-features>`_ for more details.
jameslamb marked this conversation as resolved.
Show resolved Hide resolved

- can be used to speed up training

- ``cat_l2`` :raw-html:`<a id="cat_l2" title="Permalink to this parameter" href="#cat_l2">&#x1F517;&#xFE0E;</a>`, default = ``10.0``, type = double, constraints: ``cat_l2 >= 0.0``

Expand Down Expand Up @@ -668,7 +676,7 @@ Dataset Parameters

- ``feature_pre_filter`` :raw-html:`<a id="feature_pre_filter" title="Permalink to this parameter" href="#feature_pre_filter">&#x1F517;&#xFE0E;</a>`, default = ``true``, type = bool

- set this to ``true`` to pre-filter the unsplittable features by ``min_data_in_leaf``
- set this to ``true`` (the default) to tell LightGBM to ignore the features that are unsplittable based on ``min_data_in_leaf``

- as dataset object is initialized only once and cannot be changed after that, you may need to set this to ``false`` when searching parameters with ``min_data_in_leaf``, otherwise features are filtered by ``min_data_in_leaf`` firstly if you don't reconstruct dataset object

Expand Down