## 摘要

* Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.

* 梯度增强决策树(GBDT)是一种流行的机器学习算法，有很多有效的实现，如XGBoost和pGBRT。虽然在这些实现中采用了许多工程优化，但在特征维数高、数据量大的情况下，其效率和可扩展性仍不能令人满意。一个主要的原因是，对于每个特性，他们需要扫描所有的数据实例来估计所有可能的拆分点的信息增益，这非常耗时。为了解决这个问题，我们提出了两种新的技术:基于梯度的单边采样(GOSS)和独家特征绑定(EFB)。在GOSS中，我们排除了具有小梯度的数据实例的一个重要部分，只使用其余部分来估计信息增益。我们证明，由于梯度较大的数据实例在信息增益的计算中起着更重要的作用，GOSS可以用小得多的数据量获得相当准确的信息增益估计。使用EFB，我们捆绑了相互排斥的特性(即。，它们很少同时取非零值)，以减少特性的数量。我们证明了寻找排他特征的最优绑定是np困难的，但是贪婪算法可以获得很好的逼近比(从而可以有效地减少特征的数量，而不会对分割点的确定造成很大的影响)。我们将新的GBDT实现称为GOSS和EFB LightGBM。我们在多个公共数据集上的实验表明，LightGBM在达到几乎相同精度的同时，将常规GBDT的训练过程加快了20倍以上。

## 1 介绍

* Gradient boosting decision tree (GBDT) [1] is a widely-used machine learning algorithm, due to its efficiency, accuracy, and interpretability. GBDT achieves state-of-the-art performances in many machine learning tasks, such as multi-class classification [2], click prediction [3], and learning to rank [4]. In recent years, with the emergence of big data (in terms of both the number of features and the number of instances), GBDT is facing new challenges, especially in the tradeoff between accuracy and efficiency. Conventional implementations of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points. Therefore, their computational complexities will be proportional to both the number of features and the number of instances. This makes these implementations very time consuming when handling big data.

* 梯度增强决策树(Gradient decision tree, GBDT)[1]是一种应用广泛的机器学习算法，它具有效率高、精度高、可解释性强等优点。GBDT在许多机器学习任务中，如多类分类[2]、点击预测[3]、学习对[4]排序等，都取得了最先进的性能。近年来，随着大数据的出现(包括特征数量和实例数量)，GBDT面临着新的挑战，尤其是在准确性和效率之间的权衡。传统的GBDT实现需要对每个特征扫描所有的数据实例来估计所有可能的分割点的信息增益。因此，它们的计算复杂性将与特性的数量和实例的数量成正比。这使得这些实现在处理大数据时非常耗时。

* To tackle this challenge, a straightforward idea is to reduce the number of data instances and the number of features. However, this turns out to be highly non-trivial. For example, it is unclear how to perform data sampling for GBDT. While there are some works that sample data according to their weights to speed up the training process of boosting [5, 6, 7], they cannot be directly applied to GBDT since there is no sample weight in GBDT at all. In this paper, we propose two novel techniques towards this goal, as elaborated below.

* 为了应对这一挑战，一个简单的想法是减少数据实例的数量和特性的数量。然而，这是非常重要的。例如，目前还不清楚如何对GBDT进行数据采样。虽然有一些工作是根据它们的权值对数据进行抽样，以加快boost的训练过程[5,6,7]，但由于GBDT中根本没有样本权值，因此无法直接应用到GBDT中。在本文中，我们提出了两种实现这一目标的新技术，如下所述。

* Gradient-based One-Side Sampling (GOSS). While there is no native weight for data instance in GBDT, we notice that data instances with different gradients play different roles in the computation of information gain. In particular, according to the definition of information gain, those instances with larger gradients1 (i.e., under-trained instances) will contribute more to the information gain. Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, we should better keep those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients. We prove that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, especially when the value of information gain has a large range.

* 基于梯度的单边采样(GOSS)。虽然在GBDT中没有数据实例的固有权值，但是我们注意到不同梯度的数据实例在计算信息增益时发挥着不同的作用。特别是，根据信息增益的定义，梯度较大的实例1(即，训练不足的实例)将对信息获取做出更大的贡献。因此，当向下采样数据实例时，为了保持信息增益估计的准确性，我们最好保留那些梯度较大的实例(例如，大于一个预先定义的阈值，或者位于前百分位数之间)，只随机删除那些梯度较小的实例。我们证明了在目标采样率相同的情况下，这种处理方法可以比均匀随机采样得到更准确的增益估计，特别是在信息增益值有较大范围的情况下。

* Exclusive Feature Bundling (EFB). Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.

* 独家功能捆绑(EFB)。通常在实际应用中，虽然存在大量的特征，但特征空间是相当稀疏的，这为我们提供了一种近乎无损的方法来减少有效特征的数量。具体来说，在一个稀疏的特征空间中，许多特征(几乎)是排他的，即。，它们很少同时取非零值。示例包括一个热门特性(例如，文本挖掘中的一个one-hot表示)。我们可以安全地将这些特性捆绑在一起。为此，我们设计了一个有效的算法，将最优的捆绑问题简化为一个图着色问题(如果两个特征不是互斥的，则以特征为顶点，每两个特征加边)，并使用一个具有恒定近似比的贪婪算法求解。

* We call the new GBDT algorithm with GOSS and EFB LightGBM2. Our experiments on multiple public datasets show that LightGBM can accelerate the training process by up to over 20 times while achieving almost the same accuracy.

* 我们称这种新的GBDT算法为GOSS和EFB LightGBM2。我们在多个公共数据集上的实验表明，LightGBM可以将训练过程加速20倍以上，同时达到几乎相同的精度。

* The remaining of this paper is organized as follows. At first, we review GBDT algorithms and related work in Sec. 2. Then, we introduce the details of GOSS in Sec. 3 and EFB in Sec. 4. Our experiments for LightGBM on public datasets are presented in Sec. 5. Finally, we conclude the paper in Sec. 6.

* 本文的其余部分组织如下。首先，我们回顾了第2节中GBDT算法和相关工作。然后，我们介绍了第3节中的GOSS和第4节中的EFB的详细信息。我们在公共数据集上的LightGBM实验在第5节中介绍。最后，我们在第6节对本文进行了总结。

## 2 正文前书页

### 2.1 GBDT和它的复杂性分析

* GBDT is an ensemble model of decision trees, which are trained in sequence [1]. In each iteration, GBDT learns the decision trees by fitting the negative gradients (also known as residual errors).

* GBDT是一个决策树的集成模型，它按序列[1]进行训练。在每个迭代中，GBDT通过拟合负梯度(也称为残差)来学习决策树。

* The main cost in GBDT lies in learning the decision trees, and the most time-consuming part in learning a decision tree is to find the best split points. One of the most popular algorithms to find split points is the pre-sorted algorithm [8, 9], which enumerates all possible split points on the pre-sorted feature values. This algorithm is simple and can find the optimal split points, however, it is inefficient in both training speed and memory consumption. Another popular algorithm is the histogram-based algorithm [10, 11, 12], as shown in Alg. 13. Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the histogram-based algorithm is more efficient in both memory consumption and training speed, we will develop our work on its basis.

* GBDT的主要成本在于学习决策树，而学习决策树最耗时的部分就是找到最佳的分割点。其中最流行的算法之一是预排序算法[8,9]，它列举了所有可能的分裂点在预排序的特征值上。该算法简单，可以找到最优的分割点，但在训练速度和内存消耗方面效率都不高。另一种流行的算法是基于直方图的算法[10,11,12]，如Alg. 13所示。基于直方图的算法不是在已排序的特征值上寻找分割点，而是将连续的特征值放入离散的bin中，并在训练过程中使用这些bin构造特征直方图。由于基于直方图的算法在内存消耗和训练速度上都更高效，我们将在此基础上展开我们的工作。

* As shown in Alg. 1, the histogram-based algorithm finds the best split points based on the feature histograms. It costs O(#data × #f eature) for histogram building and O(#bin × #f eature) for split point finding. Since #bin is usually much smaller than #data, histogram building will dominate the computational complexity. If we can reduce #data or #feature, we will be able to substantially speed up the training of GBDT.

* 如Alg. 1所示，基于直方图的算法根据特征直方图找到最佳分割点。它花费O(#data×#feature)来构建直方图，花费O(#bin×#feature)来寻找分割点。由于#bin通常比#data小得多，因此直方图的构建将主导计算复杂度。如果我们能减少#data或#feature，我们就能大大加快GBDT的训练。

### 2.2 相关工作

* There have been quite a few implementations of GBDT in the literature, including XGBoost [13], pGBRT [14], scikit-learn [15], and gbm in R [16] 4. Scikit-learn and gbm in R implements the pre- sorted algorithm, and pGBRT implements the histogram-based algorithm. XGBoost supports both the pre-sorted algorithm and histogram-based algorithm. As shown in [13], XGBoost outperforms the other tools. So, we use XGBoost as our baseline in the experiment section.

* 在文献中已经有相当多的GBDT实现，包括XGBoost [13]， pGBRT [14]， scikit-learn[15]，以及R[16] 4中的gbm。R中的Scikit-learn和gbm实现了预排序算法，pGBRT实现了基于直方图的算法。XGBoost同时支持预排序算法和基于直方图的算法。如[13]所示，XGBoost的性能优于其他工具。因此，我们在实验部分使用XGBoost作为基线。

* To reduce the size of the training data, a common approach is to down sample the data instances. For example, in [5], data instances are filtered if their weights are smaller than a fixed threshold. SGB [20] uses a random subset to train the weak learners in every iteration. In [6], the sampling ratio are dynamically adjusted in the training progress. However, all these works except SGB [20] are based on AdaBoost [21], and cannot be directly applied to GBDT since there are no native weights for data instances in GBDT. Though SGB can be applied to GBDT, it usually hurts accuracy and thus it is not a desirable choice.

* 为了减少训练数据的大小，一种常见的方法是减少对数据实例的采样。例如，在[5]中，如果数据实例的权值小于固定阈值，则对其进行过滤。SGB[20]在每次迭代中使用一个随机子集来训练弱的学习者。在[6]中，采样率是在训练过程中动态调整的。但是，除了SGB[20]以外，所有这些工作都是基于AdaBoost[21]的，不能直接应用于GBDT，因为在GBDT中没有数据实例的本地权重。虽然SGB可以应用于GBDT，但它通常会影响精度，因此不是一个理想的选择。

* Similarly, to reduce the number of features, it is natural to filter weak features [22, 23, 7, 24]. This is usually done by principle component analysis or projection pursuit. However, these approaches highly rely on the assumption that features contain significant redundancy, which might not always be true in practice (features are usually designed with their unique contributions and removing any of them may affect the training accuracy to some degree).

* 同样，为了减少特征的数量，很自然地要过滤弱特征[22,23,7,24]。这通常是通过主成分分析或投影追踪来实现的。然而，这些方法高度依赖于特征包含显著冗余的假设，而这在实践中可能并不总是正确的(特征通常以其独特的贡献进行设计，删除任何特征都可能在一定程度上影响训练的准确性)。

* The large-scale datasets used in real applications are usually quite sparse. GBDT with the pre-sorted algorithm can reduce the training cost by ignoring the features with zero values [13]. However, GBDT with the histogram-based algorithm does not have efficient sparse optimization solutions. The reason is that the histogram-based algorithm needs to retrieve feature bin values (refer to Alg. 1) for each data instance no matter the feature value is zero or not. It is highly preferred that GBDT with the histogram-based algorithm can effectively leverage such sparse property.

* 在实际应用中使用的大规模数据集通常是非常稀疏的。使用预排序算法的GBDT可以通过忽略[13]值的特征来降低训练成本。然而，基于直方图算法的GBDT并没有有效的稀疏优化解。原因是，无论特征值是否为零，基于直方图的算法都需要检索每个数据实例的特征库值(参考Alg. 1)。使用基于直方图的算法的GBDT能够有效地利用这种稀疏性是非常可取的。

* To address the limitations of previous works, we propose two new novel techniques called Gradient- based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). More details will be introduced in the next sections.

* 为了解决以往工作的局限性，我们提出了两种新的技术，称为基于梯度的单边采样(GOSS)和独家特征捆绑(EFB)。更多的细节将在接下来的章节中介绍。

![avatar](pic/pic_1.png)

## 3 Gradient-based One-Side Sampling

## 3 Gradient-based单边采样

* In this section, we propose a novel sampling method for GBDT that can achieve a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees.

* 在本节中，我们提出了一种新颖的GBDT抽样方法，该方法可以在减少数据实例数量和保持学习决策树的准确性之间取得良好的平衡。

### 3.1 算法描述

* In AdaBoost, the sample weight serves as a good indicator for the importance of data instances. However, in GBDT, there are no native sample weights, and thus the sampling methods proposed for AdaBoost cannot be directly applied. Fortunately, we notice that the gradient for each data instance in GBDT provides us with useful information for data sampling. That is, if an instance is associated with a small gradient, the training error for this instance is small and it is already well-trained. A straightforward idea is to discard those data instances with small gradients. However, the data distribution will be changed by doing so, which will hurt the accuracy of the learned model. To avoid this problem, we propose a new method called Gradient-based One-Side Sampling (GOSS).

* 在AdaBoost中，样例权重可以很好地指示数据实例的重要性。但是在GBDT中，没有天然的样本权值，所以针对AdaBoost提出的采样方法不能直接应用。幸运的是，我们注意到GBDT中每个数据实例的梯度为我们提供了有用的数据采样信息。也就是说，如果一个实例与一个小的梯度相关联，那么这个实例的训练误差就很小，而且它已经经过了良好的训练。一个简单的想法是使用小的梯度丢弃那些数据实例。但是这样做会改变数据分布，会影响学习模型的准确性。为了避免这一问题，我们提出了一种新的基于梯度的单边采样(GOSS)方法。

* GOSS keeps all the instances with large gradients and performs random sampling on the instances with small gradients. In order to compensate the influence to the data distribution, when computing the information gain, GOSS introduces a constant multiplier for the data instances with small gradients (see Alg. 2). Specifically, GOSS firstly sorts the data instances according to the absolute value of their gradients and selects the top a × 100% instances. Then it randomly samples b × 100% instances from the rest of the data. After that, GOSS amplifies the sampled data with small gradients by a constant 1−a when calculating the information gain. By doing so, we put more focus on the under-trained b instances without changing the original data distribution by much.

* GOSS保留所有梯度大的实例，并对梯度小的实例进行随机抽样。为了补偿数据分布的影响,计算信息增益时,高斯数据实例的介绍一个常数乘法器与小梯度(见Alg.2)。具体来说,戈斯首先排序数据实例根据其梯度的绝对值并选择顶部实例×100%。然后从剩下的数据中随机抽取b×100%的实例。之后，GOSS在计算信息增益时，用小梯度将采样数据放大1−a。通过这样做，我们更多地关注训练不足的b实例，而不改变原始数据分布。

### 3.2 理论分析

* GBDT uses decision trees to learn a function from the input space X s to the gradient space G [1]. Suppose that we have a training set with n i.i.d. instances {x1, · · · , xn}, where each xi is a vector with dimension s in space Xs. In each iteration of gradient boosting, the negative gradients of the loss function with respect to the output of the model are denoted as {g1 , · · · , gn }. The decision tree model splits each node at the most informative feature (with the largest information gain). For GBDT, the information gain is usually measured by the variance after splitting, which is defined as below.

* GBDT使用决策树来学习从输入空间$X^s$到梯度空间G[1]的函数。假设我们有一个包含$n$个i.i.d实例的训练集${(x_1，···，x_n)}$，其中每个$x_i$都是空间$X^s$中的一个维数为s的向量。在梯度助推的每次迭代中，损失函数相对于模型输出的负梯度记为${g_1，···，g_n}$。决策树模型在最有信息的特性(具有最大的信息增益)处拆分每个节点。对于GBDT来说，信息增益通常用分割后的方差来度量，定义如下:

* Definition 3.1 Let O be the training dataset on a fixed node of the decision tree. The variance gain of splitting feature j at point d for this node is defined as

* 定义3.1设O为决策树固定节点上的训练数据集。定义该节点在点d处分割特征j的方差增益为

![avatar](pic/pic_2.png)

* For feature$j$,the decision tree algorithm selects $d_{j}^{*} = argmax x_d V_j (d)$ and calculates the largest gain $V_j (d)$. Then, the data are split according feature $j^*$ at point $d_{j}^{*}$ into the left and right child nodes. 

* 对于特征$j$,决策树算法进行选择 $d_{j}^{*} = argmax x_d V_j (d)$ 以及计算最大的增益 $V_j (d)$. 然后，根据特征对数据进行分割 $j^*$ 在点$d_{j}^{*}$ 到左子节点和右子节点。

* In our proposed GOSS method, first, we rank the training instances according to their absolute values of their gradients in the descending order; second, we keep the top $-a×100\%$ instances with the larger gradients and get an instance subset A; then, for the remaining set $A^c$ consisting $(1 − a) × 100\%$ instances with smaller gradients, we further randomly sample a subset B with size $b × |A^c|$; finally, we split the instances according to the estimated variance gain $\hat{V}_j (d)$ over the subset $A ∪ B$, i.e.,

* 在我们提出的GOSS方法中, 首先，根据训练样本的梯度绝对值，按梯度的降序对训练样本进行排序; 第二，我们保留顶部 $-a×100\%$ 实例，并得到一个样本子集A; 然后，对于剩下的集合 $A^c$ 组成 $(1 − a) × 100\%$ 具有较小梯度的实例, 我们进一步随机抽取a子集B的大小 $b × |A^c|$; 最后，我们根据估计的方差增益分割实例 $\hat{V}_j (d)$ 的子集 $A ∪ B$, i.e.,

![avatar](pic/pic_3.png)

![avatar](pic/pic_4.png)

* Thus, in GOSS, we use the estimated $V ̃j (d)$ over a smaller instance subset, instead of the accurate $Vj (d)$ over all the instances to determine the split point, and the computation cost can be largely reduced. More importantly, the following theorem indicates that GOSS will not lose much training accuracy and will outperform random sampling. Due to space restrictions, we leave the proof of the theorem to the supplementary materials.

* 因此，在GOSS中，我们使用估计值 $V ̃j (d)$ 而不是更小的实例子集 $Vj (d)$ 通过对所有实例进行分界点确定，可大大降低计算成本。更重要的是，下面的定理表明GOSS不会损失太多的训练精度，并且会比随机抽样的效果更好。由于空间的限制，我们把定理的证明留给补充材料。

![avatar](pic/pic_5.png)

* According to the theorem, we have the following discussions: (1) The asymptotic approximation ratio of GOSS is $O(\frac{1}{n_{l}^{j} (d)}+\frac{1}{n_{r}^{j} (d)}+\frac{1}{\sqrt{n}})$，if the split is not too unbalanced $(i.e., n_l^j (d) >=O(\sqrt{n}))$ and $n_r^j (d) >=O(\sqrt{n})$the approximation error will be dominated by the second term of Ineq.(2) which decreases to 0 in $O(\sqrt{n})$  with $n → ∞$ That means when number of data is large, the approximation is quite accurate. (2) Random sampling is a special case of GOSS with a = 0. In many cases, GOSS could out perform random sampling, under the condition $C_{0,\beta} > C_{\alpha,\beta-\alpha}$, which is equivalent to $\frac{\alpha_a}{\sqrt{\beta}} > \frac{1 - \alpha_a}{\sqrt{\beta-\alpha}}$ with $\alpha_a = max x_i∈A∪A^c |g_i|/ max x_i∈A^c |g_i|$.

* 根据定理，我们有以下讨论:(1)高斯的渐近逼近比为 $O(\frac{1}{n_{l}^{j} (d)}+\frac{1}{n_{r}^{j} (d)}+\frac{1}{\sqrt{n}})$，如果分裂不是太不平衡的话 $(i.e., n_l^j (d) >=O(\sqrt{n}))$ and $n_r^j (d) >=O(\sqrt{n})$近似误差将由Ineq.(2)的第二项决定，它减小到0在 $O(\sqrt{n})$  with $n → ∞$ 这意味着当数据量很大时，近似是相当准确的。(2)随机抽样是GOSS中a = 0的一个特例。在很多情况下，GOSS可以在这种情况下执行随机抽样 $C_{0,\beta} > C_{\alpha,\beta-\alpha}$, 这相当于 $\frac{\alpha_a}{\sqrt{\beta}} > \frac{1 - \alpha_a}{\sqrt{\beta-\alpha}}$ 与 $\alpha_a = max x_i∈A∪A^c |g_i|/ max x_i∈A^c |g_i|$.

* Next, we analyze the generalization performance in GOSS. We consider the generalization error in GOSS $e_gen^{GOSS} (d) = |\hat{V}_j (d) − V_{∗} (d)|$ which is the gap between the variance gain calculated by the sampled training instances in GOSS and the true variance gain for the underlying distribution. We have $e_gen^{GOSS} (d) <= |\hat{V}_j (d) − V_{j} (d)| + |{V}_j (d) − V_{∗} (d)| = e_{GOSS} (d) + e_{gen} (d)$ Thus, the generalization error with GOSS will be close to that calculated by using the full data instances if the GOSS approximation is accurate. On the other hand, sampling will increase the diversity of the base learners, which potentially help to improve the generalization performance 

* 接下来，我们分析了GOSS中的泛化性能。我们考虑GOSS中的泛化误差 $e_gen^{GOSS} (d) = |\hat{V}_j (d) − V_{∗} (d)|$ 即GOSS中采样的训练实例计算的方差增益与底层分布的真实方差增益之间的差距。我们有 $e_gen^{GOSS} (d) <= |\hat{V}_j (d) − V_{j} (d)| + |{V}_j (d) − V_{∗} (d)| = e_{GOSS} (d) + e_{gen} (d)$ 因此，如果GOSS近似准确，使用GOSS的泛化误差将接近使用完整数据实例计算的泛化误差。另一方面，抽样将增加基本学习者的多样性，这可能有助于提高泛化性能。


## 4 Exclusive Feature Bundling
## 4 独立特征捆绑

* In this section, we propose a novel method to effectively reduce the number of features.

* 在本节中，我们提出了一种新的方法来有效地减少特征的数量。

![avatar](pic/pic_6.png)

* High-dimensional data are usually very sparse. The sparsity of the feature space provides us a possibility of designing a nearly lossless approach to reduce the number of features. Specifically, in a sparse feature space, many features are mutually exclusive, i.e., they never take nonzero values simultaneously. We can safely bundle exclusive features into a single feature (which we call an exclusive feature bundle). By a carefully designed feature scanning algorithm, we can build the same feature histograms from the feature bundles as those from individual features. In this way, the complexity of histogram building changes from O(#data × #f eature) to O(#data × #bundle), while #bundle << #feature. Then we can significantly speed up the training of GBDT without hurting the accuracy. In the following, we will show how to achieve this in detail.

* 高维数据通常是非常稀疏的。特征空间的稀疏性为我们提供了一种近乎无损的减少特征数量的方法。具体来说，在稀疏特征空间中，许多特征是相互排斥的，即，它们从不同时取非零值。我们可以安全地将排他特性捆绑成一个单独的特性(我们称之为排他特性捆绑)。通过一个精心设计的特征扫描算法，我们可以从特征包中构建出与单个特征相同的特征直方图。这样，直方图构建的复杂性从O(#data×# fature)变为O(#data×#bundle)，而#bundle << #特性。这样就可以在不影响精度的前提下，显著加快GBDT的训练速度。在下面，我们将详细说明如何实现这一点。

* There are two issues to be addressed. The first one is to determine which features should be bundled together. The second is how to construct the bundle.

* 有两个问题需要解决。第一个问题是确定哪些特性应该捆绑在一起。第二个问题是如何构造包。

* Theorem 4.1 The problem of partitioning features into a smallest number of exclusive bundles is NP-hard.

* 定理4.1将特征分割成最小数量的排它包的问题是NP-hard问题。

* Proof: We will reduce the graph coloring problem [25] to our problem. Since graph coloring problem is NP-hard, we can then deduce our conclusion.

* 证明:我们将把图着色问题[25]简化为我们的问题。由于图的着色问题是np -困难的，我们可以推导出我们的结论。

* Given any instance G = (V, E) of the graph coloring problem. We construct an instance of our problem as follows. Take each row of the incidence matrix of G as a feature, and get an instance of our problem with |V | features. It is easy to see that an exclusive bundle of features in our problem corresponds to a set of vertices with the same color, and vice versa

* 给定图着色问题的任意实例G = (V, E)。我们构造问题的一个实例如下。将G的关联矩阵的每一行作为一个特征，得到我们用|V |特征的问题的一个实例。很容易看出，我们的问题中有一个排他的特征集对应于一组颜色相同的顶点，反之亦然

* For the first issue, we prove in Theorem 4.1 that it is NP-Hard to find the optimal bundling strategy, which indicates that it is impossible to find an exact solution within polynomial time. In order to find a good approximation algorithm, we first reduce the optimal bundling problem to the graph coloring problem by taking features as vertices and adding edges for every two features if they are not mutually exclusive, then we use a greedy algorithm which can produce reasonably good results (with a constant approximation ratio) for graph coloring to produce the bundles. Furthermore, we notice that there are usually quite a few features, although not 100% mutually exclusive, also rarely take nonzero values simultaneously. If our algorithm can allow a small fraction of conflicts, we can get an even smaller number of feature bundles and further improve the computational efficiency. By simple calculation, random polluting a small fraction of feature values will affect the training accuracy by at most $O([(1 − γ)n]^{−2/3})$(See Proposition 2.1 in the supplementary materials), where γ is the maximal conflict rate in each bundle. So, if we choose a relatively small γ, we will be able to achieve a good balance between accuracy and efficiency.

* 对于第一个问题（哪些特征可以捆绑），我们在定理4.1中证明了找到最优捆绑策略是NP-Hard，这表明在多项式时间内不可能找到精确解。为了找到一个好的近似算法,我们首先降低图着色问题的最优捆绑问题以特性为顶点和添加边每两个特性如果他们并不是相互排斥的,那么我们使用贪婪算法可以产生良好的结果(一个常数近似比)图着色生产包。此外，我们注意到通常有相当多的特性，尽管不是100%互斥的，但也很少同时接受非零值。如果我们的算法允许一小部分冲突，我们可以得到更小数量的特征包，从而进一步提高计算效率。通过简单的计算,随机污染的一小部分特征值会影响训练精度最多$O([(1 − γ)n]^{−2/3})$(见命题2.1补充材料),其中γ是最大的冲突率在每个包。所以,如果我们选择一个相对较小的γ,我们将能够实现良好的精度和效率之间的平衡。

* Based on the above discussions, we design an algorithm for exclusive feature bundling as shown in Alg. 3. First, we construct a graph with weighted edges, whose weights correspond to the total conflicts between features. Second, we sort the features by their degrees in the graph in the descending order. Finally, we check each feature in the ordered list, and either assign it to an existing bundle with a small conflict (controlled by γ), or create a new bundle. The time complexity of Alg. 3 is O(#feature2) and it is processed only once before training. This complexity is acceptable when the number of features is not very large, but may still suffer if there are millions of features. To further improve the efficiency, we propose a more efficient ordering strategy without building the graph: ordering by the count of nonzero values, which is similar to ordering by degrees since more nonzero values usually leads to higher probability of conflicts. Since we only alter the ordering strategies in Alg. 3, the details of the new algorithm are omitted to avoid duplication.

* 在上述讨论的基础上，我们设计了Alg. 3所示的排他特征捆绑算法。首先，我们构造一个带有加权边的图，其权值对应于特征之间的总冲突。其次，我们按特征在图中的程度按降序排序。最后,我们检查每个特性在有序列表,并将其分配给一个现有的包与一个小冲突(由γ),或者创建一个新包。Alg. 3的时间复杂度是O(#feature2)，在训练之前只处理一次。当特性的数量不是很大时，这种复杂性是可以接受的，但是如果有数百万个特性，这种复杂性仍然会受到影响。为了进一步提高效率，我们提出了一种不需要构建图的更高效的排序策略:通过非零值的计数排序，这类似于按度排序，因为更多的非零值通常会导致更高的冲突概率。由于我们只改变了Alg. 3中的排序策略，为了避免重复，我们省略了新算法的细节。

* For the second issues, we need a good way of merging the features in the same bundle in order to reduce the corresponding training complexity. The key is to ensure that the values of the original features can be identified from the feature bundles. Since the histogram-based algorithm stores discrete bins instead of continuous values of the features, we can construct a feature bundle by letting exclusive features reside in different bins. This can be done by adding offsets to the original values of the features. For example, suppose we have two features in a feature bundle. Originally, feature A takes value from [0, 10) and feature B takes value [0, 20). We then add an offset of 10 to the values of feature B so that the refined feature takes values from [10, 30). After that, it is safe to merge features A and B, and use a feature bundle with range [0, 30] to replace the original features A and B. The detailed algorithm is shown in Alg. 4.

* 对于第二个问题，我们需要一种好的方法来合并相同包中的特性，以减少相应的训练复杂性。关键是确保可以从功能包中识别原始功能的值。由于基于直方图的算法存储的是离散的bin而不是特征的连续值，我们可以通过让排他的特征驻留在不同的bin中来构造一个特征包。这可以通过向特性的原始值添加偏移量来实现。例如，假设我们在一个特性包中有两个特性。最初，特征A从[0,10]中获取值，而特征B从[0,20)中获取值。然后，我们向特性B的值添加10的偏移量，以便改进的特性从[10,30]中获取值。然后将feature A和B合并，使用一个range[0,30]的feature bundle来代替原来的feature A和B。具体算法见Alg. 4。

* EFB algorithm can bundle many exclusive features to the much fewer dense features, which can effectively avoid unnecessary computation for zero feature values. Actually, we can also optimize the basic histogram-based algorithm towards ignoring the zero feature values by using a table for each feature to record the data with nonzero values. By scanning the data in this table, the cost of histogram building for a feature will change from O(#data) to O(#non_zero_data). However, this method needs additional memory and computation cost to maintain these per-feature tables in the whole tree growth process. We implement this optimization in LightGBM as a basic function. Note, this optimization does not conflict with EFB since we can still use it when the bundles are sparse.

* EFB算法可以将许多排他特征捆绑到更少的密集特征上，有效避免了零特征值的不必要计算。实际上，我们还可以优化基本的基于直方图的算法，通过使用每个特征的表来记录非零值的数据，从而忽略零特征值。通过扫描这个表中的数据，为一个特性构建直方图的成本将从O(#data)变为O(#non_zero_data)。但是，这种方法需要额外的内存和计算成本来在整个树的生长过程中维护这些每个特性的表。我们在LightGBM中实现了这个优化作为一个基本函数。注意，这种优化与EFB并不冲突，因为当bundle是稀疏的时，我们仍然可以使用它。

## 5 Experiments

* In this section, we report the experimental results regarding our proposed LightGBM algorithm. We use five different datasets which are all publicly available. The details of these datasets are listed in Table 1. Among them, the Microsoft Learning to Rank (LETOR) [26] dataset contains 30K web search queries. The features used in this dataset are mostly dense numerical features. The Allstate Insurance Claim [27] and the Flight Delay [28] datasets both contain a lot of one-hot coding features. And the last two datasets are from KDD CUP 2010 and KDD CUP 2012. We directly use the features used by the winning solution from NTU [29, 30, 31], which contains both dense and sparse features, and these two datasets are very large. These datasets are large, include both sparse and dense features, and cover many real-world tasks. Thus, we can use them to test our algorithm thoroughly.

* 在本节中，我们报告了我们提出的LightGBM算法的实验结果。我们使用五种不同的数据集，这些数据集都是公开的。表1列出了这些数据集的详细信息。其中，Microsoft Learning to Rank (LETOR)[26]数据集包含了30K个web搜索查询。该数据集中使用的特征大多是密集的数值特征。好事达保险索赔[27]和航班延误[28]数据集都包含很多一热编码特性。最后两个数据集来自2010年KDD杯和2012年KDD杯。我们直接使用NTU[29,30,31]获胜方案所使用的特征，包括稠密和稀疏特征，这两个数据集非常大。这些数据集很大，包括稀疏和密集的特性，并涵盖了许多实际任务。因此，我们可以使用它们来彻底地测试我们的算法。

* Our experimental environment is a Linux server with two E5-2670 v3 CPUs (in total 24 cores) and 256GB memories. All experiments run with multi-threading and the number of threads is fixed to 16.

* 我们的实验环境是一个Linux服务器，它有两个E5-2670 v3 cpu(总共24个内核)和256GB内存。所有实验都使用多线程，并且线程数固定为16。

### 5.1总体比较

* We present the overall comparisons in this subsection. XGBoost [13] and LightGBM without GOSS and EFB (called lgb_baselline) are used as baselines. For XGBoost, we used two versions, xgb_exa (pre-sorted algorithm) and xgb_his (histogram-based algorithm). For xgb_his, lgb_baseline, and LightGBM, we used the leaf-wise tree growth strategy [32]. For xgb_exa, since it only supports layer-wise growth strategy, we tuned the parameters for xgb_exa to let it grow similar trees like other methods. And we also tuned the parameters for all datasets towards a better balancing between speed and accuracy. We set a = 0.05, b = 0.05 for Allstate, KDD10 and KDD12, and set a = 0.1, b = 0.1 for Flight Delay and LETOR. We set γ = 0 in EFB. All algorithms are run for fixed iterations, and we get the accuracy results from the iteration with the best score.6

* 我们在本小节中给出了总体比较。使用XGBoost[13]和LightGBM(没有GOSS和EFB)(称为lgb_baselline)作为基线。对于XGBoost，我们使用了两个版本，xgb_exa(预排序算法)和xgb_his(基于直方图的算法)。对于xgb_his、lgb_baseline和LightGBM，我们使用叶向树生长策略[32]。对于xgb_exa，因为它只支持分层增长策略，所以我们调整了xgb_exa的参数，让它像其他方法一样增长类似的树。我们还调整了所有数据集的参数，以更好地平衡速度和准确性。我们对好事达、KDD10和KDD12分别设置了a = 0.05、b = 0.05，对航班延误和LETOR分别设置了a = 0.1、b = 0.1。我们在EFB设置γ= 0。所有的算法都是针对固定迭代运行的，我们从迭代中得到了精度最好的结果

![avatar](pic/pic_7.png)

* 表2:总体培训时间成本比较。LightGBM是GOSS和EFB的lgb_baseline。EFB_only是EFB的lgb_baseline。表中的值是训练一次迭代的平均时间成本(秒)。
* 表3:测试数据集的总体准确性比较。使用AUC进行分类任务，使用NDCG@10进行排序任务。SGB为带随机梯度增强的lgb_baseline，其采样率与LightGBM相同。

![avatar](pic/pic_8.png)

<center>图1:航班延误时间曲线。图2:LETOR上的时间- ndcg曲线。</center>

* The training time and test accuracy are summarized in Table 2 and Table 3 respectively. From these results, we can see that LightGBM is the fastest while maintaining almost the same accuracy as baselines. The xgb_exa is based on the pre-sorted algorithm, which is quite slow comparing with histogram-base algorithms. By comparing with lgb_baseline, LightGBM speed up 21x, 6x, 1.6x, 14x and 13x respectively on the Allstate, Flight Delay, LETOR, KDD10 and KDD12 datasets. Since xgb_his is quite memory consuming, it cannot run successfully on KDD10 and KDD12 datasets due to out-of-memory. On the remaining datasets, LightGBM are all faster, up to 9x speed-up is achieved on the Allstate dataset. The speed-up is calculated based on training time per iteration since all algorithms converge after similar number of iterations. To demonstrate the overall training process, we also show the training curves based on wall clock time on Flight Delay and LETOR in the Fig. 1 and Fig. 2, respectively. To save space, we put the remaining training curves of the other datasets in the supplementary material.

* 训练时间和测试精度分别如表2和表3所示。从这些结果可以看出，LightGBM是最快的，同时保持了几乎与基线相同的精度。xgb_exa基于预排序算法，与基于直方图的算法相比速度比较慢。通过与lgb_baseline比较，LightGBM在好事达、航班延误、LETOR、KDD10和KDD12数据集上分别加速21x、6x、1.6x、14x和13x。由于xgb_his非常消耗内存，因此由于内存不足，它无法在KDD10和KDD12数据集上成功运行。在其余的数据集上，LightGBM都更快，在Allstate数据集上实现了高达9倍的加速。由于所有算法在迭代次数相近的情况下收敛，因此加速是根据每次迭代的训练时间来计算的。为了演示整个训练过程，我们还将基于壁钟时间的训练曲线分别显示在图1和图2中。为了节省空间，我们将其他数据集的剩余训练曲线放在补充资料中。

* On all datasets, LightGBM can achieve almost the same test accuracy as the baselines. This indicates that both GOSS and EFB will not hurt accuracy while bringing significant speed-up. It is consistent with our theoretical analysis in Sec. 3.2 and Sec. 4.

* 在所有数据集上，LightGBM可以实现几乎与基线相同的测试精度。这表明GOSS和EFB在显著提高速度的同时不会影响精度。这与我们在第3.2节和第4节中的理论分析是一致的。

* LightGBM achieves quite different speed-up ratios on these datasets. The overall speed-up comes from the combination of GOSS and EFB, we will break down the contribution and discuss the effectiveness of GOSS and EFB separately in the next sections.

* LightGBM在这些数据集上实现了完全不同的加速比。总体加速来自于GOSS和EFB的结合，我们将在接下来的章节中分解GOSS和EFB的贡献并分别讨论它们的有效性。

![avatar](pic/pic_9.png)

* 表4:不同采样率下GOSS和SGB的LETOR数据集的精度比较。采用大迭代和早停止的方法，保证了所有实验都能达到收敛点。不同设置的标准差都很小。GOSS的a和b的设置可以在补充资料中找到。

### 5.2 分析GOSS

* First, we study the speed-up ability of GOSS. From the comparison of LightGBM and EFB_only (LightGBM without GOSS) in Table 2, we can see that GOSS can bring nearly 2x speed-up by its own with using 10% - 20% data. GOSS can learn trees by only using the sampled data. However, it retains some computations on the full dataset, such as conducting the predictions and computing the gradients. Thus, we can find that the overall speed-up is not linearly correlated with the percentage of sampled data. However, the speed-up brought by GOSS is still very significant and the technique is universally applicable to different datasets.

* 首先，我们研究了GOSS的加速能力。从表2中LightGBM和EFB_only(没有GOSS的LightGBM)的比较可以看出，使用10% - 20%的数据，GOSS本身可以带来近2倍的加速效果。GOSS只能通过采样数据来学习树。但是，它保留了对完整数据集的一些计算，例如进行预测和计算梯度。因此，我们可以发现总体加速与采样数据的百分比并不是线性相关的。然而，GOSS带来的加速仍然非常显著，而且该技术普遍适用于不同的数据集。

* Second, we evaluate the accuracy of GOSS by comparing with Stochastic Gradient Boosting (SGB) [20]. Without loss of generality, we use the LETOR dataset for the test. We tune the sampling ratio by choosing different a and b in GOSS, and use the same overall sampling ratio for SGB. We run these settings until convergence by using early stopping. The results are shown in Table 4. We can see the accuracy of GOSS is always better than SGB when using the same sampling ratio. These results are consistent with our discussions in Sec. 3.2. All the experiments demonstrate that GOSS is a more effective sampling method than stochastic sampling.

* 其次，我们通过与随机梯度增强(SGB)[20]进行比较来评估GOSS的准确性。在不失一般性的情况下，我们使用LETOR数据集进行测试。我们通过在GOSS中选择不同的a和b来调整采样率，并对SGB使用相同的总体采样率。我们使用提前停止来运行这些设置直到收敛。结果如表4所示。可以看出，在相同采样率的情况下，GOSS的精度总是优于SGB。这些结果与我们在第3.2节中的讨论一致。实验结果表明，高斯采样法比随机采样法更有效。

### 5.3 分析EFB

* We check the contribution of EFB to the speed-up by comparing lgb_baseline with EFB_only. The results are shown in Table 2. Here we do not allow the confliction in the bundle finding process (i.e., γ = 0).7 We find that EFB can help achieve significant speed-up on those large-scale datasets.

* 我们通过比较lgb_baseline和EFB_only来检查EFB对加速的贡献。结果如表2所示。在这里，我们不允许在包查找过程中出现冲突(例如。γ= 0)。我们发现EFB可以帮助实现那些大规模数据集的显著加速。

* Please note lgb_baseline has been optimized for the sparse features, and EFB can still speed up the training by a large factor. It is because EFB merges many sparse features (both the one-hot coding features and implicitly exclusive features) into much fewer features. The basic sparse feature optimization is included in the bundling process. However, the EFB does not have the additional cost on maintaining nonzero data table for each feature in the tree learning process. What is more, since many previously isolated features are bundled together, it can increase spatial locality and improve cache hit rate significantly. Therefore, the overall improvement on efficiency is dramatic. With above analysis, EFB is a very effective algorithm to leverage sparse property in the histogram-based algorithm, and it can bring a significant speed-up for GBDT training process.


* 请注意，lgb_baseline已经针对稀疏特性进行了优化，EFB仍然可以在很大程度上加快训练速度。这是因为EFB将许多稀疏特性(包括单一热编码特性和隐式独占特性)合并到更少的特性中。在绑定过程中包含基本的稀疏特征优化。但是，对于树学习过程中的每个特性，EFB没有维护非零数据表的额外成本。更重要的是，由于许多以前隔离的特性被捆绑在一起，它可以显著提高空间局部性和缓存命中率。因此，效率的整体提升是显著的。通过以上分析，EFB算法是一种非常有效的利用基于直方图算法的稀疏性的算法，可以显著提高GBDT训练过程的速度。

## 6 结论

* In this paper, we have proposed a novel GBDT algorithm called LightGBM, which contains two novel techniques: Gradient-based One-Side Sampling and Exclusive Feature Bundling to deal with large number of data instances and large number of features respectively. We have performed both theoretical analysis and experimental studies on these two techniques. The experimental results are consistent with the theory and show that with the help of GOSS and EFB, LightGBM can significantly outperform XGBoost and SGB in terms of computational speed and memory consumption. For the future work, we will study the optimal selection of a and b in Gradient-based One-Side Sampling and continue improving the performance of Exclusive Feature Bundling to deal with large number of features no matter they are sparse or not.

* 本文提出了一种新的GBDT算法LightGBM，它包含两种新的技术：基于梯度的单侧采样和排他特征捆绑，分别处理大量的数据实例和大量的特征。我们对这两种技术进行了理论分析和实验研究。实验结果与理论一致，表明在GOSS和EFB的帮助下，LightGBM在计算速度和内存消耗方面明显优于XGBoost和SGB。在今后的工作中，我们将研究基于梯度的单侧采样中a和b的最优选择，并继续改进排他特征捆绑的性能，以处理大量的特征，无论它们是稀疏的还是非稀疏的。