# XGBoost: A Scalable Tree Boosting System

## 摘要

* Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

* 提升树是一种高效、应用广泛的机器学习方法。在本文中，我们描述了一个可扩展的端到端提升树系统XGBoost，它被数据科学家广泛使用，以在许多机器学习挑战中获得最新的结果。提出了一种新的稀疏数据稀疏感知算法和近似树学习的加权分位数草图算法。更重要的是，我们提供了有关缓存访问模式、数据压缩和分片的见解，以构建一个可伸缩的树提升系统。通过结合这些见解，XGBoost可以比现有系统使用更少资源扩展到数十亿个样本。

## 1. 导言

* Machine learning and data-driven approaches are becoming very important in many areas. Smart spam classifiers protect our email by learning from massive amounts of spam data and user feedback; advertising systems learn to match the right ads with the right context; fraud detection systems protect banks from malicious attackers; anomaly event detection systems help experimental physicists to find events that lead to new physics. There are two important factors that drive these successful applications: usage of effective (statistical) models that capture the complex data dependencies and scalable learning systems that learn the model of interest from large datasets.

* 机器学习和数据驱动方法在许多领域都变得非常重要。智能垃圾邮件分类器通过从大量垃圾邮件数据和用户反馈中学习来保护我们的电子邮件；广告系统学习将正确的广告与正确的上下文匹配；欺诈检测系统保护银行免受恶意攻击者的攻击；异常事件检测系统帮助实验物理学家找到导致新物理的事件。驱动这些成功应用的两个重要因素是：使用有效（统计）模型捕获复杂的数据依赖关系，以及使用可伸缩学习系统从大型数据集学习感兴趣的模型。

* Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines in many applications. Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks [16]. LambdaMART [5], a variant of tree boosting for ranking, achieves state-of-the-art result for ranking problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction [15]. Finally, it is the defacto choice of ensemble method and is used in challenges such as the Netflix prize [3].

* 在实际应用的机器学习方法中，梯度提升树[10]1是一种应用广泛的方法。提升树已经被证明在许多标准分类基准上给出了最好的结果[16]。LambdaMART[5]是一种用于排名的提升树的变体，它实现了排名问题的最新结果。除了作为一个独立的预测工具，它还被整合到现实世界的生产流水线中，用于广告点击率预测[15]。最后，它实际上是集成方法的选择，并用于Netflix大奖等挑战中[3]。

* In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package. The impact of the system has been widely recognized in a number of machine learning and data mining challenges. Take the challenges hosted by the machine learning competition site Kaggle for example. Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-configured XGBoost by only a small amount [1].


* 本文描述了一个可扩展的提升树机器学习系统XGBoost。该系统作为一个开源软件包提供。该系统的影响已经被广泛认识到在一些机器学习和数据挖掘的挑战。以机器学习竞赛网站Kaggle举办的挑战赛为例。在2015年Kaggle博客上发布的29个挑战获奖解决方案中，有17个使用了XGBoost。在这些解决方案中，有8个单独使用XGBoost来训练模型，而其他大多数方案将XGBoost与神经网络结合在一起。相比之下，第二流行的方法，深度神经网络，被用于11个解决方案。该系统的成功还体现在KDDCup 2015年，在那里XGBoost被前10名的每一支获胜球队使用。此外，获胜的团队报告说，集成方法的性能仅比配置良好的XGBoost好一小部分[1]。

* These results demonstrate that our system gives state-of-the-art results on a wide range of problems. Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classification; web text classification; customer behavior prediction; motion detection; ad click through rate prediction; malware classification; product categorization; hazard risk prediction; massive online course dropout rate prediction. While domain dependent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consensus choice of learner shows the impact and importance of our system and tree boosting.

* 这些结果表明，我们的系统在一系列问题上提供了最新的结果。这些成功解决方案中的问题示例包括：商店销售预测；高能物理事件分类；web文本分类；客户行为预测；运动检测；广告点击率预测；恶意软件分类；产品分类；危害风险预测；大规模在线课程辍学率预测。虽然领域相关的数据分析和特征工程在这些解决方案中扮演着重要的角色，但是XGBoost是学习者的一致选择，这表明了我们的系统和树提升的影响和重要性。

* The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model exploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources. The major contributions of this paper is listed as follows:

* We design and build a highly scalable end-to-end tree boosting system.

* We propose a theoretically justified weighted quantile sketch for efficient proposal calculation.

* We introduce a novel sparsity-aware algorithm for parallel tree learning.

* We propose an effective cache-aware block structure for out-of-core tree learning.

* XGBoost成功背后最重要的因素是它在所有场景中的可伸缩性。该系统在一台机器上的运行速度比现有的流行解决方案快10倍以上，并在分布式或内存有限的设置下扩展到数十亿个样本。XGBoost的可伸缩性是由于几个重要的系统和算法优化。这些创新包括：一种新的树学习算法用于处理稀疏数据；一种理论上合理的加权分位数素描过程能够在近似树学习中处理实例权重。并行和分布式计算使学习更快，从而使模型探索更快。更重要的是，XGBoost利用了核心计算，使数据科学家能够在桌面上处理数亿个样本。最后，将这些技术结合起来，使端到端系统能够以最少的集群资源扩展到更大的数据，这更令人兴奋。本文的主要贡献如下：

* 我们设计并构建了一个高度可扩展的端到端提升树系统。

* 我们提出了一个理论上合理的加权分位数草图，用于有效的方案计算。

* 提出了一种新的稀疏性感知并行树学习算法。

* 提出了一种有效的cache感知块结构用于核外树学习。

* While there are some existing works on parallel tree boosting [22, 23, 19], the directions such as out-of-core computation, cache-aware and sparsity-aware learning have not been explored. More importantly, an end-to-end system that combines all of these aspects gives a novel solution for real-world use-cases. This enables data scientists as well as researchers to build powerful variants of tree boosting algorithms [7, 8]. Besides these major contributions, we also make additional improvements in proposing a regularized learning objective, which we will include for completeness.

* 虽然已有一些关于并行提升树的研究工作[22，23，19]，但是对于核外计算、cache感知和稀疏感知学习等方面的研究还没有展开。更重要的是，结合所有这些方面的端到端系统为现实世界的用例提供了一个新的解决方案。这使得数据科学家和研究人员能够构建强大的树提升算法变体[7，8]。除了这些主要贡献外，我们还在提出一个正规化学习目标方面做了额外的改进，我们将包括完整性。

* The remainder of the paper is organized as follows. We will first review tree boosting and introduce a regularized objective in Sec. 2. We then describe the split finding methods in Sec. 3 as well as the system design in Sec. 4, including experimental results when relevant to provide quantitative support for each optimization we describe. Related work is discussed in Sec. 5. Detailed end-to-end evaluations are included in Sec. 6. Finally we conclude the paper in Sec. 7.

* 论文的其余部分安排如下。我们将首先回顾提升树并在Sec2中引入一个正则化目标。然后我们在 Sec. 3中描述了分裂发现方法。以及系统设计在Sec. 4中。包括相关的实验结果，为我们描述的每一个优化提供定量支持。相关工作将在Sec. 5讨论。详细的端到端评估包含在Sec6中。最后，我们在Sec. 7中对论文进行了总结。


## 2. TREE BOOSTING IN A NUTSHELL

## 2. 提升树的生长

* We review gradient tree boosting algorithms in this section. The derivation follows from the same idea in existing literatures in gradient boosting. Specicially the second order method is originated from Friedman et al. [12]. We make minor improvements in the reguralized objective, which were found helpful in practice.

* 我们在这一节回顾了梯度提升树算法。本文的推导遵循了已有文献关于梯度提升的相同思想。具体来说，二阶方法起源于Friedman等人[12]。我们在规定的目标上做了一些小的改进，这在实践中是有帮助的。

### 2.1 Regularized Learning Objective
### 2.1 正规化学习目标

* For a given data set with n examples and m features D = {(xi,yi)} (|D| = n,xi ∈ Rm,yi ∈ R), a tree ensemble model (shown in Fig. 1) uses K additive functions to predict the output.

* 对于给定n个样本和m个特征D = {(xi,yi)} (|D| = n,xi ∈ Rm,yi ∈ R)，树集成模型（图1所示）使用K加性函数来预测输出。

![avatar](pic/1.png)

* where F = { f(x)=wq(x)}(q:Rm →T,w∈RT) is the space of regression trees (also known as CART). Here q represents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in the tree. Each fk corresponds to an independent tree structure q and leaf weights w. Unlike decision trees, each regression tree contains a continuous score on each of the leaf, we use wi to represent score on i-th leaf. For a given example, we will use the decision rules in the trees (given by q) to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves (given by w). To learn the set of functions used in the model, we minimize the following regularized objective.

* 其中 F = { f(x)=wq(x)}(q:Rm →T,w∈RT)为回归树空间(也称CART)。这里q表示将一个样本映射到相应叶索引的每棵树的结构。T是树的叶子节点数。每个fk对应一个独立的树结构q和叶权值w。与决策树不同的是，每棵回归树都包含对每个叶子节点的连续得分，我们使用wi来表示第i个叶的得分。对于一个给定的样本,我们将使用决策规则在树给出的分类到树叶子节点和计算最终的预测,总结相应的分数。学习模型中使用的函数集,我们最小化以下正规化的目标。

![avatar](pic/2.png)

<center> Figure 1: Tree Ensemble Model. The final prediction for a given example is the sum of predictions from each tree.</center>


<center>图1：树集成模型。对于给定的例子，最后的预测是来自每棵树的预测之和。</center>

* Here l is a differentiable convex loss function that measures the difference between the prediction yˆi and the target yi. The second term Ω penalizes the complexity of the model (i.e., the regression tree functions). The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions. A similar regularization technique has been used in Regularized greedy forest (RGF) [25] model. Our objective and the corresponding learning algorithm is simpler than RGF and easier to parallelize. When the regularization parameter is set to zero, the objective falls back to the traditional gradient tree boosting.

* 这里l是一个可微的凸损失函数，它测量预测yˆi和目标yi之间的差异。第二项Ω惩罚模型的复杂性（即回归树函数）。附加的正则化项有助于平滑最终学习到的权重，以避免过度拟合。直观地说，正则化目标倾向于选择一个使用简单预测函数的模型。在正则贪婪森林（RGF）[25]模型中使用了类似的正则化技术。我们的目标和相应的学习算法比RGF简单且易于并行化。当正则化参数为零时，目标回归到传统的梯度树增强。

![avatar](pic/3.png)

## 2.2  Gradient Tree Boosting
## 2.2 梯度提升树

* The tree ensemble model in Eq. (2) includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space. Instead, the modelis trained in an additive manner. Formally, let yˆ(t) be the prediction of the i-th instance at the t-th iteration, we will need to add ft to minimize the following objective.

* 式（2）中的集成树模型包含函数作为参数，在欧氏空间中不能用传统的优化方法进行优化。相反，模型是以加法的方式训练的。形式上，假设yˆ（t）是第t次迭代时第i个样本的预测，我们将需要添加ft以最小化以下目标。

![avatar](pic/4.png)

* This means we greedily add the ft that most improves our model according to Eq. (2). Second-order approximation can be used to quickly optimize the objective in the general setting [12].

* 这意味着我们贪婪地根据公式(2)添加了最能改进我们模型的ft。二阶近似可以用于在一般设置中快速优化目标

![avatar](pic/5.png)

* where![avatar](pic/6.png)

* are first and second order gradient statistics on the loss function. We can remove the constant terms to obtain the following simplified objective at step t.

* 是关于损失函数的一阶和二阶梯度统计量。在步骤t中，我们可以去掉常数项以获得以下简化目标。

![avatar](pic/7.png)

![avatar](pic/8.png)

<center>Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gradient statistics on each leaf, then apply the scoring formula to get the quality score.</center>

图2：结构得分计算。我们只需要对每片叶子的梯度和二阶梯度统计量求和，然后应用评分公式得到质量分数。

* Define $I_j = \{i|q(X_i) = j\}$ as the instance set of leaf j. We can rewrite Eq (3) by expanding Ω as follows

* 定义$I_j = \{i|q(X_i) = j\}$作为叶子j的样本集。我们可以通过扩展Ω来重写EQ（3），如下所示

![avatar](pic/9.png)

* For a fixed structure q(x), we can compute the optimal weight wj∗ of leaf j by

* 对于固定结构q（x），我们可以通过

![avatar](pic/10.png)

* and calculate the corresponding optimal value by

* 对于固定结构q（x），我们可以通过

![avatar](pic/11.png)

* Eq (6) can be used as a scoring function to measure the quality of a tree structure q. This score is like the impurity score for evaluating decision trees, except that it is derived for a wider range of objective functions. Fig. 2 illustrates how this score can be calculated.

* 等式（6）可用作衡量树结构q质量的评分函数。该评分类似于评价决策树的杂质评分，只是它是为更广泛的目标函数推导的。图2说明如何计算该分数。

* Normally it is impossible to enumerate all the possible tree structures q. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that IL and IR are the instance sets of left and right nodes after the split. Lettting I = IL ∪ IR, then the loss reduction after the split is given by

* 通常不可能枚举所有可能的树结构q。取而代之的是一个贪婪的算法，它从一个叶子开始，迭代地向树中添加分支。假设IL和IR是拆分后左右节点的实例集。Letting I=IL∪IR，则分割后的损失减少量由下式给出

![avatar](pic/12.png)

* This formula is usually used in practice for evaluating the split candidates.

* 这个公式通常在实践中用于评估被分割的候选人。

### 2.3  Shrinkage and Column Subsampling
### 2.3 缩减和列下取样

* Besides the regularized objective mentioned in Sec. 2.1, two additional techniques are used to further prevent over-fitting. The first technique is shrinkage introduced by Friedman [11]. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in tochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model. The second technique is column (feature) subsampling. This technique is used in RandomForest [4,13], It is implemented in a commercial software TreeNet 4 for gradient boosting, but is not implemented in existing opensource packages. According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling (which is also supported). The usage of column sub-samples also speeds up computations of the parallel algorithm described later.

* 除了第2.1节中提到的正则化目标外，还使用了另外两种技术来进一步防止过拟合。第一种技术是由Friedman[11]引入的缩减。在每一步提高树的重量后，缩减将增加一个因子η。与模型优化中的学习率相似，缩减减小了每棵树的影响，为未来树改进模型留下了空间。第二种技术是列(特征)下采样。这种技术在RandomForest中使用[4,13]，它是在一个用于梯度提升的商业软件TreeNet中实现的，但在现有的开源包中没有实现。根据用户反馈，使用列下采样比传统的行下采样(也支持行下采样)更能防止过拟合。列下采样的使用也加速了后面描述的并行算法的计算。

![avatar](pic/31.png)

## 3. SPLIT FINDING ALGORITHMS

## 3 分割查找算法

### 3.1 Basic Exact Greedy Algorithm

### 3.1 基本精确贪婪算法

* One of the key problems in tree learning is to find the best split as indicated by Eq (7). In order to do so, a split finding algorithm enumerates over all the possible splits on all the features. We call this the exact greedy algorithm. Most existing single machine tree boosting implementations, such as scikit-learn [20], R’s gbm [21] as well as the single machine version of XGBoost support the exact greedy algorithm. The exact greedy algorithm is shown in Alg. 1. It is computationally demanding to enumerate all the possible splits for continuous features. In order to do so efficiently, the algorithm must first sort the data according to feature values and visit the data in sorted order to accumulate the gradient statistics for the structure score in Eq (7).

* 树学习中的一个关键问题是如何找到Eq（7）所表示的最佳分割。为了做到这一点，分割查找算法会枚举所有特征上所有可能的分割。我们称之为精确贪婪算法。大多数现有的单机器树提升实现，如scikit learn[20]、R的gbm[21]以及XGBoost的单机器版本都支持精确的贪婪算法。精确的贪心算法如Alg.1 所示。计算上要求枚举连续特征的所有可能分割。为了有效地实现这一点，算法必须首先根据特征值对数据进行排序，然后访问排序后的数据，以累积Eq（7）中结构得分的梯度统计。

### 3.2 Approximate Algorithm
### 3.2 近似算法

* The exact greedy algorithm is very powerful since it enumerates over all possible splitting points greedily. However, it is impossible to efficiently do so when the data does not fit entirely into memory. Same problem also arises in the distributed setting. To support effective gradient tree boosting in these two settings, an approximate algorithm is needed.

* 精确贪婪算法非常强大，因为它贪婪地搜索所有可能的分裂点。然而，当数据不能完全装入内存时，就不可能有效地这样做。同样的问题也出现在分布式设置中。为了在这两种情况下支持有效的梯度提升树，需要一种近似算法。

* We summarize an approximate framework, which resembles the ideas proposed in past literatures [17, 2, 22], in Alg. 2. To summarize, the algorithm first proposes candidate splitting points according to percentiles of feature distribution (a specific criteria will be given in Sec. 3.3). The algorithm then maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

* 我们总结了一个近似的框架，类似于在过去的文献[17,2,22]中提出的想法。综上所述，算法首先根据特征分布的百分位数提出候选分裂点(具体标准将在第3.3节给出)。然后，该算法将连续的特征映射到由候选点分割的桶中，对统计数据进行汇总，并根据汇总的统计数据在建议中找到最佳的解决方案。

* There are two variants of the algorithm, depending on when the proposal is given. The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split. The global method requires less proposal steps than the local method. However, usually more candidate points are needed for the global proposal because candidates are not refined after each split. The local proposal refines the candidates after splits, and can potentially be more appropriate for deeper trees. A comparison of different algorithms on a Higgs boson dataset is given by Fig. 3. We find that the local proposal indeed requires fewer candidates. The global proposal can be as accurate as the local one given enough candidates.

* 该算法有两种变体，取决于何时给出建议。全局变量在树构造的初始阶段提出了所有的候选分割，并在所有层次上使用相同的分割查找建议。局部变体在每次分割后重新提出。全局方法比局部方法需要更少的建议步骤。但是，通常全局方案需要更多的候选点，因为候选点在每次分割后都没有细化。局部建议在分割后细化候选树，可能更适合更深的树。图3给出了不同算法在希格斯玻色子（Higgs boson）数据集上的比较。我们发现，局部变体的确需要较少的候选人。如果有足够的候选方案，全局方案可以和局部方案一样准确。

* Most existing approximate algorithms for distributed tree learning also follow this framework. Notably, it is also possible to directly construct approximate histograms of gradient statistics [22]. It is also possible to use other variants of binning strategies instead of quantile [17]. Quantile strategy benefit from being distributable and recomputable, which we will detail in next subsection. From Fig. 3, we also find that the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

* 现有的大多数分布式树学习的近似算法也遵循这个框架。值得注意的是，还可以直接构造梯度统计的近似直方图[22]。也可以使用二进制策略的其他变体而不是分位数[17]。分位数策略受益于可分配和可重新计算，我们将在下一小节详细介绍。从图3中，我们还发现在合理的近似水平下，分位数策略可以获得与精确贪婪策略相同的精度。

* Our system efficiently supports exact greedy for the single machine setting, as well as approximate algorithm with both local and global proposal methods for all settings. Users can freely choose between the methods according to their needs.

* 我们的系统有效地支持对单机设置的精确贪婪，以及对所有设置同时使用局部和全局建议方法的近似算法。用户可以根据自己的需要自由选择方法。

![avatar](pic/13.png)

Figure 3: Comparison of test AUC convergence on Higgs 10M dataset. The eps parameter corresponds to the accuracy of the approximate sketch. This roughly translates to 1 / eps buckets in the proposal. We find that local proposals require fewer buckets, because it refine split candidates.

Higgs-10M数据集上测试AUC收敛性的比较。eps参数对应于近似草图的精度。这大致相当于提案中的1/eps桶。我们发现，地方提案所需的桶数更少，因为它细化了分离的候选人。


### 3.3 Weighted Quantile Sketch
### 3.3 加权分位数示意图

* One important step in the approximate algorithm is to propose candidate split points. Usually percentiles of a feature are used to make candidates distribute evenly on the data. Formally,letmulti-setDk ={(x1k,h1),(x2k,h2)···(xnk,hn)} represent the k-th feature values and second order gradient statistics of each training instances. We can define a rank functions rk : R → [0, +∞) as

* 近似算法的一个重要步骤是提出候选分割点。通常，特征的百分位数用于使候选数据在数据上均匀分布。形式上，让多重集合Dk={（x1k，h1），（x2k，h2）··（xnk，hn）}表示每个训练实例的第k个特征值和二阶梯度统计。我们可以将秩函数rk:R→[0，+∞）定义为

![avatar](pic/15.png)

![avatar](pic/14.png)

Figure 4: Tree structure with default directions. An example will be classified into the default direction when the feature needed for the split is missing.

图4：具有默认方向的树结构。当分割所需的特征丢失时，示例将被分类为默认方向。

* which represents the proportion of instances whose feature value k is smaller than z. The goal is to find candidate split points {sk1, sk2, · · · skl}, such that

* 它表示特征值k小于z的实例的比例。目标是找到候选分割点{sk1，sk2，···skl}，这样

![avatar](pic/16.png)

* Here ε is an approximation factor. Intuitively, this means that there is roughly 1/ε candidate points. Here each data point is weighted by hi. To see why hi represents the weight, we can rewrite Eq (3) as

* 这里ε是一个近似因子。直观地说，这意味着大约有1/ε的候选点。这里每个数据点都是用hi加权的。为了解释为什么hi代表权重，我们可以将Eq（3）重写为

![avatar](pic/17.png)

* which is exactly weighted squared loss with labels gi/hi and weights hi. For large datasets, it is non-trivial to find candidate splits that satisfy the criteria. When every instance has equal weights, an existing algorithm called quantile sketch [14, 24] solves the problem. However, there is no existing quantile sketch for the weighted datasets. Therefore, most existing approximate algorithms either resorted to sorting on a random subset of data which have a chance of failure or heuristics that do not have theoretical guarantee.

* 这就是带有gi/hi和weights hi标签的加权平方损失。对于大型数据集，找到满足条件的候选拆分是非常重要的。当每个实例的权重相等时，一个称为分位数草图[14，24]的现有算法解决了这个问题。然而，加权数据集没有现有的分位数草图。因此，大多数现有的近似算法要么是对有可能失败的随机数据子集进行排序，要么是没有理论保证的启发式算法。

* To solve this problem, we introduced a novel distributed weighted quantile sketch algorithm that can handle weighted data with a provable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level. A detailed description of the algorithm as well as proofs are given in the appendix.

* 为了解决这一问题，我们提出了一种新的分布式加权分位数草图算法，该算法能够在可证明的理论保证下处理加权数据。总体思路是提出一种支持合并和剪枝操作的数据结构，并证明每个操作都能保持一定的精度水平。附录中给出了算法的详细描述和证明。

### 3.4 Sparsity-aware Split Finding
### 3.4 稀疏感知分裂发现

![avatar](pic/19.png)

* In many real-world problems, it is quite common for the input x to be sparse. There are multiple possible causes for sparsity: 1) presence of missing values in the data; 2) frequent zero entries in the statistics; and, 3) artifacts of feature engineering such as one-hot encoding. It is important to make the algorithm aware of the sparsity pattern in the data. In order to do so, we propose to add a default direction in each tree node, which is shown in Fig. 4. When a value is missing in the sparse matrix x, the instance is classified into the default direction. There are two choices of default direction in each branch. The optimal default directions are learnt from the data. The algorithm is shown in Alg. 3. The key improvement is to only visit the non-missing entries Ik. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values. The same algorithm can also be applied when the non-presence corresponds to a user specified value by limiting the enumeration only to consistent solutions.

* 在许多实际问题中，输入x稀疏是很常见的。稀疏性有多种可能的原因：1）数据中缺少值；2）统计中经常出现零项；3）特征工程的结果，比如一个热编码。该算法对数据中稀疏模式的识别具有重要意义。为此，我们建议在每个树节点中添加一个默认方向，如图4所示。当稀疏矩阵x中缺少值时，实例将被分类为默认方向。每个分支中有两个默认方向的选择。从数据中学习最佳默认方向。算法如Alg.3所示。关键的改进是只访问未丢失的条目Ik。该算法将不存在视为缺失值，并学习处理缺失值的最佳方向。当不存在对应于用户指定的值时，也可以应用相同的算法，方法是将枚举限制为一致的解。

![avatar](pic/18.png)

Figure 6: Block structure for parallel learning. Each column in a block is sorted by the corresponding feature value. A linear scan over one column in the block is sufficient to enumerate all the split points.

图6：用于并行学习的块结构。块中的每一列都按相应的特征值排序。对块中的一列进行线性扫描就足以枚举所有拆分点。

* To the best of our knowledge, most existing tree learning algorithms are either only optimized for dense data, or need specific procedures to handle limited cases such as categorical encoding. XGBoost handles all sparsity patterns in a unified way. More importantly, our method exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. Fig. 5 shows the comparison of sparsity aware and a naive implementation on an Allstate-10K dataset (description of dataset given in Sec. 6). We find that the sparsity aware algorithm runs 50 times faster than the naive version. This confirms the importance of the sparsity aware algorithm.

* 据我们所知，大多数现有的树学习算法要么只针对密集数据进行优化，要么需要特定的过程来处理有限的情况，例如分类编码。XGBoost以统一的方式处理所有稀疏模式。更重要的是，我们的方法利用了稀疏性，使得计算复杂度与输入中的非缺失条目数成线性关系。图5示出了Allstate-10K数据集上的稀疏感知和朴素实现的比较（数据集的描述以秒为单位）。6） 是的。我们发现稀疏感知算法的运行速度是原始算法的50倍。这证实了稀疏感知算法的重要性。

![avatar](pic/20.png)

Figure 5: Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration.

稀疏感知算法对Allstate-10K的影响数据集的稀疏主要是由于一个热编码。稀疏感知算法比不考虑稀疏性的原始算法快50倍以上。

## 4. SYSTEM DESIGN
## 4 系统设计

### 4.1 Column Block for Parallel Learning
### 4.1 用于并行学习的列块

* The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations.

* 树学习最耗时的部分是将数据排序。为了降低排序的成本，我们建议将数据存储在内存单元中，我们称之为块。每个块中的数据以压缩列（compressed column，CSC）格式存储，每列按相应的特征值排序。这个输入数据布局只需要在训练之前计算一次，并且可以在以后的迭代中重用。

* In the exact greedy algorithm, we store the entire dataset in a single block and run the split search algorithm by linearly scanning over the pre-sorted entries. We do the split finding of all leaves collectively, so one scan over the block will collect the statistics of the split candidates in all leaf branches. Fig. 6 shows how we transform a dataset into the format and find the optimal split using the block structure.

* 在精确贪婪算法中，我们将整个数据集存储在一个块中，并通过对预排序条目进行线性扫描来运行分割搜索算法。我们对所有的叶子进行集体的分割查找，因此在块上进行一次扫描将收集所有叶子分支中的分割候选的统计信息。图6显示了我们如何将数据集转换为该格式，并使用块结构找到最佳分割。

* The block structure also helps when using the approximate algorithms. Multiple blocks can be used in this case, with each block corresponding to subset of rows in the dataset. Different blocks can be distributed across machines, or stored on disk in the out-of-core setting. Using the sorted structure, the quantile finding step becomes a linear scan over the sorted columns. This is especially valuable for local proposal algorithms, where candidates are generated frequently at each branch. The binary search in histogram aggregation also becomes a linear time merge style algorithm.

* 块结构在使用近似算法时也有帮助。在这种情况下，可以使用多个块，每个块对应于数据集中的行子集。不同的块可以分布在不同的机器上，也可以存储在磁盘上。使用排序结构，分位数查找步骤变成对排序列的线性扫描。这对于在每个分支处频繁生成候选的本地建议算法尤其有价值。直方图聚合中的二值搜索也成为一种线性时间归并型算法。

* Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split finding. Importantly, the column block structure also supports column subsampling, as it is easy to select a subset of columns in a block.

* 收集每一列的统计信息可以并行化，为我们提供了一个并行的拆分查找算法。重要的是，列块结构还支持列子采样，因为很容易在块中选择列的子集。

![avatar](pic/21.png)

<center>
Figure 7: Impact of cache-aware prefetching in exact greedy algorithm. We find that the cache-miss effect impacts the performance on the large datasets (10 million instances). Using cache aware prefetching improves the performance by factor of two when the dataset is large.</center>

<center>图7：精确贪婪算法中缓存感知预取的影响。我们发现缓存未命中效应会影响大数据集（1000万个实例）的性能。当数据集很大时，使用缓存感知预取可将性能提高两倍。</center>

* Time Complexity Analysis Let d be the maximum depth of the tree and K be total number of trees. For the ex- act greedy algorithm, the time complexity of original spase aware algorithm is O(Kd∥x∥0 log n). Here we use ∥x∥0 to denote number of non-missing entries in the training data. On the other hand, tree boosting on the block structure only cost O(Kd∥x∥0 + ∥x∥0 log n). Here O(∥x∥0 log n) is the one time preprocessing cost that can be amortized. This analysis shows that the block structure helps to save an additional log n factor, which is significant when n is large. For the approximate algorithm, the time complexity of original al- gorithm with binary search is O(Kd∥x∥0 log q). Here q is the number of proposal candidates in the dataset. While q is usually between 32 and 100, the log factor still introduces overhead. Using the block structure, we can reduce the time to O(Kd∥x∥0 + ∥x∥0 log B), where B is the maximum num- ber of rows in each block. Again we can save the additional log q factor in computation.

* 时间复杂度分析设d为树的最大深度，K为树的总数。对于行为贪婪算法，原始空间感知算法的时间复杂度为O（Kd∥x∥0 log n）。在这里，我们使用∥x∥0来表示训练数据中未丢失的条目数。另一方面，块结构上的树提升仅花费O（Kd∥x∥0+∥x∥0 log n）。这里O（∥x∥0 log n）是可以摊销的一次性预处理成本。分析表明，块结构有助于节省额外的log n因子，这在n较大时是显著的。对于近似算法，采用二进制搜索的原始算法的时间复杂度为O（Kd∥x∥0 log q）。这里q是数据集中的候选方案数。虽然q通常在32到100之间，但日志系数仍然会带来开销。使用块结构，我们可以将时间减少到O（Kd∥x∥0+∥x∥0 log B），其中B是每个块中的最大行数。同样，我们可以在计算中节省额外的log q因子。

### 4.2 Cache-aware Access
### 4.2 缓存感知访问

![avatar](pic/22.png)

<center>Figure 8: Short range data dependency pattern that can cause stall due to cache miss.</center>

<center>图8：可能由于缓存未命中而导致暂停的短期数据依赖模式。</center>

* While the proposed block structure helps optimize the computation complexity of split finding, the new algorithm requires indirect fetches of gradient statistics by row index, since these values are accessed in order of feature. This is a non-continuous memory access. A naive implementation of split enumeration introduces immediate read/write de- pendency between the accumulation and the non-continuous memory fetch operation (see Fig. 8). This slows down split finding when the gradient statistics do not fit into CPU cache and cache miss occur.

* 虽然这种块结构有助于优化分割查找的计算复杂度，但新算法需要按行索引间接获取梯度统计信息，因为这些值是按特征顺序访问的。这是一个非连续内存访问。拆分枚举的一个简单实现引入了累积和非连续内存获取操作之间的直接读/写依赖性（参见图8）。当渐变统计数据不适合CPU缓存且发生缓存未命中时，这会减慢拆分查找速度。

* For the exact greedy algorithm, we can alleviate the problem by a cache-aware prefetching algorithm. Specifically, we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner. This prefetching changes the direct read/write dependency to a longer dependency and helps to reduce the runtime overhead when number of rows in the is large. Figure 7 gives the comparison of cache-aware vs. non cache-aware algorithm on the the Higgs and the All-state dataset. We find that cache-aware implementation of the exact greedy algorithm runs twice as fast as the naive version when the dataset is large.

* 对于精确贪婪算法，我们可以通过一个缓存感知的预取算法来缓解这个问题。具体来说，我们在每个线程中分配一个内部缓冲区，将梯度统计信息提取到其中，然后以小批量的方式执行累积。此预取将直接读/写依赖项更改为更长的依赖项，并有助于在中的行数较大时减少运行时开销。图7给出了 Higgs 和 the All-state数据集上缓存感知和非缓存感知算法的比较。我们发现，当数据集较大时，精确贪婪算法的缓存感知实现的运行速度是原始算法的两倍。

* For approximate algorithms, we solve the problem by choosing a correct block size. We define the block size to be maximum number of examples in contained in a block, as this reflects the cache storage cost of gradient statistics. Choosing an overly small block size results in small workload for each thread and leads to inefficient parallelization. On the other hand, overly large blocks result in cache misses, as the gradient statistics do not fit into the CPU cache. A good choice of block size balances these two factors. We compared various choices of block size on two data sets. The results are given in Fig. 9. This result validates our discussion and shows that choosing 216 examples per block balances the cache property and parallelization.

* 对于近似算法，我们通过选择正确的块大小来解决问题。我们将块大小定义为包含在块中的最大示例数，因为这反映了渐变统计的缓存存储成本。选择过小的块大小会导致每个线程的工作量较小，并导致效率低下的并行化。另一方面，过大的块会导致缓存未命中，因为渐变统计信息不适合CPU缓存。块大小的良好选择平衡了这两个因素。我们比较了两个数据集上块大小的各种选择。结果如图9所示。这一结果验证了我们的讨论，并表明每个块选择216个示例平衡了缓存属性和并行化。

![avatar](pic/32.png)

<center>图9：块的大小对算法的影响.我们发现过小的块会导致效率低下的并行化，而过大的块也会由于缓存未命中而减慢训练速度。</center>

## 4.3 Blocks for Out-of-core Computation
## 4.3 核外计算块

* One goal of our system is to fully utilize a machine’s resources to achieve scalable learning. Besides processors and memory, it is important to utilize disk space to handle data that does not fit into main memory. To enable out-of-core computation, we divide the data into multiple blocks and store each block on disk. During computation, it is important to use an independent thread to pre-fetch the block into a main memory buffer, so computation can happen in concurrence with disk reading. However, this does not entirely solve the problem since the disk reading takes most of the computation time. It is important to reduce the overhead and increase the throughput of disk IO. We mainly use two techniques to improve the out-of-core computation.

* 我们系统的一个目标是充分利用机器的资源来实现可扩展的学习。除了处理器和内存，利用磁盘空间处理不适合主内存的数据也很重要。为了实现核外计算，我们将数据分成多个块，并将每个块存储在磁盘上。在计算过程中，使用独立线程将块预取到主内存缓冲区非常重要，因此计算可以与磁盘读取同时进行。但是，这并不能完全解决问题，因为磁盘读取占用了大部分计算时间。降低磁盘IO的开销并提高其吞吐量是非常重要的。我们主要使用两种技术来改进核外计算。

* Block Compression The first technique we use is block compression. The block is compressed by columns, and decompressed on the fly by an independent thread when loading into main memory. This helps to trade some of the computation in decompression with the disk reading cost. We use a general purpose compression algorithm for compressing the features values. For the row index, we substract the row index by the begining index of the block and use a 16bit integer to store each offset. This requires 216 examples per block, which is confirmed to be a good setting. In most of the dataset we tested, we achieve roughly a 26% to 29% compression ratio.

* 块压缩我们使用的第一种技术是块压缩。块由列压缩，加载到主内存时由独立线程动态解压缩。这有助于将解压中的一些计算与磁盘读取成本进行权衡。我们使用通用的压缩算法来压缩特征值。对于行索引，我们用块的起始索引减去行索引，并使用16位整数存储每个偏移量。这需要每个块216个样本，这被证实是一个良好的设置。在我们测试的大多数数据集中，我们获得了大约26%到29%的压缩比。

* Block Sharding The second technique is to shard the data onto multiple disks in an alternative manner. A pre-fetcher thread is assigned to each disk and fetches the data into an in-memory buffer. The training thread then alternatively reads the data from each buffer. This helps to increase the throughput of disk reading when multiple disks are available.

* 块分片第二种技术是以另一种方式将数据分片到多个磁盘上。为每个磁盘分配一个预取线程，并将数据提取到内存缓冲区中。然后，训练线程交替地从每个缓冲区读取数据。当多个磁盘可用时，这有助于提高磁盘读取的吞吐量。

![avatar](pic/23.png)

## 5. RELATED WORKS
## 5 相关工作

* Our system implements gradient boosting [10], which performs additive optimization in functional space. Gradient tree boosting has been successfully used in classification [12], learning to rank [5], structured prediction [8] as well as other fields. XGBoost incorporates a regularized model to prevent overfitting. This this resembles previous work on regularized greedy forest [25], but simplifies the objective and algorithm for parallelization. Column sampling is a simple but effective technique borrowed from RandomForest [4]. While sparsityaware learning is essential in other types of models such as linear models [9], few works on tree learning have considered this topic in a principled way. The algorithm proposed in this paper is the first unified approach to handle all kinds of sparsity patterns.

* 我们的系统实现了梯度提升[10]，它在函数空间中执行加法优化。梯度提升树已成功地应用于分类[12]、学习排序[5]、结构预测[8]等领域。XGBoost采用了一个正则化模型来防止过度拟合。这类似于之前关于正则贪婪森林的工作[25]，但简化了并行化的目标和算法。列抽样是一种简单而有效的技术，它借鉴了RandomForest[4]。虽然稀疏感知学习在其他类型的模型（如线性模型）中是必不可少的[9]，但是很少有关于树学习的工作以一种原则性的方式考虑这个问题。本文提出的算法是第一种处理各种稀疏模式的统一方法。

* There are several existing works on parallelizing tree learning [22, 19]. Most of these algorithms fall into the approximate framework described in this paper. Notably, it is also possible to partition data by columns [23] and apply the exact greedy algorithm. This is also supported in our framework, and the techniques such as cache-aware prefecthing can be used to benefit this type of algorithm. While most existing works focus on the algorithmic aspect of parallelization, our work improves in two unexplored system directions: out-of-core computation and cache-aware learning. This gives us insights on how the system and the algorithm can be jointly optimized and provides an end-to-end system that can handle large scale problems with very limited computing resources. We also summarize the comparison between our system and existing opensource implementations in Table 1.

* 已有的关于并行树学习的一些工作[22，19]。这些算法大多属于本文描述的近似框架。值得注意的是，还可以按列[23]划分数据，并应用精确的贪婪算法。我们的框架也支持这一点，并且可以使用诸如缓存感知完善之类的技术来改进这类算法。虽然大多数现有的工作集中在并行化的算法方面，但我们的工作在两个未探索的系统方向上有所改进：核心外计算和缓存感知学习。这让我们深入了解了如何对系统和算法进行联合优化，并提供了一个端到端的系统，可以在非常有限的计算资源下处理大规模问题。我们还在表1中总结了我们的系统与现有开源实现之间的比较。

* Quantile summary (without weights) is a classical problem in the database community [14, 24]. However, the approximate tree boosting algorithm reveals a more general problem – finding quantiles on weighted data. To the best of our knowledge, the weighted quantile sketch proposed in this paper is the first method to solve this problem. The weighted quantile summary is also not specific to the tree learning and can benefit other applications in data science and machine learning in the future.

* 分位数汇总（不带权重）是数据库社区中的一个经典问题[14，24]。然而，近似提升树算法揭示了一个更普遍的问题-在加权数据上寻找分位数。据我们所知，本文提出的加权分位数草图是解决这一问题的第一种方法。加权分位数摘要也不是树型学习的特例，有利于将来在数据科学和机器学习中的其他应用。

## 6. END TO END EVALUATIONS
## 6 端到端评估

### 6.1 System Implementation
### 6.1 系统实现

* We implemented XGBoost as an open source package. The package is portable and reusable. It supports various weighted classification and rank objective functions, as well as user defined objective function. It is available in popular languages such as python, R, Julia and integrates naturally with language native data science pipelines such as scikit-learn. The distributed version is built on top of the rabit library for allreduce. The portability of XGBoost makes it available in many ecosystems, instead of only being tied to a specific platform. The distributed XGBoost runs natively on Hadoop, MPI Sun Grid engine. Recently, we also enable distributed XGBoost on jvm bigdata stacks such as Flink and Spark. The distributed version has also been integrated into cloud platform Tianchi7 of Alibaba. We believe that there will be more integrations in the future.

* 我们将XGBoost实现为一个开源包。该软件包是可移植和可重用的。支持多种加权分类和排序目标函数，支持自定义目标函数。它以流行语言（如python、R、Julia）提供，并与语言原生数据科学pipelines（如scikit learn）自然集成。分布式版本是在rabit library for allreduce的基础上构建的。XGBoost的可移植性使它在许多生态系统中都可用，而不仅仅是绑定到特定的平台上。分布式XGBoost在Hadoop、MPI Sun Grid引擎上本机运行。最近，我们还启用了jvm bigdata堆栈上的分布式XGBoost，比如Flink和Spark。分布式版本也已经集成到阿里巴巴的云平台天池中。我们相信未来会有更多的整合。

### 6.2 Dataset and Setup
### 6.2 数据集和设置

![avatar](pic/24.png)

* We used four datasets in our experiments. A summary of these datasets is given in Table 2. In some of the experiments, we use a randomly selected subset of the data either due to slow baselines or to demonstrate the performance of the algorithm with varying dataset size. We use a suffix to denote the size in these cases. For example Allstate-10K means a subset of the Allstate dataset with 10K instances.

* 我们在实验中使用了四个数据集。这些数据集的摘要见表2。在一些实验中，我们使用随机选择的数据子集，要么是由于基线速度慢，要么是为了演示该算法在不同数据集大小下的性能。在这些情况下，我们用后缀来表示大小。例如，Allstate-10K表示具有10K个实例的Allstate数据集的子集。

* The first dataset we use is the Allstate insurance claim dataset8. The task is to predict the likelihood and cost of an insurance claim given different risk factors. In the experiment, we simplified the task to only predict the likelihood of an insurance claim. This dataset is used to evaluate the impact of sparsity-aware algorithm in Sec. 3.4. Most of the sparse features in this data come from one-hot encoding. We randomly select 10M instances as training set and use the rest as evaluation set.

* 我们使用的第一个数据集是全州保险索赔数据集8。任务是在给定不同风险因素的情况下，预测保险索赔的可能性和成本。在实验中，我们将任务简化为只预测保险索赔的可能性。该数据集用于评估Sec.3.4中稀疏感知算法的影响。此数据中的大多数稀疏特性来自一个独热编码。随机选取10万个实例作为训练集，其余作为评价集。

* The second dataset is the Higgs boson dataset from high energy physics. The data was produced using Monte Carlo simulations of physics events. It contains 21 kinematic properties measured by the particle detectors in the accelerator. It also contains seven additional derived physics quantities of the particles. The task is to classify whether an event corresponds to the Higgs boson. We randomly select 10M instances as training set and use the rest as evaluation set.

* 第二个数据集是来自高能物理的希格斯玻色子（ Higgs boson）数据集。数据是用蒙特卡罗模拟物理事件产生的。它包含由加速器中的粒子探测器测量的21个运动特性。它还包含了另外七个导出的粒子物理量。任务是对一个事件是否对应于希格斯玻色子进行分类。随机选取10万个实例作为训练集，其余作为评价集。

* The third dataset is the Yahoo! learning to rank challenge dataset [6], which is one of the most commonly used benchmarks in learning to rank algorithms. The dataset contains 20K web search queries, with each query corresponding to a list of around 22 documents. The task is to rank the documents according to relevance of the query. We use the official train test split in our experiment.

* 第三个数据集是Yahoo！学习排名挑战数据集[6]，这是学习排名算法中最常用的基准之一。该数据集包含20K个web搜索查询，每个查询对应于大约22个文档的列表。任务是根据查询的相关性对文档进行排序。我们在实验中使用了正式的训练测试集拆分。

* The last dataset is the criteo terabyte click log dataset. We use this dataset to evaluate the scaling property of the system in the out-of-core and the distributed settings. The data contains 13 integer features and 26 ID features of user, item and advertiser information. Since a tree based model is better at handling continuous features, we preprocess the data by calculating the statistics of average CTR and count of ID features on the first ten days, replacing the ID features by the corresponding count statistics during the next ten days for training. The training set after preprocessing contains 1.7 billion instances with 67 features (13 integer, 26 average CTR statistics and 26 counts). The entire dataset is more than one terabyte in LibSVM format.

* 最后一个数据集是criteo terabyte click log dataset。我们使用这个数据集来评估系统在核心外和分布式设置中的缩放特性。数据包含13个整数特征和26个用户、商品和广告商信息的ID特征。由于基于树的模型更适合处理连续特征，所以我们通过计算前10天的平均CTR和ID特征计数来对数据进行预处理，在接下来的10天内用相应的计数统计来代替ID特征进行训练。预处理后的训练集包含17亿个实例，有67个特征（13个整数，26个平均CTR统计和26个计数）。整个数据集的LibSVM格式超过1tb。

* We use the first three datasets for the single machine parallel setting, and the last dataset for the distributed and out-of-core settings. All the single machine experiments are conducted on a Dell PowerEdge R420 with two eight-core Intel Xeon (E5-2470) (2.3GHz) and 64GB of memory. If not specified, all the experiments are run using all the available cores in the machine. The machine settings of the distributed and the out-of-core experiments will be described in the corresponding section. In all the experiments, we boost trees with a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling unless explicitly specified. We can find similar results when we use other settings of maximum depth.

* 前三个数据集用于单机并行设置，最后一个数据集用于分布式和非核心设置。所有的单机实验都是在一台Dell PowerEdge R420上进行的，它有两个8核Intel Xeon（E5-2470）（2.3GHz）和64GB内存。如果未指定，则使用机器中的所有可用内核运行所有实验。分布式和堆外实验的机器设置将在相应章节中描述。在所有的实验中，我们将树的最大深度设置为8，收缩率设置为0.1，除非明确指定，否则不进行列子采样。当我们使用其他最大深度设置时，我们可以找到类似的结果。

![avatar](pic/25.png)

### 6.3 Classification
### 6.3 分类

* In this section, we evaluate the performance of XGBoost on a single machine using the exact greedy algorithm on Higgs-1M data, by comparing it against two other commonly used exact greedy tree boosting implementations. Since scikit-learn only handles non-sparse input, we choose the dense Higgs dataset for a fair comparison. We use the 1M subset to make scikit-learn finish running in reasonable time. Among the methods in comparison, R’s GBM uses a greedy approach that only expands one branch of a tree, which makes it faster but can result in lower accuracy, while both scikit-learn and XGBoost learn a full tree. The results are shown in Table 3. Both XGBoost and scikit-learn give better performance than R’s GBM, while XGBoost runs more than 10x faster than scikit-learn. In this experiment, we also find column subsamples gives slightly worse performance than using all the features. This could due to the fact that there are few important features in this dataset and we can benefit from greedily select from all the features.

* 在这一节中，我们使用精确贪婪算法对Higgs-1M数据在一台机器上评估XGBoost的性能，并将其与其他两个常用的精确贪婪树提升实现进行比较。由于scikit learn只处理非稀疏输入，因此我们选择稠密的Higgs数据集进行公平比较。我们使用1M子集使scikit learn在合理的时间内完成运行。在比较的方法中，R的GBM使用贪婪的方法，只扩展树的一个分支，这使得它更快，但会导致更低的精度，而scikit learn和XGBoost都学习完整的树。结果见表3。XGBoost和scikit learn的性能都比R的GBM好，而XGBoost的运行速度比scikit learn快10倍以上。在这个实验中，我们还发现列子样本的性能比使用所有特征稍差。这可能是因为这个数据集中几乎没有重要的特性，我们可以从所有特性中贪婪地选择中受益。

### 6.4 Learning to Rank

### 6.4 学习排名

* We next evaluate the performance of XGBoost on the learning to rank problem. We compare against pGBRT [22], the best previously pubished system on this task. XGBoost runs exact greedy algorithm, while pGBRT only support an approximate algorithm. The results are shown in Table 4 and Fig. 10. We find that XGBoost runs faster. Interestingly, subsampling columns not only reduces running time, and but also gives a bit higher performance for this problem. This could due to the fact that the subsampling helps prevent overfitting, which is observed by many of the users.

* 接下来我们将评估XGBoost在学习排序问题上的性能。我们将其与pGBRT[22]进行比较，pGBRT[22]是这项任务之前发布的最好的系统。XGBoost运行精确贪婪算法，而pGBRT只支持近似算法。结果如表4和图10所示。我们发现XGBoost跑得更快。有趣的是，列下采样不仅减少了运行时间，而且为这个问题提供了更高的性能。这可能是因为列采样有助于防止过拟合，这是许多用户观察到的。

![avatar](pic/26.png)

<center>Figure 11: Comparison of out-of-core methods on different subsets of criteo data. The missing data points are due to out of disk space. We can find that basic algorithm can only handle 200M exam- ples. Adding compression gives 3x speedup, and sharding into two disks gives another 2x speedup. The system runs out of file cache start from 400M examples. The algorithm really has to rely on disk after this point. The compression+shard method has a less dramatic slowdown when running out of file cache, and exhibits a linear trend afterwards.</center>

<center>图11:criteo数据的不同子集上的核心外方法的比较。丢失的数据点是由于磁盘空间不足。我们发现基本算法只能处理200M的考试题。增加压缩可以提高3倍的速度，而将磁盘分成两部分则可以提高2倍的速度。从400万个例子开始，系统的文件缓存就用完了。在这一点之后，算法确实必须依赖磁盘。当文件缓存用完时，compression+shard方法的速度减慢得不太明显，并且随后呈现线性趋势。</center>

### 6.5 Out-of-core Experiment
### 6.5 堆外实验

* We also evaluate our system in the out-of-core setting on the criteo data. We conducted the experiment on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). The results are shown in Figure 11. We can find that compression helps to speed up computation by factor of three, and sharding into two disks further gives 2x speedup. For this type of experiment, it is important to use a very large dataset to drain the system file cache for a real out-of-core setting. This is indeed our setup. We can observe a transition point when the system runs out of file cache. Note that the transition in the final method is less dramatic. This is due to larger disk throughput and better utilization of computation resources. Our final method is able to process 1.7 billion examples on a single machine.

* 我们还评估了我们的系统在核心外设置的克里托数据。我们在一台AWS c3.8x大型机器（32个vcores，两个320gb SSD，60gb RAM）上进行了实验。结果如图11所示。我们可以发现，压缩有助于将计算速度提高三倍，而将磁盘分为两部分则可以使计算速度提高两倍。对于这种类型的实验，使用非常大的数据集来排出系统文件缓存以实现真正的内核外设置是很重要的。这确实是我们的设置。当系统用完文件缓存时，我们可以观察到一个转换点。请注意，最终方法中的转换不太引人注目。这是由于更大的磁盘吞吐量和更好的计算资源利用率。我们的最终方法能够在一台机器上处理17亿个例子。

### 6.6 Distributed Experiment
### 6.6 分布式实验

* Finally, we evaluate the system in the distributed setting. We set up a YARN cluster on EC2 with m3.2xlarge machines, which is a very common choice for clusters. Each machine contains 8 virtual cores, 30GB of RAM and two 80GB SSD local disks. The dataset is stored on AWS S3 instead of HDFS to avoid purchasing persistent storage.

* 最后，在分布式环境下对系统进行了评估。我们在EC2上用m3.2x大型机器建立了一个纱线簇，这是簇的一个非常常见的选择。每台机器包含8个虚拟内核、30GB RAM和两个80GB固态硬盘。数据集存储在AWS S3上，而不是HDFS上，以避免购买持久性存储。

![avatar](pic/27.png)

<center>Figure 12: Comparison of different distributed sys- tems on 32 EC2 nodes for 10 iterations on different subset of criteo data. XGBoost runs more 10x than spark per iteration and 2.2x as H2O’s optimized ver- sion (However, H2O is slow in loading the data, get- ting worse end-to-end time). Note that spark suffers from drastic slow down when running out of mem- ory. XGBoost runs faster and scales smoothly to the full 1.7 billion examples with given resources by utilizing out-of-core computation.</center>

<center>图12:32个EC2节点上不同分布式系统在不同criteo数据子集上10次迭代的比较。XGBoost每次迭代的运行速度比spark高10倍，而作为H2O的优化版本，它的运行速度是spark的2.2倍（然而，H2O加载数据的速度很慢，端到端的时间越来越短）。请注意，当内存耗尽时，火花会急剧减速。XGBoost通过利用核心外计算，在给定资源的情况下运行更快，并平滑地扩展到17亿个实例。</center>

* We first compare our system against two production-level distributed systems: Spark MLLib [18] and H2O 11. We use 32 m3.2xlarge machines and test the performance of the systems with various input size. Both of the baseline systems are in-memory analytics frameworks that need to store the data in RAM, while XGBoost can switch to out-of-core setting when it runs out of memory. The results are shown in Fig. 12. We can find that XGBoost runs faster than the baseline systems. More importantly, it is able to take advantage of out-of-core computing and smoothly scale to all 1.7 billion examples with the given limited computing resources. The baseline systems are only able to handle subset of the data with the given resources. This experiment shows the advantage to bring all the system improvement together and solve a real-world scale problem. We also evaluate the scaling property of XGBoost by varying the number of machines. The results are shown in Fig. 13. We can find XGBoost’s performance scales linearly as we add more machines. Importantly, XGBoost is able to handle the entire 1.7 billion data with only four machines. This shows the system’s potential to handle even larger data.

* 我们首先将我们的系统与两个生产级分布式系统进行比较：Spark MLLib[18]和H2O 11。我们使用32 m3.2台大型机器，测试不同输入尺寸系统的性能。这两个基线系统都是内存分析框架，需要将数据存储在RAM中，而XGBoost在内存不足时可以切换到内核外设置。结果如图12所示。我们可以发现XGBoost比基线系统运行得更快。更重要的是，它能够利用核心外计算的优势，在给定的有限计算资源下，平滑地扩展到所有17亿个示例。基线系统只能处理具有给定资源的数据子集。这个实验显示了将所有的系统改进结合起来并解决实际规模问题的优势。我们还通过改变机器数量来评估XGBoost的缩放特性。结果如图13所示。当我们增加更多的机器时，我们可以发现XGBoost的性能呈线性扩展。重要的是，XGBoost仅用四台机器就可以处理17亿个数据。这显示了系统处理更大数据的潜力。

![avatar](pic/28.png)

* Figure 13：Scaling of XGBoost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear. XGBoost can process the entire dataset using as little as four machines, and scales smoothly by utilizing more available resources.

* 在criteo全17亿数据集上使用不同数量的机器缩放XGBoost。使用更多的机器会产生更多的文件缓存，并使系统运行得更快，从而使趋势稍微呈超线性。XGBoost只需使用四台机器就可以处理整个数据集，并通过利用更多可用资源平滑地扩展。

## 7. CONCLUSION
## 7. 结论

* In this paper, we described the lessons we learnt when building XGBoost, a scalable tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems. We proposed a novel sparsity aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate learning. Our experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well. By combining these insights, XGBoost is able to solve realworld scale problems using a minimal amount of resources.

* 本文介绍了我们在构建XGBoost时所吸取的经验教训，XGBoost是一个可扩展的提升树系统，被数据科学家广泛使用，并在许多问题上提供了最新的结果。我们提出了一种新的稀疏性感知算法来处理稀疏数据，并提出了一种理论上合理的加权分位数草图来进行近似学习。我们的经验表明，缓存访问模式、数据压缩和分片是构建用于树提升的可伸缩端到端系统的基本要素。这些经验教训也可以应用于其他机器学习系统。通过结合这些见解，XGBoost能够使用最少的资源来解决现实世界规模的问题

## 致谢

* We would like to thank Tyler B. Johnson, Marco Tulio Ribeiro, Sameer Singh, Arvind Krishnamurthy for their valuable feedback. We also sincerely thank Tong He, Bing Xu, Michael Benesty, Yuan Tang, Hongliang Liu, Qiang Kou, Nan Zhu and all other contributors in the XGBoost community. This work was supported in part by ONR (PECASE) N000141010672, NSF IIS 1258741 and the TerraSwarm Research Center sponsored by MARCO and DARPA.

* 我们要感谢泰勒·约翰逊、马可·图里奥·里贝罗、萨迈尔·辛格、阿文·克里希纳穆尔蒂提供了宝贵的反馈意见。我们也衷心感谢童和、徐冰冰、迈克尔·贝尼斯蒂、袁唐、刘洪亮、强口、南珠和XGBoost社区的所有其他贡献者。这项工作在一定程度上得到了ONR（PECASE）N000141010672、NSF IIS1258741和由MARCO和DARPA赞助的Terrasworm研究中心的支持。

# 知识点总结：

* 1.CART树，正则项：L1正则T（叶子节点数），L2正则w（叶子节点权值）。
* 2.损失函数用泰勒二阶近似展开，快速优化目标。
* 3.缩减（Shrinkage）：进行一次迭代后，会将叶子节点的权值乘上系数，目的是为了减少每棵树对整体模型的影响，让后面又更大的学习空间，可以简单理解为，模型在学习的过程中并不完全相信每棵树，只是从每棵树中学习一个大概的方向。
* 4.列下采样比传统的行下采样(也支持行下采样)更能防止过拟合。列下采样的使用也加速了后面描述的并行算法的计算。
* 5.基本精确贪婪算法：预排序。
* 6.近似算法：内存装不下，分布式不能使用精确贪婪算法。
* 7.分布式加权分位数草图算法。
* 8.稀疏感知算法（缺失值，one-hot等）。