## 简介

基于树的学习算法被认为是最优秀学习方法之一，它主要用于监督学习。树方法帮助预测模型拥有较高的精度，稳定性、易于解释。与线性模型不同，树模型映射非线性关系相当好。适用于解决手头上任何类型的问题（``分类或回归``）。

决策树、随机森林、gradient boosting等方法被广泛用于各种数据学科问题中。因此，对于每个分析师（包括新人）来说，学习这些算法并用它们建模很重要。

本教程旨在帮助初学者学习树建模。成功学完本教程后，有望成为一个精通使用树算法，建立预测模型的人.

## Introduction

Tree based learning algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based methods empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression).

Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it’s important to learn these algorithms and use them for modeling.

This tutorial is meant to help beginners learn tree based modeling from scratch. After the successful completion of this tutorial, one is expected to become proficient at using tree based algorithms and build predictive models.

## 目录

- ①、什么是决策树？它是如何工作的？
- ②、回归树和分类树
- ③、一棵树是如何决定在哪里分裂？
- ④、树建模的关键参数是什么？我们如何避免决策树的过拟合？
- ⑤、基于树的模型优于线性模型吗？
- ⑥、Python中使用决策树
- ⑦、树建模的集成方法是什么？
- ⑧、什么是Bagging？它是如何工作的？
- ⑨、什么是随机森林？它是如何工作的？
- ⑩、什么是Boosting？它是如何工作的？
- 11、GBM和Xgboost哪个更强大？
- 12、Python中使用GBM
- 13、Python中使用XGBoost
- 14、在哪里实践？

## Table of Contents
- ①、 What is a Decision Tree? How does it work?
- ②、Regression Trees vs Classification Trees
- ③、How does a tree decide where to split?
- ④、What are the key parameters of model building and how can we avoid over-fitting in decision trees?
- ⑤、Are tree based models better than linear models?
- ⑥、Working with Decision Trees in R and Python
- ⑦、What are the ensemble methods of trees based model?
- ⑧、What is Bagging? How does it work?
- ⑨、What is Random Forest ? How does it work?
- ⑩、What is Boosting ? How does it work?
- 11、Which is more powerful: GBM or Xgboost?
- 12、Working with GBM in R and Python
- 13、Working with Xgboost in R and Python
- 14、Where to Practice ?

## 1、什么是决策树？它是如何工作的？

<img align="rigth" src="figures/tree1.png">

决策树是一种监督学习算法（有一个预定义的目标变量）主要用于分类问题。它适用于离散的和连续的输入和输出变量。该技术中，将人口或样本分成两个或两个以上同质组（或者子集），以输入变量中最重要的差异A/不同为依据进行分割。

**例如：**
假设我们有一个样本，30名学生包含三个变量：性别（男/女），班级（IX/X）和身高（5到6英尺）。这30名学生中有15个名空闲时打板球。现在，我想建立一个模型来预测谁会在空闲时打板球？在这个问题上，我们需要根据非常重要的三个输入变量来分开空闲时打板球的学生。

这正是决策树可以帮助的，它将依据三个变量并识别创造最同质的学生组（互为异构）的变量，来分开学生。在下面的图片中，可以看到与其它两个变量相比，性别能够识别最佳同质组。

## What is a Decision Tree ? How does it work ?
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

**Example:-**
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to identify best homogeneous sets compared to the other two variables.

<img align="rigth" src="figures/tree2.png">

如上所述，决策树识别了最重要的变量并给出最佳的同质组人口值。现在问题是，它是如何识别并区分变量？要做到这一点，决策树使用各种算法，这些将在下面的章节讨论。

As mentioned above, decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population. Now the question which arises is, how does it identify the variable and the split? To do this, decision tree uses various algorithms, which we will shall discuss in the following section.

## 决策树的类型

决策树的类型基于目标变量的类型。它可以有两种类型：
-    1.``离散变量决策树``：有明确目标变量的决策树被称为离散变量决策树。**例如**：-在上面学生问题的情景中，目标变量是“学生将打板球与否”即是或否。


-    2.``连续变量决策树``：有连续目标变量的决策树被称为连续变量决策树。
**例如**：假设有一个问题，预测客户是否会支付保险公司的续保费用（是/否）。在这里，客户的收入是一个重要的变量，但保险公司并没有所有客户的收入明细。现在，我们知道这是一个重要的变量，可以根据职业、作品和其它各种变量建立一个决策树来预测客户收入。在这种情况下，预测的是连续变量值。

## Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two types:

- Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. 

**Example**:- In above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.


- Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

**Example**:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.

## 与决策树相关的重要术语
使用决策树时的基本术语：
-    1.根节点：它代表整个人口或样本，并进一步分为两个或两个以上的同质组。
-    2.分割：它是一个将一个节点划分成两个或两个以上子节点的过程。
-    3.决策节点：当一个子节点进一步分裂成子节点，那么它被称为决策节点。
-    4.叶/终端节点：不分裂的节点被称为叶子或终端节点。
-    5.剪枝：删除决策节点的子节点的过程被称为剪枝。你可以说相反的过程叫分裂。
-    6.分支 /子树：整个树的子部分被称为分支或子树。
-    7.父子节点：一个节点，分为子节点被称为子节点的父节点，而子节点是父节点的孩子。

这些都是常用的决策树术语。我们知道，每种算法都有优缺点，下面是应该知道的重要因素。

## Important Terminology related to Decision Trees
Let’s look at the basic terminology used with Decision trees:

-    1.Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
-    2.Splitting: It is a process of dividing a node into two or more sub-nodes.
-    3.Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
-    4.Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
-    5.Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
-    6.Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
-    7.Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

These are the terms commonly used for decision trees. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.

<img align="rigth" src="figures/tree3.png">

### 优点
-    1.易于理解：决策树的输出非常容易理解，即使是非分析背景的人。不需要任何统计知识来阅读和解释它们。它的图形表示非常直观，用户可以很容易地关联他们的假设。
-    2.有用的数据探索：决策树是最快的识别最重要变量和两个或两个以上变量之间的关系的方式之一。在决策树的帮助下，我们可以创建新的具有更强能力的变量/特征来预测目标变量。你可以参考文章（提高回归模型力量的技巧）一条技巧。它也可以用于数据探索方面。例如，我们正在研究一个问题，我们有数百个变量的信息，此时决策树将有助于识别最重要的变量。
-    3.较少的数据清理要求：与其它建模技术相比，它需要更少的数据清理。它的公平度不受异常值和缺失值的影响。
-    4.数据类型不受限制：它可以处理连续数值和离散变量。
-    5.非参数方法：决策树被认为是一种非参数方法。这意味着决策树没有空间分布和分类器结构。

### 缺点
-    1.过拟合：过拟合是决策树模型最实际的困难之一。这个问题通过限定参数模型和修剪得到解决（详细讨论如下）。
-    2.不适合连续变量：在处理连续数值变量时，决策树用不同类别分类变量丢失信息。

### Advantages
-    1.Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
-    2.Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. You can refer article (Trick to enhance power of regression model) for one such trick.  It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.
-    3.Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
-    4.Data type is not a constraint: It can handle both numerical and categorical variables.
-    5.Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
 

### Disadvantages
-    1.Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).
-    2.Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

## 2、回归树和分类树

<img align="rigth" src="figures/tree4.png">

我们都知道，终端节点（或叶子）位于决策树的底部。这意味着，决策树绘制时通常是颠倒的，叶子在底部，根在顶部（如上所示）。

这两种树的工作几乎相似，我们看看分类树和回归树的主要差异和相似性：
-    1.回归树用于因变量是连续时。分类树用于因变量是离散时。
-    2.对于回归树，通过训练数据的终端节点获取的值是观察到的落在该区域的平均响应。因此，如果一个未知结果数据落在该区域，我们将用平均值进行预测。
-    3.对于分类树，通过训练数据的终端节点获取的值（分类）是观察落在该区域的模式。因此，如果一个未知分类的数据落在该区域，我们将用模式值进行预测。
-    4.两种树划分预测空间（自变量）为不同且无重叠的区域。为了简单起见，你可以把这些区域想象为高维盒子或箱子。
-    5.两种树遵循自上而下的贪婪方法被称为递归二分分裂。我们叫它“自上而下”，因为当所有的观察都在一个单一的区域时它从树的顶部开始，并依次将预测空间分裂成两个新的分支。它被称作“贪婪”是因为，该算法只关心（寻找最佳可用变量）当前分裂，而不关心未来会带来更好树的分裂。
-    6.这个分裂过程一直持续直到达到一个用户定义的停止标准。例如：一旦每个节点的观测值不到50，我们可以告诉该算法停止。
-    7.在这两种情况下，分裂过程直到达到停止标准导致树完全生长。但是，完全生长的树可能过拟合数据，导致看不见数据精度差。这带来了‘修剪’。修剪是用来解决过拟合技术的一种。关于它我们将了解更多在以下部分。


##  Regression Trees vs Classification Trees
We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the the bottom & roots are the tops (shown below).

Both the trees work almost similar to each other, let’s look at the primary differences & similarity between classification and regression trees:

- Regression trees are used when dependent variable is continuous. Classification trees are used when dependent variable is categorical.
- In case of regression tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mean value.
- In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its prediction with mode value.
- Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions. For the sake of simplicity, you can think of these regions as high dimensional boxes or boxes.
- Both the trees follow a top-down greedy approach known as recursive binary splitting. We call it as ‘top-down’ because it begins from the top of tree when all the observations are available in a single region and successively splits the predictor space into two new branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for best variable available) about only the current split, and not about future splits which will lead to a better tree.
- This splitting process is continued until a user defined stopping criteria is reached. For example: we can tell the the algorithm to stop once the number of observations per node becomes less than 50.
- In both the cases, the splitting process results in fully grown trees until the stopping criteria is reached. But, the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data. This bring ‘pruning’. Pruning is one of the technique used tackle overfitting. We’ll learn more about it in following section.