# 决策树（Decision Trees）
在机器学习算法中决策树和它的集成方法（随机森林,GBDT）都是很流行的分类和回归方法。决策树在很多任务上表现出的性能很好，相对容易解释和理解，可以处理类属或者数值特征，同时不要求数据归一化或标准化。决策树非常适用集成方法（ensemble method）,比如多个决策树的集成，称为决策树森林。  
## 算法基础
决策树是一种贪婪算法，通过递归二分类特征空间。为了在下一个子树获得最大信息增益，每个分区都尽可能地选择最佳分裂点。换句话讲对于数据集D我们要找到一个分隔点s使得信息增益$\underset{s}{\operatorname{argmax}} IG(D,s)$最大。
## 节点混乱度（impurity）和信息增益（information gain）
节点点混乱度是衡量节点上标签点均匀度。目前MLlib提供了两种度量分类混乱度的方法(Gini impurity and entropy)和一种回归混乱度的度量的方法（variance）。

|    Impurity   |   Task   |  Formula  | Description |
| ---- | ---- | ---- |
| Gini impurity | Classification | $\sum_{i=1}^{C} f_i(1-f_i)$ | $f_i$is the frequency of label ii at a node and CC is the number of unique labels. |
| Entropy | Classification | $\sum_{i=1}^{C} -f_ilog(f_i)$ | $f_i$ is the frequency of label ii at a node and CC is the number of unique labels.(对于单个分类的impurity $l(x_i)=-log(f_i)$,表示选择该分类的概率越大混乱度$l(x_i)$越小。概率越小就不能保证混乱度越大了，因为没办法保证其它分类的概率分布  ) |
| Variance | Regression | $\frac{1}{N} \sum_{i=1}^{N} (y_i - \mu)^2$ | $y_i$is label for an instance, NN is the number of instances and μμ is the mean given by $\frac{1}{N} \sum_{i=1}^N y_i$ |  
信息增益是父节点和两个子节点混乱度加权总和的差异。假设分割点s将大小为N的数据集D分割成两部分$D_{left}$和$D_{right}$大小分别为$N_{left}$和$N_{right}$,所以信息增益有：  
$IG(D,s) = Impurity(D) - \frac{N_{left}}{N} Impurity(D_{left}) - \frac{N_{right}}{N} Impurity(D_{right})$
## 分割点的选择（Split candidates）
### 连续型特征（Continuous features）
对于在单机实现中的小数据集，每个连续特征的分割候选通常是特征的唯一值。一些实现对特征值进行排序，然后使用排序的唯一值作为更快的树计算的分割候选。
对于大型分布式数据集的排序特征值是昂贵的。实现计算的方式是通过设置分割点通过数据采样的分位数计算。有序分割点创建“bins”最大bins可以通过 maxBins 参数设置。
### 类别型特征（Categorical features）
一个分类特征有M种可能的值那么我们可能要进行$2^{M-1}-1$次的分割候选。
## 停止规则（Stopping rule）
当满足下列条件之一时，递归树的构建在节点上停止：  
1) 节点的深度等于maxDepth训练参数。  
2) 分裂的候选没有信息增益大于mininfogain。  
3) 到达叶子节点了。

In [2]:
val PATH = "file:///Users/lzz/work/SparkML/"
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH+"data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = DecisionTreeModel.load(sc, "myModelPath")

Test Error = 0.08571428571428572
Learned classification tree model:
DecisionTreeModel classifier of depth 1 with 3 nodes
  If (feature 378 <= 71.0)
   Predict: 0.0
  Else (feature 378 > 71.0)
   Predict: 1.0



## regression 
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform regression using a decision tree with variance as an impurity measure and a maximum tree depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate goodness of fit.

In [4]:
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity,
  maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)

// Save and load model
model.save(sc, "myModelPath")
val sameModel = DecisionTreeModel.load(sc, "myModelPath")

Test Mean Squared Error = 0.05555555555555555
Learned regression tree model:
DecisionTreeModel regressor of depth 1 with 3 nodes
  If (feature 406 <= 0.0)
   Predict: 0.0
  Else (feature 406 > 0.0)
   Predict: 1.0

