# 随机森林（Random Forests）
多棵决策树组合成随机森林。随机森林是分类和回归的最成功的机器学习模型之一。它们通过合并多克决策树来减少过度拟合。像决策树、随机森林处理的是类别特征，可以扩展支持多分类，可用于研究非线性特征的相互作用。  
MLlib 随机森林支持二分类和多分的分类模型，也可以用于回归使用连续型特征和分类特征。MLlib实现随机森林使用现有的决策树接口，详情可以查阅决策树那一章节。
## 基本算法（Basic algorithm）
随机森林算法通过分开训练每一棵决策树，所以训练可以并行处理。该算法随机性训练每棵决策树，最后通过合并每一棵决策树的预测结果来减少预测方差提高测试数据的性能。
### 训练（Training）
The randomness injected into the training process includes:  
* 对原始的数据进行采样在每次迭代中使用不同的训练集。
* 每个节点的分裂是考虑不同随机子集的特点。
* 除了随机化，决策树的训练都是独立的。

### 预测（Prediction）
要走一个新的实例上进行预测，随机森林必须聚合它设置的决策树，这种设置分类和回归是不一样的。  
Classification:投票方式，每一棵树的预测结果作为一类的投票，获得票数最多的类被标记为预测类。  
Regression: 取平均数，每棵树实际值，标签被预测为树的平均值。
## 使用提示（Usage tips）
我们介绍通过解释各种参数来研究如何使用随机森林，这个过程会忽略一些决策树，因为这些参数可以在决策树那个模块中查找。  

前面的这两个参数最重要了，因为经常使用它们来提高算法的性能。  
* numTrees: Number of trees in the forest.  
    。Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.  
    。Training time increases roughly linearly in the number of trees.
* maxDepth: Maximum depth of each tree in the forest. 
    。Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.  
    。In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).  
    
下面的这两个参数不一定要设置，但是有时候设置这两个参数可以加快训练过程。
* subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.

* featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.

## 例子（Examples）

### Classification
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform classification using a Random Forest. The test error is calculated to measure the algorithm accuracy.

In [2]:
val PATH = "file:///Users/lzz/work/SparkML/"

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification forest model:\n" + model.toDebugString)


Test Error = 0.0
Learned classification forest model:
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 344 <= 0.0)
     If (feature 378 <= 71.0)
      Predict: 0.0
     Else (feature 378 > 71.0)
      Predict: 1.0
    Else (feature 344 > 0.0)
     If (feature 523 <= 31.0)
      If (feature 688 <= 0.0)
       Predict: 1.0
      Else (feature 688 > 0.0)
       Predict: 0.0
     Else (feature 523 > 31.0)
      Predict: 0.0
  Tree 1:
    If (feature 433 <= 0.0)
     If (feature 324 <= 38.0)
      Predict: 0.0
     Else (feature 324 > 38.0)
      Predict: 1.0
    Else (feature 433 > 0.0)
     Predict: 1.0
  Tree 2:
    If (feature 463 <= 0.0)
     If (feature 317 <= 0.0)
      If (feature 489 <= 0.0)
       Predict: 0.0
      Else (feature 489 > 0.0)
       Predict: 1.0
     Else (feature 317 > 0.0)
      Predict: 0.0
    Else (feature 463 > 0.0)
     Predict: 1.0



### Regression
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform regression using a Random Forest. The Mean Squared Error (MSE) is computed at the end to evaluate goodness of fit.

In [4]:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "variance"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression forest model:\n" + model.toDebugString)

Test Mean Squared Error = 0.03831417624521073
Learned regression forest model:
TreeEnsembleModel regressor with 3 trees

  Tree 0:
    If (feature 489 <= 0.0)
     Predict: 0.0
    Else (feature 489 > 0.0)
     Predict: 1.0
  Tree 1:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0
  Tree 2:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0

