## Classification
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform classification using a Random Forest. The test error is calculated to measure the algorithm accuracy.

In [2]:
val PATH = "file:///Users/lzz/work/SparkML/"

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification forest model:\n" + model.toDebugString)


Test Error = 0.0
Learned classification forest model:
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 344 <= 0.0)
     If (feature 378 <= 71.0)
      Predict: 0.0
     Else (feature 378 > 71.0)
      Predict: 1.0
    Else (feature 344 > 0.0)
     If (feature 523 <= 31.0)
      If (feature 688 <= 0.0)
       Predict: 1.0
      Else (feature 688 > 0.0)
       Predict: 0.0
     Else (feature 523 > 31.0)
      Predict: 0.0
  Tree 1:
    If (feature 433 <= 0.0)
     If (feature 324 <= 38.0)
      Predict: 0.0
     Else (feature 324 > 38.0)
      Predict: 1.0
    Else (feature 433 > 0.0)
     Predict: 1.0
  Tree 2:
    If (feature 463 <= 0.0)
     If (feature 317 <= 0.0)
      If (feature 489 <= 0.0)
       Predict: 0.0
      Else (feature 489 > 0.0)
       Predict: 1.0
     Else (feature 317 > 0.0)
      Predict: 0.0
    Else (feature 463 > 0.0)
     Predict: 1.0



## Regression
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform regression using a Random Forest. The Mean Squared Error (MSE) is computed at the end to evaluate goodness of fit.

In [4]:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "variance"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression forest model:\n" + model.toDebugString)

Test Mean Squared Error = 0.03831417624521073
Learned regression forest model:
TreeEnsembleModel regressor with 3 trees

  Tree 0:
    If (feature 489 <= 0.0)
     Predict: 0.0
    Else (feature 489 > 0.0)
     Predict: 1.0
  Tree 1:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0
  Tree 2:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0

