## Classification
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform classification using Gradient-Boosted Trees with log loss. The test error is calculated to measure the algorithm accuracy.

In [2]:
val PATH = "file:///Users/lzz/work/SparkML/"
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH + "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.
//  The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 5
//  Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification GBT model:\n" + model.toDebugString)

Test Error = 0.02702702702702703
Learned classification GBT model:
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 434 <= 0.0)
     If (feature 99 <= 0.0)
      Predict: -1.0
     Else (feature 99 > 0.0)
      Predict: 1.0
    Else (feature 434 > 0.0)
     Predict: 1.0
  Tree 1:
    If (feature 434 <= 0.0)
     If (feature 352 <= 246.0)
      If (feature 400 <= 9.0)
       If (feature 124 <= 0.0)
        Predict: -0.4768116880884702
       Else (feature 124 > 0.0)
        Predict: -0.4768116880884703
      Else (feature 400 > 9.0)
       Predict: -0.4768116880884703
     Else (feature 352 > 246.0)
      Predict: 0.4768116880884694
    Else (feature 434 > 0.0)
     If (feature 467 <= 28.0)
      If (feature 518 <= 248.0)
       Predict: 0.47681168808847024
      Else (feature 518 > 248.0)
       Predict: 0.47681168808847024
     Else (feature 467 > 28.0)
      Predict: 0.4768116880884712
  Tree 2:
    If (feature 434 <= 0.0)
     If (feature 242 <= 0.0)
      Predic

## Regression
The example below demonstrates how to load a LIBSVM data file, parse it as an RDD of LabeledPoint and then perform regression using Gradient-Boosted Trees with Squared Error as the loss. The Mean Squared Error (MSE) is computed at the end to evaluate goodness of fit.

In [4]:
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, PATH+"data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.
//  The defaultParams for Regression use SquaredError by default.
val boostingStrategy = BoostingStrategy.defaultParams("Regression")
boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.
boostingStrategy.treeStrategy.maxDepth = 5
//  Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression GBT model:\n" + model.toDebugString)


Test Mean Squared Error = 0.13333333333333333
Learned regression GBT model:
TreeEnsembleModel regressor with 3 trees

  Tree 0:
    If (feature 405 <= 0.0)
     If (feature 99 <= 0.0)
      Predict: 0.0
     Else (feature 99 > 0.0)
      Predict: 1.0
    Else (feature 405 > 0.0)
     Predict: 1.0
  Tree 1:
    Predict: 0.0
  Tree 2:
    Predict: 0.0

