In [1]:
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('rf').getOrCreate()

In [3]:
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("/FileStore/tables/sample_libsvm_data.txt")

In [4]:
data.show(5)

In [5]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Decision Tree Example

This is just a quick walkthrough of the Documentation's Example of Decision Tree:

In [7]:
from pyspark.ml.classification import DecisionTreeClassifier

In [8]:
# Train a Decision Tree model.
dt = DecisionTreeClassifier()

# Train model.  This also runs the indexers.
model = dt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.show(5)

In [9]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

# Random Forest Example

This is just a quick walkthrough of the Documentation's Example of Random Forest:

In [11]:
from pyspark.ml.classification import RandomForestClassifier

In [12]:
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=150)

# Train model. This also runs the indexers.
model = rf.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.show(5)

In [13]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

In [14]:
# Variable importance

# Not a very good example to show this!
model.featureImportances

## Gradient Boosted Trees

Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees. More information about the spark.ml implementation can be found further in the section on [GBTs.](http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-trees-gbts). For more information on the algorithm itself, please see the [spark.mllib documentation on GBTs.](http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts)

Luckily Spark makes very easy to use, basically just an import switch:

In [16]:
from pyspark.ml.classification import GBTClassifier

In [17]:
# Train a GBT model.
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)

# Train model.  This also runs the indexers.
model = gbt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

In [18]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

So this data isn't really realistic enough to really judge to effectiveness of GBT models, this data makes it seem like they are perfection, instead of just an improvement on normal Random Forests.

Let's move on to a more realistic example!