# A random forest classification engine with Spark ML

Dr Jose M. Albornoz, May 2019

In this notebook I will build a classifier using Spark's sample classification dataset

In [1]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator  
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator  
import org.apache.spark.mllib.evaluation.MulticlassMetrics  
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics  
import org.apache.spark.ml.classification.RandomForestClassifier  
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}  
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer, OneHotEncoderEstimator}  
import org.apache.spark.ml.linalg.Vectors  
import org.apache.spark.ml.Pipeline  
import org.apache.log4j._  
Logger.getLogger("org").setLevel(Level.ERROR) 

Intitializing Scala interpreter ...

Spark Web UI available at http://DESKTOP-FQ2BOOJ:4040
SparkContext available as 'sc' (version = 2.4.0, master = local[*], app id = local-1558006945508)
SparkSession available as 'spark'


import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer, OneHotEncoderEstimator}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.log4j._


# 1.- Load data

In [2]:
// Prepare training and test data.
val data = spark.read.format("libsvm").load("/home/jmalbornoz/Downloads/spark-2.4.0-bin-hadoop2.7/data/mllib/sample_binary_classification_data.txt")

data: org.apache.spark.sql.DataFrame = [label: double, features: vector]


In [3]:
data.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
+-----+--------------------+
only showing top 5 rows



In [4]:
data.count

res2: Long = 100


# 2.- Train-test split

In [5]:
// Splitting the data by create an array of the training and test data
val Array(training, test) = data.select("label","features").randomSplit(Array(0.7, 0.3), seed = 801)

training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]


In [6]:
training.count

res3: Long = 71


In [7]:
test.count

res4: Long = 29


# 3.- Random forest model

I will now create a model object (I’m using a Random Forest Classifier), define a parameter grid (I kept it simple and only varied the number of trees), create a Cross Validator object (here is where we set our scoring metric for training the model) and fit the model.

In [8]:
// create the model
val rf = new RandomForestClassifier()

rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_162afc2d4152


In [9]:
// create the param grid
val paramGrid = new ParamGridBuilder().addGrid(rf.numTrees,Array(20,50,100)).build()

paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	rfc_162afc2d4152-numTrees: 20
}, {
	rfc_162afc2d4152-numTrees: 50
}, {
	rfc_162afc2d4152-numTrees: 100
})


In [10]:
// create cross val object, define scoring metric
val cv = new CrossValidator().
  setEstimator(rf).
  setEvaluator(new BinaryClassificationEvaluator().setMetricName("areaUnderPR")).
  setEstimatorParamMaps(paramGrid).
  setNumFolds(3).
setParallelism(2)

cv: org.apache.spark.ml.tuning.CrossValidator = cv_76a84baed0a1


# 3.- Model training

In [11]:
// You can then treat this object as the model and use fit on it.
val model = cv.fit(training)

model: org.apache.spark.ml.tuning.CrossValidatorModel = cv_76a84baed0a1


In [12]:
model.avgMetrics

res5: Array[Double] = Array(1.0, 1.0, 1.0)


In [13]:
model.bestModel 

res6: org.apache.spark.ml.Model[_] = RandomForestClassificationModel (uid=rfc_162afc2d4152) with 20 trees


# 4.- Model evaluation

This is a little more difficult because the evaluation functionality still mostly resides in the RDD-API for Spark, requiring some different syntax. Let’s begin by getting predictions on our test data and storing them.

In [14]:
val results = model.transform(test).select("features", "label", "prediction")

results: org.apache.spark.sql.DataFrame = [features: vector, label: double ... 1 more field]


In [15]:
results.show

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(692,[121,122,123...|  0.0|       0.0|
|(692,[122,123,124...|  0.0|       0.0|
|(692,[123,124,125...|  0.0|       0.0|
|(692,[123,124,125...|  0.0|       0.0|
|(692,[124,125,126...|  0.0|       0.0|
|(692,[124,125,126...|  0.0|       0.0|
|(692,[124,125,126...|  0.0|       0.0|
|(692,[126,127,128...|  0.0|       0.0|
|(692,[127,128,129...|  0.0|       0.0|
|(692,[153,154,155...|  0.0|       0.0|
|(692,[154,155,156...|  0.0|       0.0|
|(692,[123,124,125...|  1.0|       1.0|
|(692,[123,124,125...|  1.0|       1.0|
|(692,[123,124,125...|  1.0|       1.0|
|(692,[123,124,125...|  1.0|       1.0|
|(692,[124,125,126...|  1.0|       1.0|
|(692,[125,126,127...|  1.0|       1.0|
|(692,[125,126,127...|  1.0|       1.0|
|(692,[125,126,153...|  1.0|       1.0|
|(692,[126,127,128...|  1.0|       1.0|
+--------------------+-----+----------+
only showing top 20 rows



We will then convert these results to an RDD.

In [16]:
val predictionAndLabels = results.select($"prediction",$"label").as[(Double, Double)].rdd

predictionAndLabels: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[604] at rdd at <console>:38


We now create our metrics objects and print out the confusion matrix.

In [17]:
// Instantiate a new metrics objects
val bMetrics = new BinaryClassificationMetrics(predictionAndLabels)
val mMetrics = new MulticlassMetrics(predictionAndLabels)
val labels = mMetrics.labels

// Print out the Confusion matrix
println("Confusion matrix:")
println(mMetrics.confusionMatrix)

Confusion matrix:
11.0  0.0   
0.0   18.0  


bMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@46fad8a9
mMetrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@7908ae02
labels: Array[Double] = Array(0.0, 1.0)


We will now use the numbers in the confusion matrix to calculate some useful metrics.

In [18]:
// Precision by label
labels.foreach { l =>
  println(s"Precision($l) = " + mMetrics.precision(l))
}

// Recall by label
labels.foreach { l =>
  println(s"Recall($l) = " + mMetrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
  println(s"FPR($l) = " + mMetrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
  println(s"F1-Score($l) = " + mMetrics.fMeasure(l))
}

Precision(0.0) = 1.0
Precision(1.0) = 1.0
Recall(0.0) = 1.0
Recall(1.0) = 1.0
FPR(0.0) = 0.0
FPR(1.0) = 0.0
F1-Score(0.0) = 1.0
F1-Score(1.0) = 1.0


In [19]:
// Precision by threshold
val precision = bMetrics.precisionByThreshold
precision.foreach { case (t, p) =>
  println(s"Threshold: $t, Precision: $p")
}

// Recall by threshold
val recall = bMetrics.recallByThreshold
recall.foreach { case (t, r) =>
  println(s"Threshold: $t, Recall: $r")
}

// Precision-Recall Curve
val PRC = bMetrics.pr

// F-measure
val f1Score = bMetrics.fMeasureByThreshold
f1Score.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 1")
}

Threshold: 1.0, Precision: 1.0
Threshold: 0.0, Precision: 0.6206896551724138
Threshold: 1.0, Recall: 1.0
Threshold: 0.0, Recall: 1.0
Threshold: 1.0, F-score: 1.0, Beta = 1
Threshold: 0.0, F-score: 0.7659574468085107, Beta = 1


precision: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[621] at map at BinaryClassificationMetrics.scala:214
recall: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[622] at map at BinaryClassificationMetrics.scala:214
PRC: org.apache.spark.rdd.RDD[(Double, Double)] = UnionRDD[625] at union at BinaryClassificationMetrics.scala:110
f1Score: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[626] at map at BinaryClassificationMetrics.scala:214


In [20]:
val beta = 0.5
val fScore = bMetrics.fMeasureByThreshold(beta)
f1Score.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 0.5")
}

// AUPRC
val auPRC = bMetrics.areaUnderPR
println("Area under precision-recall curve = " + auPRC)

// Compute thresholds used in ROC and PR curves
val thresholds = precision.map(_._1)

// ROC Curve
val roc = bMetrics.roc

// AUROC
val auROC = bMetrics.areaUnderROC
println("Area under ROC = " + auROC)

Threshold: 1.0, F-score: 1.0, Beta = 0.5
Threshold: 0.0, F-score: 0.7659574468085107, Beta = 0.5
Area under precision-recall curve = 1.0
Area under ROC = 1.0


beta: Double = 0.5
fScore: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[627] at map at BinaryClassificationMetrics.scala:214
auPRC: Double = 1.0
thresholds: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[633] at map at <console>:52
roc: org.apache.spark.rdd.RDD[(Double, Double)] = UnionRDD[637] at UnionRDD at BinaryClassificationMetrics.scala:90
auROC: Double = 1.0
