# A random forest classification engine with Spark ML

Dr Jose M. Albornoz, April 2019

In this notebook I will build a classifier that predicts whether a student passes a course based on data accumulated throughout the entire course. I will use he Harvard EdX to be found at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/26147

I’m not going to do much feature engineering because I want to focus on the mechanics of training the model in Spark.

In [1]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator  
import org.apache.spark.mllib.evaluation.MulticlassMetrics  
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics  
import org.apache.spark.ml.classification.RandomForestClassifier  
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}  
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer, OneHotEncoderEstimator}  
import org.apache.spark.ml.linalg.Vectors  
import org.apache.spark.ml.Pipeline  
import org.apache.log4j._  
Logger.getLogger("org").setLevel(Level.ERROR) 

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.192:4040
SparkContext available as 'sc' (version = 2.4.0, master = local[*], app id = local-1556116701375)
SparkSession available as 'spark'


import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer, OneHotEncoderEstimator}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.log4j._


# 1.- Load training data

In [2]:
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("mooc.csv")

data: org.apache.spark.sql.DataFrame = [course_id: string, userid_DI: string ... 18 more fields]


In [3]:
data.show(5)

+--------------------+--------------+----------+------+--------+---------+-----------------+------+---+------+-----+-------------------+-------------------+-------+---------+-----------+---------+------------+-----+---------------+
|           course_id|     userid_DI|registered|viewed|explored|certified|final_cc_cname_DI|LoE_DI|YoB|gender|grade|      start_time_DI|      last_event_DI|nevents|ndays_act|nplay_video|nchapters|nforum_posts|roles|incomplete_flag|
+--------------------+--------------+----------+------+--------+---------+-----------------+------+---+------+-----+-------------------+-------------------+-------+---------+-----------+---------+------------+-----+---------------+
|HarvardX/CB22x/20...|MHxPC130442623|         1|     0|       0|        0|    United States|    NA| NA|    NA|    0|2012-12-19 00:00:00|2013-11-17 00:00:00|   null|        9|       null|     null|           0| null|              1|
| HarvardX/CS50x/2012|MHxPC130442623|         1|     1|       0|        

# 2.- Data pre-processing

## 2.1.- Selection of relevant columns 

A Spark model needs exactly two columns: “label” and “features”. To get there will take a few steps. First we will identify our label using the select method while also keeping only relevant columns.

In [4]:
val df = (data.select(data("certified").as("label"), $"registered", $"viewed", $"explored", 
          $"final_cc_cname_DI", $"gender", $"nevents", $"ndays_act", $"nplay_video", $"nchapters", $"nforum_posts"))

df: org.apache.spark.sql.DataFrame = [label: int, registered: int ... 9 more fields]


Putting the entire method call in a set of parentheses allows you to break up the lines arbitrarily without Spark freaking out.

In [5]:
df.show(5)

+-----+----------+------+--------+-----------------+------+-------+---------+-----------+---------+------------+
|label|registered|viewed|explored|final_cc_cname_DI|gender|nevents|ndays_act|nplay_video|nchapters|nforum_posts|
+-----+----------+------+--------+-----------------+------+-------+---------+-----------+---------+------------+
|    0|         1|     0|       0|    United States|    NA|   null|        9|       null|     null|           0|
|    0|         1|     1|       0|    United States|    NA|   null|        9|       null|      1.0|           0|
|    0|         1|     0|       0|    United States|    NA|   null|       16|       null|     null|           0|
|    0|         1|     0|       0|    United States|    NA|   null|       16|       null|     null|           0|
|    0|         1|     0|       0|    United States|    NA|   null|       16|       null|     null|           0|
+-----+----------+------+--------+-----------------+------+-------+---------+-----------+-------

In [6]:
df.count

res3: Long = 641138


## 2.2.- One-hot encoding of categorical features

Next we will do some one-hot encoding on our categorical features. This takes a few steps. First we have to use the StringIndexer to convert the strings to integers. Then we have to use the OneHotEncoderEstimator to do the encoding.

In [7]:
// string indexing
val indexer1 = new StringIndexer().
    setInputCol("final_cc_cname_DI").
    setOutputCol("countryIndex").
    setHandleInvalid("keep") 
val indexed1 = indexer1.fit(df).transform(df)

val indexer2 = new StringIndexer().
    setInputCol("gender").
    setOutputCol("genderIndex").
    setHandleInvalid("keep")
val indexed2 = indexer2.fit(indexed1).transform(indexed1)

// one hot encoding
val encoder = new OneHotEncoderEstimator().
  setInputCols(Array("countryIndex", "genderIndex")).
  setOutputCols(Array("countryVec", "genderVec"))
val encoded = encoder.fit(indexed2).transform(indexed2)

indexer1: org.apache.spark.ml.feature.StringIndexer = strIdx_025904f265de
indexed1: org.apache.spark.sql.DataFrame = [label: int, registered: int ... 10 more fields]
indexer2: org.apache.spark.ml.feature.StringIndexer = strIdx_446e86b7c0d5
indexed2: org.apache.spark.sql.DataFrame = [label: int, registered: int ... 11 more fields]
encoder: org.apache.spark.ml.feature.OneHotEncoderEstimator = oneHotEncoder_56d9833e7f4e
encoded: org.apache.spark.sql.DataFrame = [label: int, registered: int ... 13 more fields]


With the *.setHandleInvalid("keep")* option the indexer adds new indexes whenever it sees new labels (which may happen on a test set, for example).

In [8]:
encoded.show(5)

+-----+----------+------+--------+-----------------+------+-------+---------+-----------+---------+------------+------------+-----------+--------------+-------------+
|label|registered|viewed|explored|final_cc_cname_DI|gender|nevents|ndays_act|nplay_video|nchapters|nforum_posts|countryIndex|genderIndex|    countryVec|    genderVec|
+-----+----------+------+--------+-----------------+------+-------+---------+-----------+---------+------------+------------+-----------+--------------+-------------+
|    0|         1|     0|       0|    United States|    NA|   null|        9|       null|     null|           0|         0.0|        2.0|(34,[0],[1.0])|(4,[2],[1.0])|
|    0|         1|     1|       0|    United States|    NA|   null|        9|       null|      1.0|           0|         0.0|        2.0|(34,[0],[1.0])|(4,[2],[1.0])|
|    0|         1|     0|       0|    United States|    NA|   null|       16|       null|     null|           0|         0.0|        2.0|(34,[0],[1.0])|(4,[2],[1.0])

## 2.3.- Checking for null values

In [9]:
val nanEvents = encoded.groupBy("nevents").count().orderBy($"count".desc)
val nanEvents1 = nanEvents.withColumn("proportion", $"count"*100/641138)
nanEvents1.show

+-------+------+-------------------+
|nevents| count|         proportion|
+-------+------+-------------------+
|   null|199151| 31.062111433108004|
|      1| 63565|  9.914402203581757|
|      2| 34329|  5.354385483312485|
|      3| 17669| 2.7558809491872265|
|      4| 12217|  1.905518000804819|
|      5|  9850|  1.536330711952809|
|      6|  8480| 1.3226481662294234|
|      7|  7259| 1.1322055470117198|
|      8|  6648|  1.036906251072312|
|      9|  6076| 0.9476898889162707|
|     10|  5621| 0.8767223281103288|
|     11|  5197| 0.8105899197988576|
|     12|  4870| 0.7595868596152466|
|     13|  4476| 0.6981336311371343|
|     14|  4222| 0.6585165752146964|
|     15|  3891| 0.6068896243866375|
|     16|  3808| 0.5939438935143448|
|     17|  3545| 0.5529230836418992|
|     18|  3349|  0.522352442063955|
|     19|  3144|0.49037804653600314|
+-------+------+-------------------+
only showing top 20 rows



nanEvents: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nevents: int, count: bigint]
nanEvents1: org.apache.spark.sql.DataFrame = [nevents: int, count: bigint ... 1 more field]


In [10]:
val nanNdays = encoded.groupBy("ndays_act").count().orderBy($"count".desc)
val nanNdays1 = nanNdays.withColumn("proportion", $"count"*100/641138)
nanNdays1.show

+---------+------+-------------------+
|ndays_act| count|         proportion|
+---------+------+-------------------+
|        1|209941| 32.745056446506055|
|     null|162743| 25.383458787343756|
|        2| 80625| 12.575295802151798|
|        3| 43081|  6.719458213364361|
|        4| 26813| 4.1820949623949915|
|        5| 18552| 2.8936048089490876|
|        6| 13239| 2.0649220604612424|
|        7| 10281| 1.6035549288920639|
|        8|  8075| 1.2594792384790794|
|        9|  6510| 1.0153820238388616|
|       10|  5324| 0.8303984477600767|
|       11|  4415| 0.6886192988093047|
|       12|  3815| 0.5950357021421285|
|       13|  3323| 0.5182971528750441|
|       14|  2794| 0.4357876151468171|
|       15|  2542|0.39648250454660305|
|       16|  2146| 0.3347173307462668|
|       17|  2002| 0.3122572675461445|
|       18|  1852|0.28886136837935045|
|       19|  1564| 0.2439412419791059|
+---------+------+-------------------+
only showing top 20 rows



nanNdays: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ndays_act: int, count: bigint]
nanNdays1: org.apache.spark.sql.DataFrame = [ndays_act: int, count: bigint ... 1 more field]


In [11]:
val nanPlayVideo = encoded.groupBy("nplay_video").count().orderBy($"count".desc)
val nanPlayVideo1 = nanPlayVideo.withColumn("proportion", $"count"*100/641138)
nanPlayVideo1.show

+-----------+------+-------------------+
|nplay_video| count|         proportion|
+-----------+------+-------------------+
|       null|457530|   71.3621716385552|
|          1| 16968| 2.6465441137477423|
|          2| 11000| 1.7156992722315632|
|          3|  8371| 1.3056471461682195|
|          4|  6995| 1.0910287644781622|
|          5|  5992| 0.9345881853828661|
|          6|  5373| 0.8380411081545627|
|          7|  4714| 0.7352551244817809|
|          8|  4296| 0.6700585521369814|
|          9|  4076| 0.6357445666923501|
|         10|  3620| 0.5646210332252962|
|         11|  3453| 0.5385735988195989|
|         12|  3187| 0.4970848709638175|
|         13|  2853|0.44499000215242274|
|         14|  2641|0.41192379799668716|
|         15|  2453| 0.3826009377076386|
|         16|  2401|0.37449035932981667|
|         17|  2138| 0.3334695494573711|
|         18|  2085| 0.3252029984184372|
|         19|  1887| 0.2943204115182691|
+-----------+------+-------------------+
only showing top

nanPlayVideo: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nplay_video: int, count: bigint]
nanPlayVideo1: org.apache.spark.sql.DataFrame = [nplay_video: int, count: bigint ... 1 more field]


In [12]:
val nanNChapters = encoded.groupBy("nchapters").count().orderBy($"count".desc)
val nanNChapters1 = nanNChapters.withColumn("proportion", $"count"*100/641138)
nanNChapters1.show

+---------+------+-------------------+
|nchapters| count|         proportion|
+---------+------+-------------------+
|     null|258753|  40.35839398070306|
|      1.0|121837| 19.003241111897907|
|      2.0|110085|  17.17025039851015|
|      3.0| 52296|  8.156746285511076|
|      4.0| 24937|  3.889490250148954|
|      5.0| 13838| 2.1583496844673067|
|      6.0|  8536|  1.331382635251693|
|     12.0|  7987|  1.245753644301227|
|      7.0|  6556| 1.0225567662500117|
|      8.0|  5009| 0.7812670595098091|
|      9.0|  4091| 0.6380841566090296|
|     10.0|  3598| 0.5611896346808332|
|     18.0|  3411| 0.5320227470528965|
|     11.0|  3258| 0.5081589299027667|
|     16.0|  2890| 0.4507609906135652|
|     15.0|  2684|0.41863062242450144|
|     13.0|  2053| 0.3202118732628545|
|     14.0|  2021|0.31522074810727174|
|     17.0|  1877|0.29276068490714946|
|     32.0|   657|0.10247403835055792|
+---------+------+-------------------+
only showing top 20 rows



nanNChapters: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nchapters: double, count: bigint]
nanNChapters1: org.apache.spark.sql.DataFrame = [nchapters: double, count: bigint ... 1 more field]


For some columns the proportion of null values reaches 71% - we will impute the null values of the above explored columns using the median value of each column

In [13]:
// define medians
val neventsMedianArray = encoded.stat.approxQuantile("nevents", Array(0.5), 0)
val neventsMedian = neventsMedianArray(0)

val ndays_actMedianArray = encoded.stat.approxQuantile("ndays_act", Array(0.5), 0)
val ndays_actMedian = ndays_actMedianArray(0)

val nplay_videoMedianArray = encoded.stat.approxQuantile("nplay_video", Array(0.5), 0)
val nplay_videoMedian = nplay_videoMedianArray(0)

val nchaptersMedianArray = encoded.stat.approxQuantile("nchapters", Array(0.5), 0)
val nchaptersMedian = nchaptersMedianArray(0)

// replace 
val filled = encoded.na.fill(Map(
  "nevents" -> neventsMedian, 
  "ndays_act" -> ndays_actMedian, 
  "nplay_video" -> nplay_videoMedian, 
"nchapters" -> nchaptersMedian))

neventsMedianArray: Array[Double] = Array(24.0)
neventsMedian: Double = 24.0
ndays_actMedianArray: Array[Double] = Array(2.0)
ndays_actMedian: Double = 2.0
nplay_videoMedianArray: Array[Double] = Array(18.0)
nplay_videoMedian: Double = 18.0
nchaptersMedianArray: Array[Double] = Array(2.0)
nchaptersMedian: Double = 2.0
filled: org.apache.spark.sql.DataFrame = [label: int, registered: int ... 13 more fields]


## 2.4.- Construction of features column

We use the VectorAssembler object to construct our features column. Remember, Spark models need exactly two columns: “label” and “features”.

In [14]:
// Set the input columns as the features we want to use
val assembler = (new VectorAssembler().setInputCols(Array(
  "viewed", "explored", "nevents", "ndays_act", "nplay_video", 
  "nchapters", "nforum_posts", "countryVec", "genderVec")).
   setOutputCol("features"))

// Transform the DataFrame
val output = assembler.transform(filled).select($"label",$"features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_aaa4867dfdb2
output: org.apache.spark.sql.DataFrame = [label: int, features: vector]


In [15]:
output.show

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|(45,[2,3,4,5,7,43...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[2,3,4,5,7,43...|
|    0|(45,[2,3,4,5,7,43...|
|    0|(45,[2,3,4,5,7,43...|
|    0|(45,[0,1,2,3,4,5,...|
|    0|(45,[2,3,4,5,7,43...|
|    0|(45,[0,2,3,4,5,32...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,7,...|
|    0|(45,[0,2,3,4,5,6,...|
|    0|(45,[0,2,3,4,5,9,...|
+-----+--------------------+
only showing top 20 rows



## 2.5.- Train-test split

In [16]:
// Splitting the data by create an array of the training and test data
val Array(training, test) = output.select("label","features").randomSplit(Array(0.7, 0.3), seed = 801)

training: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: int, features: vector]
test: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: int, features: vector]


In [17]:
training.count

res10: Long = 449388


In [18]:
test.count

res11: Long = 191750


# 3.- Random forest model

I will now create a model object (I’m using a Random Forest Classifier), define a parameter grid (I kept it simple and only varied the number of trees), create a Cross Validator object (here is where we set our scoring metric for training the model) and fit the model.

WARNING: This code will take some time to run! If you have a particularly old / underpowered computer, beware.

In [19]:
// create the model
val rf = new RandomForestClassifier()

rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_50f023f527ce


In [20]:
// create the param grid
val paramGrid = new ParamGridBuilder().addGrid(rf.numTrees,Array(20,50,100)).build()

paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	rfc_50f023f527ce-numTrees: 20
}, {
	rfc_50f023f527ce-numTrees: 50
}, {
	rfc_50f023f527ce-numTrees: 100
})


In [21]:
// create cross val object, define scoring metric
val cv = new CrossValidator().
  setEstimator(rf).
  setEvaluator(new MulticlassClassificationEvaluator().setMetricName("weightedRecall")).
  setEstimatorParamMaps(paramGrid).
  setNumFolds(3).
setParallelism(2)

cv: org.apache.spark.ml.tuning.CrossValidator = cv_17e12cb8b37e


## 3.1- Model training

In [22]:
// You can then treat this object as the model and use fit on it.
val model = cv.fit(training)

model: org.apache.spark.ml.tuning.CrossValidatorModel = cv_17e12cb8b37e


In [23]:
model.avgMetrics

res12: Array[Double] = Array(0.9846214019525908, 0.9847726928730518, 0.9847926896199338)


In [24]:
model.bestModel 

res13: org.apache.spark.ml.Model[_] = RandomForestClassificationModel (uid=rfc_50f023f527ce) with 100 trees


## 3.2- Model evaluation

This is a little more difficult because the evaluation functionality still mostly resides in the RDD-API for Spark, requiring some different syntax. Let’s begin by getting predictions on our test data and storing them.

In [25]:
val results = model.transform(test).select("features", "label", "prediction")

results: org.apache.spark.sql.DataFrame = [features: vector, label: int ... 1 more field]


In [26]:
results.show

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       1.0|
|(45,[0,1,2,3,4,5,...|    0|       1.0|
|(45,[0,1,2,3,4,5,...|    0|       1.0|
|(45,[0,1,2,3,4,5,...|    0|       1.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       1.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
|(45,[0,1,2,3,4,5,...|    0|       0.0|
+--------------------+-----+----------+
only showing top 20 rows



We will then convert these results to an RDD.

In [27]:
val predictionAndLabels = results.select($"prediction",$"label").as[(Double, Double)].rdd

predictionAndLabels: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[719] at rdd at <console>:37


We now create our metrics objects and print out the confusion matrix.

In [28]:
// Instantiate a new metrics objects
val bMetrics = new BinaryClassificationMetrics(predictionAndLabels)
val mMetrics = new MulticlassMetrics(predictionAndLabels)
val labels = mMetrics.labels

// Print out the Confusion matrix
println("Confusion matrix:")
println(mMetrics.confusionMatrix)

Confusion matrix:
185595.0  908.0   
1960.0    3287.0  


bMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@7a4a9825
mMetrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@7deee5ac
labels: Array[Double] = Array(0.0, 1.0)


We will now use the numbers in the confusion matrix to calculate some useful metrics.

In [29]:
// Precision by label
labels.foreach { l =>
  println(s"Precision($l) = " + mMetrics.precision(l))
}

// Recall by label
labels.foreach { l =>
  println(s"Recall($l) = " + mMetrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
  println(s"FPR($l) = " + mMetrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
  println(s"F1-Score($l) = " + mMetrics.fMeasure(l))
}

Precision(0.0) = 0.9895497320785903
Precision(1.0) = 0.7835518474374255
Recall(0.0) = 0.9951314456067731
Recall(1.0) = 0.6264532113588718
FPR(0.0) = 0.3735467886411283
FPR(1.0) = 0.004868554393226919
F1-Score(0.0) = 0.9923327398424844
F1-Score(1.0) = 0.6962507943232367


In [30]:
// Precision by threshold
val precision = bMetrics.precisionByThreshold
precision.foreach { case (t, p) =>
  println(s"Threshold: $t, Precision: $p")
}

// Recall by threshold
val recall = bMetrics.recallByThreshold
recall.foreach { case (t, r) =>
  println(s"Threshold: $t, Recall: $r")
}

// Precision-Recall Curve
val PRC = bMetrics.pr

// F-measure
val f1Score = bMetrics.fMeasureByThreshold
f1Score.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 1")
}

Threshold: 1.0, Precision: 0.7835518474374255
Threshold: 0.0, Precision: 0.027363754889178617
Threshold: 0.0, Recall: 1.0
Threshold: 1.0, Recall: 0.6264532113588718
Threshold: 0.0, F-score: 0.053269846748935264, Beta = 1
Threshold: 1.0, F-score: 0.6962507943232367, Beta = 1


precision: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[738] at map at BinaryClassificationMetrics.scala:214
recall: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[739] at map at BinaryClassificationMetrics.scala:214
PRC: org.apache.spark.rdd.RDD[(Double, Double)] = UnionRDD[742] at union at BinaryClassificationMetrics.scala:110
f1Score: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[743] at map at BinaryClassificationMetrics.scala:214


In [31]:
val beta = 0.5
val fScore = bMetrics.fMeasureByThreshold(beta)
f1Score.foreach { case (t, f) =>
  println(s"Threshold: $t, F-score: $f, Beta = 0.5")
}

// AUPRC
val auPRC = bMetrics.areaUnderPR
println("Area under precision-recall curve = " + auPRC)

// Compute thresholds used in ROC and PR curves
val thresholds = precision.map(_._1)

// ROC Curve
val roc = bMetrics.roc

// AUROC
val auROC = bMetrics.areaUnderROC
println("Area under ROC = " + auROC)

Threshold: 1.0, F-score: 0.6962507943232367, Beta = 0.5
Threshold: 0.0, F-score: 0.053269846748935264, Beta = 0.5
Area under precision-recall curve = 0.6423160306473965
Area under ROC = 0.8107923284828225


beta: Double = 0.5
fScore: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[744] at map at BinaryClassificationMetrics.scala:214
auPRC: Double = 0.6423160306473965
thresholds: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[750] at map at <console>:51
roc: org.apache.spark.rdd.RDD[(Double, Double)] = UnionRDD[754] at UnionRDD at BinaryClassificationMetrics.scala:90
auROC: Double = 0.8107923284828225
