# Classification with Decision Tree and Naive Bayes

### Importing MLlib libraries 

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

### Read data and pre-processing

The **Iris flower data set** or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

In [2]:
val rawData = spark.read.format("csv").option("header","true").option("inferSchema", "true").load("data/iris_h.csv")

In [3]:
rawData.printSchema()
rawData.show()

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- Species: string (nullable = true)

+-----------+----------+-----------+----------+-------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|
+-----------+----------+-----------+----------+-------+
|        5.1|       3.5|        1.4|       0.2| setosa|
|        4.9|       3.0|        1.4|       0.2| setosa|
|        4.7|       3.2|        1.3|       0.2| setosa|
|        4.6|       3.1|        1.5|       0.2| setosa|
|        5.0|       3.6|        1.4|       0.2| setosa|
|        5.4|       3.9|        1.7|       0.4| setosa|
|        4.6|       3.4|        1.4|       0.3| setosa|
|        5.0|       3.4|        1.5|       0.2| setosa|
|        4.4|       2.9|        1.4|       0.2| setosa|
|        4.9|       3.1|        1.5|       0.1| setosa|
|        5.4|       3.7|        1.5|       0.2| setosa|
|

In [4]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("SepalLength","SepalWidth",
"PetalLength","PetalWidth")).setOutputCol("features")
val Data = assembler.transform(rawData)
Data.show()

+-----------+----------+-----------+----------+-------+-----------------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|         features|
+-----------+----------+-----------+----------+-------+-----------------+
|        5.1|       3.5|        1.4|       0.2| setosa|[5.1,3.5,1.4,0.2]|
|        4.9|       3.0|        1.4|       0.2| setosa|[4.9,3.0,1.4,0.2]|
|        4.7|       3.2|        1.3|       0.2| setosa|[4.7,3.2,1.3,0.2]|
|        4.6|       3.1|        1.5|       0.2| setosa|[4.6,3.1,1.5,0.2]|
|        5.0|       3.6|        1.4|       0.2| setosa|[5.0,3.6,1.4,0.2]|
|        5.4|       3.9|        1.7|       0.4| setosa|[5.4,3.9,1.7,0.4]|
|        4.6|       3.4|        1.4|       0.3| setosa|[4.6,3.4,1.4,0.3]|
|        5.0|       3.4|        1.5|       0.2| setosa|[5.0,3.4,1.5,0.2]|
|        4.4|       2.9|        1.4|       0.2| setosa|[4.4,2.9,1.4,0.2]|
|        4.9|       3.1|        1.5|       0.1| setosa|[4.9,3.1,1.5,0.1]|
|        5.4|       3.7|        1.5|  

In [5]:
val labelIndexer = new StringIndexer().setInputCol("Species").setOutputCol("indexedLabel").fit(Data)
// Automatically identify categorical features, and index them.
// features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4) .fit(Data)


### Split the data into training and test sets (30% held out for testing)

In [6]:
val Array(trainingData, testData) = Data.randomSplit(Array(0.7, 0.3))
println("Training Data")
trainingData.take(100).foreach(println)
println("Test Data")
testData.take(100).foreach(println)

Training Data
[4.3,3.0,1.1,0.1,setosa,[4.3,3.0,1.1,0.1]]
[4.4,2.9,1.4,0.2,setosa,[4.4,2.9,1.4,0.2]]
[4.5,2.3,1.3,0.3,setosa,[4.5,2.3,1.3,0.3]]
[4.6,3.1,1.5,0.2,setosa,[4.6,3.1,1.5,0.2]]
[4.6,3.4,1.4,0.3,setosa,[4.6,3.4,1.4,0.3]]
[4.7,3.2,1.3,0.2,setosa,[4.7,3.2,1.3,0.2]]
[4.7,3.2,1.6,0.2,setosa,[4.7,3.2,1.6,0.2]]
[4.8,3.4,1.6,0.2,setosa,[4.8,3.4,1.6,0.2]]
[4.8,3.4,1.9,0.2,setosa,[4.8,3.4,1.9,0.2]]
[4.9,3.0,1.4,0.2,setosa,[4.9,3.0,1.4,0.2]]
[4.9,3.1,1.5,0.1,setosa,[4.9,3.1,1.5,0.1]]
[4.9,3.1,1.5,0.2,setosa,[4.9,3.1,1.5,0.2]]
[5.0,2.0,3.5,1.0,versicolor,[5.0,2.0,3.5,1.0]]
[5.0,2.3,3.3,1.0,versicolor,[5.0,2.3,3.3,1.0]]
[5.0,3.0,1.6,0.2,setosa,[5.0,3.0,1.6,0.2]]
[5.0,3.3,1.4,0.2,setosa,[5.0,3.3,1.4,0.2]]
[5.0,3.4,1.5,0.2,setosa,[5.0,3.4,1.5,0.2]]
[5.0,3.4,1.6,0.4,setosa,[5.0,3.4,1.6,0.4]]
[5.0,3.5,1.3,0.3,setosa,[5.0,3.5,1.3,0.3]]
[5.0,3.5,1.6,0.6,setosa,[5.0,3.5,1.6,0.6]]
[5.0,3.6,1.4,0.2,setosa,[5.0,3.6,1.4,0.2]]
[5.1,3.3,1.7,0.5,setosa,[5.1,3.3,1.7,0.5]]
[5.1,3.4,1.5,0.2,setosa,[5.1,3.4

### Train a Decision Tree

Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble algorithms such as random forests and boosting are among the top performers for classification and regression tasks.
MLlib supports decision trees for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions of instances.

In [7]:
val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)

In [8]:
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))


In [18]:
val model = pipeline.fit(trainingData)
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println("Learned classification tree model:\n" + treeModel.toDebugString)

Learned classification tree model:
DecisionTreeClassificationModel (uid=dtc_e0ac8db8e732) of depth 4 with 11 nodes
  If (feature 2 <= 1.9)
   Predict: 2.0
  Else (feature 2 > 1.9)
   If (feature 2 <= 4.8)
    If (feature 3 <= 1.6)
     Predict: 0.0
    Else (feature 3 > 1.6)
     If (feature 0 <= 5.9)
      Predict: 0.0
     Else (feature 0 > 5.9)
      Predict: 1.0
   Else (feature 2 > 4.8)
    If (feature 3 <= 1.5)
     Predict: 0.0
    Else (feature 3 > 1.5)
     Predict: 1.0



### Test the Decision Tree model

In [10]:
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "Species", "features").show(5)
predictions.printSchema()

+--------------+-------+-----------------+
|predictedLabel|Species|         features|
+--------------+-------+-----------------+
|        setosa| setosa|[4.4,3.0,1.3,0.2]|
|        setosa| setosa|[4.4,3.2,1.3,0.2]|
|        setosa| setosa|[4.6,3.2,1.4,0.2]|
|        setosa| setosa|[4.6,3.6,1.0,0.2]|
|        setosa| setosa|[4.8,3.0,1.4,0.1]|
+--------------+-------+-----------------+
only showing top 5 rows

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- Species: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- indexedLabel: double (nullable = true)
 |-- indexedFeatures: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)
 |-- predictedLabel: string (nullable = true)



In [17]:
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))



Test Error = 0.1071428571428571
Learned classification tree model:
DecisionTreeClassificationModel (uid=dtc_e0ac8db8e732) of depth 4 with 11 nodes
  If (feature 2 <= 1.9)
   Predict: 2.0
  Else (feature 2 > 1.9)
   If (feature 2 <= 4.8)
    If (feature 3 <= 1.6)
     Predict: 0.0
    Else (feature 3 > 1.6)
     If (feature 0 <= 5.9)
      Predict: 0.0
     Else (feature 0 > 5.9)
      Predict: 1.0
   Else (feature 2 > 4.8)
    If (feature 3 <= 1.5)
     Predict: 0.0
    Else (feature 3 > 1.5)
     Predict: 1.0



In [12]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val predictionRDD = predictions.select("prediction", "indexedLabel").as[(Double,Double)].rdd
val metrics = new MulticlassMetrics(predictionRDD)
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)

// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")

// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}

// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}

// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")

Confusion matrix:
19.0  2.0   0.0   
4.0   17.0  0.0   
0.0   0.0   14.0  
Summary Statistics
Precision = 0.8928571428571429
Recall = 0.8928571428571429
F1 Score = 0.8928571428571429
Precision(0.0) = 0.8260869565217391
Precision(1.0) = 0.8947368421052632
Precision(2.0) = 1.0
Recall(0.0) = 0.9047619047619048
Recall(1.0) = 0.8095238095238095
Recall(2.0) = 1.0
FPR(0.0) = 0.11428571428571428
FPR(1.0) = 0.05714285714285714
FPR(2.0) = 0.0
F1-Score(0.0) = 0.8636363636363636
F1-Score(1.0) = 0.8500000000000001
F1-Score(2.0) = 1.0
Weighted precision: 0.895308924485126
Weighted recall: 0.8928571428571428
Weighted F1 score: 0.8926136363636363
Weighted false positive rate: 0.06428571428571428


### Train Naive Bayes

Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes theorem to compute the conditional probability distribution of label given an observation and use it for prediction.

In [None]:
import org.apache.spark.ml.classification.NaiveBayes
val nb = new NaiveBayes().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)

In [None]:
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, nb, labelConverter))

In [None]:
val nbmodel = pipeline.fit(trainingData)

### Test the Naive Bayes model

In [None]:
// Make predictions.
val predictions = nbmodel.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "Species", "features").show(5)
predictions.printSchema()

In [None]:
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
// val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
// println("Learned Naive Bayesmodel:\n" + treeModel.toDebugString)
println("Test Error = " + (1.0 - accuracy))

### Complete evaluation on test set for Naive Bayes model

In [None]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics
val predictionRDD = predictions.select("prediction", "indexedLabel").as[(Double,Double)].rdd
val metrics = new MulticlassMetrics(predictionRDD)
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)

// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")

// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}

// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}

// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")