# Iris data classification using scala

This analysis is from the apache spark documents website (https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#random-forest-classifier), and shows the steps used to construct a classification model in scala using the iris data set. 

The readme file shows how to configure jupyter to run this example.

The data set used for this classification is in *LIBSVM* format. more details on LIBSVM can be found here (https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#/Q3:_Data_preparation). 

The **LIBSVM** data format stores data in a parse array form that ensures only non-zero data points are stored.

In [1]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

In [3]:
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/iris_libsvm.txt")

data = [label: double, features: vector]


lastException: Throwable = null


[label: double, features: vector]

In [5]:
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

labelIndexer = strIdx_fc2642244120
featureIndexer = vecIdx_d1ddb36a2306
trainingData = [label: double, features: vector]
testData = [label: double, features: vector]


[label: double, features: vector]

In [6]:
// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

rf = rfc_820a84b73246
labelConverter = idxToStr_7903f0ed7588
pipeline = pipeline_e941f714de3c
model = pipeline_e941f714de3c


pipeline_e941f714de3c

In [7]:
// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))

val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println("Learned classification forest model:\n" + rfModel.toDebugString)

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           1.0|  1.0|(4,[0,1,2,3],[-0....|
|           1.0|  1.0|(4,[0,1,2,3],[-0....|
|           1.0|  1.0|(4,[0,1,2,3],[-0....|
|           1.0|  1.0|(4,[0,1,2,3],[-0....|
|           1.0|  1.0|(4,[0,1,2,3],[-0....|
+--------------+-----+--------------------+
only showing top 5 rows

Test Error = 0.0
Learned classification forest model:
RandomForestClassificationModel (uid=rfc_820a84b73246) with 10 trees
  Tree 0 (weight 1.0):
    If (feature 3 <= -0.625)
     Predict: 0.0
    Else (feature 3 > -0.625)
     If (feature 3 <= 0.375)
      If (feature 3 <= 0.12500015)
       Predict: 2.0
      Else (feature 3 > 0.12500015)
       If (feature 2 <= 0.4067795)
        If (feature 0 <= -0.638889)
         Predict: 1.0
        Else (feature 0 > -0.638889)
         Predict: 2.0
       Else (feature 2 > 0.4067795)
        Predict: 1.0
     Else (feature 3 > 0.37

predictions = [label: double, features: vector ... 6 more fields]
evaluator = mcEval_9ea21a3ed224
accuracy = 1.0
rfModel = RandomForestClassificationModel (uid=rfc_820a84b73246) with 10 trees


RandomForestClassificationModel (uid=rfc_820a84b73246) with 10 trees