<img src=http://fd.perso.eisti.fr/Logos/TORUS2.png>

To illustrate classification algorithms, an example of Random Forest will be enough!

"Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set"

(source : https://en.wikipedia.org/wiki/Random_forest)

### Read dataset (csv format) from HDFS

Here we use the dataset from https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

The target variable will be Survival status (1 - the patient survived 5 years or longer and 2 - the patient died within 5 year ) and the variables descriptives are : 
- Age of patient at time of operation (numerical) 
- Patient's year of operation (year - 1900, numerical) 
- Number of positive axillary nodes detected (numerical) 

In [ ]:
val sqlContext = new SQLContext(sc)

val data = sqlContext.read.format("com.databricks.spark.csv")
              .option("header", "true").option("inferSchema", "true") 
              .load("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/formation4_ML/haberman.csv")

       val sqlContext = new SQLContext(sc)
                        ^
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@6dd28972
data: org.apache.spark.sql.DataFrame = [age: int, nbYearOperation: int ... 2 more fields]


In [ ]:
data.show()

+---+---------------+-------------+------+
|age|nbYearOperation|nbPosAxillary|status|
+---+---------------+-------------+------+
| 30|             64|            1|     1|
| 30|             62|            3|     1|
| 30|             65|            0|     1|
| 31|             59|            2|     1|
| 31|             65|            4|     1|
| 33|             58|           10|     1|
| 33|             60|            0|     1|
| 34|             59|            0|     2|
| 34|             66|            9|     2|
| 34|             58|           30|     1|
| 34|             60|            1|     1|
| 34|             61|           10|     1|
| 34|             67|            7|     1|
| 34|             60|            0|     1|
| 35|             64|           13|     1|
| 35|             63|            0|     1|
| 36|             60|            1|     1|
| 36|             69|            0|     1|
| 37|             60|            0|     1|
| 37|             63|            0|     1|
+---+------

In [ ]:
// Convert to RDD and replace status 2 by 0 because the format needed for input of model is 2 values : 0 and 1. Then we should create a rdd of labeledPoint

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val data_rdd = data.map(l => (l(3).asInstanceOf[Int].toString.replace("2", "0").toDouble, l(1).asInstanceOf[Int].toDouble, 
                             l(2).asInstanceOf[Int].toDouble, l(0).asInstanceOf[Int].toDouble)).map(l => LabeledPoint(l._1, Vectors.dense(l._2, l._3, l._4))).rdd

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
data_rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[22] at rdd at <console>:76


In [ ]:
data_rdd.take(50)

res6: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((1.0,[64.0,1.0,30.0]), (1.0,[62.0,3.0,30.0]), (1.0,[65.0,0.0,30.0]), (1.0,[59.0,2.0,31.0]), (1.0,[65.0,4.0,31.0]), (1.0,[58.0,10.0,33.0]), (1.0,[60.0,0.0,33.0]), (0.0,[59.0,0.0,34.0]), (0.0,[66.0,9.0,34.0]), (1.0,[58.0,30.0,34.0]), (1.0,[60.0,1.0,34.0]), (1.0,[61.0,10.0,34.0]), (1.0,[67.0,7.0,34.0]), (1.0,[60.0,0.0,34.0]), (1.0,[64.0,13.0,35.0]), (1.0,[63.0,0.0,35.0]), (1.0,[60.0,1.0,36.0]), (1.0,[69.0,0.0,36.0]), (1.0,[60.0,0.0,37.0]), (1.0,[63.0,0.0,37.0]), (1.0,[58.0,0.0,37.0]), (1.0,[59.0,6.0,37.0]), (1.0,[60.0,15.0,37.0]), (1.0,[63.0,0.0,37.0]), (0.0,[69.0,21.0,38.0]), (1.0,[59.0,2.0,38.0]), (1.0,[60.0,0.0,38.0]), (1.0,[60.0,0.0,38.0]), (1.0,[62.0,3.0,38.0]), (1.0,[64.0,1.0,38.0]), (1.0,[66.0,0.0,38.0]), (1.0,[66.0...

### Split randomly data_withGoodColumns to have trainData and testData 

In [ ]:
val Array(trainingData, testData) = data_rdd.randomSplit(Array(0.7, 0.3))

trainingData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[23] at randomSplit at <console>:74
testData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[24] at randomSplit at <console>:74


### Build a random forest model 

In [ ]:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
numClasses: Int = 2
categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map()
numTrees: Int = 10
featureSubsetStrategy: String = auto
impurity: String = gini
maxDepth: Int = 4
maxBins: Int = 32
model: org.apache.spark.mllib.tree.model.RandomForestModel =
TreeEnsembleModel classifier with 10 trees


### Evaluation of model 

In [ ]:
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification forest model:\n" + model.toDebugString)

Test Error = 0.2988505747126437
Learned classification forest model:
TreeEnsembleModel classifier with 10 trees

  Tree 0:
    If (feature 1 <= 4.0)
     If (feature 0 <= 63.0)
      If (feature 0 <= 58.0)
       If (feature 1 <= 3.0)
        Predict: 1.0
       Else (feature 1 > 3.0)
        Predict: 0.0
      Else (feature 0 > 58.0)
       If (feature 2 <= 49.0)
        Predict: 1.0
       Else (feature 2 > 49.0)
        Predict: 1.0
     Else (feature 0 > 63.0)
      If (feature 2 <= 61.0)
       If (feature 1 <= 1.0)
        Predict: 1.0
       Else (feature 1 > 1.0)
        Predict: 1.0
      Else (feature 2 > 61.0)
       Predict: 1.0
    Else (feature 1 > 4.0)
     If (feature 0 <= 63.0)
      If (feature 0 <= 58.0)
       If (feature 1 <= 12.0)
        Predict: 0.0
       Else (feature 1 > 12.0)
        Predict: 0.0
      Else (feature 0 > 58.0)
       If (feature 0 <= 60.0)
        Predict: 1.0
       Else (feature 0 > 60.0)
        Predict: 1.0
     Else (feature 0 > 63.0)
  

In [ ]:
/*
// Save and load model
model.save(sc, savePath)
val sameModel = RandomForestModel.load(sc, savePath)
*/