<img src=http://fd.perso.eisti.fr/Logos/TORUS2.png>

To illustrate classification algorithms, an example of Random Forest will be enough!

"Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set"

(source : https://en.wikipedia.org/wiki/Random_forest)

### Read dataset (csv format) from HDFS

Here we use the dataset from https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

The target variable will be Survival status (1 - the patient survived 5 years or longer and 2 - the patient died within 5 year ) and the variables descriptives are : 
- Age of patient at time of operation (numerical) 
- Patient's year of operation (year - 1900, numerical) 
- Number of positive axillary nodes detected (numerical) 

In [ ]:
import org.apache.spark.sql._            
val spark = SparkSession.builder().getOrCreate()

val data = spark.read.format("com.databricks.spark.csv")
              .option("header", "true").option("inferSchema", "true") 
              .load("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/formation4_ML/haberman.csv")
              .withColumn("status", when($"status"===2, 0).otherwise($"status"))

import org.apache.spark.sql._
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@638b040b
data: org.apache.spark.sql.DataFrame = [age: int, nbYearOperation: int ... 2 more fields]


In [ ]:
data.show()

+---+---------------+-------------+------+
|age|nbYearOperation|nbPosAxillary|status|
+---+---------------+-------------+------+
| 30|             64|            1|     1|
| 30|             62|            3|     1|
| 30|             65|            0|     1|
| 31|             59|            2|     1|
| 31|             65|            4|     1|
| 33|             58|           10|     1|
| 33|             60|            0|     1|
| 34|             59|            0|     0|
| 34|             66|            9|     0|
| 34|             58|           30|     1|
| 34|             60|            1|     1|
| 34|             61|           10|     1|
| 34|             67|            7|     1|
| 34|             60|            0|     1|
| 35|             64|           13|     1|
| 35|             63|            0|     1|
| 36|             60|            1|     1|
| 36|             69|            0|     1|
| 37|             60|            0|     1|
| 37|             63|            0|     1|
+---+------

In [ ]:
data.printSchema()

root
 |-- age: integer (nullable = true)
 |-- nbYearOperation: integer (nullable = true)
 |-- nbPosAxillary: integer (nullable = true)
 |-- status: integer (nullable = true)



In [ ]:
// Analyse des données
data.describe()

res4: org.apache.spark.sql.DataFrame = [summary: string, age: string ... 3 more fields]


## Pre-processing
Pour appliquer le modèle Random Forest, il faut effectuer plusieures étapes de pre-processing.
Il faut tout d'abord indexer les données.

**Vector Assembler :**
On regroupe toutes les variables explicatives dans une seule colonne. La colonne "features" on aura toutes les informations des variables pour chaque observation.

In [ ]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}


In [ ]:
// Indexer la variable à expliquer
val labelIndexer = new StringIndexer()
  .setInputCol("status")
  .setOutputCol("indexedLabel")
  .fit(data)

labelIndexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e0913f8d049a


In [ ]:
// Indexer les variables explicatives et les assembler
val assembler = new VectorAssembler()
  .setInputCols(Array("age", "nbYearOperation", "nbPosAxillary"))
  .setOutputCol("features")

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3a83610f5146
featureIndexer: org.apache.spark.ml.feature.VectorIndexer = vecIdx_9bd4bd6564dd


In [ ]:
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")

labelConverter: org.apache.spark.ml.feature.IndexToString = idxToStr_7aa9872da46b


## Train-Test split
On choisit de séparer notre jeu de données avec 70% des données pour la phase d'entraînement et les 30% restant pour la phase de test.

In [ ]:
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 11L)

trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [age: int, nbYearOperation: int ... 2 more fields]
testData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [age: int, nbYearOperation: int ... 2 more fields]


## Build a Random Forest model
Il faut instantier le modèle.

In [ ]:
// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)
//   .setMaxDepth(4)
//   .setMaxBins(32)
//   .setImpurity("gini")
//   .setFeatureSubsetStrategy("auto")

rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_30ee4740762f


## Pipeline construction

In [ ]:
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, assembler, featureIndexer, rf))

pipeline: org.apache.spark.ml.Pipeline = pipeline_09be084cc8e9


## Training

In [ ]:
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

model: org.apache.spark.ml.PipelineModel = pipeline_09be084cc8e9


## Make predictions

In [ ]:
// Make predictions.
val predictions = model.transform(testData)
                        .withColumn("prediction", when($"prediction" === 0, 1).otherwise(0))
                        .withColumn("prediction", $"prediction".cast("Double"))

predictions: org.apache.spark.sql.DataFrame = [age: int, nbYearOperation: int ... 8 more fields]


In [ ]:
predictions.show()

+---+---------------+-------------+------+------------+----------------+----------------+--------------------+--------------------+----------+
|age|nbYearOperation|nbPosAxillary|status|indexedLabel|        features| indexedFeatures|       rawPrediction|         probability|prediction|
+---+---------------+-------------+------+------------+----------------+----------------+--------------------+--------------------+----------+
| 30|             65|            0|     1|         0.0| [30.0,65.0,0.0]| [30.0,65.0,0.0]|[9.35321551300932...|[0.93532155130093...|       1.0|
| 31|             59|            2|     1|         0.0| [31.0,59.0,2.0]| [31.0,59.0,2.0]|           [8.5,1.5]|         [0.85,0.15]|       1.0|
| 34|             60|            0|     1|         0.0| [34.0,60.0,0.0]| [34.0,60.0,0.0]|[9.35321551300932...|[0.93532155130093...|       1.0|
| 34|             60|            1|     1|         0.0| [34.0,60.0,1.0]| [34.0,60.0,1.0]|[9.85321551300932...|[0.98532155130093...|       1.0|

## Analyser les résultats
On veut obtenir un score, mais aussi ressortir l'arbre de décision construit lors de l'entraînement.

In [ ]:
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("status")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")

val rfModel = model.stages(3).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")

Test Error = 0.2954545454545454
Learned classification forest model:
 RandomForestClassificationModel (uid=rfc_b27bebea91dd) with 10 trees
  Tree 0 (weight 1.0):
    If (feature 2 <= 14.0)
     If (feature 0 <= 41.0)
      If (feature 1 <= 59.0)
       If (feature 0 <= 34.0)
        If (feature 2 <= 0.0)
         Predict: 1.0
        Else (feature 2 > 0.0)
         Predict: 0.0
       Else (feature 0 > 34.0)
        Predict: 0.0
      Else (feature 1 > 59.0)
       Predict: 0.0
     Else (feature 0 > 41.0)
      If (feature 0 <= 54.0)
       If (feature 2 <= 1.0)
        If (feature 1 <= 62.0)
         Predict: 0.0
        Else (feature 1 > 62.0)
         Predict: 1.0
       Else (feature 2 > 1.0)
        If (feature 1 <= 61.0)
         Predict: 0.0
        Else (feature 1 > 61.0)
         Predict: 1.0
      Else (feature 0 > 54.0)
       If (feature 1 <= 65.0)
        If (feature 1 <= 64.0)
         Predict: 0.0
        Else (feature 1 > 64.0)
         Predict: 1.0
       Else (featur

In [ ]:
model.stages(3)

res36: org.apache.spark.ml.Transformer = RandomForestClassificationModel (uid=rfc_b27bebea91dd) with 10 trees


In [ ]:
/*
// Save and load model
model.save(sc, savePath)
val sameModel = RandomForestModel.load(sc, savePath)
*/

# Exercice
Appliquer un modèle de classification Regréssion Logistique et comparer les résultats des deux modèles. Il faut donc trouver des indicateurs à comparer entre eux.

Documentation : https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression

Créer une chaîne de pipeline à partir des étapes Pre-processing, jusqu'à la création du modèle.

In [ ]:
// Preparation des données d'entrées


In [ ]:
// Création du modèle


In [ ]:
// Le pipeline


In [ ]:
// Prédictions


In [ ]:
// Analyse des résultats


# Correction