# MSBD 5012 group project

In [None]:
"""     
Student Name:  ZHANG Xinyue                /        Qiao  Shuyu        /       Li Zuoxuan
Student ID:    20750194                    /          20747563         /        20740917
Student Email: xzhangfa@connect.ust.hk     /   sqiaoac@connect.ust.hk  /   zlify@connect.ust.hk
Course Name:   MSBD5012
URL in github: https://github.com/orange-neng/MSBD5012-Forest-type-prediction-exploration
"""

Intitializing Scala interpreter ...

**Description**
In this project, we predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (opposed to remotely sensed data). To finish the classification task, we first analyze the dataset to equip the future data pre-processing and application. Then we select multiple machine learning algorithms from various machine learning packages such as Sklearn, Keras and Spark MLlib , to compare their performance. In this report, the process of data analysis, data cleaning, data normalization and hyperparameter tuning will be described to show how they affect the final classification accuracy.
* Here,this file applies Decesion tree form spark MLlib

> ### 1.Load data & LabeledPoint object format
**comment:** 
* Init returns all values except the last value; the last column is the target.
* The decision tree requires label to start at 0, so subtract 1 from it.
* convert to LabeledPoint

In [None]:
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression._

val rawData = sc.textFile("./forest-cover-type-prediction/covtype.data")

val data = rawData.map { line =>
  val values = line.split(',').map(_.toDouble)
  val featureVector = Vectors.dense(values.init)
  val label = values.last - 1
  LabeledPoint(label, featureVector)
}

Intitializing Scala interpreter ...

> ### 2.Data preprocessing
**comment:** 
* We now divide the data into three complete parts: training set, cross-validation set (CV) and test set. 
* In the code below, you will see that the training set accounts for 80%, the cross-validation set and the test set each account for 10%:

In [None]:
val Array(trainData, cvData, testData) =
  data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache()
cvData.cache()
testData.cache()

Intitializing Scala interpreter ...

> ### 3.Model
**comment:** 
* Try to construct a DecisionTreeModel model on the training set, use default values for the parameters, and use the CV set to calculate the indicators of the resulting model:
* trainClassifier indicates that the target in each LabeledPoint should be treated as a different class label, rather than a numerical feature value
* The category label of the DecisionTreeModel model starts from 0

In [None]:
import org.apache.spark.mllib.evaluation._
import org.apache.spark.mllib.tree._
import org.apache.spark.mllib.tree.model._
import org.apache.spark.rdd._

def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]):

    MulticlassMetrics = {
  val predictionsAndLabels = data.map(example =>
    (model.predict(example.features), example.label)
  )
  new MulticlassMetrics(predictionsAndLabels)
}

val model = DecisionTree.trainClassifier(
  trainData, 7, Map[Int,Int](), "gini", 4, 100)

Intitializing Scala interpreter ...

In [None]:
val metrics = getMetrics(model, cvData)
metrics.confusionMatrix

Intitializing Scala interpreter ...

In [None]:
metrics.precision _

Intitializing Scala interpreter ...

In [None]:
(0 until 7).map(target => (metrics.precision(target), metrics.recall(target))).foreach(println)

Intitializing Scala interpreter ...

In [None]:
import org.apache.spark.rdd._

def classProbabilities(data: RDD[LabeledPoint]): Array[Double] = {
  val countsByCategory = data.map(_.label).countByValue()
  val counts = countsByCategory.toArray.sortBy(_._1).map(_._2)
  counts.map(_.toDouble / counts.sum)
}

val trainPriorProbabilities = classProbabilities(trainData)
val cvPriorProbabilities = classProbabilities(cvData)
trainPriorProbabilities.zip(cvPriorProbabilities).map {
  case (trainProb, cvProb) => trainProb * cvProb
}.sum

Intitializing Scala interpreter ...

In [None]:
DecisionTree.trainClassifier

Intitializing Scala interpreter ...

> ### 4.Parameters tuning
**comment:** 
* Triple for loop.
* Sort and print in descending order according to the second value (accuracy)

In [None]:
val evaluations =
  for (impurity <- Array("gini", "entropy");
       depth    <- Array(1, 20);
       bins     <- Array(10, 300))
    yield {
      val model = DecisionTree.trainClassifier(
        trainData, 7, Map[Int,Int](), impurity, depth, bins)
      val predictionsAndLabels = cvData.map(example =>
        (model.predict(example.features), example.label)
      )
      val accuracy =
        new MulticlassMetrics(predictionsAndLabels).precision(_)
      ((impurity, depth, bins), accuracy)
    }

evaluations.sortBy(_._2).reverse.foreach(println)

Intitializing Scala interpreter ...

> ### 5.Evaluation
**comment:** 
* The last step is to use the obtained hyperparameters to construct the model on the training set and CV set at the same time and evaluate it as before:

In [None]:
val model = DecisionTree.trainClassifier(
  trainData.union(cvData), 7, Map[Int,Int](), "entropy", 20, 300)

Intitializing Scala interpreter ...