# Higgs Boson Machine Learning classification

### Importing MLlib libraries

In [1]:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import sqlContext.implicits._
import org.apache.spark.sql.functions._

### Read data and pre-processing

File descriptions

- training.csv: Training set of 250000 events, with an ID column, 30 feature columns, a weight column and a label column.
 
- test.csv: Test set of 550000 events with an ID column and 30 feature columns (without label is not usefull in this example but only for submission on Kaggle competition).


For detailed information on the semantics of the features, labels, and weights, see the technical documentation from https://www.kaggle.com/c/higgs-boson/https://www.kaggle.com/c/higgs-boson/

Some details to get started:

- All variables are floating point, except PRI_jet_num which is integer
- Variables prefixed with PRI (for PRImitives) are “raw” quantities about the bunch collision as measured by the detector.
- Variables prefixed with DER (for DERived) are quantities computed from the primitive features, which were selected by the physicists of ATLAS
- It can happen that for some entries some variables are meaningless or cannot be computed; in this case, their value is −999.0, which is outside the normal range of all variables

In [2]:
val rawData = sc.textFile("data/Higgs/training.csv")
rawData.take(5).foreach(println)
rawData.count()

EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,DER_pt_ratio_lep_tau,DER_met_phi_centrality,DER_lep_eta_centrality,PRI_tau_pt,PRI_tau_eta,PRI_tau_phi,PRI_lep_pt,PRI_lep_eta,PRI_lep_phi,PRI_met,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.00265331133733,s
100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.23358448717,b
100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-9

250001

In [3]:
val splitlines = rawData.map(lines => {
    lines.split(',')
  })
splitlines.first()

Array(EventId, DER_mass_MMC, DER_mass_transverse_met_lep, DER_mass_vis, DER_pt_h, DER_deltaeta_jet_jet, DER_mass_jet_jet, DER_prodeta_jet_jet, DER_deltar_tau_lep, DER_pt_tot, DER_sum_pt, DER_pt_ratio_lep_tau, DER_met_phi_centrality, DER_lep_eta_centrality, PRI_tau_pt, PRI_tau_eta, PRI_tau_phi, PRI_lep_pt, PRI_lep_eta, PRI_lep_phi, PRI_met, PRI_met_phi, PRI_met_sumet, PRI_jet_num, PRI_jet_leading_pt, PRI_jet_leading_eta, PRI_jet_leading_phi, PRI_jet_subleading_pt, PRI_jet_subleading_eta, PRI_jet_subleading_phi, PRI_jet_all_pt, Weight, Label)

In [4]:
val temp = splitlines.filter(lines => lines(0) != "EventId")
temp.first

Array(100000, 138.47, 51.655, 97.827, 27.98, 0.91, 124.711, 2.666, 3.064, 41.928, 197.76, 1.582, 1.396, 0.2, 32.638, 1.017, 0.381, 51.626, 2.273, -2.414, 16.824, -0.277, 258.733, 2, 67.435, 2.15, 0.444, 46.062, 1.24, -2.475, 113.497, 0.00265331133733, s)

Drop the last feature Weight

In [5]:
val Data = temp.map { col =>   
     val temp_label = col(col.size - 1)                       
     val label = if (temp_label == "s") 1.toInt else 0.toInt
     val features = col.slice(1, col.size - 2).map(_.toDouble)
     LabeledPoint(label, Vectors.dense(features))
}
Data.take(5).foreach(println)

(1.0,[138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2.0,67.435,2.15,0.444,46.062,1.24,-2.475,113.497])
(0.0,[160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1.0,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226])
(0.0,[-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1.0,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251])
(0.0,[143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,75.968,2.354,-1.285,-999.0,22.647,-1.655,0.01,53.321,-0.522,-3.1,31.082,0.06,86.062,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0])
(0.0,[175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,57.983,1.056,-1.385,-999.0,28.209,-2.197,-2.231,29.774,0.798,1.569,2.723,-0.871,53.131,0.0,-999.0,-999.

### Split the data into training and test sets (40% held out for testing)

In [6]:
val splits = Data.randomSplit(Array(0.6, 0.4), seed = 13L)
val trainingData = splits(0).cache()
val testData = splits(1)
println("Training Data")
trainingData.take(5).foreach(println)
println("Test Data")
testData.take(5).foreach(println)

Training Data
(1.0,[138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2.0,67.435,2.15,0.444,46.062,1.24,-2.475,113.497])
(0.0,[-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1.0,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251])
(0.0,[143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,75.968,2.354,-1.285,-999.0,22.647,-1.655,0.01,53.321,-0.522,-3.1,31.082,0.06,86.062,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0])
(0.0,[89.744,13.55,59.149,116.344,2.636,284.584,-0.54,1.362,61.619,278.876,0.588,0.479,0.975,53.651,0.371,1.329,31.565,-0.884,1.857,40.735,2.237,282.849,3.0,90.547,-2.412,-0.653,56.165,0.224,3.106,193.66])
(1.0,[148.754,28.862,107.782,106.13,0.733,158.359,0.113,2.941,2.545,305.967,3.371,1.393,0.791,28.85,1.113,2.409,97.24,0.675,-0.966,38.421,-1.443,294.074,2.0,123.01

### Train a Decision Tree

In [7]:
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "entropy"
val maxDepth = 3
val maxBins = 10
val dtModel = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)
println(dtModel.toDebugString)

DecisionTreeModel classifier of depth 3 with 15 nodes
  If (feature 0 <= 94.822)
   If (feature 0 <= 83.871)
    If (feature 2 <= 46.551)
     Predict: 0.0
    Else (feature 2 > 46.551)
     Predict: 0.0
   Else (feature 0 > 83.871)
    If (feature 7 <= 2.868)
     Predict: 0.0
    Else (feature 7 > 2.868)
     Predict: 0.0
  Else (feature 0 > 94.822)
   If (feature 1 <= 46.975)
    If (feature 2 <= 116.784)
     Predict: 1.0
    Else (feature 2 > 116.784)
     Predict: 0.0
   Else (feature 1 > 46.975)
    If (feature 2 <= 116.784)
     Predict: 0.0
    Else (feature 2 > 116.784)
     Predict: 0.0



In [8]:
val dtTotalCorrect = trainingData.map { point =>
  if (dtModel.predict(point.features) == point.label) 1 else 0
  }.sum

println(dtTotalCorrect)
println(trainingData.count)

117478.0
150118


In [9]:
val dtAccuracy = dtTotalCorrect / trainingData.count
println(dtAccuracy)

0.7825710441119652


### Test

In [10]:
val dtTotalCorrect = testData.map { point =>
  if (dtModel.predict(point.features) == point.label) 1 else 0
  }.sum
println(dtTotalCorrect)
println(testData.count)

78009.0
99882


In [11]:
val dtAccuracy = dtTotalCorrect / testData.count
println(dtAccuracy)

0.7810115936805431


In [12]:
val predictionAndLabels = testData.map { case LabeledPoint(label, features) =>
  val prediction = dtModel.predict(features)
  (prediction, label)
}

// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)

// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)

// Overall Statistics
val precision = metrics.precision
val recall = metrics.recall // same as true positive rate
val f1Score = metrics.fMeasure
println("Summary Statistics")
println(s"Precision = $precision")
println(s"Recall = $recall")
println(s"F1 Score = $f1Score")

// Precision by label
val labels = metrics.labels
labels.foreach { l =>
    println(s"Precision($l) = " + metrics.precision(l))
}

// Recall by label
labels.foreach { l =>
    println(s"Recall($l) = " + metrics.recall(l))
}

// False positive rate by label
labels.foreach { l =>
    println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}

// F-measure by label
labels.foreach { l =>
    println(s"F1-Score($l) = " + metrics.fMeasure(l))
}

// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")

Confusion matrix:
55072.0  10576.0  
11297.0  22937.0  
Summary Statistics
Precision = 0.7810115936805431
Recall = 0.7810115936805431
F1 Score = 0.7810115936805431
Precision(0.0) = 0.8297849899802618
Precision(1.0) = 0.6844209709664906
Recall(0.0) = 0.8388983670485011
Recall(1.0) = 0.6700064263597593
FPR(0.0) = 0.3299935736402407
FPR(1.0) = 0.1611016329514989
F1-Score(0.0) = 0.8343167925342948
F1-Score(1.0) = 0.6771369949960883
Weighted precision: 0.7799622809143896
Weighted recall: 0.7810115936805431
Weighted F1 score: 0.7804442910933649
Weighted false positive rate: 0.2721068002722826


checked