# Particle Identification Task Using Gradient Boosted Trees

In this project, I will train a classifier to identify type of a particle. There are six particle types: electron, proton, muon, kaon, pion and ghost. Ghost is a particle with other type than the first five or a detector noise. 

Different particle types remain different responses in the detector systems or subdetectors. Thre are five systems: tracking system, ring imaging Cherenkov detector (RICH), electromagnetic and hadron calorimeters, and muon system.

![pid](Images/pid.jpg)

My aim is to identify a particle type using the responses in the detector systems. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas
import numpy
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import utils

# Read Data File

In [None]:
data = pandas.read_csv('Data/training.csv.gz')

In [None]:
data.head()

### Domain Information

Following quantities stands for
+ **Spd** : Scintillating Pad Detector
+ **Prs** : Preshower
+ **Ecal** : Electromagnetic Calorimeter
+ **Hcal** : Hadronic Calorimeter
+ **Brem** : Denotes traces of the Particles that were deflected by Detector.

Column Descripions are as follows:

- ***ID*** : id value for tracks (presents only in the test file for the submitting purposes)
- ***Label*** : string valued observable denoting particle types. Can take values "Electron", "Muon", "Kaon", "Proton", "Pion" and "Ghost". This column is absent in the test file.
- ***FlagSpd*** : flag (0 or 1), if reconstructed track passes through Spd
- ***FlagPrs*** : flag (0 or 1), if reconstructed track passes through Prs
- ***FlagBrem*** : flag (0 or 1), if reconstructed track passes through Brem
- ***FlagEcal*** : flag (0 or 1), if reconstructed track passes through Ecal
- ***FlagHcal*** : flag (0 or 1), if reconstructed track passes through Hcal
- ***FlagRICH1*** : flag (0 or 1), if reconstructed track passes through the first RICH detector
- ***FlagRICH2*** : flag (0 or 1), if reconstructed track passes through the second RICH detector
- ***FlagMuon*** : flag (0 or 1), if reconstructed track passes through muon stations (Muon)
- ***SpdE*** : energy deposit associated to the track in the Spd
- ***PrsE*** : energy deposit associated to the track in the Prs
- ***EcalE*** : energy deposit associated to the track in the Hcal
- ***HcalE*** : energy deposit associated to the track in the Hcal
- ***PrsDLLbeElectron*** : delta log-likelihood for a particle candidate to be electron using information from Prs
- ***BremDLLbeElectron*** : delta log-likelihood for a particle candidate to be electron using information from Brem
- ***TrackP*** : particle momentum
- ***TrackPt*** : particle transverse momentum
- ***TrackNDoFSubdetector1*** : number of degrees of freedom for track fit using hits in the tracking sub-detector1
- ***TrackQualitySubdetector1*** : chi2 quality of the track fit using hits in the tracking sub-detector1
- ***TrackNDoFSubdetector2*** : number of degrees of freedom for track fit using hits in the tracking sub-detector2
- ***TrackQualitySubdetector2*** : chi2 quality of the track fit using hits in the  tracking sub-detector2
- ***TrackNDoF*** : number of degrees of freedom for track fit using hits in all tracking sub-detectors
- ***TrackQualityPerNDoF*** : chi2 quality of the track fit per degree of freedom
- ***TrackDistanceToZ*** : distance between track and z-axis (beam axis)
- ***Calo2dFitQuality*** : quality of the 2d fit of the clusters in the calorimeter 
- ***Calo3dFitQuality*** : quality of the 3d fit in the calorimeter with assumption that particle was electron
- ***EcalDLLbeElectron*** : delta log-likelihood for a particle candidate to be electron using information from Ecal
- ***EcalDLLbeMuon*** : delta log-likelihood for a particle candidate to be muon using information from Ecal
- ***EcalShowerLongitudinalParameter*** : longitudinal parameter of Ecal shower
- ***HcalDLLbeElectron*** : delta log-likelihood for a particle candidate to be electron using information from Hcal
- ***HcalDLLbeMuon*** : delta log-likelihood for a particle candidate to be using information from Hcal
- ***RICHpFlagElectron*** : flag (0 or 1) if momentum is greater than threshold for electrons to produce Cherenkov light
- ***RICHpFlagProton*** : flag (0 or 1) if momentum is greater than threshold for protons to produce Cherenkov light
- ***RICHpFlagPion*** : flag (0 or 1) if momentum is greater than threshold for pions to produce Cherenkov light
- ***RICHpFlagKaon*** : flag (0 or 1) if momentum is greater than threshold for kaons to produce Cherenkov light
- ***RICHpFlagMuon*** : flag (0 or 1) if momentum is greater than threshold for muons to produce Cherenkov light
- ***RICH_DLLbeBCK *** : delta log-likelihood for a particle candidate to be background using information from RICH
- ***RICH_DLLbeKaon*** : delta log-likelihood for a particle candidate to be kaon using information from RICH
- ***RICH_DLLbeElectron*** : delta log-likelihood for a particle candidate to be electron using information from RICH
- ***RICH_DLLbeMuon*** : delta log-likelihood for a particle candidate to be muon using information from RICH
- ***RICH_DLLbeProton*** : delta log-likelihood for a particle candidate to be proton using information from RICH
- ***MuonFlag*** : muon flag (is this track muon) which is determined from muon stations
- ***MuonLooseFlag*** : muon flag (is this track muon) which is determined from muon stations using looser criteria
- ***MuonLLbeBCK*** : log-likelihood for a particle candidate to be not muon using information from muon stations
- ***MuonLLbeMuon*** : log-likelihood for a particle candidate to be muon using information from muon stations
- ***DLLelectron*** : delta log-likelihood for a particle candidate to be electron using information from all subdetectors
- ***DLLmuon*** : delta log-likelihood for a particle candidate to be muon using information from all subdetectors
- ***DLLkaon*** : delta log-likelihood for a particle candidate to be kaon using information from all subdetectors
- ***DLLproton*** : delta log-likelihood for a particle candidate to be proton using information from all subdetectors
- ***GhostProbability*** : probability for a particle candidate to be ghost track. This variable is an output of classification model used in the tracking algorithm.

Delta log-likelihood in the features descriptions means the difference between log-likelihood for the mass hypothesis that a given track is left by some particle (for example, electron) and log-likelihood for the mass hypothesis that a given track is left by a pion (so, DLLpion = 0 and thus we don't have these columns). This is done since most tracks (~80%) are left by pions and in practice we actually need to discriminate other particles from pions. In other words, the null hypothesis is that particle is a pion.

In [None]:
# Classification Labels

set(data.Label)

In [None]:
# Converting Labels into Numerical Factor

data['Class'] = utils.get_class_ids(data.Label.values)
set(data.Class)

### Training Features

The following set of features describe particle responses in the detector systems:

![features](Images/features.jpeg)

Also there are several combined features. The full list is following.

In [None]:
features = list(set(data.columns) - {'Label', 'Class'})
features

# Data Splits

In [None]:
training_data, validation_data = train_test_split(data, random_state=11, train_size=0.90)

In [None]:
len(training_data), len(validation_data)

# Gradient Boosted Trees Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
%%time 
gb = GradientBoostingClassifier(learning_rate=0.1, 
                                n_estimators=50, 
                                subsample=0.3, 
                                random_state=13,
                                min_samples_leaf=100, 
                                max_depth=3)

gb.fit(training_data[features].values, training_data.Class.values)

In [None]:
# Prediction for each track

proba_gb = gb.predict_proba(validation_data[features].values)

In [None]:
# Error in the prediction {Log Loss}

log_loss(validation_data.Class.values, proba_gb)

### Bayesian Optimization

In [None]:
from tqdm import tqdm_notebook
from skopt import Optimizer
from skopt.utils import create_result

In [None]:
search_space = [(0.1, 0.3), # Learning Rate
                (50, 1000), # Estimators
                (0.2, 0.5), # SubSample
                (80, 140) # Minimum Leaf Node
                (2, 5) # Max Depth
                ]

In [None]:
def model_loss(params):
    learning_rate, n_estimators, subsample, min_samples_leaf, max_depth = params

    gb = GradientBoostingClassifier(learning_rate = learning_rate, 
                                n_estimators = n_estimators, 
                                subsample = subsample, 
                                random_state = 13,
                                min_samples_leaf = min_samples_leaf, 
                                max_depth = max_depth)

    gb.fit(training_data[features].values, training_data.Class.values)
    proba_gb = gb.predict_proba(validation_data[features].values)
    return gb, log_loss(validation_data.Class.values, proba_gb)

In [None]:
for i in tqdm_notebook(range(10)):
    next_x = opt.ask()
    _, f_val = model_loss(next_x)
    opt.tell(next_x, f_val)
    
res = create_result(Xi = opt.Xi, 
                    yi = opt.yi, 
                    space = opt.space,
                    rng = opt.rng, 
                    models = opt.models)

### Result Analysis

In [None]:
import skopt.plots

In [None]:
# Convergence Traces

skopt.plots.plot_convergence(res)
print (list(zip(["learning_rate", "n_estimators", "subsample", "min_samples_leaf", "max_depth"], res.x)))

In [None]:
# Cumulative regret traces

skopt.plots.plot_regret(res)
plt.show()

In [None]:
# Pairwise dependence plot of the objective function

skopt.plots.plot_objective(res, dimensions=["learning_rate", "n_estimators", "subsample", "min_samples_leaf", "max_depth"])
plt.show()

In [None]:
# Visualizing the order in which points where sampled.
# The order in which samples were evaluated is encoded in each point’s color.

skopt.plots.plot_evaluations(res, dimensions=["learning_rate", "n_estimators", "subsample", "min_samples_leaf", "max_depth"])
plt.show()

# Result

In [None]:
gb, loss =  model_loss(res.x)

In [None]:
print("Best Loss for Gradeint Boosted Trees is: ", loss)