# About

In this programming assignment you will train a classifier to identify type of a particle. There are six particle types: electron, proton, muon, kaon, pion and ghost. Ghost is a particle with other type than the first five or a detector noise. 

Different particle types remain different responses in the detector systems or subdetectors. Thre are five systems: tracking system, ring imaging Cherenkov detector (RICH), electromagnetic and hadron calorimeters, and muon system.

![pid](pic/pid.jpg)

You task is to identify a particle type using the responses in the detector systems. 

# Attention

Data files you should download from https://github.com/hse-aml/hadron-collider-machine-learning/releases/tag/Week_2

In [None]:
!wget https://github.com/hse-aml/hadron-collider-machine-learning/releases/download/Week_2/test.csv.gz

In [None]:
!wget https://github.com/hse-aml/hadron-collider-machine-learning/releases/download/Week_2/training.csv.gz

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas
import numpy
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import utils

In [None]:
# Display all columns
pandas.options.display.max_columns = None

In [None]:
grid_search = False

# Download data

Download data used to train classifiers.

### Read training file

In [None]:
data = pandas.read_csv('training.csv.gz')

In [None]:
data.head()

### List of columns in the samples

Here, **Spd** stands for Scintillating Pad Detector, **Prs** - Preshower, **Ecal** - electromagnetic calorimeter, **Hcal** - hadronic calorimeter, **Brem** denotes traces of the particles that were deflected by detector.

- ID - id value for tracks (presents only in the test file for the submitting purposes)
- Label - string valued observable denoting particle types. Can take values "Electron", "Muon", "Kaon", "Proton", "Pion" and "Ghost". This column is absent in the test file.
- FlagSpd - flag (0 or 1), if reconstructed track passes through Spd
- FlagPrs - flag (0 or 1), if reconstructed track passes through Prs
- FlagBrem - flag (0 or 1), if reconstructed track passes through Brem
- FlagEcal - flag (0 or 1), if reconstructed track passes through Ecal
- FlagHcal - flag (0 or 1), if reconstructed track passes through Hcal
- FlagRICH1 - flag (0 or 1), if reconstructed track passes through the first RICH detector
- FlagRICH2 - flag (0 or 1), if reconstructed track passes through the second RICH detector
- FlagMuon - flag (0 or 1), if reconstructed track passes through muon stations (Muon)
- SpdE - energy deposit associated to the track in the Spd
- PrsE - energy deposit associated to the track in the Prs
- EcalE - energy deposit associated to the track in the Hcal
- HcalE - energy deposit associated to the track in the Hcal
- PrsDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Prs
- BremDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Brem
- TrackP - particle momentum
- TrackPt - particle transverse momentum
- TrackNDoFSubdetector1  - number of degrees of freedom for track fit using hits in the tracking sub-detector1
- TrackQualitySubdetector1 - chi2 quality of the track fit using hits in the tracking sub-detector1
- TrackNDoFSubdetector2 - number of degrees of freedom for track fit using hits in the tracking sub-detector2
- TrackQualitySubdetector2 - chi2 quality of the track fit using hits in the  tracking sub-detector2
- TrackNDoF - number of degrees of freedom for track fit using hits in all tracking sub-detectors
- TrackQualityPerNDoF - chi2 quality of the track fit per degree of freedom
- TrackDistanceToZ - distance between track and z-axis (beam axis)
- Calo2dFitQuality - quality of the 2d fit of the clusters in the calorimeter 
- Calo3dFitQuality - quality of the 3d fit in the calorimeter with assumption that particle was electron
- EcalDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Ecal
- EcalDLLbeMuon - delta log-likelihood for a particle candidate to be muon using information from Ecal
- EcalShowerLongitudinalParameter - longitudinal parameter of Ecal shower
- HcalDLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from Hcal
- HcalDLLbeMuon - delta log-likelihood for a particle candidate to be using information from Hcal
- RICHpFlagElectron - flag (0 or 1) if momentum is greater than threshold for electrons to produce Cherenkov light
- RICHpFlagProton - flag (0 or 1) if momentum is greater than threshold for protons to produce Cherenkov light
- RICHpFlagPion - flag (0 or 1) if momentum is greater than threshold for pions to produce Cherenkov light
- RICHpFlagKaon - flag (0 or 1) if momentum is greater than threshold for kaons to produce Cherenkov light
- RICHpFlagMuon - flag (0 or 1) if momentum is greater than threshold for muons to produce Cherenkov light
- RICH_DLLbeBCK  - delta log-likelihood for a particle candidate to be background using information from RICH
- RICH_DLLbeKaon - delta log-likelihood for a particle candidate to be kaon using information from RICH
- RICH_DLLbeElectron - delta log-likelihood for a particle candidate to be electron using information from RICH
- RICH_DLLbeMuon - delta log-likelihood for a particle candidate to be muon using information from RICH
- RICH_DLLbeProton - delta log-likelihood for a particle candidate to be proton using information from RICH
- MuonFlag - muon flag (is this track muon) which is determined from muon stations
- MuonLooseFlag muon flag (is this track muon) which is determined from muon stations using looser criteria
- MuonLLbeBCK - log-likelihood for a particle candidate to be not muon using information from muon stations
- MuonLLbeMuon - log-likelihood for a particle candidate to be muon using information from muon stations
- DLLelectron - delta log-likelihood for a particle candidate to be electron using information from all subdetectors
- DLLmuon - delta log-likelihood for a particle candidate to be muon using information from all subdetectors
- DLLkaon - delta log-likelihood for a particle candidate to be kaon using information from all subdetectors
- DLLproton - delta log-likelihood for a particle candidate to be proton using information from all subdetectors
- GhostProbability - probability for a particle candidate to be ghost track. This variable is an output of classification model used in the tracking algorithm.

Delta log-likelihood in the features descriptions means the difference between log-likelihood for the mass hypothesis that a given track is left by some particle (for example, electron) and log-likelihood for the mass hypothesis that a given track is left by a pion (so, DLLpion = 0 and thus we don't have these columns). This is done since most tracks (~80%) are left by pions and in practice we actually need to discriminate other particles from pions. In other words, the null hypothesis is that particle is a pion.

### Look at the labels set

The training data contains six classes. Each class corresponds to a particle type. Your task is to predict type of a particle.

In [None]:
set(data.Label)

Convert the particle types into class numbers.

In [None]:
data['Class'] = utils.get_class_ids(data.Label.values)
set(data.Class)

### Define training features

The following set of features describe particle responses in the detector systems:

![features](pic/features.jpeg)

Also there are several combined features. The full list is following.

In [None]:
features = list(set(data.columns) - {'Label', 'Class'})
features

### Divide training data into 2 parts

In [None]:
training_data, validation_data = train_test_split(data, random_state=11, train_size=0.90, test_size=0.10)

In [None]:
len(training_data), len(validation_data)

# Sklearn classifier

On this step your task is to train **Sklearn** classifier to provide lower **log loss** value.


TASK: your task is to tune the classifier parameters to achieve the lowest **log loss** value on the validation sample you can.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gb = GradientBoostingClassifier(learning_rate=0.1, 
                                n_estimators=50, 
                                subsample=0.3, 
                                random_state=13,
                                min_samples_leaf=100,
                                max_depth=3,
                                verbose=3)
gb.fit(training_data[features].values, training_data.Class.values)

### Log loss on the cross validation sample

In [None]:
# Predict each track
proba_gb = gb.predict_proba(validation_data[features].values)

In [None]:
log_loss(validation_data.Class.values, proba_gb)

### Parameter search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
param_dist = {'learning_rate': [0.05, 0.1, 0.2], 
              'n_estimators': [25, 50, 75], 
              'subsample': [0.1, 0.3, 0.6], 
              'random_state': [13],
              'min_samples_leaf': [50, 100, 200],
              'max_depth': [1, 3, 6],
              'verbose': [3]}

In [None]:
if grid_search:
    random_search = RandomizedSearchCV(GradientBoostingClassifier(),
                                       param_distributions=param_dist,
                                       # See 
                                       # https://stackoverflow.com/questions/43081251/sklearn-metrics-log-loss-is-positive-vs-scoring-neg-log-loss-is-negative
                                       # https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
                                       scoring='neg_log_loss', 
                                       n_iter=10,
                                       n_jobs=-1,  # NOTE: You should monitor the memory so it doesn't get saturated
                                       refit=True,  # Refit with the best parameters on the whole data set (default)
                                       return_train_score=True,
                                       cv=3)
    
    # NOTE: Here we fit on the data from training.csv.gz and let the RandomizedSearchCV do the splitting
    random_search.fit(data.loc[:, features].values,
                      data.loc[:, 'Class'].values)
    
    print(random_search.best_params_)
    print(random_search.best_score_)
    pandas.DataFrame(random_search.cv_results_)

The random search gives the following best parameters

```
{'verbose': 3,
 'subsample': 0.3,
 'random_state': 13,
 'n_estimators': 50,
 'min_samples_leaf': 50,
 'max_depth': 6,
 'learning_rate': 0.2}
```

with the score

```
0.5808503911824091
```

# Keras neural network

On this step your task is to train **Keras** NN classifier to provide lower **log loss** value.


TASK: your task is to tune the classifier parameters to achieve the lowest **log loss** value on the validation sample you can. Data preprocessing may help you to improve your score.

In [None]:
from keras.layers.core import Dense
from keras.layers.core import Activation
from keras.layers.core import Dropout
from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.optimizers import Adam
from keras.utils import np_utils

In [None]:
def nn_model(input_dim):
    model = Sequential()
    model.add(Dense(100, input_dim=input_dim))
    model.add(Activation('tanh'))

    model.add(Dense(6))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer=Adam())
    return model

In [None]:
nn = nn_model(len(features))
nn.fit(training_data[features].values, 
       np_utils.to_categorical(training_data.Class.values), 
       verbose=1, 
       epochs=5, 
       batch_size=256)

### Log loss on the cross validation sample

In [None]:
# predict each track
proba_nn = nn.predict_proba(validation_data[features].values)

In [None]:
log_loss(validation_data.Class.values, proba_nn)

### Parameter search

In [None]:
def plot_history(history):
    epochs_range = range(len(history.history['val_loss']))
    fig, ax = plt.subplots()
    ax.plot(epochs_range, history.history['val_loss'], label='val loss')
    ax.plot(epochs_range, history.history['loss'], label='train loss')
    ax.grid(True)
    ax.set_xlabel('Epochs')
    ax.set_ylabel('Loss')
    ax.legend(loc='best', fancybox=True, framealpha=0.5)
    plt.show()

In [None]:
early_stopper = EarlyStopping(monitor='val_loss',
                              min_delta=0, 
                              patience=10, 
                              verbose=1, 
                              mode='auto',
                              restore_best_weights=True)


In [None]:
def nn_model_v2(input_dim):
    model = Sequential()
    model.add(Dense(180, input_dim=input_dim))
    model.add(Activation('relu'))
    
    model.add(Dropout(rate=.30))
    
    model.add(Dense(180))
    model.add(Activation('relu'))

    model.add(Dropout(rate=.30))
    
    model.add(Dense(6))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer=Adam())
    return model

In [None]:
checkpointer = ModelCheckpoint(filepath='model_v2.h5',
                               verbose=1,
                               save_best_only=True)

nn_v2 = nn_model_v2(len(features))

history_v2 = nn_v2.fit(training_data[features].values, 
                       np_utils.to_categorical(training_data.Class.values), 
                       validation_data=(validation_data[features].values,
                                        np_utils.to_categorical(validation_data.Class.values)),
                       verbose=1, 
                       epochs=100, 
                       callbacks=[checkpointer, early_stopper],
                       batch_size=256)

In [None]:
plot_history(history_v2)

# Quality metrics

Plot ROC curves and signal efficiency dependece from particle mometum and transverse momentum values.

In [None]:
proba = proba_gb

In [None]:
utils.plot_roc_curves(proba, validation_data.Class.values)

In [None]:
utils.plot_signal_efficiency_on_p(proba, validation_data.Class.values, validation_data.TrackP.values, 60, 50)
plt.show()

In [None]:
utils.plot_signal_efficiency_on_pt(proba, validation_data.Class.values, validation_data.TrackPt.values, 60, 50)
plt.show()

# Prepare submission

Select your best classifier and prepare submission file.

In [None]:
test = pandas.read_csv('test.csv.gz')

In [None]:
best_model = gb

In [None]:
# predict test sample
submit_proba = best_model.predict_proba(test[features])
submit_ids = test.ID

In [None]:
from IPython.display import FileLink
utils.create_solution(submit_ids, submit_proba, filename='submission_file.csv.gz')