## ML Course Challenge
### Predicting reconstruction efficiencies for tau leptons
This is a physics-inspired challenge where your task is to make the best possible predictions for the probability that a certain particle, a tau lepton, will be correctly reconstructed in the detector. We simulated a lot (hundred of thousands) of collisions of a proton with another proton, where in each of these events, a small number of tau leptons (1, 2, or 3) is produced. Unfortunately, it is quite difficult to detect ("reconstruct") the presence of a tau lepton in a particle detector, as they decay before they have a chance to interact with the detector. We thus only see the decay products of the tau leptons, and from this only slightly more than half of the tau leptons is actually reconstructed as a tau lepton.

The probability that a given true tau lepton is reconstructed ("reconstruction efficiency") depends on its properties: how much energy is has, where it hits our detector, and possibly other properties. Often, we therefore measure the efficiency as function of e.g. the transverse momentum ($p_T$) and the position ($\eta$).
The goal in this challenge is to predict this probability. 

This seems like a nice opportunity to practice our newly learned skills in data analysis and machine learning! Let's get to work...

In [None]:
### imports
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

In [None]:
### config
path_csv = "https://cernbox.cern.ch/index.php/s/sLWDkDKVkNmlDuw/download" # you may want to make a local copy
#path_csv = "output1.csv.gz"

In [None]:
### load data
df = pd.read_csv(path_csv, sep=";", compression="gzip", comment = "#")
df.describe().T

There is a lot of input features available in the dataset: (one row = one tau lepton)
* `mcChannelNumber`, `eventNumber`: unique identifiers for the dataset and collision event
* `N_true_elec`: true number of electrons in the collision event
* `N_true_muon`: true number of muons in the collision event
* `N_true_taus`: true number of tau leptons in the collision event
* `GenWeight`: relative physical probability of this collision event
* `MetTST_met`: measured missing transverse momentum in GeV
* `truth_pt`: true transverse momentum of the tau lepton in GeV
* `truth_eta`, `truth_phi`: true geometrical coordinates of the tau lepton in the detector ($\eta$ is called the [https://en.wikipedia.org/wiki/Pseudorapidity](pseudorapidity))
* `truth_prong`, `truth_neutral`: true number of charged and neutral particles the tau lepton decayed into
* `truth_charge`: charge of the tau lepton (in units of the elementary charge of the electron)
* `dR_min`: geometrical distance of the tau lepton and its reconstruction (if it has been reconstructed, otherwise 999)
* `match_pt`: measured transverse momentum of the tau lepton in GeV (if it has been reconstructed, otherwise -999)
* `dR_min_taujet`: geometrical distance of the tau lepton and the closest jet in the detector (if there is a jet, otherwise 999)
* `TruthMET_met`: true missing transverse momentum in GeV
* `Vtx_n`: number of concurrent proton-proton collisions

Notes:
* Not all of the above features are suited as input features. 
* `dR` is defined as $\Delta R = \sqrt{\eta^2 + \phi^2}$.

The target feature:
* `reco_matched`: 1 if the tau lepton has been reconstructed, otherwise 0

Note: This is what we want to predict.

### Some Examples for Illustration

In [None]:
### global definitions
# bin edges
EX, EY = (
    np.linspace(0,   3, 10), # eta
    np.linspace(0, 400,  5)  # pt [GeV]
)
# random seed for train/test split
SEED = 42

In [None]:
### helper function
def GetEff(X, y):
    # prints efficiencies in bins of pt and eta
    if isinstance(y, np.ndarray):
        y = pd.Series(y, name = "matched")
    # combine pt, eta columns from X with flag whether tau is reconstructed from y
    df = pd.concat([X[["truth_pt", "truth_eta"]].reset_index(drop = True), y.reset_index(drop = True)], axis=1)
    # compute bins 
    ptbins  = pd.cut(    df["truth_pt"]  , EY)
    etabins = pd.cut(abs(df["truth_eta"]), EX)
    # group in bins
    return df.groupby([etabins, ptbins])[y.name].mean().unstack()

Print the actual efficiencies as function of two of the input features:

In [None]:
print("Actual efficiencies:")
GetEff(df, df["reco_matched"])

---
We define a subset of features to learn from -- this you can change:

In [None]:
input_features = [u'truth_pt', u'truth_eta', u'truth_phi', u'truth_prong', u'dR_min']

In [None]:
### define training and test datasets
from sklearn.model_selection import train_test_split

X = df[input_features]
y = df["reco_matched"]

X_train, X_test, y_train, y_test = train_test_split(
  X, y, random_state=SEED
)

In [None]:
### fit a BDT
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

### BDT
model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                           algorithm="SAMME",
                           n_estimators=500)
model.fit(X_train, y_train)
print("Score:", model.score(X_test, y_test))

A score of 1.0? That's perfect! But is it?

In [None]:
# for comparison
print("Actual efficiencies:")
print(GetEff(X_test, y_test))

print()
print("Predicted efficiencies:")
print(GetEff(X_test, model.predict(X_test)))

### The Twist
What we actually see is that the reconstruction efficiencies depend on the MC dataset number: 

In [None]:
for mcChannelNumber in df["mcChannelNumber"].unique():
    df1 = df[df["mcChannelNumber"] == mcChannelNumber]
    print()
    print("Actual efficiencies for dataset number:", mcChannelNumber)
    print(GetEff(df1, df1["reco_matched"]))

The MC dataset numbers correspond to different ways of how the tau leptons are produced (i.e. which particles decay to produce the tau leptons), but the tau leptons themselves are elementary particles (like an electron) and as such have no way to know the process which produced them. In particular there should be no dependence on the MC dataset numbers.

Thus, the challenge is to predict the reconstruction efficiencies from the input features without using the MC dataset numbers such that they match the actual numbers as closely as possible. 

The measure we will use is the mean squared difference for the predicted and true efficiencies (in the binning above) on some these and some additional samples as well as possibly the AUC for the prediction of "reco_matched" (to be discussed).

---
Closing with some more convenience functions to test prediction on a particular sample:

In [None]:
def PredictForChannel(model, X_trained, df, mcChannelNumber):
    # test model on channelnumber
    columns = X_trained.columns
    df1 = df[df["mcChannelNumber"] == mcChannelNumber]
    #y_pred = model.predict_proba(df1[columns])[:,1]
    y_pred = model.predict(df1[columns])
    return df1, y_pred

In [None]:
GetEff(*PredictForChannel(model, X_train, df, 397049))

### Try to predict challenge metrics for histogram method:
To have something to compare to, here is how the method of just using a histogram in the variables `truth_pt` and `truth_eta` as our classifier performs.

The prediction is done by looking up the efficiencies from the histogram:

In [None]:
def pred_proba_from_hist(pt, abseta, eff_hist):
    # add overflow and underflow bins
    # (duplicate the first and last one)
    h = eff_hist
    h = np.append(h, h[-1:], axis=0)
    h = np.append(h[0:1], h, axis=0)
    h = np.append(h, h[:,-1:], axis=1)
    h = np.append(h[:,0:1], h, axis=1)

    # bin indices
    eta_idx = np.digitize(abseta, EX)
    pt_idx = np.digitize(pt, EY)
    
    # lookup efficiencies
    effs = np.empty(len(eta_idx))
    for i, (x, y) in enumerate(zip(eta_idx, pt_idx)):
        effs[i] = h[x, y]
    
    return effs

In [None]:
hist_train = GetEff(X_train, y_train).values

In [None]:
proba_test = pred_proba_from_hist(X_test.truth_pt, X_test.truth_eta.abs(), hist_train)

In [None]:
def roc_curve_and_auc(y_true, proba):
    fpr, tpr, thr = roc_curve(y_test, proba)
    plt.plot(fpr, tpr)
    plt.plot([0, 1], [0, 1], "--", color="black")
    print("AUC:", auc(fpr, tpr))

In [None]:
roc_curve_and_auc(y_test, proba_test)

In [None]:
def mean_squared_error(X, df_mcChannelNumber, proba):
    for mcChannelNumber in df["mcChannelNumber"].unique():
        df1 = df[df["mcChannelNumber"] == mcChannelNumber]
        df2 = X[df_mcChannelNumber == mcChannelNumber]
        print("Mean squared error for dataset number:", mcChannelNumber)
        actual_hist = GetEff(df1, df1["reco_matched"])
        pred_hist = GetEff(df2, proba)
        squared_diff = ((actual_hist.values - pred_hist.values) ** 2)
        mean_squared_error = np.where(~np.isnan(squared_diff), squared_diff, 0).sum()
        print(mean_squared_error)

In [None]:
# Make sure to use the same random seed as above to get the same splitting
df_mcChannelNumber_train, df_mcChannelNumber_test = train_test_split(df["mcChannelNumber"], random_state=SEED)

In [None]:
mean_squared_error(X_test, df_mcChannelNumber_test, proba_test)