# Identifying Tau Leptons In A High Energy Particle Collider Experiment

This project aims to use a machine learning algorithm to distinguish tau-jets (from tau lepton decays into pions) from hadronic jets in high energy electron-proton colisions at the proposed Electron Ion Collider. This distinction is needed to identify tau leptons and effectively search for hypothesized (but extremely rare) electron-to-tau conversions, a phenomenon which would be very strong evidence for physics beyond the Standard Model.

The [Standard Model of Elementary Particle Physics](https://en.wikipedia.org/wiki/Standard_Model) describes the fundamental building blocks of all visible matter in the universe and the forces acting between them. It includes twelve matter particles, called fermions. These fermions are grouped in three generations of 'quarks' and three generations of 'leptons'. Each generation includes two quarks and two leptons, respectively. In addition, the Standard Model contains four 'gauge bosons', which mediate the forces between the fermions, and the Higgs boson. All these fundamental particles of the Standard Model have been observed in experiments. 

<br>

<img src="figures/Standard_Model_of_Elementary_Particles.jpg" alt="The Standard Model of Elementary Particle Physics" style="width: 600px;"/>
<center> (Image source: [Wikipedia](https://en.wikipedia.org/wiki/Standard_Model#/media/File:Standard_Model_of_Elementary_Particles.svg) ) </center>

Over the past decades, the Standard Model has had enormous success in describing data from experiments and making accurate predictions. However, it cannot answer all our questions about the universe, and the search for phenomena and particles [beyond the Standard Model](https://en.wikipedia.org/wiki/Physics_beyond_the_Standard_Model) continues with more dedication than ever. One example of such a 'phenomenon of interest' is the conversion of electrons into tau leptons. According to the Standard Model, the chance that this conversion happens is so small that no current or planned experiment could possibly hope to observe it. And so far, no experiment has seen it. Therefore, if electron-to-tau conversions were measured, they would be very strong evidence for physics beyond the Standard Model.

The proposed Electron Ion Collider (EIC) will collide electrons with protons (and heavier atomic nuclei) at nearly the speed of light. While the main purpose of EIC is to study the inner structure and dynamics of protons and nuclei, it also provides a new opportunity to search for conversions of electrons into tau leptons. The big challenge here is to find the rare electron-proton collisions in wich an electron turns into a tau lepton. If they occur, we would expect about X every X years- compared to an overall rate of X electron-proton collisions per second.

<img src="figures/tau_signature.jpg" alt="Experimental signature of a tau lepton decaying into pions." style="width: 400px;"/>

Most electron-proton collisions at EIC result in one or more hadron <a href="https://en.wikipedia.org/wiki/Jet_(particle_physics)">jets</a> that are measured by the experiment. If tau leptons are being created, they decay before we can measure them directly. However, we can measure their decay products and use that information to reconstruct the original tau lepton. For this study, we focus on tau decays that result in three charged [pions](https://en.wikipedia.org/wiki/Pion) (and a neutral pion and a neutrino, which escapes direct detection). These pions form a characteristically narrow and jet-like cone, which is typically narrower and contains fewer particles than the ubiquitous hadron jets. To identify tau leptons, we need to find an effective way to distinguish tau-jets from hadron jets. This project aims to use a machine learning algorithm to accomlish this.



# The Data

For the simulations used in this study, we assume a model of physics beyond the Standard Model that includes [Leptoquarks](https://en.wikipedia.org/wiki/Leptoquark). Leptoquarks are hypothetical particles that combine properties of quarks and leptons. In addition, they mediate identity changes for charged leptons.


Picture: DIS process
Picture: DIS in detector
Picture: Leptoquark
Picture: Jet

One category of intriguing research found the standard model is the search for possible extensions / its limitations.

Search for extensions of the standard model, BSM, like new particles.

Leptoquarks, which combine Lepton and Qaurk properties. One interesting feature of these particles is that they would allow charged leptons to change their identity. Such type changes have been observed within the SM for neutrinos and quarks, but the likelihood for charged leptons to do the same is so small that it’s out of reach of experiments. Therefore, measuring such transitions wold be a clear signature for physics beyond th standard model. Exciting!!!

Large experiments consisting of various different types of measurement instruments are used to measure the reaction products, i.e. the particles coming out of each collision. Examples…

In a typical electron-proton collision…

Finding Letptoquarks:
1. Find collisions where the electron disappears, i.e. turns into a different type of particle.
2. Within thee collisions, find the ones with a tau particle in the final state.



# The Experiment

# Jet Characteristics

# The Data

In [6]:
# import libraries
import pandas as pd
import numpy as np

In [7]:
# read data
data = pd.read_csv('data/LeptoAna_r05_p250_e20.csv')
#data = data.astype('float32')
#data = data.dropna(axis=0)

In [11]:
# replace values: DIS = 0, tau = 1
# note whitespace ' ' before ' DIS' and ' tau'
#map_replace = {
#'jet_type':{
#    ' DIS':0,
#    ' tau':1
#}
#}

#data.replace( map_replace, inplace=True )

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14363 entries, 0 to 14362
Data columns (total 49 columns):
Row                          14363 non-null int64
event                        14363 non-null int64
evtgen_is_tau                14363 non-null int64
evtgen_tau_etotal            6556 non-null float64
evtgen_tau_eta               6556 non-null float64
evtgen_tau_phi               6556 non-null float64
evtgen_tau_decay_prong       14363 non-null int64
evtgen_tau_decay_hcharged    14363 non-null int64
evtgen_tau_decay_lcharged    14363 non-null int64
evtgen_is_uds                14363 non-null int64
evtgen_uds_etotal            6519 non-null float64
evtgen_uds_eta               6519 non-null float64
evtgen_uds_phi               6519 non-null float64
jet_id                       14363 non-null int64
jet_eta                      14363 non-null float64
jet_phi                      14363 non-null float64
jet_etotal                   14363 non-null float64
jet_etrans                   

In [9]:
data.head(1)

Unnamed: 0,Row,event,evtgen_is_tau,evtgen_tau_etotal,evtgen_tau_eta,evtgen_tau_phi,evtgen_tau_decay_prong,evtgen_tau_decay_hcharged,evtgen_tau_decay_lcharged,evtgen_is_uds,...,jetshape_emcal_econe_r02,jetshape_emcal_econe_r03,jetshape_emcal_econe_r04,jetshape_emcal_econe_r05,tracks_count_r02,tracks_count_r04,tracks_rmax_r02,tracks_rmax_r04,tracks_chargesum_r02,tracks_chargesum_r04
0,2,0,1,34.829021,0.221435,0.760311,3,3,0,0,...,4.608803,4.749965,4.966215,5.408717,1,1,0.073429,0.073429,-1,-1


In [10]:
data['evtgen_is_tau'].value_counts()

0    7807
1    6556
Name: evtgen_is_tau, dtype: int64

# Tau-Jet Classification Using Machine Learning

In [7]:
#feature_cols = ['n_Above_0p1', 'eta_average', 'Delta_phi_std', 'tower_energy_sum']
#target_col = 'jet_type'
#feature_cols = ['tracks_count_r04', 'tracks_chargesum_r04', 'tracks_rmax_r04', 'jetshape_radius']
feature_cols = [
#    'jet_eta',
#    'jet_phi',
    'jet_etotal',
    'jet_etrans',
    'jet_ptotal',
    'jet_ptrans',
    'jet_minv',
    'jet_mtrans',
    'jet_ncomp',
    'jet_ncomp_above_0p1',
    'jet_ncomp_above_1',
#    'jet_ncomp_above_10',
    'jet_ncomp_emcal',
    'jetshape_radius',
    'jetshape_rms',
    'jetshape_r90',
    'jetshape_econe_r01',
    'jetshape_econe_r02',
    'jetshape_econe_r03',
    'jetshape_econe_r04',
    'jetshape_econe_r05',
    'jetshape_emcal_radius',
    'jetshape_emcal_rms',
    'jetshape_emcal_r90',
    'jetshape_emcal_econe_r01',
    'jetshape_emcal_econe_r02',
    'jetshape_emcal_econe_r03',
    'jetshape_emcal_econe_r04',
    'jetshape_emcal_econe_r05',
#    'tracks_count_r02',
    'tracks_count_r04',
#    'tracks_rmax_r02',
    'tracks_rmax_r04',
#    'tracks_chargesum_r02',
    'tracks_chargesum_r04']

target_col = 'evtgen_is_tau'

features = data[ feature_cols ]
target = data[ target_col ]
target.value_counts()

0    7807
1    6556
Name: evtgen_is_tau, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split

# create training and testing vars
#X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.8)
#print (X_train.shape, y_train.shape)
#print (X_test.shape, y_test.shape)

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold

penalty = {
    0: 100,
    1: 1
}

#lr = LogisticRegression(class_weight=penalty)

#lr = DecisionTreeClassifier(class_weight=penalty, max_depth=30)

lr = AdaBoostClassifier(random_state=1)

#lr = RandomForestClassifier(class_weight=penalty, random_state=1, max_depth=20)
#lr = RandomForestClassifier(class_weight='balanced', random_state=1)
#lr = RandomForestClassifier(random_state=1)
kf = KFold(features.shape[0], shuffle=True, random_state=1)

predictions = cross_val_predict(lr, features, target, cv=kf)
#predictions = cross_val_predict(lr, features, target, cv=10)

#lr.fit( features, target )
#predictions = lr.predict(features)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (target == 0)
fp = len(data[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (target == 1)
tp = len(data[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (target == 1)
fn = len(data[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (target == 0)
tn = len(data[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print( "True positive: "+str(tp))
print( "True negativee: "+str(tn))
print( "False positive: "+str(fp))
print( "False negative: "+str(fn))
print( "True Positive Rate: "+str(tpr) )
print( "False Positive Rate: "+str(fpr) )



True positive: 6020
True negativee: 7086
False positive: 721
False negative: 536
True Positive Rate: 0.9182428309945089
False Positive Rate: 0.09235301652363263
