
# <font color="darkred" size="+4">**Particle Physics**</font>
Centrale-Supélec - ST4 - 2024

<font color="darkred" size="+3">**Work on systematics**</font>

<br/>

---
## <font color="darkred">**General Instructions**</font>
---
<br/>

To do this homework, you have to download it on your own "Google Colaboratory" space (see in the "File" menu, "save a copy in your Google Drive", you will have to create a gmail account if you don't already have one) so that you can modify it as you wish. You need to train with Google Colaboratory as you will be using it for the Challenging Week (Enseignement d'Intégration).

Once you did indeed save this notebook on your own Google drive, you can modify it and save it again. If you corrupt anything by mistake, you have automatic recording of previous version which you can still acess to get back. Anyway you can still start from scratch with the default notebook if you experience really weird things with your current version.

On the Help menu above, (menu "Aide" in French) you can switch the language of the interface. For this lectures I set it to English.

Last notice: in this notebook we will use greek characters in the code cells. To be able to write yourself these greek characters you need to install greek keyboard on your laptop, or to copy paste the ones provided in the code cells.


## <font color="darkred">**Overview**</font>

> The purpose of this practical exercise is to apply the statistical and systematic methods covered in Lectures 3 and 4.
>
> We will address the following topics:
>
>
> 1.   Simple Plots computations of basic statistical and systematic uncertainties
> 2.   Propagating systematic uncertainties through a pre-trained BDT


## <font color="darkred">**Signal and backgounds**</font>

When we collect measurements from sensors over a short period, we term them as events. Events vary in structure, ranging from sensor values to complex derived quantities. Often, we cannot definitively identify if we have detected what we sought in the recorded information.

We categorize events broadly into two types:

- Signal: desired detection.
- Background: unwanted data collected alongside.

The name of the game is to check if observed proportions of signal and backgrounds are in line with our predictions.

## <font color="darkred">**Prediction framework**</font>

Imagine we have a dataset containing both signal and background events, and we only know the total event count, $n_{\rm obs}$. We have formulated a model to predict the numbers of signal ($S$) and background ($B$) events. Our aim is to verify if our predicted counts match the actual counts in the dataset, particularly focusing on the signal category.

To achieve this, we employ a parametric model

$n_{\rm pred}({\boldsymbol\mu})={\boldsymbol\mu}S+B$

where S and B are fixed, but $\boldsymbol\mu$, denoted as the signal strength, is what we seek to determine.
- If our hypothesis, $n_{\rm pred}=\mu S+B$, aligns with the data, $n_{\rm obs}=S+B$, we expect the total recorded events to match our model's prediction, from which we derive the signal strength being $\mu=1$.
- If the observed number of events is from background only, $n_{\rm obs}=B$ then we get $\mu=0$.
- Other possibilities arise because of statistical fluctuations (observation process being random by nature) or systematic errors in the determination of $S$ and/or $B$.

Statistical fluctuations occur because of the random nature of the measurement process at stake.

Systematical errors come into play because of incomplete knowledge of the experimental conditions and/or theoretical prediction framework.

The verification process about possible $\mu$ values is what we will undertake next.

## <font color="darkred">**Statistical modeling**</font>

As we have seen in the lecture, $n_{\rm obs}$ should follow a Poisson distribution:

$${\cal P}(n|S, B) = \frac{\left(S+B\right)^{n_{\rm obs}}}{n_{\rm obs}!}e^{-\left(\,S+B\,\right)}$$

In particular we have the first two moments:
- ${\mathbb E}[n_{\rm obs}]= S+B$
- ${\mathbb V}[n_{\rm obs}]= S+B$

We start by first importing prerequisited packages and modules

In [None]:
from matplotlib import pyplot as plt
import numpy as np


# <font color="darkred">**Part 1: On statistical behaviours of uncertainties**</font>


## **<font color="darkred">QUESTION 1</font>**

Match the observed number of events, $n_{\rm obs}$ to the expected number of events from the model

$$n_{\rm exp}=\mu S+B$$


Define the value of $\mu$ as a function of $n_{\rm obs}$, $S$ and $B$

In [None]:
### Type your code here


Without data, in the experiment design phase, we often resort to Monte Carlo simulations by sampling synthetic data around expectations to get realistic simulated datasets.

Any deviation in $n_{\rm obs}$ with respect to $S+B$ will result in deviation in $\mu$ with respect to 1.

Therefore because of randomness of $n_{\rm obs}$ around mean value ${\mathbb E}[n_{\rm obs}]=S+B$ following the Poisson distribution, $\mu$ will be random around 1.

## **<font color="darkred">QUESTION 2</font>**

What are:
1. the expectation, ${\mathbb E}[\mu]$,
2. the variance ${\mathbb V}[\mu]$,
3. the standard deviation $\sigma_{\mu}$?

To start with (not to bother with random sampling) we first assume that we collect the expected number of event $n_{\rm obs}={\mathbb E}[n]=S+B$ (_i.e._ without any statistical fluctuation). This peculiar situation is called the "*Asimov dataset*"<a name="cite_ref-1"></a>[<sup>[^1]</sup>](#cite_note-1). <a name="back_to_citation_1"></a>

To study the evolution of $\sigma_\mu=\sqrt{{\mathbb V}[\mu]}$ with respect to $n_{\rm obs}$ with will consider the experiment is taking data over an increasing amount of time. We will simply comsider the $n_{obs}$, $S$ and $B$ will be proportional to some factor $\alpha$.

Take for this application:
1. $S=100$
2. $B=10,000$
3. $\alpha$ from 1 to 1,000

Write down a code to compute sigma_mu as a function of alpha values for S and B fixed and plot sigma_mu as function of alpha.
1. First in Linear scale (using `np.linspace` for alpha and plotting with `plt.plot`)
2. Second in loglog scale (using `np.logspace` for alpha and plotting with
 `plt.loglog`)

 Do not forget to name axes with `plt.xlabel(...)` and `plt.ylabel(...)` when producing graphics.

In [None]:
α = ...

def sigma_mu(α, S=100, B=10_000):
    return ...

# plt.plot(...)


What do you notice in the loglog scale plot? Comment about this.

## **<font color="darkred">QUESTION 3</font>**

**Study the case of signal fraction**: Consider that we can better analyse the candidate events and we have a handle to purify the signal fraction, $f_S$, in the observed sample $n_{\rm obs}$:
$$n_{obs}= \underbrace{f_S\times n}_{S}  + \underbrace{(1-f_S) \times n}_{B}$$

How does $\sigma_\mu = \sqrt{{\mathbb V}[\mu]}$ evolves with $f_S$ for $n=10,000$. Plot the variation of $\sigma_\mu$ in log-log scale.

In [None]:
### Type your code here



Comment about the equivalent signal fraction $f_S$ needed to reach $\sigma_{\mu}=10^{-1}$ with $n=10,000$ and the $\alpha$ scaling parameter required for reaching the same sensitivity. Why larger signal fraction leads to smaller uncertainty in μ?

Compute $\sigma_\mu$ for 3 different configurations:
1. S = 100, n = 10,000, fS = 0.01
2. S = 1,000, n = 10,000, fS = 0.10
3. S = 10,000, n = 1,000,000, fS = 0.01

## **<font color="darkred">QUESTION 4</font>**

Consider now $f_S=0.2$ is fixed and get $\sigma_\mu$ as a function of total number of events $n$ between 2,600 and 250,000.


In [None]:
### Type your code here


Why, with still only 20% of total number of events only is it possible to reach a 1% precision with a large 250,000 events dataset?


# <font color="darkred">**Part 2: Taking into account systematic uncertainties**</font>

On top of potential errors with respect to expectations due to random statistical fluctuations, our expectation on total number of signal and background events could have been wrongly calculated for several reasons. In the following of this section, we consider systematic uncertainties associated to signal and background number of events.

To estimate potential errors on our determination of $\mu$ we will estimate associated uncertainties on $S$ and $B$.

Physically the total number of signal events $S$ can we written as:

$$S = {\cal L}\times\sigma_S\times\varepsilon_S$$

and for the total number of brackground events $B$:

$$B = {\cal L}\times\sigma_B\times\varepsilon_B$$

where,
- ${\cal L}$ is the integrated luminosity (flux) of particles participating to interaction process within the detector,
- $\varepsilon_S$ is the signal detection efficiency,
- $\varepsilon_B$ the background detection efficiency,
- $\sigma_S$ sthe signal cross section,
- $\sigma_B$ the background cross section.

${\cal L}$, $\varepsilon_S$ and $\varepsilon_B$ are not fully determined and are subject to repsective relative uncertainties:
- $\delta {\cal L} = 1\%$
- $\delta \varepsilon_S = 5\%$
- $\delta \varepsilon_B = 2\%$

## **<font color="darkred">QUESTION 5</font>**

Using linear error propagation on $\mu$, with the above uncertain parameters, determine the total uncertainty $\sigma_\mu$ on $\mu$.

Below a functor is defined. A functor is an object, when called for the first time defines a function with specific internal parameters provided. The instant of the object thus created is specific with the values provided for the creation. Then this object can be used as a regulard function thanks to the `__call__` attribute of the class.

In [None]:
class δμFunctor():
    def __init__(self, δε_S=0.1, δε_B=0.1, δL=0.1):
        self.δε_S, self.δε_B, self.δL = δε_S, δε_B, δL

    def Jacobian(self, S, B):
        n = S+B
        ### WRITE THE CORRECT VALUES THERE:
        return np.array([..., ..., ..., ...])
        ###

    def getUncertainty(self, S, B):
        uncertainties = np.array([1/np.sqrt(S+B), self.δε_S, self.δε_B, self.δL])
        return np.sqrt(sum(self.Jacobian(S, B)**2 * uncertainties**2))

    __call__ = getUncertainty

Calling the `δμFunctor` in the cell below should run floawlessly and gives `δμ=0.234`

In [None]:
# We define a specific δμ functino here with provided uncertainties
δμ = δμFunctor(δε_S=.05, δε_B=.02, δL=.01)
# Here "function" δμ is called with S and B values and compute the total uncertainty
# for values of S and B provided.
print(δμ(100, 2_500))

We now define a usefull printing command to play with $\alpha$ and $f_S$ values instead of $S$ and $B$.

In [None]:
δμ_printer = lambda n, α, fS: print(f"{n = },\t{α = },\t{fS = }\t--->\tδμ = {δμ(n*α*fS, n*α*(1-fS)):.3f}")
δμ_printer(n=10_000, α=1.0, fS=.1)

What is better?
1. 10,000 events with 200 events of signal?
2. 200 events with 150 events of signal?
3. 100 events with 99 events of signal?

Comment on these computations?

---
# <font color="darkred">**Study of systematics with a trained BDT on test samples**</font>
---


In this section we will not spend time on developping and tuning Machine Learning algorithm. This part of the job is addressed within ML lectures and practice sessions. Here we are interested to use these algorithms to estimate uncertainties and to compute final outputs on statistical signifiance such as the AMS (average median significance) or the $\delta\mu$ uncertainty.

The initial dataset, model and computations are provided within a specific module provided for this notebook. This module is called `systLib`. It's role is to assist you on your path to estimate statistical and systematical uncertainties on a quantitiy of interest such as the $\mu$ parameter.

If you are curious you can look in at the code provided in the `systLib.py` file once dowloaded. It basically creates classes around codes extracted from ML practice session from David.

You will need to download a data files from your professor's Google Drive. In order to do so, you need to authenticate with you Google account, so when executing the following cell, you will be prompted at some point to enter a verification code. In order to get it, follow the instructions given in the popup windows once executing next cell.

In [None]:
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Checking local content before downloading files
!ls -rtlh

In [None]:
# Download the training set from Google Drive
download = drive.CreateFile({'id': '1Gtw9_i4e4XuDZnxLbVct7AqSQYaB7ZST'})
download.GetContentFile('data.csv.gz')
# Download systLib module
download = drive.CreateFile({'id': '1-4ejSiPc2Y4tISgsr6zghEbCPiGe9x1m'})
download.GetContentFile('systLib.py')

In [None]:
# Cheking data file is now accessible in colab, and it's timestamp
!ls -rtlh

You should see 2 new files:
- `data.csv.gz`
- `systLib.py`

Otherwise, something went wrong on the way. In that latter case restart the previous steps.

We import a module, `systLib`, which has been specifically developped for this notebook.

In [None]:
from systLib import DataObject, Model, make_systematic_datasets # type: ignore

From this module we will use the following objects and function:
- `DataObject` is a class which whose instantance contains pre-processed data. It is used to get train and test datasets.
- `Model` is a class to be used for training models on training data and evaluating them on testing datasets.
- `make_systematics_datasets` is a function to be used to compute effect of systematics on test dataset.


We will use the same dataset which was introduced in the first ML practice session with David. It is a Monte Carlo simulation of a Higgs boson decay into two W bosons. This is a quite common decay mode of the Higgs boson (21%). We are then interested in the process where two W bosons ($W^+$ and $W^-$) then decay into two (oppositely) charged leptons and and two neutrinos. The neutrinos leave the detector being undetected and give rise to imbalance in energy conservation. This channel study therefore introduces the quantities related to the two charged leptons and missing transverse energy, MET. The reduced dataset which is here provided contain only the following observables:
- the two charged leptons transverse momenta and "angles" ($\eta$ and $\phi$)
- the two missing transverse energy angles ($\eta$ and $\phi$)

totalizing 6 features as in ML practice session #1.



In [None]:
data = DataObject()
ds_train = data.get_train_dataset()
ds_test = data.get_test_dataset()

We assume the following:
- a luminosity systematic error $\delta {\cal L} = 1\%$
- a signal systematic normalization error $\delta \varepsilon_S = 5\%$
- a background systematic normalization error $\delta \varepsilon_B = 2\%$
- a detector response systematic error $\delta R = 2\%$


In [None]:
ds = make_systematic_datasets(ds=ds_test, δε_S=.05, δε_Β=.04, δL=.01, δR=.02)

Now that the datasets have been prepared, we will train some models such as LightGBM, XGBoost, scikit-learn HistGradientBoosting and a RandomForest model.

In [None]:
# Example of 4 models to train and test for systematics
from lightgbm import LGBMClassifier as LGB
from xgboost.sklearn import XGBClassifier as XGB
from sklearn.ensemble import HistGradientBoostingClassifier as HGB
from sklearn.ensemble import RandomForestClassifier as RF

Here is the simple way to define the model and train it on a train dataset and plot it's AMS performance on a test dataset.

In [None]:
m = Model(classifier=HGB)
m.train(ds_train)
m.plot(ds_test);

## **<font color="darkred">QUESTION 6</font>**

From the above code examples, try different models such as:
- `HGB`
- `LGB`
- `XGB`
- `RF`

What do you notice? Are there models better than others? Is the training time comparable?

## **<font color="darkred">QUESTION 7</font>**

Loop on the models and extract the AMS, average median significance which we have covered at the end of Lecture 4.
The AMS definition is:

$${\rm AMS} = \sqrt{2\left(\left(S+B\right)\log\left(1+\frac{S}{B}\right)-S\right)}$$

This quantity is accessible through a call to `Model.get_ams(data_set, scores)` method.

Fill the code below to perform the requested actions:

In [None]:
x = np.linspace(0.4, 0.9, 100)
models = [HGB, LGB, XGB]
ams_models = []
ams_max_models = []
for model in models:
  m = ...
  ...
  _, ams = m.evaluate(ds_test, x)
  ams_models.append(ams)
  ams_max_models.append(m.getArgMaxAMS(ds_test))

In [None]:
labels = ['HGB', 'LGB', 'XGB']
for ams, label in zip(ams_models, labels):
  plt.plot(x, ams, label=label)
plt.grid(ls='--')
plt.legend()
plt.ylabel('AMS')
plt.xlabel('score')
plt.show()

In [None]:
ams_max_models

You can investigate shapes of signal and background in two datasets with `Model.compare` method

In [None]:
m.compare(ds['test'], ds['sig'], range=(0.5, 1), bins=30, density=False)

We need to compute for different score values the δμ of each systematic

For computation time efficiencies it is faster to get for each systematic the computation for all the scores at the same time.

This means we will get a data structure which is a list of 4 lists of δμ  for each score values.


In [None]:
δμ_stat = []
n, S, _ = m.get_nSB(ds_test, x)
eps = 1e-12
for _n, _S in zip(n, S):
  δμ_stat.append(np.sqrt(_n)/(_S+eps))  # numerical trick to avoid division by 0
δμ = []
for d in ds.values():
  _, S, B = m.get_nSB(d, x)
  δμ.append([(_n-_B)/(_S+eps) for _n, _S, _B in zip(n ,S, B)])

δμ is a list of 4 lists of δμ_syst values for score values

To compute the total δμ_tot = sqrt(sum(δμ_syst_i^2)) we need to then "transpose" this list of lists.

Here are 3 methods:

1. Using python list comprehension:

`>>> δμT = [[row[i] for row in μ] for i in range(len(μ[0]))]`

2. Using map to apply a list grouping of each element of the 4 lists

`>>> δμT = list(map(list, zip(*μ)))`


3. Using a numpy array conversion from list of list to 2D array, transpose and back to list

`>>> δμT = np.asarray(δμ).T.tolist()`


In [None]:
δμT = np.asarray(δμ).T.tolist()
μ0 = 1
δμ_syst_total = [sum([(δμ_ix-μ0)**2 for δμ_ix in δμ_x]) for δμ_x in δμT]
δμ_tot = [np.sqrt(u**2 + y**2) for u, y in zip(δμ_stat, δμ_syst_total)]

We finish by plotting the δμ's vs. score values:

In [None]:
plt.plot(x, δμ_stat, label='stat')
plt.plot(x, δμ_syst_total, label='syst')
plt.plot(x, δμ_tot, label='total')
plt.legend(facecolor='white')
plt.xlabel('score')
plt.ylabel(r'$\delta{\mu}$')
plt.grid(ls='--')
plt.axis([min(x), max(x), 0, 0.5])
plt.show()

## **<font color="darkred">QUESTION 8</font>**

Based on the preceding code define a function to compute the statistical and systematic uncertainties for a given model.

Your job is to build up a table of systematic uncertainties for each model. This table should contain:

- a row per model: LGB, HGB, XGB, RF (last one if possible)
- columns: δμ_stat, δμ_sig, δμ_bkg, δμ_lumi, δμ_det, δμ_syst_tot, δμ_tot,



---
# Footnotes




<a name="cite_note-1"></a>[[^1]](#cite_ref-1) In reference to the Isaac Asimov's novel *Franchise* (1955). In this short story, the outcome of the 2008 US Presidential election is determined by the choice made by a single representative of the entire electorate, an average citizen. The scientific article which first used this name for this reference dataset is  [[arXiv: 1007.1727](https://arxiv.org/pdf/1007.1727.pdf)]. The term is now often used to describe this peculiar setup. (Go back to [citation mark](#back_to_citation_1))