# NSBI training

In this chapter, you will be training surrogate neural networks that estimate the ratio of probability densities of an event under different hypotheses, referred to as CARL models.

$$
r( x | \theta_1, \theta_2 ) \equiv \frac{p(x | \theta_1)}{p( x | \theta_2)}
$$

We will be learning the density ratio of two numerator hypotheses with respect to a common denominator: (1) the signal-only and (2) signal+background+interference processes over the background-only process.

$$
r( x | {\rm S}, {\rm B}), \, r( x | {\rm SBI}, {\rm B})
$$

Re-refer to the introductory overview and as to how these estimates allow us to obtain an estimate of the full SBI process under modifications to the Higgs signal strength, which we will fully realize later on in the last chapter. For now, we focus on the training.

In [None]:
import json

import pandas as pd
import numpy as np

import torch
from torch.utils.data import TensorDataset, DataLoader
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('medium')
import lightning as L

from physics.simulation import mcfm
from physics.analysis import zz4l, zz2l2v
from physics.hstar import sigstr
from nsbi import carl

import matplotlib, matplotlib.pyplot as plt

## 1. Preparing the training datasets

The training data consists of examples drawn from the two hypotheses of which we wish to estimate the ratio of. They will correspondingly be referred to as the numerator and denominator hypotheses from this point on. 

In [None]:
data_dir = '/global/cfs/cdirs/trn016/carl_models/'

(events_sig_train, events_sig_val), (events_bkg_train, events_bkg_val) = carl.utils.load_data(data_dir, 'sig_over_bkg')
(events_sbi_train, events_sbi_val), _ = carl.utils.load_data(data_dir, 'sbi_over_bkg')

### 1.(a) Scale the features

The first "usual" thing to do is to scale the features to have $\left< x \right> = 0$ and standard deviation $\sigma_x = 1$, referred to as standard scaling. Perform the following:

1. Scale the features of the *training* data such that the above holds exactly.
2. Scale the features of *validation* data exactly according to the scaling performed to the training data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

features_4l = ['l1_pt', 'l1_eta', 'l1_phi', 'l1_energy', 'l2_pt', 'l2_eta', 'l2_phi', 'l2_energy', 'l3_pt', 'l3_eta', 'l3_phi', 'l3_energy', 'l4_pt', 'l4_eta', 'l4_phi', 'l4_energy']

X_sig_train = scaler.fit_transform(events_sig_train.kinematics[features_4l].to_numpy())
X_sig_val = scaler.transform(events_sig_val.kinematics[features_4l].to_numpy())

X_sbi_train = scaler.fit_transform(events_sbi_train.kinematics[features_4l].to_numpy())
X_sbi_val = scaler.transform(events_sbi_val.kinematics[features_4l].to_numpy())

X_bkg_train = scaler.fit_transform(events_bkg_train.kinematics[features_4l].to_numpy())
X_bkg_val = scaler.transform(events_bkg_val.kinematics[features_4l].to_numpy())

### 1.(b) "Balance" the hypotheses

The key to the likelihood ratio trick is to ensure that the neural network sees examples of two hypotheses that are balanced, i.e. their total rate of occurences in the training data are the same.

$$
    N(y = 0) = N(y = 1)
$$

For $N = 1$, then of course the neural networks only sees the relative rate, i.e. probability, of each event under the different hypotheses. Let's enforce these for each of the datasets:

In [None]:
w_sig_train, w_sig_val = events_sig_train.weights / events_sig_train.weights.sum(), events_sig_val.weights / events_sig_val.weights.sum()
w_sbi_train, w_sbi_val = events_sbi_train.weights / events_sbi_train.weights.sum(), events_sbi_val.weights / events_sbi_val.weights.sum()
w_bkg_train, w_bkg_val = events_bkg_train.weights / events_bkg_train.weights.sum(), events_bkg_val.weights / events_bkg_val.weights.sum()

## 2. Building the NN

### 2.(a) NN architecture

Implement the function to specify the layers of a multi-layer perceptron (MLP) with:

1. As many input nodes as there are features,
2. As many hidden layer-times-nodes as desired, all with a ReLU activation function.
4. One output node with a sigmoid activation function.

In [None]:
import torch
from torch import nn

def nn_layers(n_features, n_layers, n_nodes):
    layers = []
    layers.append(nn.Sequential(nn.Linear(n_features, n_nodes), nn.ReLU()))
    for _ in range(n_layers):
        layers.append(nn.Sequential(nn.Linear(n_nodes, n_nodes), nn.ReLU()))
    layers.append(nn.Sequential(nn.Linear(n_nodes, 1), nn.Sigmoid()))
    return layers

## 3. Training the NNs

A `torch/lightning` implementation of everything above has already been prepared for you, and can be launched with a command such as:

```sh
 python -m nsbi.carl fit \
    --data.features '["l1_pt", "l1_eta", "l1_phi", "l1_energy", "l2_pt", "l2_eta", "l2_phi", "l2_energy", "l3_pt", "l3_eta", "l3_phi", "l3_energy", "l4_pt", "l4_eta", "l4_phi", "l4_energy"]' \
    --data.numerator_events '/ptmp/mpp/taepa/higgs-offshell-interpretation/data/zz4l/ggZZ_sbi/analyzed.csv' \
    --data.denominator_events '/ptmp/mpp/taepa/higgs-offshell-interpretation/data/zz4l/ggZZ_sbi/analyzed.csv' \
    --data.denominator_reweight '["sbi","bkg"]' \
    --data.batch_size BATCH_SIZE \
    --model.learning_rate LEARNING_RATE \
    --model.n_layers N_LAYERS \
    --model.n_nodes N_NODES \
    --trainer.max_epochs 500 \
    --trainer.seed_everything SEED
```

The example commands for training the two estimates, $p_{\rm S (x)} / p_{\rm B}(x)$ and $p_{\rm SBI}(x) / p_{\rm B}(x)$, along with some suggested hyperparameters are already available for you at `scripts/fit-carl-4l.sh`. IMPORTANT: It is *strongly* encouraged (for convenience) to also additionally specify the seed as:
```sh
    --trainer.seed_everything SEED
```
This ensures that the train/validation/test dataset splitting is done consistently across the two NNs. This is not required, but convenient (i.e. you will need to write additional code otherwise) for later chapters!

### Extra note about "reweighting"

You may have noticed that for the $p_{\rm SBI}(x) / p_{\rm B}(x)$ training, that the SBI dataset is specified as both the SBI and B hypotheses! But also notice the extra argument `--data.denominator_reweight '["sbi", "bkg"]'`, which internally reweights the SBI event weights to the B hypothesis, via

$$
 w_{i, \rm SBI \to B} = w_{i, SBI} \times \frac{|\mathcal{M}_{\rm SBI}|^2}{|\mathcal{M}_{\rm B}|^2}
$$

This helps the training converge faster given the available dataset size as examples of the exact same $x_i$ points are seen by the network under both $y_i = 0, 1$ labels. This is actually a prelude to some of the more advanced NSBI methods ([arXiv.1805.12244](https://arxiv.org/abs/1805.12244)), which are beyond the scope of this tutorial.
