# NSBI training

In this tutorial, you will be training surrogate neural networks that estimate the ratio of probability densities of an event under different hypotheses. These are special kind of classifiers known as CARL models.

$$
r( x | \theta_1, \theta_2 ) \equiv \frac{p(x | \theta_1)}{p( x | \theta_2)}
$$

In [None]:
import json

import pandas as pd
import numpy as np

import torch
from torch.utils.data import TensorDataset, DataLoader
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('medium')
import lightning as L

from physics.simulation import mcfm
from physics.analysis import zz4l, zz2l2v
from physics.hstar import sigstr
from nsbi import carl

import matplotlib, matplotlib.pyplot as plt

## 1. Preparing the training datasets

The training data consists of examples drawn from the two hypotheses of which we wish to estimate the ratio of. They will correspondingly be referred to as the numerator and denominator hypotheses from this point on. 

In [None]:
data_dir = 'run/h4l'

# we ignore the later "results" objects, which you must obtain for yourself
(events_sig_train, events_sig_val), (events_bkg_train, events_bkg_val) = carl.utils.load_data(data_dir, 'sig_over_bkg')
(events_sbi_train, events_sbi_val), _, = carl.utils.load_data(data_dir, 'sbi_over_bkg')

### 1.(a) Scale the features

The first "usual" thing to do is to scale the features to have $\left< x \right> = 0$ and standard deviation $\sigma_x = 1$, referred to as standard scaling. Perform the following:

1. Scale the features of the *training* data such that the above holds exactly.
2. Scale the features of *validation* data exactly according to the scaling performed to the training data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScale()

features_4l = ['l1_pt', 'l1_eta', 'l1_phi', 'l1_energy', 'l2_pt', 'l2_eta', 'l2_phi', 'l2_energy', 'l3_pt', 'l3_eta', 'l3_phi', 'l3_energy', 'l4_pt', 'l4_eta', 'l4_phi', 'l4_energy']

# IMPLEMENT ME
X_sig_train = scaler.fit_transform(events_sig_train.kinematics[features_4l].to_numpy())
X_sig_val = scaler.transform(events_sig_val.kinematics[features_4l].to_numpy())

X_sbi_train = scaler.fit_transform(events_sbi_train.kinematics[features_4l].to_numpy())
X_sbi_val = scaler.transform(events_sbi_val.kinematics[features_4l].to_numpy())

X_bkg_train = scaler.fit_transform(events_bkg_train.kinematics[features_4l].to_numpy())
X_bkg_val = scaler.transform(events_bkg_val.kinematics[features_4l].to_numpy())

### 1.(b) "Balance" the hypotheses

The key to the likelihood ratio trick is to ensure that the neural network sees examples of two hypotheses that are balanced, i.e. their total rate of occurences in the training data are the same.

$$
    N(y = 0) = N(y = 1)
$$

For $N = 1$, then of course the neural networks only sees the relative rate, i.e. probability, of each event under the different hypotheses. Let's enforce these for each of the datasets:

In [None]:
# IMPLEMENT ME
w_sig_train, w_sig_val = events_sig_train.weights / events_sig_train.weights.sum(), events_sig_val.weights / events_sig_val.weights.sum()
w_sbi_train, w_sbi_val = events_sbi_train.weights / events_sbi_train.weights.sum(), events_sbi_val.weights / events_sbi_val.weights.sum()
w_bkg_train, w_bkg_val = events_bkg_train.weights / events_bkg_train.weights.sum(), events_bkg_val.weights / events_bkg_val.weights.sum()

## 2. Building the NN

## 2. NN architecture

Implement the function to specify the layers of a multi-layer perceptron (MLP) with:

1. As many input nodes as there are features,
2. As many hidden layer-times-nodes as desired, all with a ReLU activation function.
4. One output node with a sigmoid activation function.

In [None]:
import torch
from torch import nn

def nn_layers(n_features, n_layers, n_nodes):
    # IMPLEMENT ME
    layers = []
    layers.append(nn.Sequential(nn.Linear(n_features, n_nodes), nn.ReLU()))
    for _ in range(n_layers):
        layers.append(nn.Sequential(nn.Linear(n_nodes, n_nodes), nn.ReLU()))
    layers.append(nn.Sequential(nn.Linear(n_nodes, 1), nn.Sigmoid()))
    return layers

### 2.(b) Custom loss function

We wish to perform a classification between the numerator $(y = 1)$ and denominator $(y = 0)$ events. Recall that we have the (balanced) weight of each event, rather than an example being an occurrence of $1$. Define a custom binary cross-entropy (BCE) function that is the weighted average accounting for the weight of each event:

$$
l_i (f(x_i), y_i) = -w_i (y_i \log (f(x_i)) + (1-y_i) \log(1-f(x_i)))
$$

In [None]:
def weighted_binary_cross_entropy(yhat, y):
  # IMPLEMENT ME

Of course a way to specify this is pre-availble within the `torch` library, which will be what is used in the end. But we do this to make sure we understand what we are doing ourselves.