# Training of the efficiency model
##### Notebook tested within the environment `TensorFlow on GPU` available in the docker image [`landerlini/lhcbaf:v0p8`](https://hub.docker.com/r/landerlini/lhcbaf)

This notebook is part of a pipeline, in particular it requires the data preprocessed as defined in the notebook [Preprocessing.ipynb](./Preprocessing.ipynb) and the validation of the trained model is demanded to the notebook [Efficiency-validation.ipynb](./Efficiency-validation.ipynb).

Here, we define the training procedure for the Deep Neural Network model defining the class each track is reconstructed as.
As evident from the preprocessing step, we restrain the classes to:
 * long tracks (traversing the whole detector)
 * upstream tracks (traversing the VELO and the Tracker Turincensis)
 * downstream tracks (traversing the Tracker Turicensis and the downstream tracker, TT).
 
We include as a class the "unreconstructed" category which includes both the non-reconstructed particles and those reconstructed as other classes.
 
The neural network we will train is designed to predict the probability each track is reconstructed as a given track.
In the deployment of the model we will assign the particle to a single class, by drawing one of the classes above based on the probabilities obtained from the network.

The classes are mutually exclusive, each particle can be assigned to at most one of the reconstruction classes.
Hence, we describe the problem as a multiclass classification with a multinomial probability function and a Categorical Cross-entropy as loss function.

## Libraries and environment setup

As for the [training of the acceptance model](./Acceptance.ipynb), we are using here the standard software stack for TensorFlow on GPU.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import os
from os import environ

## Remove annoying warnings 
environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf

E0000 00:00:1752755958.213706  618231 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752755958.262001  618231 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752755958.708448  618231 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752755958.708485  618231 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752755958.708488  618231 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752755958.708490  618231 computation_placer.cc:177] computation placer already registered. Please check linka

We ensure the GPU is properly loaded and assigned to TensorFlow as hardware accelerator for the training.

If the GPU is loaded properly, the following code block should result in a string similar to `'/devince:GPU:0'`.

In [2]:
tf.test.gpu_device_name()

''

## Loading data 

We are reading the data using our custom implementation of `FeatherReader` streaming the data directly to TensorFlow.
In particular, we are loading:
 * the training dataset to optimize the weights;
 * the validation dataset to evaluate possible overtraining and select model and tune the regularization hyper-parameters and techniques.

In [3]:
from feather_io import FeatherReader    
data_reader_train =  FeatherReader(environ.get("TRAIN_DATA", "/tmp/efficiency-train"))
train_dataset = data_reader_train.as_tf_dataset()
data_reader_validation =  FeatherReader(environ.get("VALIDATION_DATA", "/tmp/efficiency-validation"))
validation_dataset = data_reader_validation.as_tf_dataset()

We also load to RAM a small chunk of data to ease the model building.

In [4]:
X, y = next(iter(train_dataset.batch(1_000_000)))
y.shape

AttributeError: module 'ml_dtypes' has no attribute 'float4_e2m1fn'


TensorShape([1000000, 4])

## Model definition

We define the neural network as a deep network with skip connections to limit the gradient vanishing problem.

Note that the activation of the last layer is a [softmax](https://keras.io/api/layers/activations/#softmax-function) as expected by the [Categorical Cross-entropy loss function](https://keras.io/api/losses/probabilistic_losses/#categoricalcrossentropy-class).

Unfortunately, the `scikinC` package that we are relying on to deploy these models in Lamarr does not support the `softmax` activation function is indicated as a string, but needs it defined as an independent layer.

In [5]:
from pprint import pprint 

dense_config = dict(
    units=128,
    activation='tanh', 
    kernel_initializer='he_normal', 
    kernel_regularizer=tf.keras.regularizers.L2(1e-3),
)
input = tf.keras.layers.Input(shape=X.shape[1:])
x = tf.keras.layers.Dense(**dense_config)(input)

for i in range(5):
    r = tf.keras.layers.Dense(**dense_config)(x)
    x = tf.keras.layers.Add()([x, r])
x = tf.keras.layers.Dense(y.shape[1], activation='linear', kernel_initializer='he_normal')(x)
x = tf.keras.layers.Softmax()(x)  ## needed by scikinC

model = tf.keras.Model(inputs=[input], outputs=[x])
pprint (dense_config)
model.summary()

{'activation': 'tanh',
 'kernel_initializer': 'he_normal',
 'kernel_regularizer': <keras.src.regularizers.regularizers.L2 object at 0x7f7676743470>,
 'units': 128}


The configuration of the training is standard for the multiclass classification task.

 * [`CategoricalCrossentropy`](https://keras.io/api/losses/probabilistic_losses/#categoricalcrossentropy-class) loss function
 * [`RMSprop`](https://keras.io/api/optimizers/rmsprop/) optimizer

In [6]:
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.optimizers import RMSprop
from training_utils import TimeLimitCallback

Once again we split the training procedure in two steps, we train with a very high learning rate as long as it brings to some improvement in the value of the loss function. Then we drastically reduce it to a much smaller value.

Note that to limit the local minima in the loss function and ease convergence towards the global minimum at such a high learning rate, we apply a small [smoothing of the labels](https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06). 
This results into a non-probabilistic meaning of the generated output, which is unaccepable to our purpose.
Hence, we reset the label smoothing to zero for the second (and last) part of the training with a reduced learning rate.



In [10]:
model.compile(loss=CategoricalCrossentropy(label_smoothing=0.01), optimizer=RMSprop(10e-3))
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

MAX_EPOCHS = int(environ.get("MAX_EPOCHS", "100"))
PRE_TRAINING_TIME_LIMIT_SECONDS = int(environ.get("PRE_TRAINING_TIME_LIMIT_SECONDS", "60"))
BATCH_SIZE = int(environ.get("BATCH_SIZE", 100_000))

training_data = train_dataset.batch(BATCH_SIZE, drop_remainder=True).repeat().prefetch(tf.data.AUTOTUNE)
validation_data=next(iter(validation_dataset.batch(BATCH_SIZE)))

history = model.fit(
    training_data, 
    epochs=MAX_EPOCHS, 
    validation_data=validation_data, 
    callbacks=[early_stopping, TimeLimitCallback(PRE_TRAINING_TIME_LIMIT_SECONDS)],
    steps_per_epoch=50
)

Epoch 1/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 476ms/step - loss: 3.2808 - val_loss: 1.4424
Epoch 2/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 490ms/step - loss: 1.2548 - val_loss: 0.7614
Epoch 3/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 430ms/step - loss: 0.7585

Training stopped after 74.10s (limit: 60s)
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 475ms/step - loss: 0.7587 - val_loss: 0.6031


In [12]:
FINE_TUNING_TIME_LIMIT_SECONDS = int(environ.get("FINE_TUNING_TIME_LIMIT_SECONDS", "600"))

model.compile(loss=CategoricalCrossentropy(label_smoothing=0.00), optimizer=RMSprop(1e-3))
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
history_ft = model.fit(
    training_data, 
    epochs=MAX_EPOCHS, 
    validation_data=validation_data, 
    callbacks=[early_stopping, TimeLimitCallback(FINE_TUNING_TIME_LIMIT_SECONDS)],
    steps_per_epoch=50
)

Epoch 1/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 485ms/step - loss: 0.3370 - val_loss: 0.3059
Epoch 2/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 476ms/step - loss: 0.3004 - val_loss: 0.2907
Epoch 3/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 474ms/step - loss: 0.2813 - val_loss: 0.2788
Epoch 4/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 471ms/step - loss: 0.2701 - val_loss: 0.2644
Epoch 5/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 476ms/step - loss: 0.2586 - val_loss: 0.2506
Epoch 6/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 501ms/step - loss: 0.2504 - val_loss: 0.2487
Epoch 7/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 473ms/step - loss: 0.2460 - val_loss: 0.2394
Epoch 8/100
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 465ms/step - loss: 0.2405 - val_loss: 0.2342
Epoch 9/100
[1m50/50[0

UnknownError: Graph execution error:

Detected at node PyFunc defined at (most recent call last):
<stack traces unavailable>
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/efficiency-train/08ba5f12.feather'
Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/ops/script_ops.py", line 269, in __call__
    ret = func(*args)
          ^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/private/lamarr/lb-trksim-train/notebooks/workflow/notebooks/feather_io.py", line 115, in tf_generator
    with open(filename, 'rb') as f:
         ^^^^^^^^^^^^^^^^^^^^

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/efficiency-train/08ba5f12.feather'


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]] [Op:__inference_one_step_on_iterator_16620]

The two training phases are well visible in the plot below reporting the full history of the training procedure.

In [None]:
plt.plot(history.history['loss'] + history_ft.history['loss'], label="Loss (train)")
plt.plot(history.history['val_loss'] + history_ft.history['val_loss'], label="Loss (validation)")
plt.xlabel("Epoch")
plt.ylabel("Binary cross-entropy")
plt.yscale('log')
plt.legend()
plt.show()

## A first rough validation (sanity checks)

As done for the [acceptance training](./Acceptance.ipynb), we perform simple and quick checks on the trained model to ensure that the model makes sense, while demanding the most important part of the validation to a [dedicated notebook](./Efficiency-validation.ipynb).

First we plot the distribution of the original labels and of the predictions for the various categories. 

In [None]:
head = data_reader_validation.as_dask_dataframe().head(1_000_000, npartitions=-1)
Xv = head[data_reader_validation.features].values
yv = head[data_reader_validation.labels].values
yv_hat = model.predict(Xv, batch_size=len(Xv))

print (yv.sum(axis=1).mean(axis=0))

n_classes = len(data_reader_validation.labels)
plt.figure(figsize=(5*n_classes, 3))

for iVar, varname in enumerate(data_reader_validation.labels, 0):
    plt.subplot(1, n_classes, iVar+1)
    
    bins = np.linspace(0, 1, 11)
    plt.hist(yv[:, iVar], bins=bins, label="Training labels")
    plt.hist(yv_hat[:, iVar], bins=bins, histtype='step', linewidth=2, label="Prediction")
    plt.title(varname.replace("_", " ").capitalize())
    plt.xlabel("Label")
    plt.legend()
    plt.yscale('log')
plt.show()

Then we use the probability of belonging to the `long track` class as a weight to compare the distribution of candidates reconstructed as long tracks in the detailed simulation with candidates probably reconstructable as `long tracks` according to Lamarr.

In [None]:
log_p = head['mc_log10_p']
mask_long = head['recoed_as_long'] == 1
w_long = yv_hat[:, data_reader_validation.labels.index('recoed_as_long')]

bins = np.linspace(-4, 4, 121)
denominator, _ = np.histogram(log_p, bins=bins)
true_numerator, _ = np.histogram(log_p[mask_long], bins=bins)
predicted_numerator, _ = np.histogram(log_p, bins=bins, weights=w_long)

plt.hist((bins[1:] + bins[:-1])/2, bins=bins, weights=denominator, label="In acceptance", histtype='step')
plt.hist((bins[1:] + bins[:-1])/2, bins=bins, weights=true_numerator, label="Long tracks (validation)", color='#8e8')
plt.hist((bins[1:] + bins[:-1])/2, bins=bins, weights=predicted_numerator, label="Long tracks (model)", histtype='step', linewidth=2)

plt.xlabel(r"$\log_{10} \left(p / (1 \mathrm{MeV}/c\right)$")
plt.legend()
plt.show()

# Exporting the model

As a last step, we export the model to the same directory where we stored the preprocessing steps.

In [None]:
import os
default_output_model = "/tmp/models/efficiency/model.keras"
output_model = os.environ.get('OUTPUT_MODEL', default_output_model)
base_dir = os.path.dirname(output_model)
if not os.path.exists(base_dir):
    os.mkdir(base_dir)
model.save(output_model)

# Conclusion

In this notebook we trained a model for the track reconstruction efficiency, implemented a very simple sanity check to ensure that the trained model makes sense, and finally we exported it to perform a more complete validation in a dedicated notebook.
