# Machine Learning at CoDaS-HEP 2024, Lesson 3: Main Project

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn.datasets
import torch
from torch import nn

<br><br><br><br><br>

## Project 1: classify jets with a neural network

### Step 0: Put your name on the results sheet

Go to the [Google Spreadsheet for this course](https://docs.google.com/spreadsheets/d/1nRtNJoxW1i-jCr04ZHUlfv0DU4tMGbyvCZcpXakYedE/edit?usp=sharing) and add your name to the _second_ sheet (tab on the bottom of the window):

<a href="https://docs.google.com/spreadsheets/d/1nRtNJoxW1i-jCr04ZHUlfv0DU4tMGbyvCZcpXakYedE/edit?usp=sharing"><img src="../img/google-sheet-main-project.png" width="700"></a>

Your results (a ROC curve) will go _below_ your name. You own a column of this spreadsheet.

<br><br><br><br><br>

### Step 1: Download and understand the data

We'll use an LHC dataset from an online catalog, [hls4ml_lhc_jets_hlf](https://openml.org/search?type=data&status=active&id=42468).

The full description is online, with references to the paper in which it was published.

Scikit-Learn has a tool for downloading it (takes a minute or two).

In [2]:
hls4ml_lhc_jets_hlf = sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf")

features, targets = hls4ml_lhc_jets_hlf["data"], hls4ml_lhc_jets_hlf["target"]

View the features (16 numerical properties of jets) as a Pandas DataFrame:

In [3]:
features

Unnamed: 0,zlogz,c1_b0_mmdt,c1_b1_mmdt,c1_b2_mmdt,c2_b1_mmdt,c2_b2_mmdt,d2_b1_mmdt,d2_b2_mmdt,d2_a1_b1_mmdt,d2_a1_b2_mmdt,m2_b1_mmdt,m2_b2_mmdt,n2_b1_mmdt,n2_b2_mmdt,mass_mmdt,multiplicity
0,-2.935125,0.383155,0.005126,0.000084,0.009070,0.000179,1.769445,2.123898,1.769445,0.308185,0.135687,0.083278,0.412136,0.299058,8.926882,75.0
1,-1.927335,0.270699,0.001585,0.000011,0.003232,0.000029,2.038834,2.563099,2.038834,0.211886,0.063729,0.036310,0.310217,0.226661,3.886512,31.0
2,-3.112147,0.458171,0.097914,0.028588,0.124278,0.038487,1.269254,1.346238,1.269254,0.246488,0.115636,0.079094,0.357559,0.289220,162.144669,61.0
3,-2.666515,0.437068,0.049122,0.007978,0.047477,0.004802,0.966505,0.601864,0.966505,0.160756,0.082196,0.033311,0.238871,0.094516,91.258934,39.0
4,-2.484843,0.428981,0.041786,0.006110,0.023066,0.001123,0.552002,0.183821,0.552002,0.084338,0.048006,0.014450,0.141906,0.036665,79.725777,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
829995,-3.575320,0.473246,0.040693,0.005605,0.053711,0.004402,1.319914,0.785488,1.319914,0.211968,0.106151,0.037546,0.315867,0.123637,72.537308,71.0
829996,-2.408292,0.429539,0.040022,0.005620,0.020352,0.000804,0.508506,0.143106,0.508506,0.077383,0.043065,0.011398,0.131738,0.028787,77.263367,30.0
829997,-3.338864,0.467011,0.075235,0.017644,0.097954,0.022681,1.301970,1.285501,1.301970,0.236583,0.110919,0.068624,0.307230,0.183485,136.165955,72.0
829998,-1.535967,0.335411,0.002537,0.000021,0.002692,0.000017,1.061160,0.797847,1.061160,0.175014,0.086063,0.048476,0.271106,0.161818,4.660848,11.0


And some summary statistics for each feature:

In [4]:
features.describe()

Unnamed: 0,zlogz,c1_b0_mmdt,c1_b1_mmdt,c1_b2_mmdt,c2_b1_mmdt,c2_b2_mmdt,d2_b1_mmdt,d2_b2_mmdt,d2_a1_b1_mmdt,d2_a1_b2_mmdt,m2_b1_mmdt,m2_b2_mmdt,n2_b1_mmdt,n2_b2_mmdt,mass_mmdt,multiplicity
count,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0,830000.0
mean,-2.865343,0.433322,0.037766,0.007995166,0.045608,0.00760947,1.295784,1.083618,1.295784,0.19038,0.090024,0.04246,0.281169,0.143915,75.15361,51.887834
std,0.580389,0.055448,0.029154,0.009402567,0.038657,0.01217365,0.458041,0.730066,0.458041,0.075417,0.036523,0.026396,0.084556,0.080461,55.612557,21.677036
min,-4.759511,0.091104,7.3e-05,4.472011e-08,2e-06,1.472518e-10,0.005866,0.000156,0.005866,0.000213,7.7e-05,2e-06,0.000643,1.8e-05,0.113449,6.0
25%,-3.283773,0.419295,0.009977,0.0003371321,0.015352,0.0004735599,0.976546,0.485602,0.976546,0.125212,0.059285,0.018935,0.213851,0.071025,19.084184,36.0
50%,-2.909453,0.452219,0.037919,0.005950152,0.036848,0.00250109,1.278506,0.983084,1.278506,0.192994,0.089061,0.038755,0.292299,0.13928,80.106373,48.0
75%,-2.493677,0.468801,0.04851,0.0081934,0.062181,0.007816279,1.559999,1.505659,1.559999,0.251016,0.118213,0.062612,0.350496,0.210668,93.843903,64.0
max,-0.438996,0.493779,0.165237,0.07122659,0.219034,0.107914,3.968144,6.408456,3.968144,0.366573,0.187837,0.137693,0.449523,0.337616,573.616516,212.0


You can convert the (830000 row × 16 column) DataFrame into a NumPy array (of shape `(830000, 16)`) with

In [8]:
features.values

array([[-2.93512535e+00,  3.83155316e-01,  5.12587558e-03, ...,
         2.99057871e-01,  8.92688179e+00,  7.50000000e+01],
       [-1.92733514e+00,  2.70698756e-01,  1.58540264e-03, ...,
         2.26661310e-01,  3.88651156e+00,  3.10000000e+01],
       [-3.11214662e+00,  4.58171129e-01,  9.79138538e-02, ...,
         2.89219588e-01,  1.62144669e+02,  6.10000000e+01],
       ...,
       [-3.33886433e+00,  4.67011213e-01,  7.52350464e-02, ...,
         1.83485478e-01,  1.36165955e+02,  7.20000000e+01],
       [-1.53596663e+00,  3.35411340e-01,  2.53672758e-03, ...,
         1.61818489e-01,  4.66084814e+00,  1.10000000e+01],
       [-2.98799491e+00,  4.55647677e-01,  5.21810818e-03, ...,
         2.57964820e-01,  1.15550756e+01,  4.20000000e+01]])

<br><br><br><br><br>

View the target (5 jet categories) as a Pandas Series:

In [5]:
targets

0         g
1         w
2         t
3         z
4         w
         ..
829995    z
829996    w
829997    t
829998    q
829999    g
Name: class, Length: 830000, dtype: category
Categories (5, object): ['g', 'q', 't', 'w', 'z']

The categories are represented as 5 Python strings (`dtype='object'` means Python objects in a NumPy array/Pandas Series).

In [6]:
targets.cat.categories

Index(['g', 'q', 't', 'w', 'z'], dtype='object')

But the large dataset consists of (8-bit) integers corresponding to the position in this list of categories.

In [7]:
targets.cat.codes

0         0
1         3
2         2
3         4
4         3
         ..
829995    4
829996    3
829997    2
829998    1
829999    0
Length: 830000, dtype: int8

In [9]:
targets.cat.codes.values

array([0, 3, 2, ..., 2, 1, 0], dtype=int8)

<br><br><br><br><br>

As for the physical meaning of these features and targets, there's more in the paper, but

* `'g'` means gluon jet (a gluon from the original proton-proton collision hadronized into a jet)
* `'q'` means a light quark hadronized into a jet: up (u), down (d), or strange (s)
* `'t'` means a top (t) quark decayed into a bottom (b) quark and a W boson, which subsequently decayed and hadronized
* `'W'` means a W boson directly from the original proton-proton collision decayed and its constituents hadronized
* `'Z'` means the same thing for a Z boson

<img src="../img/JetDataset.png" width="429">

Each of these physical processes produces systematically different jet shapes, as characterized by the 16 input features.

The distributions of these jet shapes overlap, so this is a job for machine learning! (Not just manually chosen cuts, for instance.)

Use the space below for any plotting. It's always good to look at the data (in _some_ way) before trying to fit it.

<br><br><br><br><br>

### Step 2: Split the data into training, validation, and test samples

For this exercise, put

* 80% of the data into the training sample, which the optimizer will use in its fits
* 10% of the data into the validation sample, which you will look at while developing the model
* 10% of the data into the test sample, which you should not look at until you're done and making the final ROC curve

These data are supposed to be Independent and Identically Distributed (IID), but just in case there are any beginning-of-dataset, end-of-dataset biases, sample them randomly.

Scikit-Learn has a [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), PyTorch has a [torch.utils.data.random_split](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split), or you can do it manually.

<br><br><br><br><br>

### Step 3: Build a classifier neural network

It should have the following architecture:

* take the 16 numerical features as input
* have 3 (fully connected) hidden layers with 32 neurons each
* return probabilities for the 5 output categories

Use any tools you have to improve the quality of the model, but the model should be implemented in PyTorch.

Think about all of the issues covered in [Lesson 2: Issues in Practice](../lesson-2-issues/lecture-slides.ipynb).

<br><br><br><br><br>

### Step 4: Monitor the loss function

Plot the loss function versus epoch for the training sample and the validation sample (and _not_ the test sample!).

Do they diverge? If so, what can you do about that?

<br><br><br><br><br>

### Step 5: 