# Machine learning at CoDaS-HEP 2024, lesson 1 part 2

<br><br><br><br><br>

## Reminders

As a reminder, a (simple, feed-forward, 4-layer) neural network looks like this:

<img src="../img/artificial-neural-network-layers-2.svg" width="700">

<br><br><br><br><br>

Which is to say, like this:

$$
y_i =
f\left(a_{i,j}^{\mbox{\scriptsize L3--L4}} \cdot
f\left(a_{i,j}^{\mbox{\scriptsize L2--L3}} \cdot
f\left(a_{i,j}^{\mbox{\scriptsize L1--L2}} \cdot x_j + b_i^{\mbox{\scriptsize L1--L2}}\right)
+ b_i^{\mbox{\scriptsize L2--L3}}\right)
+ b_i^{\mbox{\scriptsize L3--L4}}\right)
$$

<br><br><br><br><br>

In code, that means:

In [1]:
import numpy as np

In [11]:
# take 8-dimensional input layer 1 to 7-dimensional hidden layer 2
a_L1_L2 = np.random.normal(0, 1, (7, 8))
b_L1_L2 = np.random.normal(0, 1, (7,))

# take 7-dimensional hidden layer 2 to 9-dimensional hidden layer 3
a_L2_L3 = np.random.normal(0, 1, (9, 7))
b_L2_L3 = np.random.normal(0, 1, (9,))

# take 9-dimensional hidden layer 3 to 6-dimensional output layer 4
a_L3_L4 = np.random.normal(0, 1, (6, 9))
b_L3_L4 = np.random.normal(0, 1, (6,))

def relu(x):
    return np.maximum(0, x)

def model(x):
    layer1 = x
    layer2 = relu(a_L1_L2 @ layer1 + b_L1_L2)
    layer3 = relu(a_L2_L3 @ layer2 + b_L2_L3)
    layer4 = relu(a_L3_L4 @ layer3 + b_L3_L4)
    y = layer4
    return y

Here's the model's output for a sample input:

In [12]:
x = np.random.normal(0, 1, (8,))

model(x)

array([11.03079726, 12.45383983,  0.        ,  0.        ,  7.81129447,
        7.33418936])

<br><br><br><br><br>

Given a large dataset of `x` vectors, an equally large set of expected `y` vectors, and a minimizer, we could train the model by optimizing these parameters:

In [15]:
a_L1_L2

array([[ 0.46584529, -0.79434105,  0.30786216, -0.50507019, -0.03540209,
        -0.24112821,  0.37416534, -0.62789647],
       [ 0.94458063, -0.51744175,  0.05822749, -1.14103328, -0.00535695,
         0.27368702,  0.87765437,  0.86686374],
       [-0.70204751,  0.61642705, -0.9627886 , -0.94822418, -1.14344213,
         0.13787023, -1.19870441, -0.46053546],
       [ 1.2529536 , -0.58980945, -0.44096821, -0.64246417,  0.8853539 ,
         0.44068554,  2.43598694,  0.49854176],
       [-0.14821625, -0.32792798,  0.33920959, -0.52909471, -1.42775839,
         0.39491426, -2.21739509,  0.68555603],
       [-1.44681708, -0.02982129, -1.46738637,  0.42057161,  0.35277722,
        -0.83142961,  2.32882804,  0.28770379],
       [ 1.28867981,  1.08870556,  0.06398847,  0.45545825,  1.19160146,
         1.00476985, -0.00683849, -1.63661002]])

In [16]:
b_L1_L2

array([ 1.65692997,  1.80684596,  1.55930913, -1.32976663, -0.52313649,
       -1.22225077,  0.74303582])

In [17]:
a_L2_L3

array([[ 1.25902691,  0.22452816,  1.36976825, -0.55811491, -0.07347018,
        -0.42207942,  1.80366964],
       [-0.35619402,  1.380792  ,  0.31643495,  0.18425909, -1.81170905,
        -0.49149711,  0.69482636],
       [-1.20662481,  1.19480494,  0.49589162, -0.03139444, -0.20982596,
         2.15567115,  0.9640442 ],
       [-1.43552956, -1.23384725,  0.68979683,  0.79612705, -0.00638356,
         0.60604382,  0.12707776],
       [ 0.60671319,  0.23179956, -0.93240115, -1.20597387, -0.32877326,
        -0.04077577,  0.97144842],
       [-0.58499887,  0.74685522,  0.60902534, -0.14289348,  0.07788466,
         0.01558596, -0.15741029],
       [-0.8650252 , -3.08058742, -0.3268319 , -0.04060389,  0.56433748,
         0.00652574,  1.31594315],
       [ 0.36718713, -2.10683057, -1.34246018,  0.59514127,  0.32629339,
        -0.58318272,  1.0320227 ],
       [-0.08635942,  0.12455025, -0.05816353,  0.61746697,  0.40378234,
         0.53672263,  0.11352741]])

In [18]:
b_L2_L3

array([ 1.68540897,  1.68547654,  0.77596705,  0.92005175, -1.20306912,
        1.4762198 , -1.13197467,  0.78167442,  0.8892578 ])

In [19]:
a_L3_L4

array([[ 0.82445788, -0.33004676,  0.4936911 , -2.34170104,  0.29632171,
        -0.12980095, -0.0303506 ,  0.87969774,  0.20929147],
       [ 1.67042236, -1.23078755,  0.62490481, -1.81355255, -0.82479196,
         0.8193912 , -0.73813887, -0.34035706,  0.88898891],
       [-1.33837677, -0.41860029, -0.38780771, -0.39605987, -2.91797095,
         1.47461802,  0.20972448,  1.29858008, -0.84848694],
       [-1.24581663, -0.66039626, -2.37231689, -0.34789816,  0.92078624,
        -1.76154556,  0.22515589,  1.41809949, -1.04924791],
       [-0.06665433, -1.11116286,  1.12650561,  0.19868453,  0.6702581 ,
         0.26141375, -0.95402739,  0.63671454, -0.53160827],
       [ 0.12500722, -0.19787728,  0.74681658,  0.91134624,  0.69956507,
         0.65845079, -0.47853924, -0.86570753, -0.87268969]])

In [20]:
b_L3_L4

array([ 0.32526209,  0.08141577,  0.74827707, -0.16723844, -0.78073532,
        1.52148635])

such that `model(x)` comes as close as possible to `y`.

Then we could use `model(x_new)` to predict new $y$ values for `x_new`, and the predictions would have (roughly) the same correlations as the training dataset.

<br><br><br><br><br>

HEP has a favorite minimizer: MINUIT.

Introduced in 1972 by Fred James, MINUIT computes numerical second derivatives of the function, attempts to jump to the minimum, and then recomputes.

It doesn't scale well with a large number of parameters to optimize, and we would have

In [23]:
a_L1_L2.size + b_L1_L2.size + a_L2_L3.size + b_L2_L3.size + a_L3_L4.size + b_L3_L4.size

195

parameters to optimize in this simple example.

Nevertheless, we'll use MINUIT in some early examples, through the iminuit package.

<img src="https://raw.githubusercontent.com/scikit-hep/iminuit/develop/doc/_static/iminuit_logo.svg" width="300">

In [21]:
import iminuit

<br><br><br><br><br>

As another simplification, note that we don't have to maintain the distinction between matrices of parameters $a_{i,j}$ and vectors of parameters $b_i$:

$$
\left(\begin{array}{c c c c}
a_{1,1} & a_{1,2} & \ldots & a_{1,10} \\
a_{2,1} & a_{2,2} & \ldots & a_{2,10} \\
a_{3,1} & a_{3,2} & \ldots & a_{3,10} \\
a_{4,1} & a_{4,2} & \ldots & a_{4,10} \\
a_{5,1} & a_{5,2} & \ldots & a_{5,10} \\
\end{array}\right) \cdot \left(\begin{array}{c}
x_1 \\
x_2 \\
\vdots \\
x_{10} \\
\end{array}\right) + \left(\begin{array}{c}
b_1 \\
b_2 \\
b_3 \\
b_4 \\
b_5 \\
\end{array}\right)
$$

is the same as

$$
\left(\begin{array}{c c c c c}
a_{1,1} & a_{1,2} & \ldots & a_{1,10} & b_1 \\
a_{2,1} & a_{2,2} & \ldots & a_{2,10} & b_2 \\
a_{3,1} & a_{3,2} & \ldots & a_{3,10} & b_3 \\
a_{4,1} & a_{4,2} & \ldots & a_{4,10} & b_4 \\
a_{5,1} & a_{5,2} & \ldots & a_{5,10} & b_5 \\
\end{array}\right) \cdot \left(\begin{array}{c}
x_1 \\
x_2 \\
\vdots \\
x_{10} \\
1 \\
\end{array}\right)
$$

We can absorb our $b_i$ vectors into a bigger matrix $A_{i,j}$ with the understanding that we concatenate a $1$ at the end of the $x_j$ vector.

<br><br><br><br><br>

## What's so special about this linear-nonlinear sandwich?