# Machine Learning at CoDaS-HEP 2024, Lesson 4: Survey of Architectures

In lesson 1, I introduced neural networks and the universal function approximation theorem. A single hidden layer implements _adaptive_ basis functions, more flexible than classic Taylor and Fourier series.

In lesson 2, we talked about issues involed in any fitting procedure, whether multilayered or not (i.e. a pure linear fit).

Lesson 3 was an open-ended project to build your own neural network.

In lesson 4, we will consider a variety of neural network "architectures": ways of building networks to improve learning for different types of problems.

<br><br><br><br><br>

## Why should learning be "deep"?

**Deep learning:** a neural network with 3 or more layers.

<img src="../img/rise-of-deep-learning.svg" width="800">

* 2006‒2007: solved problems in training algorithms that _prevented_ deep learning.
* 2012: AlexNet, a GPU-enabled 8 layer network with ReLU, won the ImageNet competition.
* 2015: ResNet, a GPU-enabled 152+ layer network with skip-connections, won the ImageNet competition.

By 2015, it was clear that networks with many layers have more potential than one big hidden layer.

<br><br><br><br><br>

Why does it work?

One big hidden layer can approximate any shape, by optimizing adaptive basis functions, but

> Shallow networks are very good at memorization, but not so good at generalization.

That is, they have a tendency to overfit.

Why are multiple layers better?

<br><br><br><br><br>

Adaptive basis function:

$$ \psi(x; a, b) = \left\{\begin{array}{c l}
a + b x & \mbox{if } x > -a/b \\
0 & \mbox{otherwise} \\
\end{array}\right. $$

Function approximation with one hidden layer:

$$ f_j(x) = \sum_i^{N_1} \psi(x; a_{ij}, b_{ij}) c_{ij} $$

Function approximation with two hidden layers:

$$ f_k(x) = \sum_j^{N_2} \psi\left(x; \left[
\sum_i^{N_1} \psi(x; a_{i1}, b_{i1}) c_{i1}
\right], \left[
\sum_i^{N_1} \psi(x; a_{i2}, b_{i2}) c_{i2}
\right]\right) \left[
\sum_i^{N_1} \psi(x; a_{i3}, b_{i3}) c_{i3}
\right] $$

And so on: adaptively adaptive basis functions, then adaptively adaptively adaptive basis functions...

<br><br><br><br><br>

* More neurons in a single layer adds wiggles to a fit function.
* More layers effectively fold the space in which a wiggly function can fit the data. Instead of finding individual wiggles, they find symmetries in the data that (probably) correspond to the underlying relationship, rather than noise.

Consider this horseshoe-shaped decision boundary: with two well-chosen folds along the symmetries, it reduces to a simpler curve to fit. Instead of 4 ad-hoc wiggles, it's 2 folds and 1 wiggle.

<img src="../img/deep-learning-by-space-folding.svg" width="800">

Montúfar, Pascanu, Cho, & Bengio, [_On the Number of Linear Regions of Deep Neural Networks_](https://arxiv.org/abs/1402.1869) (2014).

<br><br><br><br><br>

Roy Keyes's fantastic demo ([with code](https://gist.github.com/jpivarski/f99371614ecaa48ace90a6025d430247)):

<img src="../img/network-layer-space-folding.png" width="800">

A uniform grid on the feature space (left; grid not shown) projected through the first layer's transformation shows what the underlying space looks like (right; grid is gray) before the second layer makes a linear decision boundary.

<br><br><br><br><br>

This is our first architecture:

<img src="../img/artificial-neural-network-layers-2.svg" width="700">

Neurons in a layer add wiggles to the fitted function; layers add reflections and symmetries that are (probably) real structure.

<br><br><br><br><br>

## Autoencoders

In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn.datasets
import torch
from torch import nn
from torch import optim

In [24]:
hls4ml_lhc_jets_hlf = sklearn.datasets.fetch_openml("hls4ml_lhc_jets_hlf")

features = torch.tensor(hls4ml_lhc_jets_hlf["data"].values).float()
targets = torch.tensor(hls4ml_lhc_jets_hlf["target"].cat.codes.values).long()