In [1]:
import numpy as np
import matplotlib.pyplot as plt

from PIL import Image

import torch
import torchvision
import torchvision.transforms as transforms

import torch.nn as nn
import torch.nn.functional as F

# Lecture 5: Deep Learning

* Convolutional networks and the Imagenet competition
* Dropout, Batch Normalization
* Why do deep networks generalize well?
    * Deep networks easily fit random labels.
    * Implicit regularization and memorization.

_Recommended readings and watching_:

* [Deep Learning](https://www.deeplearningbook.org/) book by Ian Goodfellow, Yoshua Bengio and Aaron Courville
* [Understanding deep learning](https://udlbook.github.io/udlbook/) by Simon J.D. Prince
* [Dive into Deep Learning](https://d2l.ai/), an interactive deep learning book with code, math, and discussions
* You may want to have a look at Andrey Karpathy's [Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)
* Karpathy's excellent youtube playlist [Neural Networks: Zero to Hero](https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
* [Awesome Deep Phenomena](https://github.com/MinghuiChen43/awesome-deep-phenomena), a curated list of papers of interesting empirical study and insight on deep learning.

## <center> The mechanics of deep representations </center>

<center><img src="figs/nucleation_resnet.png" width=700/></center>

Modified from [Hierarchical nucleation in deep neural networks](https://proceedings.neurips.cc/paper_files/paper/2020/hash/54f3bc04830d762a3b56a789b6ff62df-Abstract.html) (Neurips, 2020)

## Deep learning is _almost_ all about going down gradients

Library we will be using: <img src="figs/pytorch_logo.png" width=80 />

Install pytorch from https://pytorch.org/:

<center><img src="figs/pytorch_install.png" width=300 /></center>

and go through the tutorial [Getting Started with PyTorch](https://docs.pytorch.org/tutorials/beginner/basics/intro.html) to get ready for the exercises.

## A first look at two popular datasets:

### CIFAR:
https://app.activeloop.ai/activeloop/cifar10-train/firstdbf9474d461a19e9333c2fd19b46115348f

#### Imagenet:
https://app.activeloop.ai/activeloop/imagenet-train/firstdbf9474d461a19e9333c2fd19b46115348f

# <center> Convolutional networks </center>

* Units in a convolutional layer are organized in **feature maps**.
* Each unit in a feature map is connected to local patches in the feature maps of the previous layer through a **common** set of weights called a **filter bank** (**weight sharing**). Different feature maps in a layer use different filter banks.

  The result of this local weighted sum is then passed through a non-linearity such as a ReLU.
  
* **(max) pooling**: Downsample each feature map (not adaptive).

  Each pooling unit computes the max/avg of a local patch in a feature map

  This induces (approximate) translational invariance.
* Repeat many times...

# <center> A bit of history </center>

### The _ancient history_ of deep learning: Fukusima's neocognitron

<center><img src="figs/neocognitron.png" width=700 /></center>

### The _not so ancient history_ of deep learning: LeNet

<center><img src="figs/lenet.svg" width=700 /></center>

## Convolution layer

In one dimension, for an input image $x_j$ and a one-dimensional feature array $y_i$. In a FC layer we have: $y_i =h\left(\sum_j w_{ij}x_j\right)$. With **weight sharing**:
$$y_i = h\left(\sum_{j=i-K}^{i+K} w_{j-i} x_j\right)$$
So $x_i$ and $y_j$ have the same connection as $x_{i+a}$ and $y_{j+a}$ (which is $w_{j-i}$).
##### Have a look at: [class torch.nn.Conv1d](https://docs.pytorch.org/docs/stable/generated/torch.nn.Conv1d.html)

In two dimensions, for an input image $x_{j,j'}$ and a two-dimensional feature array:
$$y_{i,i'} = h\left(\sum_{j=i-K}^{i+K}\sum_{j'=i'-K}^{i'+K} w_{j-i,j'-i'}x_{j,j'}\right)$$

##### Have a look at: [class torch.nn.Conv2d](https://docs.pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)

# Working with dataloaders, images and convolutions in pytorch

In [2]:
# define a simple imshow procedure
def show_images(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

# load CIFAR dataset

# write your own data folder here and set download = True in the torchvision dataset
data_folder = '/home/ai/repos/multiple_readout/data/'

batch_size = 8

# define transforms and dataloaders
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root=data_folder, train=True,
                                        download=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root=data_folder, train=False,
                                       download=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Have a look at some images in the dataset

In [None]:
# plt.figure(figsize=(12,4))
images1, labels1 = next(iter(trainloader))
show_images(torchvision.utils.make_grid(images1))
print("labels:")
print(' '.join(f'{classes[labels1[j]]}' for j in range(batch_size)))

In [None]:
images1.shape

In [None]:
# visualize the red - green - blue content of a single image
plt.figure(figsize=(10,4))
cmaps = ["Reds", "Greens", "Blues"]
plt.subplot(1,4,1)
plt.imshow(np.transpose(images1[0]*0.5+0.5,(1,2,0)))
for ic in range(3):
    plt.subplot(1,4,ic+2)
    plt.imshow(images1[0,ic]*0.5+0.5, cmap=cmaps[ic])
plt.tight_layout();

## Convolution with fixed filters: an example with edge detection

#### A torch rendition of the wiki article [Sobel operator](https://en.wikipedia.org/wiki/Sobel_operator)

In [None]:
# read from file and transform into torch tensor
image = transforms.ToTensor()(Image.open("figs/bikesgray.jpg"))

# show image (first dimension is its unique color channel)
plt.imshow(image[0].numpy(), cmap="grey");

Define the Sobel kernels and use torch to compute convolution

In [None]:
# Gx (horizontal gradient)
Gx = torch.tensor([
    [-1., 0., 1.],
    [-2., 0., 2.],
    [-1., 0, 1.]
])

# Gy (vertical gradient)
Gy = torch.tensor([
    [-1., -2., -1.],
    [ 0.,  0.,  0.],
    [ 1.,  2.,  1.]
])

# stack the two filters and unsqueeze
weights = torch.stack([Gx, Gy])
weights = torch.unsqueeze(weights,1) # add an empty dimension for group convolution in conv2d

# Apply convolution using the fixed weights
filtered_image = F.conv2d(image, weight=weights, padding=1) # padding = 1 is used to keep dimension equal (size of kernel is 3)

Have a look at the filterd images

In [None]:
plt.figure(figsize=(12,4))
plt.subplot(131)
plt.imshow(filtered_image[0].numpy(), cmap='grey');
plt.subplot(132)
plt.imshow(filtered_image[1].numpy(), cmap='grey');
plt.subplot(133)
plt.imshow(torch.sqrt(filtered_image**2).sum(0).numpy(), cmap='grey');

plt.tight_layout();

#### A simpler example with a synthetic image

In [None]:
size = 100
img = torch.zeros(1, size, size, dtype=torch.float32)

start = size // 4
end = 3 * size // 4

img[0, start:end, start + 4:start + 8] = 1.0
img[0, start + 20:start + 24, start:end] = 1.0

plt.imshow(img[0], cmap='gray');

In [None]:
filtered_image = F.conv2d(img, weight=weights, padding=1)

plt.figure(figsize=(12,4))
plt.subplot(131)
plt.imshow(filtered_image[0].numpy(), cmap='grey', origin="lower");
plt.subplot(132)
plt.imshow(filtered_image[1].numpy(), cmap='grey', origin="lower");
plt.subplot(133)
plt.imshow(torch.sqrt(filtered_image**2).sum(0).numpy(), cmap='grey', origin="lower");

plt.tight_layout();

Let's look at a horizontal section of the image:

In [None]:
plt.plot(img[0,start]);
plt.plot(filtered_image[0,start]);

#### Food for thought: taking derivatives through integrals

In [None]:
epss = np.linspace(0.1, 0.01, 5)

def heat(x, eps):
    return np.exp(-0.5 * x**2 / eps) / np.sqrt(2*np.pi*eps)

def diffheat(x, eps):
    return x/eps * np.exp(-0.5 * x**2 / eps) / np.sqrt(2*np.pi*eps)

xs = np.linspace(-3, 3., 500)

for iep, eps in enumerate(epss):
    plt.plot(xs, heat(xs, eps), alpha=0.3, color='black', label="heat" if iep==0 else None)
    plt.plot(xs, diffheat(xs, eps), color='red', label="diff heat" if iep==0 else None)
plt.legend();

## The rise of Convolutional networks: Alexnet and the Imagenet 2012 competition

ImageNet Large Scale Visual Recognition Challenge (**ILSVRC**) is an annual computer vision competition. The goal is to classify images into 1000 classes. It uses 1.2 million training images, 50.000 validation images, and 150.000 testing images.

<img src="figs/imagenet_labels.png" />

## Alexnet

<img src="figs/imagenet_architecture.png" />

## Alexnet in detail #1

The first 5 layers of weights are convolutional. The last three layers are fully connected.

The first convolutional layer filters the $224\times 224\times 3$ input image with 96 kernels (feature maps) of size $11\times 11\times 3$ with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map).

The second convolutional layer takes as input the (response-normalized and pooled - See [paper](https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html) for details) output of the first convolutional layer and filters it with 256 kernels of size $5\times 5 \times 48$.

The third convolutional layer has 384 kernels of size $3\times 3\times 256$.
The fourth convolutional layer has 384 kernels of size $3\times 3 \times 192$.
The fifth convolutional layer has 256 kernels of size $3\times 3 \times 192$.
(No pooling or normalization).

In layers 2, 4, and 5, kernel computation on each GPU takes input only from its own GPU.

## Alexnet in detail #2

The fully-connected layers each have 4096 neurons.

The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.

All non-linearities are **ReLU**.

The network is trained to maximize the **multinomial logistic regression** objective.

It has **60 million parameters** and 650,000 neurons.

Pre-trained available on [pytorch hub](https://pytorch.org/hub/pytorch_vision_alexnet/).

## Alexnet in detail #3

**Data augmentation**:
During training, each $256\times 256$ image is replaced by 2048 images of size $224\times 224$ (shifting and reflection).
At test time, the output is computed for 5 images of size $224\times 224$ (four corners and center) and the results are averaged.

**Dropout** is used in the first two fully-connected layers. Without dropout, there is substantial overfitting. Dropout roughly doubles the number of iterations required to converge.

## Convolutional filters found by Alexnet

Filters resulting from training:

<center><img src="figs/filters_imagenet_architecture.png" /></center>

##### Have a look at [Gabor](https://en.wikipedia.org/wiki/Gabor_filter) filters

## <center> (Deep) learning is all about generalization </center>

#### How can we understand the good generalization performance of deep learning?

Traditionally, good generalization is explained by:
* The structure in the data matches the structure in the model.
* Sufficient data.

Deep networks can essentially learn anything and often have insufficient data. How can we understand their good generalization performance?

Here we look at:
* Dropout, regularization, data augmentation, batch normalization.
* Implicit regularization of GD and memorization.

## Dropout

Combining the predictions of many different models is a very successful way to reduce test errors.

'Dropout' mimics **model averaging** by setting the output of each hidden neuron to zero with a probability of 0.5. The neurons that are dropped out in this way do not contribute to the forward pass and do not participate in backpropagation. Every time an input is presented, the neural network samples a different architecture, but all these architectures share weights.

This technique reduces **complex co-adaptations** of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn **more robust features** that are useful in conjunction with many different random subsets of the other neurons.

At test time, you can either:
* For each test pattern, average the output for random dropout realisations.
* Use all the neurons but multiply their outputs by 0.5.

##### Have a look at [torch.nn.Dropout](https://docs.pytorch.org/docs/stable/generated/torch.nn.Dropout.html) and the original [paper](https://jmlr.org/papers/v15/srivastava14a.html)

## Batch normalization (BN)

A $T$-layer network:
$$y_k = h\left(\sum_{j_T} w^{(T)}_{j_Tj_{T-1}}h\left(\sum_{j_{T-1}} w^{(T-1)}_{j_{T-1}j_{T-2}}h\left(\ldots h\left(\sum_{j_1}w_{j_1j_0}^{(1)} x_{j_0}\right)\right)\right)\ldots\right)$$
with $h$ a non-linear activation function.
Or,
$$x^{(t)}=w^{(t)} a^{(t)} \qquad a^{(t+1)}=h\left(x^{(t)}\right) \qquad t=1,\ldots, T$$
with $x^{(t)}, a^{(t)}$ being the vectors of summed input and output neural activity, and $w^{(t)}$ being the $t$-th layer weight matrix.

**Vanishing gradient problem:**
Empirically, it's observed that $w^{(t)}$ tends to grow as a result of learning. Then $h\left(x^{(t)}\right)$ saturates.
The gradient $h'\left(x^{(t)}\right)\to 0$, which slows down learning.
This problem is largest in early layers of deep networks.

**Co-dependence of learning in different layers:**
Since $x^{(t)}=w^{(t)}h\left(w^{(t-1)} a^{(t-1)}\right)$, the optimal value of $w^{(t)}$ depends on the optimal values of $w^{(t-1)}$.

## Batch normalization and internal covariate shift

Batch normalization addresses the issue of _Internal Covariate Shift_ by normalizing the inputs to each layer within a mini-batch.

<center><img src="figs/batch_normalization_algorithm.png" width=400></center>

##### Have a look at [torch.nn.BatchNorm2d](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html) and the original [paper](https://proceedings.mlr.press/v37/ioffe15.html)

# BatchNorm in pytorch

Let us extract a new batch of images so that we can test the update of running statistics in the BatchNorm layer

In [13]:
images2, labels2 = next(iter(trainloader))

In [None]:
images1.shape, images2.shape

We will look at the effect of BN after a fully connected layer (without bias for simplicity)

In [15]:
in_features = np.prod(images1.shape[-2:])
out_features = 200
fc = nn.Linear(in_features, out_features, bias=False)

In [None]:
fc.weight.shape

In [19]:
# flatten the image before passing to the FC layer
output_fc1 = fc(images1.view(-1, 3, in_features))
output_fc2 = fc(images2.view(-1, 3, in_features))

In [None]:
# note that channels are computed in parallel
output_fc1.shape, output_fc2.shape

Let us implement the Linear layer ourselves and check

In [None]:
output_fc1_mine = torch.einsum('nci,ki->nck', images1.view(-1, 3, in_features), fc.weight)
output_fc2_mine = torch.einsum('nci,ki->nck', images2.view(-1, 3, in_features), fc.weight)

torch.allclose(output_fc1, output_fc1_mine),\
torch.allclose(output_fc2, output_fc2_mine)

Let's define a BatchNorm object

In [None]:
momentum = 1.
m = nn.BatchNorm1d(3, eps=1e-100, momentum=momentum, affine=True, track_running_stats=True)

The running average is performed using

$$\hat{x}_{\text{new}}=\left(1-\text{momentum}\right)\times\hat{x}_{\text{old}}+\text{momentum}\times x_{t}$$

where $\hat{x}$ stands for the estimate of average and variance.

As they note in Dive into Deep Learning:

`This is somewhat of a misnomer as it has nothing whatsoever to do with the momentum term of optimization`

and it also acts in the opposite way as how you would think momentum does!

In [None]:
m.train() # this is needed in order to track the running stats and allow learning of the affine transormation

Pass the two FC outputs through the BN layer and look at the resulting shape

In [None]:
output_bn1 = m(output_fc1)
output_bn2 = m(output_fc2)

output_bn1.shape

Have a look at how the distribution changes

In [None]:
plt.hist(output_fc1.detach().numpy().flatten(), bins=100, density=True, histtype="step", label="after fc");
plt.hist(output_bn1.detach().numpy().flatten(), bins=100, density=True, histtype="step", label="after bn");
plt.legend();

Check the new batch average and variance

In [None]:
output_bn1.mean([0,2]), output_bn1.var([0,2], correction=0) ,\
output_bn2.mean([0,2]), output_bn2.var([0,2], correction=0)

Look at the running stats stored inside BN

In [None]:
m.running_mean, m.running_var

and compute them as a check

In [None]:
(1 - momentum) * momentum * output_fc1.mean([0,2]) + momentum * output_fc2.mean([0,2]), \
(1 - momentum) * momentum * output_fc1.var([0,2]) + momentum * output_fc2.var([0,2])

Instantiate random outputs to backprop through BN and look at the resulting gradients over $\gamma$ and $\beta$ parameters

In [None]:
yhat1 = torch.tanh(output_bn1).sum((1,2))
yhat2 = torch.tanh(output_bn2).sum((1,2))

y1 = torch.randn(output_bn1.shape[0])
y2 = torch.randn(output_bn2.shape[0])

loss = ((yhat1 - y1)**2).mean() + ((yhat2 - y2)**2).mean()
loss.backward()

for name, p in m.named_parameters():
    print(name, p)
    print(p.grad)

Now let's run it at evaluation time on a different set of images

In [None]:
images3, labels3 = next(iter(trainloader))

m.eval()
with torch.no_grad():
    output_fc3 = fc(images3.view(-1, 3, in_features))
    output_bn3 = m(output_fc3)

Note that eval() stops running stats from being updated:

In [None]:
m.running_mean, m.running_var

Check outputs of BN with what you would expect

In [None]:
((output_fc3 - m.running_mean[None,:,None]) / torch.sqrt(m.running_var[None,:,None])).mean((0,2)),\
output_bn3.mean((0,2))

In [None]:
((output_fc3 - m.running_mean[None,:,None]) / torch.sqrt(m.running_var[None,:,None])).var((0,2)),\
output_bn3.var((0,2))

## Batch normalization accelerates training - exibit #1

BN accelerate training and stabilizes activity distributions.

<center><img src="figs/batch_normalization_mnist.png" width=600></center>

## Batch normalization accelerates training - exibit #2

BN-Inception architecture achieves better accuracy faster with:
* no dropout
* larger learning rate (x5 or x30)
* faster learning rate decay
* smaller regularization 

<center><img src="figs/batch_normalization_inception.png" width=700></center>

## Deep neural networks easily fit random labels

<center><img src="figs/zhang_random_labels.png" width=600></center>

#### Main conclusions:
* deep networks can easily learn when labels are randomly shuffled
* regularization does not hurt learning with random labels (does not prevent overfitting)
* generalization is good even without regularization (although regularization helps)

##### Have a look at the original [paper](https://arxiv.org/abs/1611.03530) for more details

### Generalization: CIFAR

<center><img src="figs/zhang_table1_cifar.png" width=600></center>

### Generalization: Imagenet

<center><img src="figs/zhang_table2_imagenet.png" width=600></center>

### Learning dynamics and implicit regularization

<center><img src="figs/zhang_learning_dynamics.png" width=800></center>

#### Main conclusions:
* Data augmentation may yield significantly better generalization.
* Early stopping works in some cases (Imagenet) but not in others (CIFAR).
* Batch normalization (BN) may significantly improve convergence and generalization on Cifar10.

### Implicit regularization of gradient descent

When the number of samples $N$ is smaller than the number of parameters in the network and is trained with (Stochastic) Gradient Descent (SGD), one observes **memorization**: the network explicitly represents the training data in some form.

We illustrate this for the problem of linear regression
$$y=\sum_{i=1}^d w_i x_i$$

The problem can be easily generalized to higher-dimensional output, non-linear output, and a more general cost function, where $y_j=\phi(\sum_{i=1}^d w_{ij} x_i)$ with $\phi$ an invertible function (such as sigmoid, for instance) and $L(w)= \sum_\mu F(y^\mu,t^\mu) $ with $F(y,t)\ge 0 $ and such that $F(y,t)=0$ if $y=t$ and $y_j^\mu =\phi\left(\sum_{i=1}^d w_{ij} x^{\mu}_i\right)$.

### Implicit regularization of gradient descent

Given the data $\{x^{\mu}_i,t^\mu,\mu=1,\ldots,N\}$ and cost:
$$L(w)=\frac{1}{2}\sum_\mu (y^\mu -t^\mu)^2 \qquad y^\mu =\sum_{i=1}^d w_i x^\mu_i$$A solution $w$ with $L(w)=0$ satisfies:$$t^\mu=y^\mu =\sum_i w_i x_i^\mu \qquad t = X w$$
with $X_{(N\times d)}=x^\mu_i$ and $t_{(N\times 1)}=t^\mu$.

We have $N$ equations and $d$ unknowns. When $N< d$, there are many solutions.

### Implicit regularization of gradient descent

The gradient is a linear combination of the training data:
$$\frac{\partial L}{\partial w_i}=\sum_\mu (y^\mu -t^\mu) x_i^\mu \qquad \Delta w_i^t = -\eta_t \frac{\partial L}{\partial w_i}$$Therefore, starting at $w=w^0$, the (S)GD solution is a linear combination of the training data:$$w_i = w^0_i+\sum_{t=1}^T \Delta w_i^t = w^0_i+\sum_\mu \alpha_\mu x_i^\mu\qquad w=w_0+X^T \alpha$$with $\alpha_{(N\times 1)}=\alpha^\mu$. Since also $X w =t$, we obtain:$$XX^T \alpha =t -X w_0$$
Since $N<d$, $XX^T$ is full rank and invertible.
Thus, gradient descent finds a unique solution:
$$w =w_0+X^T \alpha \qquad \alpha = \left(XX^T\right)^{-1}\left( t -Xw_0\right)$$
but it depends on the initialization $w_0$.

### Implicit regularization of gradient descent

When $w_0=0$, the (S)GD solution is also the minimum norm solution:
$$\min_w \frac{1}{2}\|w\|^2 \qquad \text{s.t} \quad X w = t$$
Introducing Lagrange multipliers, we get:
$$\mathcal{L}=\frac{1}{2}\|w\|^2 +\sum_\mu \lambda^\mu \left( t^\mu - \sum_i w_i x^\mu_i\right)$$
$$\frac{\partial \mathcal{L}}{\partial w_i}= w_i - \sum_\mu \lambda^\mu x^{\mu}_{ i}=0$$
Thus, $w=X^T \lambda$ and $Xw=t$. The solution is identical to the GD solution with $\lambda=\alpha$ and $w_0=0$.

The minimum norm solution is the most _smooth_ solution consistent with training error zero. Therefore, (S)GD on **over-parametrized** problems has an implicit regularization.

### Gradient descent solution memorizes the training set

The final solution is:
$$y=\sum_i w_i x_i = \sum_{i,\mu} \alpha^\mu x_i^\mu x_i$$
By construction, when $x=x^\mu$, the output $y=t^\mu$.

For $x\approx x^\mu$, the output $y\approx t^\mu$.

The solution 'memorizes' the training set and is expected to give good generalization for data (very) near individual training samples and bad generalization everywhere else.

So the answer to the question: why do deep neural networks generalize well is that they are tested on data near the training samples.

## Memorization in Overparametrized Autoencoders

We can further understand the properties of deep networks by studying iterated autoencoders.

Consider an autoencoder defined by a function $f: \mathbb{R}^d \to \mathbb{R}^d$. After successful training, all training samples satisfy $f(x^\mu)=x^\mu$.

Consider the dynamical system obtained by iterating the autoencoder multiple times: $x_{t+1}=f(x_t)$.
The training examples are **local attractors**, i.e., stable fixed points of the dynamics, if for $x_0$ in a neighborhood of $x^\mu$ we have:
$$x_0 \to x_1=f(x_0) \to x_2 = f(x_1) \to \ldots x_\infty=x^\mu$$

Local stability requires that all eigenvalues $\lambda_i$ of the Jacobian
$$J_{ij}=\frac{\partial f_i}{\partial x_j}(x^\mu)$$
satisfy $|\lambda_i|<1$. Otherwise, the point is unstable.

### Memorization in (fully connected) overparametrized Autoencoders - exibit #1

When $f$ is an overparameterized autoencoder, trained with (S)GD, the training examples are not only fixed points, but also attractors.

<center><img src="figs/autoencoder_swissroll.png" height=2000></center>

### Memorization in (fully connected) overparametrized Autoencoders - exibit #2

<center><img src="figs/autoencoder_mnist.png" width=1000></center>

### Memorization in (deep convolutional) overparametrized Autoencoders

<center><img src="figs/autoencoder_convolutional.png"></center>

### Memorization and Generalization coexist

**Memorization:** Overparameterized autoencoders exhibit a form of data-dependent self-regularization that encourages solutions that concentrate around the training examples.

<center><img src="figs/autoencoder_robustness.png"></center>

##### Have a look at the [arxiv](https://arxiv.org/abs/1810.10333) and [PNAS](https://www.pnas.org/doi/10.1073/pnas.2005013117) versions.

# <center> Assignments </center>

#### Ex 5.1
Study the performance properties of the convolutional network provided in the introductory [pytorch tutorial](https://docs.pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html).

How is learning affected if instead of ReLU units, tanh() activations are used? What is the reason for this? Compare also at least two different optimizer algorithms.

#### Ex 5.2
Try to outperform the convolutional network of **Ex 5.1** with a multilayer perceptron (MLP) that uses approximately the same number of parameters.
Report your results and explain them.

#### Ex 5.3

Learning convolutions from scratch starting from a fully connected initialization is hard.

Reproduce the emergence of localized receptive fields in a one-hidden layer fully connected network as described in [this paper](https://www.pnas.org/doi/abs/10.1073/pnas.2201854119). For simplicity, use a 1-dimensional version of the GP and NLGP synthetic data models described in the Methods sections. Use a network with $K=30$ hidden neurons, fixed second-layer weights and $\text{erf}(x/\sqrt{2})$ activation function. Feel free to use either a fixed dataset or work in an online learning setting, where new inputs are produced before each gradient step.