# IN3050/4050 - Week 12: Deep Learning Glimpse

## Deep Learning
Deep learning is a big topic, and we can't hope to cover it in any depth here. But, as an introduction we are going to take a look at some state of the art tools used in both research and industry: especially PyTorch.

We are also going to take a look at one of the biggest problems with deep neural models, the vanishing gradient problem.

### Imports used in the rest of the notebook

In [None]:
import numpy as np
import torch
from torch import nn
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

## Part 1) Introduction to PyTorch
[PyTorch](https://pytorch.org/) is a machine learning framework designed specifically for deep learning, but can be used for all sorts of computations. PyTorch uses highly optimized and accelerated code for most of the operations needed to make state of the art neural networks.

In this course we will only be using PyTorch but there exists many other deep learning frameworks, e.g. [TensorFlow](https://www.tensorflow.org/), [Flax](https://flax.readthedocs.io/en/latest/), [Caffe](https://caffe.berkeleyvision.org/), and many more.

This [PyTorch tutorial](https://pytorch.org/tutorials/beginner/basics/intro.html) is a useful reference while completing this exercise.

**Note:** We use PyTorch 2.2 for these exercises, but earlier versions should also be fine.

### Simple operations/tensors

#### Tensors
In PyTorch, the basic object is a tensor. Tensors are specialized data structures very similar to vectors and matrices, and, consequently to Numpy arrays. Tensors are used to encode the inputs and outputs of models, as well as the models' parameters. A tensor can be made from a Numpy array, and can be cast to a Numpy array. Importantly, Pytorch tensors are optimized for automatic differentiation.

In [None]:
data = np.arange(1, 10).reshape((3, 3))
x = torch.tensor(data)
print("Original tensor:")
print(x)
print(type(x))
print(x.shape)

print("\nConverted to array:")
print(x.numpy())
print(type(x.numpy()))

#### Operations
PyTorch also comes with all of the standard math operations. Again, these look very similar to the ones found in Numpy. At this point you might wonder why use tensors when they are so similar to Numpy arrays. Later in this exercise we are going to use some of the cool features of PyTorch tensors that Numpy arrays don't have.

#### Initialization
At the beginning of the trainining, we often need to randomly initialize the model weights. For example, one can use uniform random values from 0 to 1:

In [None]:
x = torch.empty(4, 6)
random_weights = nn.init.uniform_(x, a=0.0, b=1.0) # Random uniform initializer from PyTorch
random_weights

#### Exercise 1.1)
Use PyTorch tensors to do the following:

1. Create two tensors `x` and `y` with the values `3` and `7`.
2. Multiply the values and assign the result to `z`.
3. Create a matrix `A` with the shape $3 \times 3$ and a column vector `b` with 3 elements using the random uniform initializer with values from $-1$ to $1$.
4. Multiply the matrix and the vector together and assign the result to `c`.
5. Create a Numpy column vector of shape (3, 1) with the values [1, 2, 3] and add these elementwise to `c`.

In [None]:
# 1)
x = None
y = None

# 2)
z = None

# 3)
A = None
b = None

# 4)
c = None
print(c)

# 5) Update c with new values:
d = None

c = c + d
print(c)

#### Exercise 1.2) Activation function
Implement the sigmoid activation function and its derivative using PyTorch. The exponential function is available as `torch.exp(x)`.

Activation:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Gradient:
$$\frac{d}{dx}\sigma(x) = \sigma(x)(1 - \sigma(x))$$

In [None]:
def my_sigmoid(x):
    """
    Return sigmoid activation of x.
    
    Parameters
    ----------
    x : torch.Tensor
        Tensor to calculate activations for.
    
    Returns
    -------
    a : torch.Tensor
        The activations.
    """
    raise NotImplemented
    return a

def my_sigmoid_grad(a):
    """
    Returns the gradient of the sigmoid function from the value of the activation.
    
    Parameters
    ----------
    a : torch.Tensor
        Output from my_sigmoid().
    
    Returns
    -------
    grad : torch.Tensor
        Gradient of the sigmoid function.
    """
    raise NotImplemented
    return grad

We can of course plot the values in PyTorch tensors.

In [None]:
x = torch.tensor(np.linspace(-10, 10, 100))
a = my_sigmoid(x)
grad = my_sigmoid_grad(a)


fig, ax = plt.subplots(1, 2)
ax[0].plot(x, a); ax[0].set_title("Activation")
ax[1].plot(x, grad); ax[1].set_title("Gradient")

### Automatic gradient
PyTorch comes with its own automatic differentiation engine called [Autograd](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html). Using Autograd, one can get the gradient of a sequence of PyTorch operations without doing any manual differentiation.

#### Exercise 1.3)
Evaluate your function `my_sigmoid()` in all the points of the tensor `x` in the cell below, and use PyTorch automatic differentiation to calculate the gradient of the function in these points. Compare to your own implementation of the gradient.

Check the link to the autograd tutorial above, but the most important points to remember are:
1. You should explicitly tell PyTorch to track gradients for your tensor (`requires_grad=True`)
2. In Pytorch, the actual computing of the gradients happens when you do the backward pass by calling `.backward()` on the last output of your computations. In your case, it will be the original tensor transformed by the sigmoid function. However, the default assumption of PyTorch is that the backward pass starts from the final *loss* value which is scalar - but your final data point is a vector. To avoid errors, simply convert the final tensor into a scalar before running the backward pass: for example, by applying `.sum()` on it.
3. After running the backward pass, the gradient values can be retrieved from the `.grad` property of the respective tensor.
4. Matplotlib can't plot PyTorch tensors directly, convert them to Numpy arrays first (`.detach().numpy()`).

In [None]:
# Gradients from your own function:
a = my_sigmoid(x)
gradients = my_sigmoid_grad(a)

# Gradients from PyTorch:
pytorch_gradients = None

# Plot of the gradients from my_sigmoid_grad() and Pytorch.
fig, ax = plt.subplots(1, 1)
ax.plot(x, pytorch_gradients, label="Using PyTorch")
ax.plot(x, gradients, '.', label="my_sigmoid_grad")
ax.legend()

If you did everything correctly the dots should follow the line precisely

### Layers
Pytorch comes with many implementations of functions useful to deep learning. The `torch.nn` module provides all the blocks you need to build your own neural network. It is a higher lever API that allows you to very quickly and easily create complex models without needing to worry about all the minor details.


#### Sequential feedforward neural network in PyTorch

There are very few limits on what you can do with PyTorch, but to keep it simple we will restrict our models to "sequential" models. This means that the layers are all stacked on top each other and there are no loops, skip connections or forks. One layer feeds in to the next. These types of models are very easy to define using PyTorch.

Every module in PyTorch subclasses the `nn.Module`. A neural network is a module itself that consists of other modules (layers).

Below we define a simple network with an input layer of size 4,  two hidden layers of size 2 and a output layer of only one node. All the layers use a simple sigmoid activation function except the last one, where we skip the activation. Notice that we explicitly specify a simple random uniform initializer for the weights of our linear layers (by default PyTorch will use a more advanced initializer, but we want to keep it simple for now).

![Model structure](figures/4_nn.png)

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_sigmoid_stack = nn.Sequential(
            nn.Linear(4, 2),
            nn.Sigmoid(),
            nn.Linear(2, 2),
            nn.Sigmoid(),
            nn.Linear(2, 1),
        )
       
    # Forward pass
    def forward(self, x):
        logits = self.linear_sigmoid_stack(x)
        return logits

# Initialization: first check the layer type,
# then apply the desired changes to the weights
def init_uniform(layer):
    if type(layer) == nn.Linear:
        nn.init.uniform_(layer.weight)
    
model = NeuralNetwork()
model.apply(init_uniform)
print(model)
nr_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters in the model: {nr_params}")
print(type(model))

#### Exercise 1.4)
Use PyTorch to define a class and function that returns a sequential neural model. The model should have input size of 2, three hidden layers with two nodes and one output layer with two nodes. All layers except the last layer should apply the activation function specified in the function argument. Use the random uniform initializer with values from $-1$ to $1$.

![Model structure](figures/2_nn.png)

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, **kwargs):
        raise NotImplemented
       
    # Forward pass
    def forward(self, x):
        logits = self.linear_stack(x)
        return logits

def build_model(activation=nn.Sigmoid()):
    """
    Return a sequential model.
    
    Parameters
    ----------
    activation : function
        Function from PyTorch activations.
    
    Returns
    -------
    model : PyTorch Sequential model.
    """
    raise NotImplemented
    return model

## Part 2) Vanishing Gradients
In this section we are going to look at a two different ways to cope with the vanishing gradient problem.

- Change of activation function
- "Proper" initialization of weights

There are other techniques that can have a big impact on this problem, but these are the simplest and easiest to start with.

### Activation functions
The sigmoid activation function you implemented above

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

has two properties that make it good for neural networks.
- It's differentiable everywhere.
- Its gradient has a very simple form based on the output of the function.

$$\frac{d}{dx}\sigma(x) = \sigma(x)(1 - \sigma(x))$$

However, there are some issues with this function.

#### Exercise 2.1)
Can you name and explain two problems with this activation function?

**Hint:** what happens when you evaluate the function for high or low values of $x$? And what is the domain of the function?

#### Exercise 2.2)
Use the function `build_model()` to create a model using the sigmoid activation function.

In the next code block, we run one iteration over a training set and make a "violin" plot of the gradients (the implementation is imported from an external file, feel free to have a look at it). 

What do you see? Give a short description of the plot. Are the results suprising?

You can use your own implementation of the sigmoid from earlier, or you can use PyTorch sigmoid implementation.

In [None]:
model_sig = build_model(activation=nn.Sigmoid())

from helpers12 import plot_grad_pytorch as plot_grad
fig = plot_grad(model_sig)

### Better activation functions, tanh
#### Exercise 2.3) Can we solve the problems by using the tanh activation function? Why? Why not?

**Answer:**

#### Exercise 2.4)
Create a model using the tanh activation function from PyTorch and plot the gradients again. Comment on the result, and how it differs from the sigmoid case.

In [None]:
model_tanh = build_model(activation=nn.Tanh())

fig = plot_grad(model_tanh)

### Better activation functions, ReLU
The last activation function we are going to take a look at is the Rectified Linear Unit (ReLU). This function is very different from the sigmoid and the tanh.

In [None]:
x = np.linspace(-5, 5, 100)
activation = nn.ReLU()
a = activation(torch.Tensor(x))

fig, ax = plt.subplots()
ax.plot(x, a)
ax.grid()

This function is defined as

$$    f(x)= 
\begin{cases}
    0, & \text{if } x < 0\\
    x, & \text{otherwise}
\end{cases}$$

#### Exercise 2.5)
Can this activation function solve some of our problems?

#### Exercise 2.6)
As before, build a new model with this activation function and comment on the gradients.

In [None]:
model_relu = build_model(activation=nn.ReLU())
                    
fig = plot_grad(model_relu)

**Answer:**

#### Dead neurons
The ReLU activation function has the problem that the sigmoid and tanh functions do not have. During training, we can end up in a situation where the input to the activation function in a node is always less than zero. In this case, the gradient going back through that node will ALWAYS be zero, i.e the node "dies" and does no longer take part in training.

Other activation functions have been proposed to deal with this, from the simple Leaky-ReLU to the more interesting SELU. We will not explore these here, but when you are training your own models they are worth taking a look at. They are all implemented in PyTorch, see https://pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions

### Initializing weights
Before we start training a neural network, we must initialize the weights and biases of the network. There are several ways we can do this.

#### Exercise 2.7) Random uniform initialization
Imagine you are going to train a network with *many* nodes in the hidden layers using $\tanh$ as an activation function. You decide to initialize the weights of the network using a random uniform distribution in the range [-1, 1]. Can you think of any issues with this approach?

**Hint:** What happens to the sum inside the activation function as the number of neurons in the previous layer increase?

**Answer:**


#### Glorot normal initialization
Glorot initialization is a scheme where the size of the intitial weights depends on the number of neurons/weights in a layer. In this variant, the weights are initialized as samples from a normal distribution.

$$W_l \sim \mathcal{N}(\mu=0, \sigma_l)$$

And $\sigma_l$ is on the form

$$\sigma_l = \sqrt{\frac{2}{n_l + n_{l+1}}}$$

Where $n_i$ is the number of neurons in layer $i$. The weight matrix $W_l$ is of size $n_l\times n_{l+1}$.

As the number of neurons and weights **increases**, the "range" of the initial values **decreases** so that the activations are likely to stay in the center range of the activation function where the gradient is large. The expression for $\sigma$ has a theoretical underpinning that we won't look at here, but if interested you can check out the [paper by Xavier Glorot and Yoshua Bengio](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi). In it, they use a uniform distribution as an example. But the results are general for any kind of (reasonable) distribution.

In PyTorch, this initialization method is named `nn.init.xavier_normal_()`

#### Exercise 2.8)
Define a new function `build_model_glorot()`, that uses Glorot normal initialization in the layers of the model. You can copy and adapt the function you implemented in exercise 1.4.

The docs for the PyTorch Glorot normal initializer can be found [here](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_).

In [None]:
def build_model_glorot(activation=nn.Sigmoid()):
    """
    Return a sequential model with Glorot initialization.
    
    Parameters
    ----------
    activation : function
        Function from PyTorch activations.
    
    Returns
    -------
    model : PyTorch Sequential model.
    """
    raise NotImplemented
    return model

### Gradients with better initialization

#### Sigmoid
Let's take a look at the gradients when we are using a sigmoid activation function and the Glorot normal intiialization.

In [None]:
model2_sig = build_model_glorot(activation=nn.Sigmoid())
                    
fig = plot_grad(model2_sig)

Compare this to the figure from exercise 2.2, Notice the scale on the y-axis.

**Answer:**

#### tanh
Now compare the tanh activation function model.

In [None]:
model2_tanh = build_model_glorot(activation=nn.Tanh())
                    
fig = plot_grad(model2_tanh)

**Answer:**

#### ReLU
Finally, compare a new ReLU model.

In [None]:
model2_relu = build_model_glorot(activation=nn.ReLU())
                    
fig = plot_grad(model2_relu)

**Answer:**

## Summary

### Deep Learning Frameworks
In this exercise we have taken a look at the PyTorch framework. Such frameworks take a lot of pain out of creating complex deep learning models, and allow for quick development of efficient code. 

### Vanishing Gradient
Two ways to reduce the problem of vanishing gradients are to use a suitable activation function and an initialization scheme.

The sigmoid (logistic) function has a few problems that we can improve upon with other activation functions, and a proper initialization of the weights can make a huge difference.

We have only scratched the surface of possible activation functions and initializers. And there are also many other techniques that are employed to speed up training of neural networks, but these are left for later courses.