# Deep Learning for Beginners - Programming Exercises

by Aline Sindel, Katharina Breininger and Tobias Würfl

Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nürnberg, Erlangen, Germany 
# Exercise 3



In [1]:
# minor set-up work
import numpy as np # we will definitely need this

# automatic reloading
%load_ext autoreload
%autoreload 2

%matplotlib inline

## The General Idea of the Framework
<a id='network_description'></a>

Almost all tasks in this programming exercise will revolve around implementing "layers". All layers are derived from the base class defined in the next cell. Each layer needs to implement the methods ```forward``` and ```backward```. We will use the term "layer" to represent any operator in the network that can be considered as a "unit" during forward and backward pass, e.g., a "fully connected layer", an "activation layer" or a "loss layer". 

In ```forward(x)```, the forward pass of the layer is computed by applying the respective operation to the input ```x```. Furthermore, intermediate results necessary to compute the gradients in the backward pass have to be stored. 
In ```backward(error)```, the layer receives the error passed down from the subsequent layer, updates its parameters accordingly and returns the error with respect to its input.

This way, a simple network for classification can be expressed by a list of layer objects. Given an initial input ```x``` and a corresponding ```label```, the forward pass through the network is computed by subsequently calling ```forward``` for each layer in the list. The respective output is passed as input to the next layer. The very last layer, the "loss" layer, additionally receives the label to compute the loss. To adapt the weights in each layer, we then go backwards through the list, calling ```backward```, backpropagating the error through the network. The network is trained by alternating the forward and backward pass through the network while iterating through the training data.

During test-time, only the forward pass through the network is computed to generate a prediction.

### Basic notation and terminology

We will work with the following notation and terminology:

- $\mathbf{X}$ and $\mathbf{x}$ represent the input, 
- $\mathbf{W}$ and $\mathbf{w}$ the trainable weights/parameters and
- $\mathbf{Y}$ and $\mathbf{y}$ the output of a layer.
- $L$ represents the loss. Accordingly,
- $E_\mathbf{Y} = \frac{\partial L}{\partial \mathbf{Y}}$ is the error passed down from the subsequent layer,
- $E_\mathbf{W} = \frac{\partial L}{\partial \mathbf{W}}$ the error with respect to the weights and
- $E_\mathbf{X} = \frac{\partial L}{\partial \mathbf{X}}$ is the error with respect to the input.

Note that $x$ and $y$ always have "local" meaning, i.e., with respect to the __current__ layer. The $y$ of the previous layer is the $x$ to the next, and vice versa.


Have a look at the class definitions below and make yourself familiar with the concepts before continuing with the next part of the programming exercise, the fully connected layer.

In [None]:
# %load src/base.py
def enum(*sequential, **named):
    # Enum definition for backcompatibility
    enums = dict(zip(sequential, range(len(sequential))), **named)
    return type('Enum', (), enums)

# Enum to encode which phase a layer is in at the moment.
Phase = enum('train', 'test', 'validation')

class BaseLayer:
    
    def __init__(self):
        self.phase = Phase.train
        
    def forward(self, x):
        """ Return the result of the forward pass of this layer. Save intermediate results
        necessary to compute the gradients in the backward pass. 
        """
        raise NotImplementedError('Base class - method is not implemented')
    
    def backward(self, error):
        """ Update the parameters/weights of this layer (if applicable), 
        and return the gradient with respect to the input.
        """
        raise NotImplementedError('Base class - method is not implemented')

## Fully Connected Layers

Fully connected (FC) layers are the essential building blocks in (multi-layer) perceptrons. Inspired by biological neurons, they are able to represent any connection topology between two layers (without same-layer connections).

<img src="img/ann.png" width="600">

Let's have a look at the forward pass: Given an input vector $\mathbf{x} \in \mathbb{R}^{n}$ to an FC layer, the output $y$ of a single neuron can be described as a weighted sum of the input values plus a bias:
\begin{equation}
y = w_{n+1} + \sum_{j=1}^n w_j x_j ,
\end{equation}

where we collect the weights in a vector $\mathbf{w} \in \mathbb{R}^{n + 1}$.

This is simply a vector-vector multiplication: 

\begin{equation}
y = \begin{pmatrix} 
  w_{1}&\dots&w_{n}&w_{n+1} \end{pmatrix}
\begin{pmatrix} 
  x_{1}    \\ 
  \vdots \\
  x_{n} \\
  1
\end{pmatrix}
\end{equation}

By extending $\mathbf{x}$ with an additional "1", we can include the bias directly in the multiplication. 


Since we want to have a layer able to generate multiple outputs, we need multiple neurons:

<img src="img/fcn.png" width="150">

To achieve this, we extend the weight vector to a matrix to allow for an output vector $\mathbf{y} \in \mathbb{R}^{m}$:

\begin{align}
\begin{pmatrix} 
y_1    \\ 
\vdots \\
y_m
\end{pmatrix} &=
\begin{pmatrix} 
w_{1,1}    & \dots & w_{n,1} & w_{n+1,1} \\
\vdots & \ddots & \vdots & \vdots \\%
w_{1,m}    & \dots & w_{n,m} & w_{n+1,m}
\end{pmatrix}
\begin{pmatrix} 
x_1    \\ 
\vdots \\
x_n	 \\
1
\end{pmatrix}\\
\mathbf{y} &= \mathbf{W}\mathbf{x} 
\end{align}

For batch processing, we can accordingly stack multiple input vectors in a matrix $\mathbf{X}$:

\begin{equation}
\mathbf{Y} = \mathbf{W}\mathbf{X}
\end{equation}

The weight matrix represents the trainable parameters of the FC layer. To be able to update the parameters, we need the gradient of the loss with respect to these weights.
Given the error with respect to the output $\mathbf{Y}$ of the current layer $\frac{\partial L}{\partial \mathbf{Y}} = E_\mathbf{Y}$, we can compute the gradient with respect to the weights $\frac{\partial L}{\partial \mathbf{W}} = E_\mathbf{W}$ using backpropagation, i.e., the chain rule. To backpropagate the error to the previous layer (and then update the weights there), we further need to compute the error with respect to the inputs $\frac{\partial L}{\partial \mathbf{X}} = E_\mathbf{X}$.

Using the formula of the fully connected layer $\mathbf{Y} = \mathbf{W}\mathbf{X}$, we can compute the wanted gradients:

\begin{align}
\frac{\partial L}{\partial \mathbf{W}} &= \frac{\partial L}{\partial \mathbf{Y}} \frac{\partial \mathbf{Y}}{\partial \mathbf{W}}\\
                              &= E_\mathbf{Y} \mathbf{X}^T\\
\end{align}

\begin{align}
\frac{\partial L}{\partial \mathbf{X}} &= \frac{\partial L}{\partial \mathbf{Y}} \frac{\partial \mathbf{Y}}{\partial \mathbf{X}}\\
                              &= \mathbf{W}^T E_\mathbf{Y}\\
\end{align}

We will use (mini-batch) stochastic gradient descent in this programming exercise, so the update rule for the weights is as follows:

\begin{equation}
\mathbf{W}^{t+1} = \mathbf{W}^{t} - \eta E_{\mathbf{W}^t} \enspace{,}
\end{equation}

where $\eta$ is the learning rate and ${t}$ denotes the iteration.


### Implementation task

**Now it is your turn**: In the next cell, implement the methods ```init```, ```forward```, ```backward```, and ```get_gradient_weights``` and test the method by running the cell after the next. The method ```get_gradient_weights``` should return the gradient with respect to the weights and biases of the last backward pass.

**Note that input and output, and accordingly the respective errors, are actually transposed compared to the formulas above**. This is due to performance reasons and consistency with known frameworks. Make sure to consider this in your implementation.

Furthermore, implement the method ```initialize```. For the moment, take the initializer objects as given, we will return to them later. Just make sure to use them with the correct weight shapes to initialize weights and biases. Implement the update of these parameters as part of the backward pass.

In [None]:
# %load src/layers/fully_connected_0.py
#----------------------------------
# Exercise: Fully connected layers
#----------------------------------
# The original python file can be reloaded by typing %load src/layers/fully_connected_0.py in the first line of this cell.
# After successfully solving this exercise, type the following command in the first line of this cell:
# %%writefile src/layers/fully_connected.py
# This will save the result to a python file, which you will need for the next exercises.

from src.base import BaseLayer, Phase

class FullyConnectedLayer(BaseLayer):
    def __init__(self, input_size, output_size, learning_rate):
        """ A fully connected layer.
            param: input_size (int): dimension n of the input vector
            param: output_size (int): dimension m of the output vector
            param: learning_rate (float): the learning rate of this layer
        """
        # TODO: define the necessary class variables, for this have a look at the input variables and the other functions
        pass

    def forward(self, x):
        """ Compute the foward pass through the layer.
            param: x (np.ndarray): input with shape [b, n] where b is the batch size and n is the input size
            returns (np.ndarray): result of the forward pass, of shape [b, m] where b is the batch size and
                   m is the output size
        """
        # TODO: Implement forward pass of the fully connected layer
        
        # (1) Think about what you need to store during the forward pass to be able to compute the gradients in the backward pass 
        self.X = #TODO
        
        # (2) perform the actual forward pass just by matrix multiplication
        # TODO
        
        # return the result        
        pass
    
    def get_gradient_weights(self):
        """ 
        returns (np.ndarray): the gradient with respect to the weights and biases from the last call of backward(...)
        """
        # TODO: Implement the getter method, hint: store the gradient in the backward pass as a class variable, 
        # then you can easily access it here.
        pass
    
    def backward(self, error):
        """ Update the weights of this layer and return the gradient with respect to the previous layer.
            param: error (np.ndarray): of shape [b, m] where b is the batch size and m is the output size
            returns (np.ndarray): the gradient w.r.t. the previous layer, of shape [b, n] where b is the 
                   batch size and n is the input size
        """
        # TODO: Implement backward pass of the fully connected layer
        # Hint: Be careful about the order of applying the update to the weights and the calculation of 
        # the error with respect to the previous layer.
        
        # (1) calculate the error for lower layers using the transposed weights and the error
        # TODO
        
        # (2) update own parameters
        # TODO
        
        # (3) store gradient for testing purposes
        # TODO
        
        # (4) update weights using learning rate and gradient
        # TODO
        
        # (5) delete the bias row which has no meaning
        
        # TODO: return gradient w.r.t. the previous layer
        pass
    
    def initialize(self, weights_initializer, bias_initializer):
        """ Initializes the weights/bias of this layer with the given initializers.
            param: weights_initializer: object providing a method weights_initializer.initialize(weights_shape)
                   which will return initialized weights with the given shape
            param: bias_initializer: object providing a method bias_initializer.initialize(bias_shape) 
                   which will return an initialized bias with the given shape
        """
        # TODO: Implement the initialization using the given initializers. Hint: Stack the weights and bias together 
        # in the weights array.
        pass

In [None]:
# Running the testsuite
%run Tests/TestFullyConnected.py
TestFullyConnected.FullyConnected = FullyConnectedLayer
unittest.main(argv=['first-arg-is-ignored'], exit=False)

## Softmax and Loss Layer

By combining the layers we implemented so far, we can represent a non-linear function of the input. For example, we can compute an output vector with $K$ elements to classify between $K$ classes.

### Softmax
The output of this computation is not further restricted. In many cases, however, it is beneficial if a prediction for the targeted classification has the properties of a probability distribution, i.e., 

\begin{align}
\sum_{k=1}^{K} y_k &= 1 \enspace{,}\\
y_k &\le 0 \quad \forall k~\text{in}~{1, ..., K} \enspace{.}
\end{align}

This makes it for example easier to compare the prediction with the ground truth of the classification task.
We can achieve these properties by applying the softmax function as a last activation function. It is defined as: 

\begin{equation}
\mathrm{softmax}(x_k) = \frac{\mathrm{exp}(x_k)}{\sum_{j=1}^{K}\mathrm{exp}(x_j)} \enspace{.}
\end{equation}

However, if the activations in $\mathbf{x}$ are high, $\mathrm{exp}(x_k)$ can become very large. This can cause numerical instabilities. To avoid this, the activations can be shifted by the maximum value of $\mathbf{x}$ before applying the softmax:

\begin{equation}
\mathbf{\widetilde{x}} = \mathbf{x} - \mathrm{max}(\mathbf{x}) \enspace{.}
\end{equation}

After the softmax, the predictions of the network have the properties of a probability distribution.

### Loss function
To adapt the parameters of the network, we want to know how "well" the network performs compared to a given ground truth (or label) - we need a loss function. Then, we can "train" the network by minimizing this loss by iteratively adapting the weights and biases using our training data.

A common loss function is cross entropy. To compute it, we need the ground truth $\mathbf{y^*}$ in "one-hot"-vector encoding. The ground truth is represented as a vector with $K$ elements where only the value that corresponds to the true class is $\neq 0$:

\begin{equation}
\mathbf{y^*} = 
\begin{pmatrix}
  0 \\
  \vdots\\
  1\\
  \vdots\\
  0
\end{pmatrix}
\end{equation}

Then, the cross entropy loss for a batch of b samples is defined as:

\begin{equation}
L(\mathbf{Y^*},\mathbf{Y}) = - \sum_b \sum_{k=1}^K \ln( y_{b, k} ) y^*_{b, k}
\end{equation}

### Combining both

The softmax activation and the cross entropy loss are frequently combined, and sometimes called the "SoftMax loss". Together, their gradient has a simple and elegant form:

\begin{equation}
e_k = 
y_k - y^*_k \enspace{.}
\end{equation}

for every element of the batch.

### Implementation task

Implement the softmax function and the cross entropy loss combined in the class ```SoftMaxCrossEntropyLoss```. Since the two functions are combined in ```forward```, additionally implement a function ```predict``` that computes only the softmax of the input. This function can be used during test-time, when we are interested in a prediction for unseen data.

In [None]:
# %load src/layers/softmax_crossentropy_0.py
#----------------------------------
# Exercise: Softmax cross entropy
#----------------------------------
# The original python file can be reloaded by typing %load src/layers/softmax_crossentropy_0.py in the first line of this cell.
# After successfully solving this exercise, type the following command in the first line of this cell:
# %%writefile src/layers/softmax_crossentropy.py
# This will save the result to a python file, which you will need for the next exercises.

from src.base import BaseLayer, Phase

class SoftMaxCrossEntropyLoss(BaseLayer):
    def __init__(self):
        #TODO: define class variable(s)
        pass
    
    def forward(self, x, labels):
        """ Return the cross entropy loss of the input and the labels after applying the softmax to the input. 
            param: x (np.ndarray): input, of shape [b, k] where b is the batch size and k is the input size
            param: labels (np.ndarray): the corresponding labels of the training set in one-hot encoding for 
                   the current input, of the same shape as x
            returns (float): the loss of the current prediction and the label
        """
        # TODO: Implement forward pass
        prediction = #TODO: use self.predict here to apply softmax to the network output
        #Then, compute the cross entropy loss (see equation in description)
        #Hint: for the implementation, you can simplify the equation by getting the indices for the labels beeing 1
        pass
    
    def backward(self, labels):
        """ Return the gradient of the SoftMaxCrossEntropy loss with respect to the previous layer.
            param: labels (np.ndarray): (again) the corresponding labels of the training set for the current input, 
                   of shape [b, k] where b is the batch size and k is the input size
            returns (np.ndarray): the error w.r.t. the previous layer, of shape [b, k] where b is the batch 
                   size and n is the input size
        """
        # TODO: Implement backward pass: compute the gradient of the SoftMax loss (see equation in description)
        # For this, do not use self.prediction directly, but a copy of it
        # Here, also you can find the positive class via the indices for the labels beeing 1
        pass
    
    def predict(self, x):
        """ Return the softmax of the input.  This can be interpreted as probabilistic prediction of the class.
            param: x (np.ndarray): input with shape [b, k], where b is the batch size and n is the input size
            returns (np.ndarray): the result softmax(x), of the same shape as x
        """
        # TODO: Implement softmax (see equation in description)
        # Hint: beforehand, shift the activation by the max value.
        # return and store the prediction as class variable (the latter you need for the backward function)
        pass

In [None]:
%run Tests/TestSoftMaxCrossEntropyLoss.py
TestSoftMaxCrossEntropyLoss.SoftMaxCrossEntropyLoss = SoftMaxCrossEntropyLoss
unittest.main(argv=['first-arg-is-ignored'], exit=False)