# Neural Networks

We will use neural networks as a running example of a data model in this course. We need some concrete tangible models in our discussions, I feel it is sensible to adopt deep neural networks, because
- Currently (2019) they provide superior prediction performance in a wide range of practical applications. Arguably, deep neural network model champions in the widest range of problems among any single data model family does. 
- The tools of building, testing and deploying state-of-the-art neural networks are increasingly mature nowadays. Online learning materials and video tutorials/lectures are abundant.
- The design principle is elegant (though practical implementations have not come close to the ultimate goal of end-to-end learning), which saves much ad hoc data engineering effort.


Nonetheless, it is worth noting that the learning principles we study in this course apply to generic data models, not limited to deep neural networks. For techniques closely coupled with DNN, e.g. skip-layer-connections (see below), we will either introduce them in this lecture or specify explicitly when encountering them.

## Recursive Design

__Review of Linear Model__ Linear model -- allocate each $X$-variable a _weight_ (i.e. coefficient, we will use "weight" following convention in the context of NN), then consider the (or every) desired $Y$-variable to be a simple function of the weighted sum. 
$$
Y = \psi(\sum_i w_i X_i)
$$
If you have multiple $Y$-variables:

$$
Y_1 = \psi(\sum_i w_{1,i} X_i) \\
Y_2 = \psi(\sum_i w_{2,i} X_i) \\
\dots
$$
which is
$$
Y_j = \psi(\sum_i w_{j,i} X_i)
$$

Recall our figure illustrating the computation:
<img src="ref/illu-linear.png" width="500px"/>

The figure below shows an example of a linear model predicting 2 $Y$-variables (recall the discussion in our previous class)
<img src="ref/illu-linear-2y.png" width="400px"/>
where each group of computations can be seen as an independent linear model for its respective target $y$.

### Recursive design: bottom-up

Roughly, neural networks are multiple linear models grouped and stacked together. Let us first re-draw some elements and links in the figure above. There is a linear model of two prediction targets. We link the prediction computation for both targets to the data attributes, which makes the figure a bit messy but putting more focus on the important aspects for the construction of neural networks.

<img src="ref/illu-linear-2ya.png" width="450px"/>

Straightforwardly, we can extend the model to generate more targets, say, $q$. We draw weight in groups as $w_{1,\cdot}$, for all weights associated with the prediction of $y_1$.
<img src="ref/mlp-botup-1.png" width="450px"/>

Then a very natural extension to consider is what if 
> instead of directly take the $y_1, y_2, \dots$ as  the model prediction of the targets, we use them as intermediate features, and take further steps to construct $z_1, z_2, \dots$ as the prediction goal.

The figure below shows the extension. Of course, we need to introduce new sets of model parameters (weights) for the extension, see $u_{1,\cdot}, u_{2,\cdot}, \dots$.

<img src="ref/mlp-botup-2.png" width="450px"/>

### Recursive design: top-down

Another way of viewing the construction of a neural network model also starts with a linear model. This time we are motivated by the idea of making the data attributes more representative.  So instead of using the raw observable variables, we compose the attributes using "feature-engineering" models, which are also linear (so the same idea applies recursively). The figure below shows the scheme.

<img src="ref/mlp-td.png" width="550px"/>

The figure shows how attribute-2 is replaced by the output of another "lower-level" model.

### Implementation

Using a modern framework "PyTorch", we can easily implement a neural network model. Mainly, we are concerned with two aspects
1. __definition__: to specify the structure of the model: number of layers, number of output units (variables) in each layer, how the inputs are related to the outputs (we only encountered fully connected layers for now, i.e. each input is connected to each output). We specify the definition in the `__init__`  (constructor) method of a neural network class.
2. __computation__: the `forward` method performs the actual computations using the layers.

<span style="color:gray">OPTIONAL</span>
> One issue might be of interest for our careful readers: the _activation_ (non-linear map $\psi$) function conceptually belongs to the model architecture. However, as there is no __adjustable__ parameters in those maps, they need no learning. Colloquially,  one can just consider they are too simple to be included in the model definition. In a more theoretical view, the parameter-less functions have no effect on where the model is in $\mathcal H$, i.e. the current $h$. However, they do affect $\mathcal H$ itself. 

In [None]:
import torch
import torch.nn as nn

class MyNetwork(nn.Module):
    """
    The computations in a 3 layer neural network. 
    
    SEE ALSO the "MLPClassifier" we had used in `sklearn.neural_network`. 
    Here we will study the inner structure of a neural network.
    
    This construction is slightly more complex than using the simple interface
    of MLPClassifier, but provides much more flexibility.
    """
    def __init__(self):
        super(MyNetwork, self).__init__() # Important: register self as an NN model
        self.linear1 = nn.Linear(in_features=4, out_features=16)
        self.linear2 = nn.Linear(in_features=16, out_features=8)
        self.linear3 = nn.Linear(in_features=8, out_features=3)
        
        # out-features of i-1 must be the same as in-features of i
        
    def forward(self, x):
        # where the actual computation takes place
        h1 = nn.functional.relu(self.linear1(x))
        # "relu" is the elementwise map.
        h2 = nn.functional.relu(self.linear2(h1))
        h3 = nn.functional.relu(self.linear3(h2))
        
        # let's consider h3 be the "logits" of the classes
        return h3
        

Note `relu` is a kind of elementwise "activation", i.e. nonlinear transformation ($\psi$ in the figures above). Let's do a sanity test first.

In [None]:
dummy_data = torch.rand(10, 4) # allocate dummy input for the data model
nn_model = MyNetwork() # create an instance of the model
pred = nn_model(dummy_data) # when used like a function, the model's "forward" method is executed
print("NN transforms data of {}, obtain {}".format(
    dummy_data.shape, pred.shape))

<span style="color:blue">__Discussion__</span>
- Modify the code above to let the model output diagnostic information during the `forward` computaiton, about the `shape` of each intermediate result.

## Back-propagation Algorithm

### Gradient-based optimisation

Recall the discussion about NN models in the last lecture. We can easily perturb the parameters of an NN model a bit to see if the model behaves better,  taking the advantage of the fact that as long as we keep the _mathematical form_ of the model parameters intact, we always have a valid neural network. In other words, we can easily "move around" in the _space of neural networks_.

The main question is: where to move? A reasonable argument for an efficient move is to achieve the maximal change of a numerical criterion given fixed step size, say 1.0. More specifically, assume for now we know the following facts:

- If the parameter $w_1$ changes by $\Delta$, the criterion would change $3.0 \Delta$ in response. 
- If the parameter $w_2$ changes by $\Delta$, the criterion would change $4.0 \Delta$ in response. 

Then for a fixed step size 1.0 for the combined adjustment in the 2D space consisting of $w_1$ and $w_2$, the most efficient movement is to let $\Delta w_1 = 0.6$ and $\Delta w_2=0.8$ (or the opposite direction, if the aim is to decrease the criterion).

<span style="color:blue">__Discussion__</span>
Why that is the optimal movement direction? Could you try to come up with something else?


In mathematical terms, let us put together all model parameters (i.e. weights and bias terms in ALL layers) and denote as $\boldsymbol \theta$, and denote the _parameter space_ consisting of all possible $\boldsymbol \theta$ as $\boldsymbol \Theta$. Naturally, starting from an existing neural network $\boldsymbol \theta$, we want to search for a promising movement in $\boldsymbol \Theta$. The figure below shows the process.

<img src="ref/bp0.png" width="500px"/>

The process is better known as "optimisation". For each individual parameter, we carefully calculate its effect on the model's fitness to observed data. The parameter-fitness relation helps us determine the apparently "promising" direction to move the model in $\boldsymbol \Theta$. The direction is the _gradient _ of fitness with respect to the model parameters.

Fortunately, people had found smart and fast way to compute the gradient for a neural network, despite the fact that a parameter may contribute to the final model output (and thus affect its fitness to data) in a very tortuous route -- consider one weight associated with the input-output in the bottom layer in a 10 layer model.

Before we study how to compute neural network gradients, let's first heed the limitation of such a method as the means of model optimisation.

- Gradient-based methods only suit some model families. At least, the family should register with some parameter space, and in the space, each point represents a valid member model of the family, as the $\boldsymbol \Theta$ in the neural networks' case. As a counter example, consider a decision tree model, it is not trivial how to search "locally" around an existing decision tree, as certain parameters of a model, e.g. the number of nodes and the structure, cannot be modified continuously. 
- Gradient is a __local__ matter. The direction of adjusting the model remains good only when the adjustment (step-size) is small.
- The _loss_, i.e. "the model's fitness to the observed data", needs to be further scrutinised. i) "Observed data" means the gradient is calculated to improve the __estimated__ performance, which is closely related to the training-test generalisation issue. ii) In general, the complete "observed data" is a large object and can be cumbersome to manipulate in a computer's memory. So each step of adjustment is calculated using a small _batch_ of data. The batch represents a even smaller portion of data. We can only _expect_ the cumulated adjustments to lead to a good model. This is often referred to as "stochastic gradient descend" (SGD).

### Computing Gradients via Backpropagation

The principle of calculating the effect on the loss for individual parameters is to skillfully formulate the standard chain rule for computing the derivations. The figure below shows this principle of computation

<img src="ref/bp.png" width="700px"/>

The "EMSG" stands for "error message", the desired change in the _result_ of some computation. EMSG is the information we "back-propagate" through the computational graph. Using EMSG, we compute the "UMSG" (update-message), i.e. how to change the operands in a computation to achieve the desired change. UMSG represents the gradient with respect to the model parameters.

__Implementation__

Though conceptually simple, the implementation of the bp-algorithm needs painstaking care and is error-prone. Fortunately, mature computation frameworks have been developed and hide much of the details from model builders nowadays.

In "PyTorch", most operations have a two-way implementation of the computation involved -- the forward computation and the backward calculation of the gradients with respect to the operands. Let us compare the "conventional" computation (forward only) and the two-way computation using a simple example.

E.g. the operator "+" and "*" in the following computations are trivially straightforward, and work as expected. 
```python
a = 5 + 3
b = a * 2
```
On the other hand, if we let pytorch handle some operands,

In [None]:
a = torch.Tensor([5.0]).requires_grad_()
b = (a + 3.0) * 2

We write `[5.0]` instead of `5.0` because torch (and most other similar libraries) is designed to handle number arrays, in their mangaged data type, "Tensor". To construct a tensor, we need a **collection** of data. So we make a $[1 \times 1]$ array to represent a real number. The qualification `requires_grad_()` is to explicitly state that `pytorch` should handle the backward computation -- which is _unnecessary_ in constructing actual learnable models.

Let's check the effect of the backward computation:

In [None]:
print("a = {}, grad is {}".format(a.data, a.grad))
b.backward()
print("a = {}, grad is {}".format(a.data, a.grad))

The result of `a.grad = [2.0]` means if the `a` changes by 1 unit, `b` will change 2 units in response.

When using the framework to perform computations stated in a deep neural network model, the backward computation will result in the gradient, the direction toward which to modify the model parameters.

## Important Strategies

Neural networks have a long history in the development of machine intelligence and statistical learning. Though the neural network model is versatile and in theory can express arbitrary relationships, the naive implementation and application usually lead to poor results, for both analytics and prediction tasks. They slowly gained popularity in many of the application areas only after appropriate techniques or model architectures suitable to the applications had matured. 

### Parameter sharing (spatial): convolutional neural networks

One important limitation of the neural network models is the number of _adjustable_ parameters that are involved in all the steps of computations. Consider one particular step, where there are $m$ input variables and $n$ output ones. Since each output variable needs a weighted sum of __all__ the input variables, which entails $n$ weights. 

Consider the following example of dealing with data of pixels in pictures.

#### Obtaining hand-written digits images

You can ignore the operations for now, we will introduce techniques of dealing with data storage, loading and preprocessing in a later class.

The deliverable of this section is a data loader, when asked, it will yield an (x, y) pair, where x is a batch of images and y the corresponding digits.By "ask", I mean to iterate through the data, using loop operations such as 
```python
for x, y in data_loader
    ...
```    

In [None]:
from torchvision.datasets import MNIST
from torchvision import transforms

# where to store the data
MY_DATA_DIR = "../data"

DATA_TRANSFORM = transforms.ToTensor()

dataset = MNIST(MY_DATA_DIR, train=True, 
                download=True, # if you don't have it already, download
                transform=DATA_TRANSFORM)

data_loader = torch.utils.data.DataLoader(dataset, batch_size=4)

In [None]:
for x, y in data_loader:
    break # stop after having the first loading
    
print("X has a shape of", x.shape, "y:", y)

In [None]:
# To visualise one data sample, you can activate the code below
%matplotlib inline
if False:
    import matplotlib.pyplot as plt
    i = 0 # you can try to look at different samples.
    plt.imshow(x[i, 0].numpy(), cmap='gray')
    plt.show()

#### Motivating and constructing convolutional networks
Now we have 4 samples in an `x`. Consider one sample for now. It has $28 \times 28=784$ variables (which could be multiplied by 3 for images have 3 channels for RGB colours). See the figure below.

<img src="ref/hw1.png" width="128"/>

To build a linear layer for the image data, each output variable of the layer needs 784 weights to associate with the pixel values (ignore bias for now). If the layer has $n$ output variables, $784n$ weights are needed to build the layer.

Moreover, to serve as an intermediate processing step, which passes information of the data to upstream steps, $n$ cannot be too small compared to the original size $784$. So the parameter number in one layer is at the magnitude of $784^2 ~= 600k$. A network of a dozen layers can easily contain 5~10M parameters -- and keep in mind that the data of question are $28 \times 28$ gray images! 

In [None]:
# To be concrete let's make a network just for fun:

class FullNetwork(nn.Module):
    def __init__(self):
        super(FullNetwork, self).__init__()
        self.linear1 = nn.Linear(in_features=784, out_features=1024)
        self.linear2 = nn.Linear(in_features=1024, out_features=1024)
        self.linear3 = nn.Linear(in_features=1024, out_features=1024)
        self.linear4 = nn.Linear(in_features=1024, out_features=10)
        # 10 for 10 different classes as the target of the analysis
        
    def forward(self, x):
        batch_size = x.shape[0]
        h = x.view(batch_size, -1) # [m, 1, 28, 28] -> [m, 784] 
        # "flatten" all pixels to process
        h = nn.functional.relu(self.linear1(h))
        h = nn.functional.relu(self.linear2(h))
        h = nn.functional.relu(self.linear3(h))
        h = nn.functional.log_softmax(self.linear4(h), dim=-1)
        return h
        
        
fnet_model = FullNetwork()
h = fnet_model(x) 
print("10 class likelihood for each image", h.shape)

We roughly count the size of the network by saving the model to disk and checking the file size.

In [None]:
# let count the size of the network
torch.save(fnet_model.state_dict(), "../data/fullnn.pth")

# I got ~ 11 MB.

The idea of a convolutional neural network is both intuitive and intriguing. Let us take a look at a sample of our raw data, i.e. an image, for the inspiration.

<img src="ref/imagedata.png" width="450px">

The above figure illustrates the computation in a neural network highlighting the fact that the input data is an image and the variables are the pixels. (For less clutter, we focus on the connections and ignoring the sum and map ($\phi$) operations). One characteristic of data in image format is the spatial structure of the input variables. The idea goes like: if some output unit is good at representing a meaningful visual feature, it would NOT care where the feature appears. So we may apply the same set of weights and let it run over the image plane.

<img src="ref/conv.png" width="450px">



In this way, we reuse a small number of weights to compute a large group of output variables. E.g. we need only to allocate $3\times3=9$ weights to compute one output variable scanning each $3\times3$ area over the image plane. For a $28\times28$ image, there are $26\times26$ valid positions to apply the computation, which results in 676 outputs using only 9 weights!

<span style="color:blue">__Disucssion__</span>
Why $26 \times 26$?

The saving in model parameters comes with a cost that the output variables are heavily dependent on each other. Thus we employ multiple groups of output variables in each convolution step. The groups are called "channels", suggesting an analogy to the colour channels in the original image.

Fortunately, modern deep neural network frameworks provide all the operations involved in a convolution computation, including the weighted sum, scanning over image plane and the (usually burdensome and error-prone) backward propagation algorithm. Below is an implementation of a simple network in PyTorch.

In [None]:
# To be concrete let's make a 10-layer network just for fun:
class ConvNetwork(nn.Module):
    def __init__(self):
        super(ConvNetwork, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3)
        self.linear4 = nn.Linear(in_features=32*3*3, out_features=10)
        
    def forward(self, x):
        batch_size = x.shape[0]
        h = nn.functional.relu(nn.functional.max_pool2d(self.conv1(x), kernel_size=2, stride=2))
        h = nn.functional.relu(nn.functional.max_pool2d(self.conv2(h), kernel_size=2, stride=2))
        h = nn.functional.relu(self.conv3(h))
        h = h.view(batch_size, -1)
        h = nn.functional.log_softmax(self.linear4(h), dim=-1)
        return h
        
        
cv_model = ConvNetwork()
h = cv_model(x) 
print("10 class likelihood for each image", h.shape)

In [None]:
# let count the size of the network
torch.save(cv_model.state_dict(), "../data/cnn.pth")
# I got ~ 70 KB

### Parameter sharing (temporal): Recursive neural networks

The neural networks (or any data model) we had encountered so far takes an independent view of the processing of the data samples. The processing of a sample $x_5$ has nothing to do with the processing of the preceding ones $x_4, x_3, \dots$. The model is stateless. If you are familiar with how web protocols work, those models work like the HTTP protocol. You can imagine the model as an HTTP server, handling stateless connections. When a client sends data to process, the server processes the data and allows the client to take the results. After one session, the processing server completely forgets the client and its data. 

Such a computation framework may work well for tasks such as recognising an image as one object class or another. However, there are practical tasks requires the data model to have a memory, and investigate temporal relations in the data. For example, if the task is to recognise a football players strategy in the pitch, it would be necessary to take a video clip and examine the frame images containing the player for a period. Another example would be in natural language processing, the meaning of one word must be put in the context for appropriate understanding. Using our analogy above, some tasks need to add the functions of "log-in" or "cookies"  to the vanilla HTTP server. 

One solution is introduce _recursive_ connections in a network architecture. To be concrete, let's check the following modification of our old network:

<img src="ref/rnn1.png" width="200px">

If you try to program in your mind how to compute this network model and find the red links appear weird, you have got the point of those "recursive" links. The red links connect the outputs of processing one sample  $x$ to the processing of the next $x'$.

To clarify the idea, let us consider an even simpler model as shown below

<img src="ref/rnn2.png" width="220px">

Some example computation steps performed by the network are
4. $y_1^{(4)} \leftarrow \dots $
5. $y_1^{(5)} \leftarrow \psi(w_{1,1} \cdot x_1^{(5)} + u_{1,1} \cdot y_1^{(4)})$
6. $y_1^{(6)} \leftarrow \psi(w_{1,1} \cdot x_1^{(6)} + u_{1,1} \cdot y_1^{(6)})$
7. ...

__NB__: Do not mistake the term "recursive" here with the "recursive design principle" of generic neural networks, which we had discussed above. 

#### Backpropagation through time and practical solutions

One immediate question about those recursive neural networks (RNN) is how to determine the parameters associated with the temporal connections (red ones in figures above), which links $y^{(t-1)}$ to $y^{(t)}$. Remind the gradient-based approach we had learned above, the key is to compute how a small change of a weight, say, $u_{1,1}$ in the figure above, affects the final criterion. The final criterion, e.g. the classification errors or prediction accuracy, involves either the output $y_1$ at all times, or that of the ultimate stage. Therefore, to appreciate the main challenge, please consider the contribution of $u_{1,1}$ to some $y_1^{(10)}$. 

<img src="ref/rnn3.png" width="240px"/>

The above figure partially illustrates how the weight of interest $u_{1,1}$ is involved in the outcome of $y_1^{10}$. The effect is multifold: $u_{1,1}$ affects the $y$-variable at an early moment $y_1^{(9)}$ which contributes to $y_1^{(10)}$. But moreover, $y_1^{(9)}$ itself is affected by $u_{1,1}$, through its multiplication with an earlier $y_1^{(8)}$. To account for the multifold effect of a parameter in an RNN, we need to apply the chain rule of computing the derivation through all the steps. This is also implemented using the backpropagation algorithm, which is referred to as "backpropagation through time" (BPTT).

Though theoretically sound, BPTT is unstable in practice. The vanilla version of the algorithm never worked beyond toy models. 


Actually, all model parameters have multifold effect in an RNN, including those that are associated with the $x$-$y$ connections, e.g. $w_{1,1}$ in the above example.  A key difference is that we can sum up the affluence of a change of $w_{1,1}$ OVER (parallelly) all time steps to conclude the total contribution of that change to the final $y_1^{(10)}$. But for $u_{1,1}$, its effects must be considered THROUGH (consequentially) all time steps.
In the limited steps shown in our figure, when a change $\Delta$ of $u_{1,1}$ happens, $y_1^{(8)}$ will change accordingly. In turn, the change of $y_1^{(8)}$ will again multiply with $\Delta$ and affect $y_1^{(9)}$. The cumulative multiplicative effect carries on through the time steps as shown in the next figure.

<img src="ref/rnn4.png" width="360px"/>

Note the $y^{(7)}_1$ is only the first visible output variable in the figure, not the start of the entire computation. 

It is not difficult to see that the gradient computation of those "temporal connection weights" will either over- or underflow in practical computers. To alleviate the impact of consecutive multiplication, people introduced some modulation techniques on the direct effect of $y^{(t-1)}$ on $y^{t}$.  Two representative techniques are the "gated recurrent unit" (GRU) and "long-short-term memory" (LSTM). 

Briefly, those modified recurrent units are compound ones, which include multiple types of sub-units. Besides the output variable $y$, there are also auxiliary units that control the "green light" for the information from an early moment to proceed to affect the results of a later moment.

<img src="ref/rnn5.png" width="500px"/>

The figure above sketches a GRU network. Please refer to the further read section for more information on practical RNNs. The implementation of RNN in `pytorch` is straightforward. E.g.

In [None]:
torch.manual_seed(1)
rnn = nn.GRUCell(input_size=16, hidden_size=8)
# a GRU cell, where the a_t has dimension of 8 and input x_t of 16

In [None]:
x = torch.randn(6, 1, 16) # batch of 1, 6 time steps
h = torch.randn(1, 8)

In [None]:
# Then we can perform the calculation.
output = []
for i in range(6):
    h = rnn(x[i], h)
    output.append(h)
    print("Time step {}, input size {}, input-hidden size {}, output {}"
          .format(i, x[i].shape, h.shape, h))

NB: the first dimension of `x` now NOT batch samples, but time-steps. The second dimension represents the batch samples, for which we set just to 1 -- meaning we are dealing with one sequence a time.

In [None]:
# Alternative, all time steps can be done using GRU (no "Cell" in the class name)
torch.manual_seed(1)
rnn = nn.GRU(input_size=16, hidden_size=8)
x = torch.randn(6, 1, 16)
h = torch.randn(1, 1, 8)
all_outputs, last_output = rnn(x, h)

# Please compare the result to the above. Note the random-seed.

### Skip connection

<span style="color:gray">__brief topic__</span>

This technique is to incorporate the units in lower stages (early computation steps) into the final output in a more direct route. The rationale behind this idea is that the error messages (review the backpropagation part) can be more readily passed to those units and facilitates the training of the deep neural network.

The implementation of the technic is straightforward -- just add the input $x$ to the activation of a later layer. Of course, this design introduces a new restriction that the number of units in the layer to which we "short-cut" the $x$ from a lower layer must remain the same as $x$. 

In [None]:
# Simple implementation of 
# x -> linear(x) -> y
#   |_______________^
#    short-cut link

x = torch.randn(10, 5)
linear = nn.Linear(5, 5)
y = nn.functional.relu(linear(x)) + x

### Injecting noises: Dropout

Another often adopted simple yet effective technique during __training__ a neural network is _dropout_. Simply put, to perform the computations for a layer with the dropout mechanism, we randomly set all its input variables to zero.

The simple technique does not make much sense at first glance. However, the random removal of the input variables implicitly employs an exponentially large ensemble. Examine the figure (excerpt from the original paper) below. 

<img src="ref/dropout.png" width="400px">

In each training iteration, we use one random sparse network out of exponentially large amount of possibilities. All the networks share the weights. At the test stage, the dropout operation is deactivated. We are effectively using the average of all the trained sparse nets for the prediction task.

Implementation of dropout in `pytorch` is straightforward.

In [None]:
class LinearD(nn.Module):
    def __init__(self):
        super(LinearD, self).__init__()
        self.linear = nn.Linear(5, 3)
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x):
        x = self.dropout(x)
        
        return self.linear(x)

In [None]:
torch.manual_seed(1)
lin_mod = LinearD()
x = torch.randn(2, 5)

# fix a random seed so we can repeat which input
# variables are dropped.
torch.manual_seed(42)
print("Train Seed 42", lin_mod(x))

# When put the model into evaluation (test) mode
# dropout layer stops working
lin_mod.eval()
torch.manual_seed(42)
print("Eval Seed 42", lin_mod(x))
torch.manual_seed(52)
print("Eval Seed 52", lin_mod(x))

# Set back to train, and we can reproduce the original
# dropout result.
lin_mod.train()
torch.manual_seed(42)
print("Train Seed 42", lin_mod(x))
# And the network's output is affected by the randomly
# dropped variables. Note that we had not changed the
# network weights.
torch.manual_seed(52)
print("Train Seed 52", lin_mod(x))

### Dealing with data distribution shifting among layers

It is well known that most data model relies on some basic implicit assumption about the statistics of the input it expects to accept. For example, if a predictor variable $X$ has mean value of 0.5 and standard variance of 2.0 during the construction of a data model, while during testing, the variable's statistics changed to mean=5.5 and variance=20.0, it is unlikely the model can take good use of $X$ due to the shift of its range.

As to deep neural networks, if we view the bottom layers as data feeder to the layers above, the higher-level layers would face the same data-shifting problem as above. More specifically, consider a layer A which is followed by a layer B. When the parameters of A is updated during training, consequently, the statistics of its output change accordingly. This can cause instability for the training of layer B.

So a remedy is to normalise the output of an earlier so the statistics of the outputs, i.e. the inputs to the next layer, are stablised. However, if we forcefully shift the statistics to zero-mean-unit-variance, it would impose too much restriction on the expressive capacity of the layer. Thus two parameters, one for the mean and one for the variance are introduced for each output variable of a "normalised" layer.

In [None]:
# The inplementation in pytorch is simple.
# Please perform further tests using practical data
class LinearN(nn.Module):
    def __init__(self):
        super(LinearN, self).__init__()
        self.linear = nn.Linear(5, 3)
        self.bn = nn.BatchNorm1d(num_features=3)
        
    def forward(self, x):
        return self.bn(self.linear(x))

torch.manual_seed(1)    
bnl = LinearN()

x  = torch.randn(10, 5)
x[:, 1] += 0.5 # perturb the mean a bit
y = bnl(x)
print("Result")
print(y)
print("Mean")
print(y.sum(dim=0))
print("Var")
print(y.std(dim=0))
print(list(bnl.parameters()))
# please compare the variance parameter of the batch-norm layer
# with the mini-batch output's statistics.

In [None]:
list(bnl.parameters())

### Optimiser

One important issue we will not cover in this lecture is the optimisation of the network parameters. We had calculated the direction along which to adjust the model parameters. However, the direction remains optimal to approach our criterion in only a very small region. Thus it is a complex research area to design and determine the strategy to apply the adjustments. Some of the strategies are adaptive and varies along with the training process.



Let us check out the example of a popular optimiser as a concrete toolkit for practical exercises. We write a program of learning the parameters of the convolutional model which classifies hand-written digit images as we had studied above. When creating the optimiser, we provide it the parameters to work on as

```python
optimiser = Adam(access_to_model_parameters, **options)
```
One of the most important options is a proportional parameter, which determines the step size of each training iteration. (Please review the section on "gradient-based optimisation" in this class). It is set up as 0.001 in this example.

In each training step, the optimiser clears all parameter gradients computed in the previous steps by calling `optim.zero_grad()`. After computing the loss (i.e. the main criterion to be _minimised_), the statement `loss.backward()` will populate the `grad` field again. Finally, the parameters are updated using `step()` function. 

In [None]:
from torch.optim import Adam
data_loader = torch.utils.data.DataLoader(dataset, batch_size=4)
conv_mod = ConvNetwork()
optimiser = Adam(conv_mod.parameters(), lr=0.001)

for batch_idx, (x, y) in enumerate(data_loader):
    optimiser.zero_grad()
    pred = conv_mod(x)
    loss = nn.functional.nll_loss(pred, y)
    loss.backward()
    optimiser.step()
    if batch_idx % 500 == 0:
        print("[iteration {}] Loss is {}".format(batch_idx, loss))

<span style="color:blue">__Discussion__</span>
Discuss the meaning of the "loss" in the outputs of the previous program. How it is connected to the classification performance of the model?

# Summarise

## Take-home points
- Recursive design of the neural networks, bottom-up view and top-down view.
- Implementation of a multi-layer network.
- What is gradient.
- Limitation of gradient-based optimisation.
- Back-propagation in training.
    
## Programming skills
- setting up `colab`
- torch.nn.Module
- super-call in sub-classes
- class and function strings
- simple optimisation procedure

## Further reading
- The [original dropout paper][8].
- Detailed interpretation on batch normalisation[9]
- RNN [in details][7]
- RNN [GRU units][5] (and more!)
- RNN [Course on Language Processing][4]
- RNN [Advanced Tutorial and Application][3]

- Currently, there is a trend of replacing recurrent networks using attention mechanism, which takes into account not only the relations between consecutive samples, but arbitrary contextual relation in a sequence. See the [tutorial][6] for the development.

- pytorch documentation on optimiser Scheduling and Families

## Course Project - 1
- Follow [tutorial-1][1] and [tutorial-2][2] to build an image classifier. You may find the skills we will introduce in the next class helpful.

- Alternative objective: choose a dataset of your interest, build a classifier to do the prediction task.

[1]:https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py

[2]:https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py

[3]:https://www.youtube.com/watch?v=6niqTuYFZLQ

[4]:https://www.youtube.com/watch?v=Keqep_PKrY8

[5]:https://www.coursera.org/lecture/nlp-sequence-models/gated-recurrent-unit-gru-agZiL

[6]:https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

[7]:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[8]:https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

[9]:http://mlexplained.com/2018/01/10/an-intuitive-explanation-of-why-batch-normalization-really-works-normalization-in-deep-learning-part-1/