# Lab 2 - Improving your training through batches, learning-rate and initialisation


## 1. Setup (REMINDER)

1. Login to BC4

    ```
    ssh <your_UoB_ID>@bc4login.acrc.bris.ac.uk
    ```
    
2. Clone the repository

    ```
    git clone "https://github.com/COMSM0018-Applied-Deep-Learning/labsheets.git" ~/labsheets
    ```

3. Change to the lab 2 directory:

    ```
    cd ~/labsheets/Lab_2_Training/
    ```
    
4. Make all ```go_interactive.sh``` and ```tensorboard_params.sh``` files executables by using the command `chmod`: 

    ```
    chmod +x go_interactive.sh tensorboard_params.sh
    ```
   
5. Switch to interactive mode, and note the change of the gpu login to a reserved gpu:

    ```
    ./go_interactive.sh 
    ```
    
6. Run the following script. It will pop up two values: `ipnport=XXXXX` and `ipnip=XX.XXX.X.X.`

    ```
    ./tensorboard_params.sh
    ```
    
    **Write them down since we will use them for using TensorBoard.**

7. Train the model using the command:
    
    ```
    python cifar_hyperparam.py
    ```
   
8. Open a **new terminal window** and login using SSH like in step 1 then run:

    ```
    tensorboard --logdir=logs/ --port=<ipnport>
    ```
    
9. Open a **new terminal window** on your machine and type: 
    
    ```
    ssh -N <USER_NAME>@bc4login.acrc.bris.ac.uk -L 6006:<ipnip>:<ipnport>
    ```

10. Open your web browser (Use Chrome; Firefox currently has issues with tensorboard) and open the port 6006 (http://localhost:6006). This should open TensorBoard, and you can navigate through the summaries that we included.


## 2. Batch-based training

In the first part of this lab session we shall see what batch-based training means.  

### What is a batch?

They say there is no such thing as a stupid question, so let's first make sure we all know what a batch is.  
**A batch is a set of data points**.  
For example, in CIFAR-10, a data point is a single image, while a batch of size 16 is a set of 16 images. 

#### What is a training batch?

It is a set of data points we use to train our network.

#### What is batch-based training?

Batch-based training refers to train a neural network using multiple data points as opposed to a single data point.  
As a matter of fact, there is no real network that can learn anything sensible without batch training.

### Hang on a minute, didn't our CNN take a single image as input?

Yes, it does. Our first CNN takes one single image as input and outputs the confidence of the network of recognising that image as one of the 10 objects.  
However, when we train our network, we do it considering a batch of images, i.e. we adjust the parameters of the network by looking at multiple images, as opposed to just one.

### Why is batch-based training important?

For example, in CIFAR-10, a data point is a single image, while a batch of size 16 is a set of 16 images. 

### Let's have a look

Let's have a look now at the code in ``cifer_hyperparm.py``, which is very similar to ``Lab_1``. Amongst the parameters (flags) we set, notice now the following:

```python
tf.app.flags.DEFINE_integer('batch-size', 128)
```
As you'll have already guessed, this parameter controls the batch size, i.e. how many images will be used in a mini-batch to train the network.  
Let's see how much this parameter impacts the accuracy of our model.  

### PRACTICAL 2.1: Adjusting the batch size

1. Train your model **as is** using the current batch size and training rate, and note the **test accuracy**
2. Set the batch size to 16, and train your model. What is the effect on the **test accuracy**?
3. Now train with a batch size of 256
4. Save your logs and training rate (only for your 256 batch size run)

```
Lab_2_<username>.zip
 |----------logs\ 
            |----------exp_bs_256_lr_0.0001_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_256_lr_0.0001_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
 |----------run_exp_bs_256_lr_0.0001_train,tag_Loss.csv
 ```

Discuss with others in the lab what happens to the **test accuracy**

### To conclude

The following table reports the test accuracy obtained training the very same model changing the batch size. You should get similar results.

Batch size | Test accuracy  
---------- | -------------------  
1|0.1749
2|0.1796
4|0.1934
8|0.2171
16|0.24
32|0.2427
64|0.2585
128|0.2886
256|0.3015
512|0.3194
1024|0.314
2048|0.3283
**4096**|**0.3474**

![image.png](imgs/accuracyVsBatchSize.png)

What can we deduce from the above results?

It seems that the larger the mini-batch the better, right?  

**Q. What stops us using the whole training set in every training step?**
**Q. Why do people typically use batch sizes of 32, 64 or 128?**

## 3. Learning rate

The learning rate is a hyperparameter that controls how fast we descend the gradient while we train our models. Recall from our lectures the following:  

$$W_{t+1} = W_t - \eta \nabla J(\textbf{x}, W_t)$$

Where:  

* $W_{t+1}$ are the updated model parameters
* $W_t$ are the model parameters at the previous step
* $\eta$ **is the learning rate**
* $\textbf{x}$ is the input data to the model
* $\nabla J$ is the gradient of the loss function $J$

Depending mainly on the input data and the loss function, the learning rate may have a very strong impact on the overall performance of the network.  

In our code, this parameter is controlled by the following flag:  

```python
tf.app.flags.DEFINE_float('learning-rate', 1e-3)
```

Let's fix the batch size to 128 (a good compromise between speed and accuracy) and let's try to run the same model for the same number of steps (10000) with different learning rates.

### PRACTICAL 2.2: Adjusting the learning rate

1. Set the batch size back to 128 
2. Change learning rate to 1.00E-04
3. Train your model and save the relevant logs

```
Lab_2_<username>.zip
 |----------logs\ 
            |----------exp_bs_128_lr_0.0004_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_128_lr_0.0004_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
 |----------run_exp_bs_128_lr_0.0004_train,tag_Loss.csv
 ```


### To Conclude

Learning rate | Test accuracy  
---------- | -------------------  
1.00E-01|0.1
1.00E-02|0.1
1.00E-03|0.251
**1.00E-04**|**0.4977**
1.00E-05|0.4198
1.00E-06|0.2998
1.00E-07|0.1338

![image.png](imgs/accuracyVsLearningRate.png)

Did you notice that with a learning rate of 1.00E-04 we managed to reach 49% accuracy?  
Hang on a second, wasn't 35% our best shot with batch size equal to 4096? Well, yes, but in fact in the previous experiments the learning rate was fixed to 1.00E-03.

Let's now have a look at the loss graphs provided by Tensorboard  

![lossGraphDifferentLearningRates.png](imgs/lossGraphDifferentLearningRates.png)

The above graph plots the evolution of the loss function over time using the different learning rates we saw before.


**Q: Try to explain why we experience the results and the graphs above.**

**Q: What do you think would happen if we train the model with a fixed small learning rate (e.g. 1.00E-05) and double the number of steps? **


## 4. Decaying the learning rate 

In practice, many gradient descent algorithms work with a non constant learning rate, that is the learning rate is lowered over time during training.  
The basic concept behind learning rate decay is that we first descend the gradient more quickly when we start training the model, say when we "still have much to learn", and then slow down as we proceed, since learning "new things" after a while requires more time and care.  
There is indeed an analogy with how we human beings learn life: in the first few years, we learn how to walk, talk, read and write, along with so many other skills, at such a high speed. When we grow older, it takes much more effort to learn anything new (but yes, we [_can_ still teach an old dog new tricks](https://en.wiktionary.org/wiki/you_can%27t_teach_an_old_dog_new_tricks)).  

We are using Adam optimiser as our gradient-based optimisation algorithm, which typically uses learning rate decay. However, up until now, we have trained our model using the fixed rate defined by `FLAGS.learning_rate`:

```python
train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(cross_entropy)
```

It's quite simple to change our code so that we decay the learning rate while training. A commonly used way of lowering the learning rate is to apply an exponential decay as follows:  

$$ decayed\_learning\_rate = learning\_rate \cdot {decay\_rate}^{(global\_step \ / \ decay\_steps)} $$

There is a function in tensorflow to compute the decayed learning rate as defined above:

```python
tf.train.exponential_decay(
    learning_rate,
    global_step,
    decay_steps,
    decay_rate
)
```

### PRACTICAL 2.3: Learning rate decay in `cifar_hyperparam.py`


1. Find the line in your code that sets the learning rate 

    ```python
    train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(cross_entropy)
    ```
    
2. Comment this line out, and modify it to use the `decayed_learning_rate` variable as follows:

    ```python
    global_step = tf.Variable(0, trainable=False)  # this will be incremented automatically by tensorflow
    decay_steps = 1000  # decay the learning rate every 1000 steps
    decay_rate  = 0.8  # the base of our exponential for the decay

    decayed_learning_rate = tf.train.exponential_decay(FLAGS.learning_rate, global_step, decay_steps, decay_rate)

    #train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(cross_entropy)
    train_step = tf.train.AdamOptimizer(decayed_learning_rate).minimize(cross_entropy, global_step=global_step)
    ```

Note the new argument `global_step=global_step` for the `minimize()` function: this is how we increment the variable `global_step` after every training iteration.  

**Run the model** (you do not need to save any logs at this time)

We can observe a small boost in accuracy

Let's compare the two loss functions (blue is with learning rate decay, orange is without):

![lossGraphLearningRateDecay.png](imgs/lossGraphLearningRateDecay.png)

  
* **Q: What can we deduce from the above graphs?**

## 5. Batch normalisation

The distribution of each layer's input, i.e. either the input data itself or the output from intermediate layers, plays a crucial role when training (especially) deep neural networks. The main issue is that the distribution of the input data (recall we are using batch-based training) - and thus the distibution of the input to the following layers - varies while we train and while we learn the optimal parameters for our task. This phenomenon is typically referred to as _internal covariate shift_.

It would be helpful for our network if such distributions didn't change much as we observe new input data and generate intermediate layer outputs while training and adjusting the model's parameters.

This is what batch normalisation does by normalising layer inputs. It was proposed by Sergey Ioffe and Christian Szegedy at Google [in this paper](https://arxiv.org/pdf/1502.03167.pdf). Let's see in detail how it works. 

From the paper:  

>By fixing the distribution of the layer inputs as the training progresses, we expect to improve the training speed. It has been long known that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer.  

Let $\text{x} = (\textit{x}^{(1)}, \dots, \textit{x}^{(d)})$ be a d-dimensional input vector to a certain layer in our network. We have to normalise each scalar feature independently by making it have mean 0 and variance 1. We do this for each dimension $k$:

$$
\bar{x}^{(k)} = \frac{x^{(k)} - \text{E}[x^{(k)}]}{\sqrt{\text{Var}[x^{(k)}] + \epsilon}}
$$

Where the expectation and variance are computed **over the training batch**, and $\epsilon$ is a constant added for numerical stability. Let's quote again the paper now:

>Such normalization speeds up convergence, even when the features are not decorrelated. Note that simply normalizing each input of a layer may change what the layer can represent. To address this, we make sure that the transformation inserted in the network can represent the identity transform.

To do so, we add for each dimension $k$ a pair of new parameters $\gamma ^{(k)}, \beta ^{(k)}$. The output of the layer to which we are feeding the normalised vector $\bar{\text{x}}$ would then be

$$
y^{(k)} = \gamma ^{(k)} \bar{x}^{(k)} + \beta ^{(k)}
$$

These new parameters $\gamma ^{(k)}, \beta ^{(k)}$ will be learned by the network during training along with the original parameters.  

### PRACTICAL 2.4: Batch Normalisation in `cifar_hyperparam.py`

1\. Use learning decay code from PRACTICAL 2.3

2\. Set the learning rate to 1.00E-03

3\. Set the batch size to 256

4\. Change the log file names to reflect that you are doing *batch normalisation*

```python
# run_log_dir = os.path.join(FLAGS.log_dir, 'exp_bs_{bs}_lr_{lr}'.format(bs=FLAGS.batch_size, lr=FLAGS.learning_rate))
run_log_dir = os.path.join(FLAGS.log_dir, 'exp_BN_bs_{bs}_lr_{lr}'.format(bs=FLAGS.batch_size, lr=FLAGS.learning_rate))
```
    
** Note: You can change the name of your log files for different runs by changing this variable **
    
5\. Implement Batch Normalisation for the first convolutional layer .

Let's first see what we have so far:

```python
with tf.variable_scope("Conv_1") as scope:
    W_conv1 = weight_variable([5, 5, FLAGS.img_channels, 32])
    b_conv1 = bias_variable([32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

    # Pooling layer - downsamples by 2X.
    h_pool1 = max_pool_2x2(h_conv1)
```


Quite simply:

* We first apply the convolution filter to the input images: `conv2d(x_image, W_conv1)`
* We add the bias: `conv2d(x_image, W_conv1) + b_conv1`
* We feed the result of the convolution plus the bias to the ReLU activation function: `tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)`
* Finally, the ReLU output is be max-pooled: `h_pool1 = max_pool_2x2(h_conv1)`

Let's see what are the steps we need to follow to implement batch normalisation:

* We need to calculate the mean and the variance on the training batch.
* We need to normalise the data using the mean and variance.
* We need to feed the normalised data to the ReLU function. 
* We need to learn the new parameters $\gamma$ and $\beta$.

**Note:** The bias `b_conv1` should be ommitted: in fact, the effect of this bias would be eliminated when subtracting the batch mean. Instead, the role of the bias will be performed by the new $\beta$ variable.

**Hint n.1:** you can calculate mean and variance with tensorflow like this:

```python    
    batch_mean1, batch_var1 = tf.nn.moments(z1, [0])
```

Where `z1` is the results of applying the convolution, i.e. the result of `conv2d(x_image, W_conv1)`

**Hint n.2:** To learn the new parameters $\gamma$ and $\beta$, we just have to create new tensorflow variables:

```python
    gamma1 = tf.Variable(tf.ones([dimension]))
    beta1 = tf.Variable(tf.ones([dimension]))
```

Tensorflow will cleverly add them to the parameters we want to learn. Of course, you have to work out what `dimension` is!


The code below illustrates how we can do it.

```python
 with tf.variable_scope("Conv_1"):
    W_conv1 = weight_variable([5, 5, FLAGS.img_channels, 32])

    # Note that pre-batch normalisation bias 'b_conv1' is ommitted
    # Instead, the role of the bias is performed by the new beta variable
    z1 = conv2d(x_image, W_conv1)

    # Calculate batch mean and variance
    batch_mean1, batch_var1 = tf.nn.moments(z1, [0])

    # Apply the initial batch normalising transform
    z1_hat = (z1 - batch_mean1) / tf.sqrt(batch_var1 + epsilon)

    # Create two new parameters, scale and shift (gamma and beta)
    gamma1 = tf.Variable(tf.ones([32]))
    beta1 = tf.Variable(tf.zeros([32]))

    # Scale and shift to obtain the final output of the batch normalisation
    # this value is fed into the activation function (here a ReLU)
    BN1 = gamma1 * z1_hat + beta1        
    h_conv1_bn = tf.nn.relu(BN1) 

    # Pooling layer - downsamples by 2X.
    h_pool1 = max_pool_2x2(h_conv1_bn)
```

6\. Let's use it now for our second layer.


Obviously, there is a function in Tensorflow that helps us a bit! The function is the following:

```python
batch_normalization(
    x, # the input data
    mean, # the mean calculated over the batch
    variance, # the variance calculate over the bach
    offset, # the beta parameter
    scale, # the gamma parameter
    variance_epsilon # the epsilon constant for numerical stability
)
```

The function normalises the data ```x``` given ```x```'s mean and variance, along with the new parameters $\gamma$ and $\beta$

**Hint:** We need to create new $\gamma$ and $\beta$ parameters for this layer, which will have a different dimension with respect to the first layer's ```gamma1, beta1```.

For the second convolution layer, this is how our code would look like using the tensor flow helper function for batch normalisation:

```python
with tf.variable_scope("Conv_2"):
    # Second convolutional layer -- maps 32 feature maps to 64.
    W_conv2 = weight_variable([5, 5, 32, 64])
    
    z2 = conv2d(h_pool1, W_conv2)
    
    # Calculate batch mean and variance
    batch_mean2, batch_var2 = tf.nn.moments(z2, [0])
    
     # Create two new parameters, scale and shift (gamma and beta)
    gamma2 = tf.Variable(tf.ones([64]))
    beta2 = tf.Variable(tf.zeros([64]))
    
    # Using Tensorflow built-in BN function this time
    BN2 = tf.nn.batch_normalization(z2, batch_mean2, batch_var2, beta2, gamma2, epsilon)
    
    h_conv2_bn = tf.nn.relu(BN2) 
    
    # Second pooling layer
    h_pool2 = max_pool_2x2(h_conv2_bn)
```

7\. Train your model now with batch normalisation

8\. Save your logs into a new directory ```logs_BN```

```
Lab_2_<username>.zip
 |----------logs\ 
            |----------exp_BN_bs_256_lr_0.0003_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_BN_bs_256_lr_0.0003_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
 |----------run_exp_BN_bs_256_lr_0.0003_train,tag_Loss.csv
 ```


After implementing batch normalisation I got **75%** accuracy with learning rate 1.00E-03 and 10000 iterations! That's a 20% boost in accuracy! Quite nice, isn't it? 

## 6. Parameter initialisation (optional)

**NOTE: This is an optional part of the lab and is not required for your portfolio**

Another decisive element amongst the plethora of factors in training a CNN is parameter initialisation, that is how we assign initial values to the network's weights. What we have done so far was to initialise our parameters taking random values sampled from a [truncated normal distribution](https://en.wikipedia.org/wiki/Truncated_normal_distribution). In our code, we use the following snippet:

```python
def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial, name='weights')
```

Where the function ```tf.truncated_normal()``` is defined as follows:

```python
truncated_normal(
    shape, # shape of the weight matrix we want to generate
    mean=0.0, # mean to generate the distribution
    stddev=1.0, # the standard deviation to generate the distribution    
)
```

For our bias parameters, instead, we initialised them to a constant value:

```python
def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial, name='biases')
```

Let's try something silly now, just to realise how important parameter initialisation here.

**Q: What happens if we initialise our weights with zeros or ones?**

If we run our CNN with these silly initial weights, our accuracy will drop down to 10%, crippling our model and reducing it to a random classifier - a random classifier on 10 classes has a 1/10=10% chance. 

```python
def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  #initial = tf.truncated_normal(shape, stddev=0.1)
  initial = tf.zeros(shape) # silly zeros
  #initial = tf.ones(shape) # silly ones
  return tf.Variable(initial, name='weights')
```


### Xavier parameter initialisation

There are a few strategies to generate initial parameters out there. One of the most used is **Xavier**.

This initialiser is designed to keep the scale of the gradients roughly the same in all layers. Weights can be generated using either a uniform or a normal distribution. For the uniform distribution, given a layer $l$ with size $s(l)$, its parameters will be initialised uniformly from the following interval:

$$
\Bigg[-\frac{\sqrt{6}}{\sqrt{s(l) + s(l+1)}}, \frac{\sqrt{6}}{\sqrt{s(l) + s(l+1)}} \Bigg]
$$

Where $s(l+1)$ is the size of the following layer $l+1$. Note that the size of a layer corresponds in this case to the number of columns of its weight matrix. For the normal distribution, initial weights for a layer $l$ are sampled from a normal distribution whose standard deviation depends on $l$ and $l+i$'s sizes:

$$
Var[W(l)] = \frac{2}{s(l) + s(l+1)}
$$

For a comprehensive explaination of why these magic numbers provide a good parameter initialisation, I recommend you read [the original paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf).
In tensorflow, using Xavier initialisation is pretty straightforward. All we need to do is to use the initialiser object:

```python
# the initialiser object implementing Xavier initialisation
# we will generate weights from the uniform distribution
xavier_initializer = tf.contrib.layers.xavier_initializer(uniform=True)
```
Then, in our custom functions generating initial weights we simply do:

```python
def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  #initial = tf.truncated_normal(shape, stddev=0.1)
  #initial = tf.zeros(shape)
  #initial = tf.ones(shape)
      
  #return tf.Variable(initial, name='weights')
  return tf.Variable(xavier_initializer(shape)) # using Xavier initialisation

def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  #initial = tf.constant(0.1, shape=shape)
  #return tf.Variable(initial, name='biases')

  return tf.Variable(xavier_initializer(shape)) # using Xavier initialisation
```

Now there is no need to have different functions since both do the same thing.

By running our model with Xavier initialisation for 10,000 steps, with batch normalisation and (decaying) learning rate 1.00E-03, we reach 75% accuracy on the test set! Another improvement, yay! 

**What happens if we use the normal distibution for the Xavier initialisation? Try yourself!**

# 7. Preparing Lab_2 Portfolio

You should by now have the following files, which you can zip under the name `Lab_2_<username>.zip` 

**From your logs, include only the TensorBoard summaries and remove the checkpoints (model.ckpt-* files)**

 ```
 Lab_2_<username>.zip
 |----------logs\ 
            |----------exp_bs_16_lr_0.0001_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_16_lr_0.0001_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_256_lr_0.0001_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_256_lr_0.0001_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_128_lr_0.0004_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_128_lr_0.0004_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_BN_bs_256_lr_0.0003_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_BN_bs_256_lr_0.0003_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
 |----------run_exp_bs_16_lr_0.0001_train,tag_Loss.csv
 |----------run_exp_bs_256_lr_0.0001_train,tag_Loss.csv
 |----------run_exp_bs_128_lr_0.0004_train,tag_Loss.csv
 |----------run_exp_BN_bs_256_lr_0.0003_train,tag_Loss.csv
 ```
 
Store this zip safely. You will be asked to upload all your labs' portfolio to ** SAFE after Week 10 ** - check SAFE for deadline details.