## 7.310 Advanced architecture

Some advanced architectures power state-of-the-art DL models

1. ##### Residual connections

Explained in 7.140

2. ##### Batch normalisation

The output from a layer might have some large values - and these will dominate training

Batch normalisation is like data preprocessing at a hidden layer

=>  normalise layer output to quench unusually high values 

BN operates by keeping an exponentially moving average of batch mean and std during training

Assists the optimiser and allows deeper networks

Some very deep networks would be impossible to train without batch normalisation 

The idea, therefore is to normalise between layers:

```
model.add(layers.Conv2D(32, 3, activation='relu'))
model.add(layers.BatchNormalization())
...
```

The application to dense layers is identical

3. ##### Depthwise separable convolution

A replacement for `Conv2D` that is

    - lighter (fewer trainable weights)
    - a few per cent better

The idea is to use a spatial convolution on each channel under the assumption that
- spatial locations are highly correlated
- different channels are fairly independent

The parallel convolutions are mixed via a max poolinglayer

In [None]:
model.add(layers.SeperableConv2D(64, 3, activation='relu'))
model.add(layers.SeperableConv2D(128, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

Depthwise seperable convolutions are the basis of the high-performing Xception architecture

----

## 7.320 Hyperparameter optimisation

DL engineers have many seemingly arbitrary decisions:

- How many layers?
- The size of each layer
- Activation?
- Batch normalise?
- How much dropout?
- Learning rate?
- etc

Apart from intuition, how do they make these decisions?

Is there a way of exploring the hyperparameter space and the space of hypotheses spaces?

Hyperparameter optimisation 

1. Choose a set of hyperparameters
2. Build the model
3. Fit to training data and measure performance on validation data
4. If good enough
    - Evaluate on test data
    - Stop
5. Go to 1.

The particular optimisation scheme will specify step 1.

Bayesian, genetic optimisation, random search etc.

But, not gradient descent!!!

Why not?

The hyperparameter space is discrete in some dimensions (e.g. number of layers) - so the loss function on this space is typically neither continuous or differentiable

Furthermore, the process is computationally very expensive

Feature engineering, a costly and difficult human process, has been replaced by deep learning - features are automatically tuned by the feedback signal and not by hand

We can hope the same speed-up will happen for hyperparameter engineering

----

## 7.330 Model ensembling

Ensembling means pooling predictions from a number of independent models

The idea is based on the observation that any model can only grasp part of the truth, but a suitably *diverse* set of models might access much of the truth 

Ensembling generally produces the most competitive ML models

Pooling can proceed by averaging model predictions 

But there is a disadvantage - a poor model can worsen the average, even dragging it below the performace of the best model

The predictions can be weighted, with higher weights to better models

The weights can even be automatically optimised

The models in the ensemble need to be as diverse as possible

Avoid training the same model from different initialisations/order of exposure to data and then ensembling the result since the runs are not diverse

A model in the ensemble might be comparatively poor and have low weight, but can nevertheless make a quantitative difference to the overall prediction because it was distant from the other models and provided exclusive information

The concern is for a diversity of models rather than how well the best model performs

----

## 7.340 Wrapping up

- High performing convnets will utilise one or more of
    - residual connections
    - batch normalisation
    - depthwise separable convolutions


    
- DL engineering requires exploration in hyperparameter space
    - currently guided by intuition
    - automation is desirable
    - human or automatic search should be systematic
    - random search might be the only feasible option for extremely unstructured spaces

- Ensembling with a well weighted average is very powerful
- Component models are as dissimilar as possible