# Week 2 Notes

## Why Look At Case Studies

Case studies of famous networks are useful because there has usually been extensive research and implementations of them in the open source community. 

Engineers can use these well known network architectures with their implementations or finesse/modify trained versions to fit their problems and datasets.

Many days/months of training is often needed to get the weight parameters of these networks. By modifying (by transfer learning for example) these pre-trained networks fast progress can be made.

Also, there is a level of trust in their implementations as many researchers will be investigated the implementations and performance.

A good way to learn make convnets is to see others built before. These case studies link to seminal papers and a satisfying intellectually. 

Many of the ideas here have and can be used for applications in areas other than computer vision also.

## Classic Network

There are three classic networks that have had significant impact on the computer vision community. These are:

1. LeNet-5

2. AlexNet

3. VGG-16

We will look at their architecture and link to open source implementations.

### LeNet-5

The LeNet-5 architecture solves the MNIST datasets digit classification problem. It was one of the first demonstrations of competition winning performance with convolutional neural networks. 

The MNIST database is comprised of 32 by 32 pixel grayscale images (1 channel) - 32x32x1. The MNIST database contains 60,000 training images and 10,000 testing images.

#### Architecture

The network has 4 hidden layers, or is a 5 layer network - hence the name LeNet-5 (also that the first author is Yann LeCun).

![](LeNet_architecture.png)

There are two convolutional layers - each made up of  a convolution combined with an average pooling. These are followed by two fully connected layers that link to a 10 output final layer.

Note that the number of filters is also called the number of feature maps. There is a detail on the number of channels.

Note this from [samuel](https://github.com/vlfeat/matconvnet/issues/457):


>The number of feature channels and the number of filters refer to different things.

>Suppose X is an input with size W x H x D x N (where N is the size of the batch) to a convolutional layer containing filters F (with size FW x FH x FD x K) in a network .

>The number of feature channels D is the third dimension of the input X here (for example, this is typically 3 at the first input to the network if the input consists of colour images).
The number of filters K is the fourth dimension of F.
The two concepts are closely linked because if the number of filters in a layer is K, it produces an output with K feature channels. So the input to the next layer will have K feature channels.

> There is a detailed explanation in Section 4.1 of the [manual](http://www.vlfeat.org/matconvnet/matconvnet-manual.pdf)



#### Parameters

##### Arithmetic By Layer Type

For different types of layers, the number of parameters are calculated as:

* Input Layer: 0 parameters.
* Convolutional Layer: (FH*FW + 1) * K parameters
* Average Pooling Layer: 0 parameters.
* Fully Connected Layer:  $n^{L-1} * n^{L}$ parameters

The tables below provides details:

| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|
| 1 | 28| 28| 6| (28, 28, 6)| 28 x 28 x 6 = 4704|
| 2 | 14| 14| 6| (14, 14, 6)| 14 x 14 x 6 = 4704|
| 3 | 10| 10| 16| (10, 10, 16)| 10 x 10 x 16 = 1600|
| 4 | 5| 5| 16| (5, 5, 16)| 5 x 5 x 16 = 320|
| 5 | 1| 1| 16| (1, 1, 120)| 1 x 1 x 120 = 120|
| 6 | 1| 1| 84| (1, 1, 84)| 1 x 1 x 84 = 84|
| 7 | 1| 1| 10| (1, 1, 10)| 1 x 1 x 10 = 10|


| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|
|1 | 5 x 5 x 1| 1| 0| 6| (5 x 5 + 1) x 6 = 156| Convolution 1| 60,000|
|2 | 2 x 2 x 6| 2| 0| 6| 0| Average Pooling 1| 60,000|
|3 | 5 x 5 x 6| 1| 0| 16| (5 x 5 + 1) x 16 = 416| Convolution 2| 60,000|
|4 | 2 x 2 x 16| 2| 0| 120| 0| Average Pooling 2| 60,000|
|5 | 1 x 1 x 120| 1| 0| 84| 120 x 416 + 1 = 48,001| Fully Connected 3| 60,000|
|6 | 1 x 1 x 84| 1| 0| 10| 84 x 120 + 1 = 10,081| Fully Connected 4| 60,000|
|7 | 2 x 2 x 10| 1| 0| 1| 10 x 84 + 1 =  841| SoftMax| 60,000|


In total, there are: 156 + 416 + 48,001 + 10,081 + 841 = 59,495

Or almost 60K parameters.

#### Details

Some details that were popular back then, but less used or relevant now are:

* average pooling

* sigmoid/tanh non-linearities

* non-linearity after pooling

sections 2 and 3 are the most important, other sections refer to the above and other ideas that are not widely used anymore - for example graph transformers.

Notice how, generally, the height and width of the convolutional layers decreases while the number of channels increases as we move through the layers.

### AlexNet

This network demonstrated the strength of deep learning to the computer vision community in 2012. It produced state of the art results on the well mined ImageNet dataset.

#### Architecture

AlexNet has 5 convolutional layers and 3 fully connected layers. It has around ~60M parameters.

![](AlexNet_architecture.png)

AlexNet introduced a number of features that were not present in other convolutional networks. These include:

* ReLU activation function

* Using DropOut to deal with overfitting instead of regularization

* Overlap pooling to reduce the size of the network - larger pooling windows have less capacity.

The other important feature was that these networks were trained using GPUs. This made them train much much faster than just on CPUs.

#### Parameters

Here is the parameter table of the AlexNet architecture.


| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|


| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|


### VGG-16

#### Architecture

This architecture is very popular as a base to transfer learn or build more advanced heads on networks with. It has a simple repeated architecture and number of channels (feature maps) increasing in powers of two, which height and width also reducing in powers of 2.


It does have many more parameters than AlexNet, at around ~140M parameters.


![](VGG16_architecture.png)

#### Parameters


Here is the parameter table of the VGG-16 architecture.


| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|



| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|


## ResNets

ResNets make use of "skipped connections", that is they allow for a main path as well a as short cut to link layers that are more than one connection away. The authors call these skip connected layers "blocks", whereas the earlier seens connected layers are seen as "plain" networks.

**Plain Blocks**

**ResNet Blocks: Skipped Connections**

#### Architecture

#### Parameters



| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|



| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|

## Why ResNets Work

## Network In Network

"Network In A Network" is just applying 1 by 1 filters to convolve volumes. In particular, if a layer has $(H, W, D)$ dimensions then K filters of dimensions $(1, 1, D)$ convolved with it result in a layer $(H, W, K)$.

This has two well known uses:

* to go from a volume to a full connected layer.

* to "bottleneck" convolve in Inception networks - which reduces the number of parameters needed in Inception networks.


## Inception Module Motivation

The motivation for inception blocks is that it is not clear what form a hidden layer should take:

* Should it be a convolution? If so, what size of filter should it take? 1 by 1, 3 by 3 or 5 by 5?

* Should it be a max pooling layer? 

INCEPTION_IMAGE

Well, what if we allow all of these filters to be applied and generate a much higher number of feature maps (hence we need to go deeper meme from the movie inception - which also explains the name) - more correctly inception blocks go wider with many more feature maps.

![](deeper_meme.jpeg)

One problem is the computational cost (in the number of parameters to be estimated) can explode to be infeasible. A solution to this issue, presented by the authors is to have a bottleneck layer - which shrinks the volume using 1 by 1 convolutions with a smaller number of filters and then increases it back up by applying the desired filter sizes.

BOTTLENECK_IMAGE

In this way, a layer can train many feature maps, while controlling the number of parameters. This sets us up nicely to look at inception layers and inception networks.


## Inception Module Based Network

The Inception blocks produced state of the art performance on the ImageNet competition using 22 layer architecture called the GoogLeNet - in homage to Yann LeCun. 

In addition, due to its 22 layers, the network is exposed to vanishing gradients. The authors found a side auxiliary loss function - which uses earlier layers to predict the target, it carries less weight than the final loss function and has no inferential value, but stops vanishing gradients from being a great problem.

### GoogLeNet Architecture

### Parameters


| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|



| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|

## Using Open Source Implementation

One can usually google search for an implementation of a famous neural network architecture and find open source code - usually in your preferred framework. Nowadays the two competing frameworks are:

* Tensorflow - mature with dozens of APIs and traditionally in a computational graph style. An eager execution mode is more Pythonic in that it runs line by line without compilation, even to run on GPUs or specialized chipsets. Tensorflow is supported by [Google](www.google.com).

* PyTorch is popular in research groups and universities because it has fewer APIs and is pythonic - without compilation of computational graphs, and still run on GPUs. PyTorch is supported by [Facebook](www.facebook.com)


## Transfer Learning

There are many types of transfer learning, all of which take a pre-trained network (from some possibly related dataset) - that is some configuration of weight values and some other data along with a supervised learning task. Starting with this, one can do one or more of the following:

* remove final layer(s), add new layers.

* freeze some or all weights in later/all layers.

* unfreeze all weights.

Then, a learning algorithm can use the new architecture with given weights (new weights would be randomly generated, pre-trained weights used where available) and learn from the new dataset to target the supervised learning problem at hand.

There are flags in most of the popular frameworks which allow transfer learning to take place.

### When To Use Transfer Learning

Transfer learning makes sense when the number of inputs to a problem are large and training would take a long time. Also if the available data and compute for the problem at hand is small, then transfer learning makes good sense. It helps if the problem to be solved is similar in some respects to the one the pre-trained model worked for.

Where there is both a lot of data and compute, then transfer learning provides less value to the engineer.

## Data Augmentation

Data Augmentation is to generate more data samples from the existing sample by applying thoughtful transformations to generate data derived from the original dataset.

### Data Augmentation Techniques

* Mirroring

* Crops

* Rotations

* Shears

* Color Shifting

* PCA Color Shifting

* Gaussian Noise

* Test example cropping, mirroring, rotations, shears and color shifting and PCA color shifting to average predictions (probabilities for classification or target predictions for regression).

### Parallel Data Augmentation

As images are usually stored to disk and loaded into memory - an efficient way to augment the data is to:

1. load an image into memory using the CPU - then have the CPU run a thread

2. In the thread, generate the augmented data using the methods above. 

3. Follow this by sending all the augmented data to the GPU/CPU and have the learning algorithm update parameters of the network. 

4. Go back to 1 and load a new image.

This can lead to dramatically faster training and performance of ML applications.


## State Of Computer Vision

### Deep Learning in Computer Vision

Deep Learning has been successfully applied to get state of the art results in a number of areas of computer vision. These include:

* image recognition

* speech recognition

* speech translation

* online advertising

* logistics

* image localization

However, for some applications a lot of specialized architecture choices (hand engineering) need to be made. This is related to the amount of data available.


### Spectrum Of Data Size Vs Hand Engineering

The current state of computer vision is a mix of learning from enough examples on simple architectures to requiring specialized hand engineering (or 'hacks') to make progress. Below some points are marked on the spectrum.

![](spectrum_data_vs_hand_engineering.png)


One way to get around little data and avoid too much hand engineering is to make use of transfer learning. Transfer learning has good results with small data, by using the learning in other datasets with related architectures.


### Competition Tips

To do well in competitions, a few tricks are regularly used - which can boost performance by a few percentage points - which can be materially relevant to competitions.

* Ensembling: There are a wide repetoire of ensembling methods all based around generating multiple models and averaging them in some way.

* Test Case Augmentation: Generate many variations of the test case to get better performance.

### Use Open Souce

* Use open source implementations and frameworks - they have been tested and peer reviewed.

* Sometimes exisiting implementations can be finessed or modified to achieve what you want. 

* Try to contribute when you can.