# Week 2 Notes

## Why Look At Case Studies

Case studies of famous networks are useful because there has usually been extensive research and implementations of them in the open source community. 

Engineers can use these well known network architectures with their implementations or finesse/modify trained versions to fit their problems and datasets.

Many days/months of training is often needed to get the weight parameters of these networks. By modifying (by transfer learning for example) these pre-trained networks fast progress can be made.

Also, there is a level of trust in their implementations as many researchers will be investigated the implementations and performance.

A good way to learn make convnets is to see others built before. These case studies link to seminal papers and a satisfying intellectually. 

Many of the ideas here have and can be used for applications in areas other than computer vision also.

## Classic Network

There are three classic networks that have had significant impact on the computer vision community. These are:

1. LeNet-5

2. AlexNet

3. VGG-16

We will look at their architecture and link to open source implementations.

### LeNet-5

The LeNet-5 architecture solves the MNIST datasets digit classification problem. It was one of the first demonstrations of competition winning performance with convolutional neural networks. 

The MNIST database is comprised of 32 by 32 pixel grayscale images (1 channel) - 32x32x1. The MNIST database contains 60,000 training images and 10,000 testing images.

#### Architecture

The network has 4 hidden layers, or is a 5 layer network - hence the name LeNet-5 (also that the first author is Yann LeCun).

![](LeNet_architecture.png)

There are two convolutional layers - each made up of  a convolution combined with an average pooling. These are followed by two fully connected layers that link to a 10 output final layer.

Note that the number of filters is also called the number of feature maps. There is a detail on the number of channels.

Note this from [samuel](https://github.com/vlfeat/matconvnet/issues/457):


>The number of feature channels and the number of filters refer to different things.

>Suppose X is an input with size W x H x D x N (where N is the size of the batch) to a convolutional layer containing filters F (with size FW x FH x FD x K) in a network .

>The number of feature channels D is the third dimension of the input X here (for example, this is typically 3 at the first input to the network if the input consists of colour images).
The number of filters K is the fourth dimension of F.
The two concepts are closely linked because if the number of filters in a layer is K, it produces an output with K feature channels. So the input to the next layer will have K feature channels.

> There is a detailed explanation in Section 4.1 of the [manual](http://www.vlfeat.org/matconvnet/matconvnet-manual.pdf)



#### Parameters

##### Arithmetic By Layer Type

For different types of layers, the number of parameters are calculated as:

* Input Layer: 0 parameters.
* Convolutional Layer: (FH*FW + 1) * K parameters
* Average Pooling Layer: 0 parameters.
* Fully Connected Layer:  $n^{L-1} * n^{L}$ parameters

The tables below provides details:

| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|
| 1 | 28| 28| 6| (28, 28, 6)| 28 x 28 x 6 = 4704|
| 2 | 14| 14| 6| (14, 14, 6)| 14 x 14 x 6 = 4704|
| 3 | 10| 10| 16| (10, 10, 16)| 10 x 10 x 16 = 1600|
| 4 | 5| 5| 16| (5, 5, 16)| 5 x 5 x 16 = 320|
| 5 | 1| 1| 16| (1, 1, 120)| 1 x 1 x 120 = 120|
| 6 | 1| 1| 84| (1, 1, 84)| 1 x 1 x 84 = 84|
| 7 | 1| 1| 10| (1, 1, 10)| 1 x 1 x 10 = 10|


| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|
|1 | 5 x 5 x 1| 1| 0| 6| (5 x 5 + 1) x 6 = 156| Convolution 1| 60,000|
|2 | 2 x 2 x 6| 2| 0| 6| 0| Average Pooling 1| 60,000|
|3 | 5 x 5 x 6| 1| 0| 16| (5 x 5 + 1) x 16 = 416| Convolution 2| 60,000|
|4 | 2 x 2 x 16| 2| 0| 120| 0| Average Pooling 2| 60,000|
|5 | 1 x 1 x 120| 1| 0| 84| 120 x 416 + 1 = 48,001| Fully Connected 3| 60,000|
|6 | 1 x 1 x 84| 1| 0| 10| 84 x 120 + 1 = 10,081| Fully Connected 4| 60,000|
|7 | 2 x 2 x 10| 1| 0| 1| 10 x 84 + 1 =  841| SoftMax| 60,000|


In total, there are: 156 + 416 + 48,001 + 10,081 + 841 = 59,495

Or almost 60K parameters.

#### Details

Some details that were popular back then, but less used or relevant now are:

* average pooling

* sigmoid/tanh non-linearities

* non-linearity after pooling

sections 2 and 3 are the most important, other sections refer to the above and other ideas that are not widely used anymore - for example graph transformers.

Notice how, generally, the height and width of the convolutional layers decreases while the number of channels increases as we move through the layers.

### AlexNet

This network demonstrated the strength of deep learning to the computer vision community in 2012. It produced state of the art results on the well mined ImageNet dataset.

#### Architecture

AlexNet has 5 convolutional layers and 3 fully connected layers. 

![](AlexNet_architecture.png)

AlexNet introduced a number of features that were not present in other convolutional networks. These include:

* ReLU activation function

* Using DropOut to deal with overfitting instead of regularization

* Overlap pooling to reduce the size of the network - larger pooling windows have less capacity.

The other important feature was that these networks were trained using GPUs. This made them train much much faster than just on CPUs.

#### Parameters

Here is the parameter table of the AlexNet architecture.


| layer_n | H | W | D | Activation Shape (Number Of Neurons) | Activation Size |
|:--:| :--:| :--:| :--:| :--:| :--:|
| 0 | 32| 32| 1| (32, 32, 1)|32 x 32 x 1 = 1024|
| 1 | 28| 28| 6| (28, 28, 6)| 28 x 28 x 6 = 4704|
| 2 | 14| 14| 6| (14, 14, 6)| 14 x 14 x 6 = 4704|
| 3 | 10| 10| 16| (10, 10, 16)| 10 x 10 x 16 = 1600|
| 4 | 5| 5| 16| (5, 5, 16)| 5 x 5 x 16 = 320|
| 5 | 1| 1| 16| (1, 1, 120)| 1 x 1 x 120 = 120|
| 6 | 1| 1| 84| (1, 1, 84)| 1 x 1 x 84 = 84|
| 7 | 1| 1| 10| (1, 1, 10)| 1 x 1 x 10 = 10|


| layer_n | fW x fH x fD | s | p| n_filters (K) | n_parameters |  notes | n_examples |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|0 | 0 x 0 x 0| 0| 0| 0| 1024| Input Image| 60,000|
|1 | 5 x 5 x 1| 1| 0| 6| (5 x 5 + 1) x 6 = 156| Convolution 1| 60,000|
|2 | 2 x 2 x 6| 2| 0| 6| 0| Average Pooling 1| 60,000|
|3 | 5 x 5 x 6| 1| 0| 16| (5 x 5 + 1) x 16 = 416| Convolution 2| 60,000|
|4 | 2 x 2 x 16| 2| 0| 120| 0| Average Pooling 2| 60,000|
|5 | 1 x 1 x 120| 1| 0| 84| 120 x 416 + 1 = 48,001| Fully Connected 3| 60,000|
|6 | 1 x 1 x 84| 1| 0| 10| 84 x 120 + 1 = 10,081| Fully Connected 4| 60,000|
|7 | 2 x 2 x 10| 1| 0| 1| 10 x 84 + 1 =  841| SoftMax| 60,000|


### VGG-16

#### Architecture

#### Parameters

## ResNets

#### Architecture

#### Parameters

## Why ResNets Work

## Network In Network

## Inception Network Motivation

## Inception Network

#### Architecture

#### Parameters

## Using Open Source Implementation

## Transfer Learning

## Data Augmentation

## State Of Computer Vision