# Neural Network Design

[Source](https://github.com/arthurredfern/UT-Dallas-CS-6301-CNNs/blob/master/Lectures/xNNs_050_Design.pdf)

## Keep your goals Simple
- The simpler the task, the more training data there is. The more data, the better a NN will perform
- Handle complex problems with a shallow combination of simple problems.

## CNN Design Overview
- CNN exploits spatial structure in data. correlations between loaclized regions of pixels.
- Encoder (Tail & Body for feature extraction) + decoder (head for prediction)
    - Commonly: Serial, Parallel, Dense, or Residual
    
- __Tail__- few specialized layers with weaker features and better localization.
- __Body__ - layers that define the network. strong features with worse localization.
- __Head__ - task specific layers that produce the output of the network


### How CNNs work (Image Classification)

Classification network inputs an image and ouputs a vector, whose largest elements are the predicted class.

#### Head

The __head__ estimates the dominant object in a specific region by using the feature vector in the corresponding region of the feature maps at the body ouput. 
- __Global average pooling__ will extract information from each feature map (presence of feature across all areas) and the head will use a linear combination of these global features to predict the dominant object of the image.

Global average Pooling or vectorization can be used to conver the output of the body to a 1xN vector for input into the Linear layers of the head.

GAP is more popular since it allows arbitrary sized input images and also reduces the computational complexity of the decoding phase.

The __linear classifier__ uses softmax layer during training and argmax to select ouput during inference

Different head designs can be created to accomplish different goals, and more than one head can be attached to the end of a tail and body.

#### Tail and Body

Tail and body transform entire regions of the input image to feature vectures across the ouput feature maps. The __receptive field__ is the area of pixels from the input that can be mapped to a feature vector output.

##### Tail
- High compute, high feature map memory low parameter memory, lots of spatial redundancy.
- Aggressive down sampling and aggressive increase in number of channels.

Common architectures:
- conv (7,7) stride 2, 64 channels -> max pool (3x3) stride 2
- conv(3,3) stride 2, 32 channels -> conv (3,3) s=1, c=64 -> conv(3,3) s=2, c=64


##### Body
Blocks followed by down-sampling. Gradual reduction in memory required for feature maps. Leads to a loss in spatial resolution (pooling throws out pixels) Increase in number of channels

Common design practices:
- reduce cols and rows by 1/2 and double the number of channels
    - data volume shrinks by a factor of 2 (reduces compute by factor of 2)
  
#### Receptive field size at the head
Shows the area of information that the classifier has to work with at inference. May affect images with higher resolution.

__Calculation__
1. Start at the output of the body and set r.f.=1
2. Move backwards in network.
3. Everytime you reach a filter, increase the receptive field size by F-1.
4. Everytime you reach a down sample multiply the receptive field by S, then subtract by S -1


### CNN Networks

#### Serial
Sequenctial set of IO ops (LeNer, AlexNet, VGG, MobileNet V1)
- VGGNet introduced stacked 3x3 conv filters. Used vectorization at the head
- MobileNet V1 uses cascade of 3x3 spatial and 1x1 channel convs. Utilized global avg pool at head.


#### Parallel 
Input split -> parallel ops -> output combine. (GoogLeNet, Inception V3,V4, SqueezeNet)
- GoogLeNet/Inceptions - parallel paths with different receptive field sizes referred to as inception modules

#### Dense
Input split into a trasformation path and an identity path. Results are concatenated (Adds width to the netwprk w/ features at different depths) (DenseNet)

#### Residual
Instead of concat as in DenseNet perform an add on the two splits.

### Notes for Vision Optimized CNNs
- Target a feature size of 6-8x6-8 before GAP
- Early body blocks will have a high compute cost, while late blocks will require more memory.