In [1]:
import tensorflow as tf
from tensorflow.keras.layers import *
print(tf.__version__)

2.1.0


# Design

[Slides](https://github.com/arthurredfern/UT-Dallas-CS-6301-CNNs/blob/master/Lectures/xNNs_050_Design.pdf)


## Keep your goals Simple
- The simpler the task, the more training data there is. The more data, the better a NN will perform
- Handle complex problems with a shallow combination of simple problems.

## CNN Design Overview
- CNN exploits spatial structure in data. correlations between loaclized regions of pixels.
- __Encoder__ (Tail & Body for feature extraction) + __decoder__ (head for prediction)
    - Commonly: Serial, Parallel, Dense, or Residual
    
- __Tail__- few specialized layers with weaker features and better localization.
- __Body__ - layers that define the network. strong features with worse localization.
- __Head__ - task specific layers that produce the output of the network


### CNN Architecture (Image Classification)

Classification network inputs an image and ouputs a vector, whose largest elements are the predicted class.

$$TAIL\rightarrow BODY \rightarrow HEAD$$

$$ENCODER\rightarrow DECODER$$

#### Head

The __head__ estimates the dominant object in a specific region by using the feature vector in the corresponding region of the feature maps at the body ouput. 
- __Global average pooling__ will extract information from each feature map (presence of feature across all areas) and the head will use a linear combination of these global features to predict the dominant object of the image. Global average Pooling or vectorization can be used to convert the output of the body to a 1xN vector for input into the Linear layers of the head.

GAP is more popular since it allows arbitrary sized input images and also reduces the computational complexity of the decoding phase.


Different head designs can be created to accomplish different goals, and more than one head can be attached to the end of a tail and body.

#### Tail and Body

Tail and body transform entire regions of the input image to feature vectures across the ouput feature maps. The __receptive field__ is the area of pixels from the input that can be mapped to a feature vector output.

Each activation in the output of the body relates the confidence of the presence of a class in a certain area (size receptive field). Averaging over every region leads to the final prediction in the head of the dominant class in the image.


##### Tail
- High compute, high feature map memory low parameter memory, lots of spatial redundancy.
- Aggressive down sampling and aggressive increase in number of channels.

Common architectures:
- conv (7,7) stride 2, 64 channels -> max pool (3x3) stride 2
- conv(3,3) stride 2, 32 channels -> conv (3,3) s=1, c=64 -> conv(3,3) s=2, c=64


##### Body
Blocks followed by down-sampling. Gradual reduction in memory required for feature maps. Leads to a loss in spatial resolution (pooling throws out pixels) Increase in number of channels

Common design practices:
- reduce cols and rows by 1/2 and double the number of channels
    - data volume shrinks by a factor of 2 (reduces compute by factor of 2)
  
#### Receptive field size at the head
Shows the area of information that the classifier has to work with at inference. May affect images with higher resolution.

__Calculation of Receptive Field__
1. Start at the output of the body and set r.f.=1
2. Move backwards in network.
3. Everytime you reach a filter, increase the receptive field size by F-1. (F is length of filter eg 3x3)
4. Everytime you reach a down sample multiply the receptive field by S, then subtract by S -1 (S the down sampling factor eg 2)


# CNN Architectures

## Serial (LeNet, AlexNet, VGG, MobileNet V1)

__VGGNet__ introduced stacked 3x3 conv filters. Used vectorization at the head
   - Stacking 3x3 convolutions reduces compute while expanding receptive field.
    

__[MobileNet V1](https://arxiv.org/pdf/1704.04861.pdf)__ uses cascade of 3x3 spatial and 1x1 channel convs. Utilized global avg pool at head.
   - __Depthwise-seperable convolutions__. Depthwise convolution has a nxnx1 kernel for each channel of the input. Pointwise convolution has 1x1xc kernels for c channels in input. [depthwise conv explained](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

In [None]:
def MobileNetV1(inputs, num_filters, strides=(1,1)):
    """
    Conv Block from the MobileNetV1 architecture.
    
    inputs: Tensor- input to the first layer
    num_filters: int - desired number of channels in output
    strides: tuple-strides for the first convolution
    """
    x = DepthwiseConv2D((3,3), strides=strides)(inputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    
    x = Conv2D(num_filters, (1, 1))(x) #pointwise conv
    x = BatchNormalization()(x) 
    x = ReLU()(x)
    return x

## Parallel (GoogLeNet, Inceptions, SqueezeNet)
Input split $\rightarrow$ parallel ops $\rightarrow$ output combine. 

[Inception overview](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)

__[Inception V1](https://arxiv.org/pdf/1409.4842v1.pdf)__ - Size of the object may differ across images So build parallel convolution pathways with different receptive fields.

Authors limited the compute by adding 1x1 pontwise convolutions (reducing input channels) before each of the conv branches. References the *Network in Network* paper as the source of pointwise conv.

Authors also include a Dropout layer after GAP with dropout prob of 0.4.

Authors use axillary classifiers, heads inside of the network that predict the output. This attempts to allow clean gradient flow to earler parts of the network.

<img src="../img/lenet.PNG">

In [None]:
def conv_block(inputs, filters, kernel_size=(3,3), strides=(1,1)):
    """Generic Conv -> BN -> ReLU abstraction"""
    x = Conv2D(filters, kernel_size, strides=strides)(inputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    return x  

def InceptionV1(inputs, squeeze_dims, out_dims, strides=(1,1)):
    """
    Inception V1 Conv Block (GoogLeNet).
    """
    x1 = conv_block(inputs, out_dims, kernel_size=(1,1))
    
    x2 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x2 = conv_block(x2, out_dims, kernel_size=(3,3))
    
    x3 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x3 = conv_block(x3, out_dims, kernel_size=(5,5))
    
    x4 = MaxPool2D(pool_size=(3,3), strides=(1,1))(inputs)
    x4 = conv_block(inputs, out_dims, kernel_size=(1,1))
    
    concat = tf.concat([x1,x2,x3,x4],3)
    return concat

__[Inception V2 and V3](https://arxiv.org/pdf/1512.00567.pdf)__

__Inception V2__ focused on: 
- avoiding representational bottlenecks with extreme compression. 
- converting the 5x5 convolution to two stacked 3x3 convolutions. 
- Also focuses on reducing compute by factorizing an nxn convolution to an nx1 and 1xn convolutions.

In [None]:
def InceptionV2_fig5(inputs, squeeze_dims, out_dims, strides=(1,1)):
    """
    Inception V2 Figure 5
    """
    x1 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    x2 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x2 = conv_block(x2, out_dims, kernel_size=(3,3))
    
    x3 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x3 = conv_block(x3, out_dims, kernel_size=(3,3))
    x3 = conv_block(x3, out_dims, kernel_size=(3,3))
    
    x4 = MaxPool2D(pool_size=(3,3), strides=(1,1))(inputs)
    x4 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    concat = tf.concat([x1,x2,x3,x4],3)
    return concat

def InceptionV2_fig6(inputs, squeeze_dims, out_dims, strides=(1,1)):
    """
    Inception V2 figure 6
    """
    x1 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    x2 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x2 = conv_block(x2, out_dims, kernel_size=(3,1))
    x2 = conv_block(x2, out_dims, kernel_size=(1,3))
    
    x3 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x3 = conv_block(x3, out_dims, kernel_size=(3,1))
    x3 = conv_block(x3, out_dims, kernel_size=(1,3))
    x3 = conv_block(x3, out_dims, kernel_size=(3,1))
    x3 = conv_block(x3, out_dims, kernel_size=(1,3))
    
    x4 = MaxPool2D(pool_size=(3,3), strides=(1,1))(inputs)
    x4 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    concat = tf.concat([x1,x2,x3,x4],3)
    return concat

__Inception V3__ adds RMSProp optimizer, factorized 7x7 convolutions, Batch Norm in auxillary classifiers, label smoothing,

__[Inception V4](https://arxiv.org/pdf/1602.07261.pdf)__

## Dense (DenseNet)
Input split into a trasformation path and an identity path. Results are concatenated (Adds width to the netwprk w/ features at different depths) 

## Residual
Instead of concat as in DenseNet perform an add on the two splits.

__[ResNetV1](https://arxiv.org/pdf/1512.03385.pdf)__ Add the input of a convolutional block to the results of the blocks convolutions. These "skip connections" allow networks to train faster by allowing the clean flow of gradients through the network.

In [None]:

def ResNetV1(inputs, squeez_dims, expand_dims, strides=(1,1)):
    #RESIDUAL PATH
    resid = conv_block(resid, squeeze_dims, kernel_size=(3,3), strides=strides)
    #IDENTITY PATH
    if strides==(2,2):
        inputs = Conv2D(expand_dims, (1,1), strides=strides)(inputs)
    #COMBINE
    return Add()([inputs, resid])

### Notes for Vision Optimized CNNs
- Target a feature size of 6-8x6-8 before GAP
- Early body blocks will have a high compute cost, while late blocks will require more memory.

## ResNet Architecture

### Sources 
- [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/pdf/1611.05431.pdf)
- [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf)
- [Identity Mappings in Deep Residual Networks](https://arxiv.org/pdf/1603.05027.pdf)

### Notes
[ResNet V1 Paper](https://arxiv.org/pdf/1512.03385.pdf)

__Intuition behind Residual Connenction__

*In this paper, we address the degradation problem by introducing a deep residual learning framework.... we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.*

*With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.*

__What to do during down sampling__

*The identity shortcuts (Eqn.(1)) can be directly used when the input and utput are of the same dimensions (solid line shortcuts in ig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) __The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions)__. For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.*

*__Deeper Bottleneck Architectures__. Next we describe our
deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block
as a bottleneck design4
. For each residual function F, we
use a stack of 3 layers instead of 2 (Fig. 5). The three layers
are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers
are responsible for reducing and then increasing (restoring)
dimensions, leaving the 3×3 layer a bottleneck with smaller
input/output dimensions. Fig. 5 shows an example, where
both designs have similar time complexity.*

[Optimizing Residual Block Architecture](https://arxiv.org/pdf/1603.05027.pdf)

*(i) The feature xL of any deeper unit L can be represented as the
feature xl of any shallower unit l plus a residual function in a form of $\sum F,$*

*The additive term of ∂E∂xL ensures that information isdirectly propagated back to any shallower unit l. Eqn.(5) also suggests that it is unlikely for the gradient ∂E ∂xl to be canceled out for a mini-batch, because in
general the term ∂ ∂xl PL−1i=l F cannot be always -1 for all samples in a mini-batch. This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small.*

__TLDR__ Signals in both forward and backward pass can be passed directly from any layer to another (in the direction of the process)

*We want to make f an identity mapping, which is done by re-arranging
the activation functions (ReLU and/or BN). The original Residual Unit in [1]
has a shape in Fig. 4(a) — BN is used after each weight layer, and ReLU is
adopted after BN except that the last ReLU in a Residual Unit is after elementwise addition (f = ReLU). Fig. 4(b-e) show the alternatives we investigated,
explained as following.*

## ResNext Architecture

### Sources
- [1](https://arxiv.org/pdf/1611.05431.pdf)


*. Our network is constructed
by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has
only a few hyper-parameters to set. This strategy exposes a
new dimension, which we call “cardinality” (the size of the
set of transformations), as an essential factor in addition to
the dimensions of depth and width*

*In this paper, we present a simple architecture which
adopts VGG/ResNets’ strategy of repeating layers, while
exploiting the split-transform-merge strategy in an easy, extensible way. A module in our network performs a set
of transformations, each on a low-dimensional embedding,
whose outputs are aggregated by summation. We pursuit a
simple realization of this idea — the transformations to be
aggregated are all of the same topology (e.g., Fig. 1 (right)).
This design allows us to extend to any large number of
transformations without specialized designs*


## MobileNet V2 Architecture

- [1](https://arxiv.org/pdf/1801.04381.pdf)

- [2](https://towardsdatascience.com/mobilenetv2-inverted-residuals-and-linear-bottlenecks-8a4362f4ffd5)

__Main Contributions__
- Depthwise Seperable Convolutions - reduce the computational complexity of a standard 2D Conv by replacing this layer with two layers. The first, a __3x3 depthwise convolution__ applying K kernels with each channel (where K is the number of input channels). Second, a __1x1 pointwise convolution__ to produce linear combinations of channel pixels.

- Linear Bottlenecks - TLDR when you are compressing your representation with a convolutional operation, do not use a nonlinearity after the compression. (especially ReLU which throws out negative activations) Instead, make bottleneck layers linear operations.

- Inverted Residuals

In [17]:
def inverted_residual(inputs, expand_channels, squeeze_channels, strides=(1,1)):
    """
    inputs: Tensor- input to the first layer
    expand_channels: int - depth of the channel dimension after expansion
    squeeze_channels: int-depth of channel dimension after linear bottleneck
    strides: tuple-strides for the first convolution
    
    Inverted residual a la MobileNet V2 note the channel dimension will
     be expanded by pointwise conv, processed with depthwise conv, then 
     compressed by a linear bottleneck
    """
    x = Conv2D(expand_channels, (1, 1), strides=strides, padding='same')(inputs)
    x = BatchNormalization(**bn_params)(x)
    x = ReLU(max_value=6)(x)  # the paper uses a thresholded ReLU (3-bit output)
    
    x = DepthwiseConv2D((3,3), strides=(1,1), **conv_params)(x)
    x = BatchNormalization(**bn_params)(x)
    x = ReLU(max_value=6)(x)
    
    x = Conv2D(squeeze_channels, (1, 1), strides=(1, 1), padding='same')(x)
    x = BatchNormalization(**bn_params)(x) #No activation here (Linear BottleNeck)
    
    if strides==(2,2): # maintain dimensions during downsampling
        inputs = Conv2D(squeeze_channels, (1, 1), strides=strides, padding='same')(inputs)
    
    return Add()([x, inputs])