# Case Studies in Neural Nets

We look at some successful networks to gain insight on how to build NNs. See LeCun et al. (1998) "Gradient Based Learning applied to document recognition" in [Journal not mentioned].

## LeNet-5 

Build for recognizing 32x32 greyscale images of handwritten digits.

32x32x1 --->

CONV: 6 5x5 filters, stride 1
28x28x6 --->

AvPool: f=2, s=2
14x14x6 --->

CONV: 16 5x5 filters, stride 1
10x10x16 --->

AvPool: f=2, s=2
5x5x16 --->

FC : 400 ---> 120 ---> 84 --->
SOFTMAX (10) ---> y-hat

### Comments

The use of average pooling is much less common today. (My speculation: averages tend to be pretty average and can washout, especially with f > 2, but representing the most extreme values tends to give a sense of where large changes happen.)

The original used sigmoid and tanh activations, not relu. COmputational power limited the ability of the network and there's a fair amount of discussing how to apply filters to reduce complexity.





## AlexNet

Krizhevsky et al. (2012) ImageNet Clasification with deep convolutional neural networks.

Input, 227x227x3 (RGB) --->

CONV: 96 11x11, stride 4: 55x55x96 --->

MAX-Pool: 3x3, stride =2: 27x27x96 --->

CONV: 256 5x5 Same : 27x27x256 --->

MaxPool 3x3, s = 2: 13x13,256--->

CONV 3x3 Same 384 filders --->

CONv 3x3 x 384 Same ---> (yes, they repeated that)

CONV smae 3x3 x 256 same --->

MaxPool 3x3, s = 2: 6x6x256 --->

Flatten: 9216 x 1:

FC 4096 --> 4096 --> Softmax(100)

### Comments

AlexNet has about 1000 times more parameters than LeNet. Relu was the activation. 


In [10]:
import tensorflow as tf
from tensorflow import keras

def alex_net():
    '''
    Returns an uncompiled AlexNet
    '''
    inputs = keras.Input(shape=(227,227,3))
    x = keras.layers.Conv2D(96, kernel_size=11, strides=4, padding='same', activation='relu')(inputs)
    x = keras.layers.MaxPool2D(pool_size=3, strides=2)(x)
    print(x.shape)
    x = keras.layers.Conv2D(256, kernel_size=5, strides=1, padding='same', activation='relu')(x)
    x = keras.layers.MaxPool2D(pool_size=3, strides=2)(x)
    print(x.shape)
    x = keras.layers.Conv2D(384, kernel_size=3, strides=1, padding='same', activation='relu')(x)
    print(x.shape)
    x = keras.layers.Conv2D(384, kernel_size=3, strides=1, padding='same', activation='relu')(x)
    print(x.shape)
    x = keras.layers.Conv2D(256, kernel_size=3, strides=1, padding='same', activation='relu')(x)
    print(x.shape)
    x = keras.layers.MaxPool2D(pool_size=3, strides=2)(x)
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(4096, activation='relu')(x)
    x = keras.layers.Dense(4096, activation='relu')(x)
    outputs = keras.layers.Dense(1000, activation='softmax')(x)

    model = keras.Model(inputs, outputs, name = 'AlexNet')
    return model
alex_net().summary()

(None, 28, 28, 96)
(None, 13, 13, 256)
(None, 13, 13, 384)
(None, 13, 13, 384)
(None, 13, 13, 256)
Model: "AlexNet"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_10 (InputLayer)        [(None, 227, 227, 3)]     0         
_________________________________________________________________
conv2d_37 (Conv2D)           (None, 57, 57, 96)        34944     
_________________________________________________________________
max_pooling2d_21 (MaxPooling (None, 28, 28, 96)        0         
_________________________________________________________________
conv2d_38 (Conv2D)           (None, 28, 28, 256)       614656    
_________________________________________________________________
max_pooling2d_22 (MaxPooling (None, 13, 13, 256)       0         
_________________________________________________________________
conv2d_39 (Conv2D)           (None, 13, 13, 384)       885120    
__________________________

## VGG-16

This paper focused on using simple filters and pooling, rather than the more comlplex choices of the others. It's later (2015). The networks uses only 3x3,s=1,same filters and 2x2,s=2 maxpooling.

Input 224x224x3 --->

CONV(64)x2 ---> MAXPOOL --->

CONV(128)x2 ---> MAXPOOL --->

CONV(256)x3 + POOL ---> 

CONV(512)x3  + POOL --->

CONV(512) + POOL --->

Flatten ---> FC(4096)-->FC(4096)--->SOFTMAX

### Comments

There's a lot of simplicity here. The number of filters is roughly doubled each pass, with a small increase in the number of passes in each step.


In [13]:
def conv(filters):
    return keras.layers.Conv2D(filters=filters, kernel_size=3, activation='relu', strides=1, padding='same')

def pool():
    return keras.layers.MaxPool2D(pool_size=2, strides=2)

def VGG_16():
    inputs = keras.Input(shape=(224,224,3))
    x = conv(64)(inputs)
    x = conv(64)(x)
    x = pool()(x)
    x = conv(128)(x)
    x = conv(128)(x)
    x = pool()(x)
    x = conv(256)(x)
    x = conv(256)(x)
    x = conv(256)(x)
    x = pool()(x)
    x = conv(512)(x)
    x = conv(512)(x)
    x = conv(512)(x)
    x = pool()(x)
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(4096, activation='relu')(x)
    x = keras.layers.Dense(4096, activation='relu')(x)
    x = keras.layers.Dense(1000, activation='softmax')(x)
    return keras.Model(inputs, x, name='VGG-16')
VGG_16().summary()

Model: "VGG-16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        [(None, 224, 224, 3)]     0         
_________________________________________________________________
conv2d_53 (Conv2D)           (None, 224, 224, 64)      1792      
_________________________________________________________________
conv2d_54 (Conv2D)           (None, 224, 224, 64)      36928     
_________________________________________________________________
max_pooling2d_28 (MaxPooling (None, 112, 112, 64)      0         
_________________________________________________________________
conv2d_55 (Conv2D)           (None, 112, 112, 128)     73856     
_________________________________________________________________
conv2d_56 (Conv2D)           (None, 112, 112, 128)     147584    
_________________________________________________________________
max_pooling2d_29 (MaxPooling (None, 56, 56, 128)       0    

## ResNet and Residual Blocks

A _residual block_ is a fundamental component of a ResNet. These help avoid very deep networks where exploding and vanishing gradients can become a problem. The seminal paper is He et al. (2015), "Deep Residual Networks for Image Recognition."

Very deep neural networks can actually be worse than their simipler counterparts on the training data. ResNet helps to avoid this problem.

### Setting up the Residuals Model

Let's start with a simple, standard model of two layers: we begin with an input layer $a^{[l]}$ which we pass forwar to a linear function, $z^{[l+1]} = Wa^{[l]} + b$ and then a non-linear function $a^{[l+1]} = g(z^{[l+1]})$ (relu). This is then the input to another linear/non-linear pair of transformation. We refer to the result of these as $a^{l+2}$.

To use residuals, we sum the activation layer $a^l$ into the activation prior to $a^{l+2}$. That is, $$a^{l+2} = g(z^{l+2} + a^{l})$$.

And that a residual block: a "skip connection" that adds the activation $a^{[l]}$ to $z^{[l+2]}$ when calculating $a^{[l+2]}$ for each layer $a^{[l+2]}$.

### Why does this work?

The short answer is that if additional layers don't improve the network, a ResNet easily reduces to the identity funciton so additional layers can't make the network worse than a prior function. If I follow correctly, this implies that the early network layers are likely to improve "first" while the later ones can "catch up" once the first layers have a strong 

## 1x1, or Network in Network Convolutions

Wait, a 1x1 convolution? Sure--the number of filters is a non-trivial matter. 

Suppose you have a 6x6x30 volume and you convovle with a 1x1xn filter: the result is a 6x6xn volume, which is not a trivial transform. Even if the number of filters equals the number of channels on the input layer, it adds non-linearity.

### Bottleneck Layers

A 1x1 CONV filter can also be used to reduce computations.

For example, if you had a large volume, say 28x28x192 and applied a 5x5, f=32 convolution, you would need about 120M computations: 28x28x32 x 5x5x192.

Instead, we can reudce the volume with a 1x1 convolution and then apply a 5x5, f=32 convolution on it. For example, CONV1 1x1x16, CONV2 5x5x32...

CONV1 requires  28x28x16x192 = 2.4M computations.

CONV2 requires 28x28x32 x 5x5x16 = 10M computations

This doesn't reduce network strength very much if our bottle neck (the 1x1 layer) is "within reason."


## Inception Module

An inception module passes an input activation to multiple filters and concatenates their output for prediction. 

Here's the idea:

- input = 32x32x192.
- bottleneck1 = CONV1x1, f=16
- CONVx5 = bottleneck1 * 5x5, f=3232
- bottleneck2 = CONV1x1, f=16
- CONVx3 = bottleneck2 * 3x3, f=64
- CONvx1 = CONV1x1, f=32
- POOL = MaxPool 3x3,s=1,padding='same'
- CONCAT = concatenate(CONVx5,CONVx3,CONx1,POOL)


## Transfer Learning

There are many networks on the internet. These can be used to train your own network. For example, if you want to build a Pepper-Eleanor-Ody recognizer for images, you will probably achieve good results using a pretrained network. In steps:

1. Download a pretrained, open-source network (e.g., from GitHub)
1. Freeze the layers except the softmax layer, which you re-write as a 4 value softmax output.
1. Train using your own data.

If you have large amounts of data, you can consider freeze fewer of the layers, starting with downstream ones. For example, you could freeze all but the final FC layers.

Of course, if you have a ton a data and some computational power (or time), you could retrain the whole network.


## Augmentation

You've covered this before: rotation, color shift, mirroring, etc. _PCA (Principle Component Analysis) is not something you've heard about with data augmentation before_. PCA performs a color shift without changing the overall tint of the picture: if you have an RGB with high GB and low R values in most places, it performs a shift on GB but leaves R alone. It's in the AlexNet paper cited above.

## "The State of Computuer Vision"

Two sources of learning knoweldge in a system:
- labeled data
- Human engineered architechture/features

We still don't have a lot of data for all the tasks. Insightful engineering is necessary for those tasks where we have less data. 