# Convolutional Neural Networks

For more reading refer to: 

- Andrej Karpathy's Stanford course
    - http://cs231n.github.io/convolutional-networks/
- Michael Neilsen's free book
    - http://neuralnetworksanddeeplearning.com/
- Goodfellow, Bengio, Courville's book
    - http://deeplearningbook.org/

# Data Considerations

## Colour
- If you are training a network to classify objects or letters
- Colour may not be useful to the classifier and it will be faster to switch from RGB to grayscale

## Statistical invariance
- Similarly, the network should be able to classify an object regardless of where it is in an image
    - translation invariance
- Similarly in long text,a word's meaning (e.g. 'kitten') does not change meaning depending on where it is in the text
- Neural networks accomplish this via **Weight Sharing**
    - When two inputs can contain the same information 
    - Train the weights jointly for those inputs
    - For images this will be the fundamental principle behind **Convolutional networks**
    - For text, it will be the basis of **Embeddings** and **Recurrent neural networks*

# Convolutional Networks

- i.e. convnets
- Networks that share their parameters across space
- Images usually have a width, height, and depth
    - Depth = 3 channels for colour
- Take a small patch (blue) of the image (white), and run a small neural network on it with `k` outputs (red)
- This neural network will slide across the entire image without changing the weight
- The new output image will have a new output image
    - Different width, height, and **most importantly** a different depth
    - Instead of RGB, we now have `K` channels
- If the patch was the size of the image, it would be the same as a regular neural network
- This smaller patch forces the layer to share weights
- Basically has stacks of convolutions instead of matrix multiplying layers

<img src="./images/week6/convolution.png" alt="convolution" style="width:250px">    

- Basic idea is to form a pyramid
- Shallow RGB bottom 
- Apply convolutions that progressively squeeze the spatial dimensions 
- While increasing depth, which corresponds roughly to the semantic complexity of the representation

<img src="./images/week6/convnet.png" alt="pyramid" style="width:400px">  

- At the top, we can put a **classifier**

### Terminology
- `Patches  = Kernels`
- Each pancake layer of depth = 'feature map'
    - RGB = 3 feature maps
    - Map first layer from 3 to K feature maps 
- Stride = # pixels we move the kernel between each step
    - tunable hyperparameter
    - Stride of 1 leads to output roughly the same width and height as input
    - Stride of 2 means it is about half the size
    - **Valid padding**
        - only striding within the boundaries of the image
    - **Same padding**
        - Shortcut that pads outsides with zeros 
        - Results in same output size as input

## Filter depth

- Amount of filters in a convolutional layer
- Dictates how many neurons each patch connects to
- If we have `k`, we connect each patch of pixels to `k` neurons in the next layer
- `k` is a tunable hyperparameter
- Most CNNs tend to pick the same starting values

<img src="./images/week6/filter-depth.png" alt="dilterdepth" style="width:200px">  

### Why connect a single patch to multiple neurons in next layer?

- Can capture more interesting characteristics
- Multiple neurons for a given  patch ensures the CNN will learn to capture whatever characteristics are important
- The CNN is not explicitly programmed to look for characteristics
- It **learns on its own**

## Strides, depth, and padding
<img src="./images/week6/strides.png" alt="dilterdepth" style="width:400px">  

- If we start with an input image that is 28x28x3
- Output depth is fixed to `k=8`
- Same padding means the output width and height will be the same
    - achieved by zero-padding
- Valid padding means in order for out 3x3 patch to fit over the image, we must remove 1 row and 1 column from each side
- Stride of 2 reduces the output size by one half

## Implementation
<img src="./images/week6/cnn-implementation.png" alt="dilterdepth" style="width:400px">  

- With each progressing layer, we take **larger strides** and increase `k`
- Effectively reduces the width and height of the representation
- And increases the depth
- In the final step we can add a few fully connected layers
- And add a classifier on top

### Dimensionality
- Given input volume has volume `W`
- Filter has volume (height * width * depth) of `F`
- Stride size `S`
- Padding `P`
- `new_height = (input_height - filter_height + 2 * P)/S + 1`
- `new_width = (input_width - filter_width + 2 * P)/S + 1`
- `new_depth = number of filters`
- Volume of next layer = **`(W-F+2P)/S+1`**
- Knowing dimensionality of each layer helps us unerstand how large our model is and how filter size and stride will affect the overall size of the network


In [1]:
def new_dim(input_dim, filt_dim, pad_size, stride_size):
        new_out= (input_dim - filt_dim+2*pad_size)/stride_size + 1
        
        return new_out

new_dim(32,8,1,2)

14.0

In [2]:
import tensorflow as tf
input = tf.placeholder(tf.float32,(None,32,32,3))
filter_weights = tf.Variable(tf.truncated_normal((8,8,3,20)))
filter_bias = tf.Variable(tf.zeros(20))
strides = [1,2,2,1] #batch, height, width, depth
padding = 'VALID'
conv = tf.nn.conv2d(input,filter_weights,strides,padding) + filter_bias

- The above conv will be [1,13,13,20]
    - 4D
- _Not_ [1,14,14,20] because padding algo in TF is not ehs ame as we've seen
- Switching `padding` from `'VALID'` to `'SAME'`
    - Output shape [1,16,16,20]

More info on how TF does padding
https://www.tensorflow.org/api_guides/python/nn#Convolution

## Weight Sharing
- Less parameters to optimize,
- Which means faster convergence to some minima, 
- At the expense of making your model less flexible. It is interesting to note that, this "less flexibility" can work as a regularizer many times and avoiding overfitting as the weights are shared with some other neurons.

# Visualizing CNNs

http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
https://www.youtube.com/watch?v=ghEmQSxT6tw

## Layer 1
<img src="./images/week6/layer1.png" alt="layer1" style="width:200px">  

- Top left recognizes -45 degree lines
- Top middle recognizes 45 degree lines
- Also contains, green blobs, gabors

## Layer 2
<img src="./images/week6/layer2.png" alt="layer1" style="width:400px">  

- Visualization of 2nd layer in CNN
- More complex characteristics like circles and stripes, corners
- Left: what the CNN sees based on corresponding images from grid on right
- **The CNN learns to do this on its own**
    - No special instruction for CNN to focus on more complex layers in deeper layers
    
## Layer 3
<img src="./images/week6/layer3.png" alt="layer1" style="width:500px">  

- Third layer picks out complex combinations of features from second layer
- Grids, honeycombs (top left), wheels, faces

## Layer 5
<img src="./images/week6/layer5.png" alt="layer1" style="width:400px">  

- Fifth and final layer (skip layer 4)
- Last layer picks out highest order ideas
- Dog faces, bird faces, bicycles

# TensorFlow Convolutional Layer

- Tensorflow provides `tf.nn.conv2d()` and `tf.nn.bias_add()` to create our own conv layers
- TF uses a `stride` for each input dimension
    - `[batch, input_height,input_width, input_channels]`
    - We generally set 1st and 4th elment to 1

In [3]:
#Output depth
k_output = 64

#Image properties
image_width = 10
image_height = 10
color_channels = 3

#Convolution filter
filter_size_width = 5
filter_size_height = 5

#input/image
input = tf.placeholder(
    tf.float32,
    shape = [None,image_width, image_height, color_channels]
)

#weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_width,filter_size_height, color_channels,
    k_output]
))
bias = tf.Variable(tf.zeros(k_output))

#Apply convolution
conv_layer = tf.nn.conv2d(input,weight,strides=[1,2,2,1],
                         padding = 'SAME')
#add bias
conv_layer = tf.nn.bias_add(conv_layer,bias)
#apply activaion function
conv_layer = tf.nn.relu(conv_layer)

# How to improve convnets

- Pooling
- 1x1 convolutions
- Inception architecture

## Pooling
<img src="./images/week6/pooling.png" alt="pooling" style="width:400px">  

- Striding shifts pixels a bit at a time to reduce feature map size
    - This is very aggressive
    - Removes a lot of information
- What if we took small strides (say 1 step)
- Then took all the convolutions in a neighbour and combined them somehow
    - This is known as **Pooling**

### Max Pooling
<img src="./images/week6/maxpooling.png" alt="max-pooling" style="width:400px">  

- Max pooling is most common
- At every point on feature map, look at small neighbour around that point, and compute the maximum of those points
- This is **parameter-free**!
- Often more accurate
- However, it is more _expensive_ 
    - Due to shorter convolution strides
- More hyperparamters to worry about
    - Pooling Size
    - Pooling Stride
    - They do not have to be the same

**A Common architecture for classifying CNNs is:**
<img src="./images/week6/cnn-layers.png" alt="cnn-layers" style="width:250px">  

- "`LENET-5`" 
    - First successful use was Yan Lecun in 1998 
        - Character recognition        
- "`ALEXNET`"  
    - Alex Krizhevsky 2012
    - Won the object recognition challenge in 2012
<img src="./images/week6/example-networks.png" alt="example-networks" style="width:250px">  

### Average Pooling
- Take an average over a window of pixels around a specific location
- Similar to providing a blurred low resolution view of the feature map below
<img src="./images/week6/avg-pooling.png" alt="average-pooling" style="width:250px">  

# TensorFlow Max Pooling
<img src="./images/week6/ex-max.png" alt="example-max-pooling" style="width:250px">  

- Source Aphex34 (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons
- Example max pooling
- 2x2 filter and a stride of 2
- Four colours represent each time the filter was aplied to find the maximum value
- Used to reduce the size of the input, and allow the net to focus on only the most important elments
- `tf.nn.max_pool()`

In [4]:
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')

## Pooling layers today

- Fallen out of favor
- Recent datasets are so big and complex that we're more concerned about _underfitting_
- **Dropout is a much better regularizer**
- Pooling resultsin a loss of information
- Think about max pooling operation as an example
    - We only keep the largest of `n` numbers
    - Disregards `n-1` numbers completely
    
- `new_height = (input_height - filter_height)/S +1`
- `new_width = (input_width - filter_width)/S +1`
- for pooling layer, depth is same as input depth
- Pooling operation is applied individually for each depth slice

In [5]:
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)

# 1x1 convolutions
<img src="./images/week6/onebyone.png" alt="1x1-convolution" style="width:350px">  

- Not really looking at patch of image, just one pixel
- Classic convolution is _basically_ a small classifier for a patch of the image
    - Only a linear classifier
- If we add a 1x1 convolution in the middle
    - Suddenly we have a mini neural network learning over the patch before the linear classifier
- Interspersing convolutions with 1x1 convlutions is a very inexpensive way to make model deeper and have more parameters withot completely changing the structure
- **Very cheap**
    - 1x1 convolutions <=> matrix multiplies
- We went through pooling and 1x1 convolutions because in recent years there is a new technique that results in smaller + better nets
    - Simply a pyramid of convolutions

# Inception Modules
- At each layer of convnet, we can decide
    - Do we want a convolution?
        - 1x1?
        - 3x3?
        - 5x5?
    - Pooling computation?
- It turns out that all of these are beneficial to the power of the network
- _**So why choose?**_
- **Use them all!**

<img src="./images/week6/inception.png" alt="inception module" style="width:350px">  

- Average pool followed by 1x1
- 1x1 conv
- 1x1 followed by 3x3
- 1x1 followed by 5x5
- Then concatenate all of them at the end
- Can choose the parameters such that the total # of parameters in our model is very small
- Yet the model performs better than if we had a simple convolution

# Convolutional Network in TensorFlow
- found in separate file `cnn.py`