# Convolutional Neural Networks

## 1. Local Connectivity

**Multilayer Perceptrons (MLPs):**
* Only use fully connected layers
* Only accept vectors as input

**CNNs:**
* Also use sparsely connected layers
* Also accept matrices as input

Locally connected layers (CNN) uses far fewer parameters then densely(fully) connected layers, less prune to overfitting and truly understands how to tease out the patterns contained in image data.

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/fcl.png" style="display: inline-block; height: 180px"/>
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/cl.png" style="display: inline-block; height: 180px"/>


## 2. Convolutional Layers
Conv Layers break the image up into smaller pieces - convolution window (e.g. colored areas above).
We slide the **window** horizontally and vertically over the matrix of image pixels.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/convlayer.png" style="height: 300px"/>
Instead of representing the weights in the arrows (as above) we'll use a matrix. This weight matrix is called a **filter**. The size of the filter is the same as the window.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/convlayer1.png" style="height: 300px"/>
In conv layers we usually use multiple filters to detect distinct features of the input. The activation maps produced by applying different filters produceses simplifications of the input - only pays attention to small set of features.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/convlayer2.png" style="display: inline-block; height: 250px" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/convlayer3.png" style="display: inline-block; height: 250px"/>

    The size(height x width) and number of filters are hyperparameters that can be tuned.

## 3. Stride and Padding
The stride of a convolution is the amount by which the filters slide accross the image. With a stride of 1, the filters is slided 1px (horizontally and vertically) at a time, resulting in a conv layer of almost identical size of the input image.
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/stride.png" style="height: 320px; display: inline-block" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/padding.png" style="height: 320px; display: inline-block" />

It's common to add padding (zeros) around the edges of the image to avoid loosing information, when applying the filter (e.g. if filter size doesn't match input size).

### Convolutional Layers in Keras
    from keras.layers import Conv2D
    
    Conv2D(filters, kernel_size, strides, padding, activation="relu", input_shape)
* `filters` - the number of filters
* `kernel_size[number || tuple]` - height and width of the convolution window
* `?strides[number || tuple]` - stride of the convolution (`default=1`)
* `?padding` - `valid` or `same` (`default='valid'`)
* `?activation` - typically use `relu` (`default=None`)
* `?input_shape[tuble]` - only used if this layer is input layer

## 4. Dimensionality

### Number of Parameters in a Convolutional Layer
The number of parameters in a convolutional layer depends on the supplied values of `filters`, `kernel_size`, and `input_shape`. Let's define a few variables:

* K - the number of filters in the convolutional layer
* F - the height and width of the convolutional filters
* D_in - the depth of the previous layer

Notice that `K = filters`, and `F = kernel_size`. Likewise, `D_in` is the last value in the `input_shape` tuple.

Since there are `F*F*D_in` weights per filter, and the convolutional layer is composed of K filters, the total number of weights in the convolutional layer is `K*F*F*D_in`. Since there is one bias term per filter, the convolutional layer has K biases. Thus, the number of parameters in the convolutional layer is given by ** `K*F*F*D_in + K` **.

### Shape of a Convolutional Layer
The shape of a convolutional layer depends on the supplied values of `kernel_size`, `input_shape`, `padding`, and `stride`. Let's define a few variables:

* K - the number of filters in the convolutional layer
* F - the height and width of the convolutional filters
* S - the stride of the convolution
* H_in - the height of the previous layer
* W_in - the width of the previous layer

Notice that `K = filters`, `F = kernel_size`, and `S = stride`. Likewise, `H_in` and `W_in` are the first and second value of the `input_shape` tuple, respectively.

The **depth** of the convolutional layer will always equal the number of filters K.

If `padding = 'same'`, then the spatial dimensions of the convolutional layer are the following:

* **height** = ceil(float(`H_in`) / float(`S`))
* **width** = ceil(float(`W_in`) / float(`S`))

If `padding = 'valid'`, then the spatial dimensions of the convolutional layer are the following:

* **height** = ceil(float(`H_in - F` + 1) / float(`S`))
* **width** = ceil(float(`W_in - F` + 1) / float(`S`))

## 5. Pooling Layers
Pooling layers often take conv layers as input. We often use many filters in the conv layers (big stack) which means that our dimensionality can get quite large. Higher dimensionality requires more parametes, which can lead to overfitting. 

Pooling layers (MaxPooling, Global Average Pooling, etc.) takes the feature map and reduces it to the most important features. 

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/pooling.png" style="height: 320px; display: inline-block" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/pooling1.png" style="height: 320px; display: inline-block" />

### Max Pooling Layers in Keras

    from keras.layers import MaxPooling2D
        
    MaxPooling2D(pool_size, strides, padding)
    
* `pool_size[number || tuple]` - height and width of the pooling window
* `?strides[number || tuple]` - vertical and horizontal stride (`default=pool_size`)
* `?padding` - `valid` or `same` (`default='valid'`)

In [1]:
from keras.models import Sequential
from keras.layers import MaxPooling2D

model = Sequential()
model.add(MaxPooling2D(pool_size=2, strides=2, input_shape=(100, 100, 15)))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
max_pooling2d_1 (MaxPooling2 (None, 50, 50, 15)        0         
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________


## 6. CNNs for Image Classification

We're currently resizing input images to the same fixed size (e.g. 32x32x3). 

Our CNN architecture is designed with the goal of decreasing the width and height of our input images, while increasing the depth.

* Convolutional layers are used to make the input array deeper
* Max Pooling layers are used to decrease the spatial dimensions: width and height 

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/cnn.png" style="height: 250px; display: inline-block" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/cnn1.png" style="height: 250px; display: inline-block" />

#### Things to Remember
* Always add `'relu'` activation function to the `Conv2D` layers
* `Dense` layers should also use `'relu'`, except for the final layer
* Always use `'softmax'` in the final layer for multi-class classification with the total number of classes
* We usually gradually increase the `filters` (depth) of the `Conv2D` layers `(16, 32, 64...)`

In [2]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

In [8]:
def create(max_pooling=False):    
    model = Sequential()
    # These first six layers are designed to take the input array of image pixels
    # and convert it to an array where all of the spatial information has been squeezed out,
    # and only information encoding the content of the image remains
    model.add(Conv2D(filters=16, kernel_size=2,
                     padding='same', activation='relu', input_shape=(32, 32, 3)))
    if max_pooling==True: model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
    if max_pooling==True: model.add(MaxPooling2D(pool_size=2))
    model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
    if max_pooling==True: model.add(MaxPooling2D(pool_size=2))
    # The array is then flattened to a vector in the seventh layer of the CNN
    model.add(Flatten())
    # It is followed by two dense layers designed to further elucidate the content
    # of the image.
    model.add(Dense(500, activation='relu'))
    # The final layer has one entry for each object class in the dataset,
    # and has a softmax activation function, so that it returns probabilities
    model.add(Dense(10, activation='softmax'))
    return model

In [9]:
no_max_pooling = create()
no_max_pooling.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_13 (Conv2D)           (None, 32, 32, 16)        208       
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 32, 32, 32)        2080      
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 32, 32, 64)        8256      
_________________________________________________________________
flatten_5 (Flatten)          (None, 65536)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 500)               32768500  
_________________________________________________________________
dense_10 (Dense)             (None, 10)                5010      
Total params: 32,784,054
Trainable params: 32,784,054
Non-trainable params: 0
________________________________________________________________

In [10]:
max_pooling = create(max_pooling=True)
max_pooling.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_16 (Conv2D)           (None, 32, 32, 16)        208       
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 16, 16, 16)        0         
_________________________________________________________________
conv2d_17 (Conv2D)           (None, 16, 16, 32)        2080      
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 8, 8, 32)          0         
_________________________________________________________________
conv2d_18 (Conv2D)           (None, 8, 8, 64)          8256      
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 4, 4, 64)          0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 1024)              0         
__________

### Image Augmentation
We want our algorithm to learn an **invariant representation** of the image. When working with image-data, we often encounter a lot of irrelavant data (noise) such as differences in width, height, color, position, etc. We just want to know if the image contains an advocado.  
* Scale invariance: we don't want our model to change it's prediction due to the size of the object
* Rotation invariance: we don't want the angle to matter
* Translation invariance: we don't want the position to matter

CNNs has some build-in translation invariance (max-pooling).

Technique for making our algorithms more statistically invariant. 
___

**The idea is simple:**

if you want your network to be more ***rotation invariant***:
* add rotated augmentations to the training data, created by random rotations of it

if you want your network to be more **translation invariant**:
* add translated augmentations to the training data, cretad by random translations of it


#### We expand the training set by augmenting the data
Data augmentation also help us to avoid **overfitting**.

## 7. GroundBreaking CNN Architectures

* AlexNet: [paper](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)

* VGGNet: [paper](https://arxiv.org/pdf/1409.1556.pdf)

* ResNet: [paper](https://arxiv.org/pdf/1512.03385v1.pdf)

* Keras CNN: [examples](https://arxiv.org/pdf/1512.03385v1.pdf)

* Vanishing Gradients: [blog](http://neuralnetworksanddeeplearning.com/chap5.html)

* Benchmarks: [github](https://github.com/jcjohnson/cnn-benchmarks)

* ImageNet Large Scale Visual Recognition Competition (ILSVRC): [page](http://www.image-net.org/challenges/LSVRC/)

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/alexnet.png" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/vgg.png" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/resnet.png" />

## 8. Visualizing CNNs
If we can learn to better understand how a CNN learns, we can help it to perform better.

* Visualizing how CNNs learn: [page](http://cs231n.github.io/understanding-cnn/)

* App that visualizes CNNs in real-time: [demo](http://cs231n.github.io/understanding-cnn/) [page](http://openframeworks.cc/)

    * other visualization tool: [demo](https://www.youtube.com/watch?v=AgkfIQ4IGaM&t=78s) [video](https://www.youtube.com/watch?v=ghEmQSxT6tw&t=5s)

    * other visualization tool: [page](https://medium.com/merantix/picasso-a-free-open-source-visualizer-for-cnns-d8ed3a35cfc5)

* Keras blog post *#DeepDreams*: [blog](https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html) [video](https://www.youtube.com/watch?v=XatXy6ZhKZw) [demo](https://deepdreamgenerator.com/)

* Interpretability of CNNs: [article](https://blog.openai.com/adversarial-example-research/) [paper](https://arxiv.org/abs/1611.03530)


#### ImageNet
* ressources: [paper](http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf)

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/layer2.png" style="height: 200px" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/layer3.png" style="height: 200px" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/layer5.png" style="height: 400px" />

## 9. Transfer Learning
___
Use the learned understanding from previosly trained models and pass it on to our new deep learning model.

### Transfer learning involves taking a **pre-trained** neural network and adapting the neural network to a new, different data set.

Depending on both:
* the size of the new data set, and
* the similarity of the new data set to the original data set

...the approach for using transfer learning will be different. 
#### There are four main cases:

1. new data set is **small**, new data is **similar** to original training data
2. new data set is **small**, new data is **different** from original training data
3. new data set is **large**, new data is **similar** to original training data
4. new data set is **large**, new data is **different** from original training data
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer.png" style="height: 400px;" />


* the dividing line between a large and small data set is somewhat subjective
* overfitting is a concern when using transfer learning with a small data set.
* images of dogs and images of wolves would be considered similar (common characteristics)
* images of flowers and images of dogs would be considered different

### Demonstration
___
#### Ressources
* Systematic analyzis of transferability of features: [paper](https://arxiv.org/pdf/1411.1792.pdf)

* Sebastian Thrun's cancer-detecting CNN: [article](http://www.nature.com/articles/nature21056.epdf?referrer_access_token=_snzJ5POVSgpHutcNN4lEtRgN0jAjWel9jnR3ZoTv0NXpMHRAJy8Qn10ys2O4tuP9jVts1q2g1KBbk3Pd3AelZ36FalmvJLxw1ypYW0UxU7iShiMp86DmQ5Sh3wOBhXDm9idRXzicpVoBBhnUsXHzVUdYCPiVV0Slqf-Q25Ntb1SX_HAv3aFVSRgPbogozIHYQE3zSkyIghcAppAjrIkw1HtSwMvZ1PXrt6fVYXt-dvwXKEtdCN8qEHg0vbfl4_m&tracking_referrer=edition.cnn.com)

* Object localization: [paper](http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf)
[repo](https://github.com/alexisbcook/ResNetCAM-keras)
[video](https://www.youtube.com/watch?v=fZvOy0VXWAI)

* Bottleneck features: [repo](https://github.com/alexisbcook/keras_transfer_cifar10)


___
#### Overview
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer2.png" style="height: 400px;" />
* the first layer will detect edges in the image
* the second layer will detect shapes
* the third layer will detects higher level features
___
#### Case 1: Small Data Set, Similar Data:

To **avoid overfitting** on the small dataset, the weights of the original network will be held constant.

Since the data sets are similar, images from each data set will have similar higher level features. The pre-trained network contains relevant information about the new data set.

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer3.png" style="height: 150px; display: inline-block; margin: auto" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer4.png" style="height: 340px; display: inline-block; margin: auto" />

###### Approach:

1. slice off the end of the neural network
2. add a new **fully connected** layer that matches the number of classes in the new data set
3. **randomize** the weights of the new fully connected layer; freeze all the weights from the pre-trained network
4. train the network to update the weights of the new fully connected layer
___
#### Case 2: Small Data Set, Different Data
To **avoid overfitting** on the small dataset, the weights of the original network will be held constant.

Since the data sets are different, they don't share higher level features. We'll only use the low-lewel features from the pre-trained model.

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer5.png" style="height: 150px; display: inline-block; margin: auto" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer6.png" style="height: 340px; display: inline-block; margin: auto" />

###### Approach:
1. slice off most of the pre-trained layers near the beginning of the network
2. add a new **fully connected** layer that matches the number of classes in the new data set
3. **randomize** the weights of the new fully connected layer; freeze all the weights from the pre-trained network
4. train the network to update the weights of the new fully connected layer

___
#### Case 3: Large Data Set, Similar Data
Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights.

Since the data sets are similar, images from each data set will have similar higher level features. The pre-trained network contains relevant information about the new data set.

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer7.png" style="height: 150px; display: inline-block; margin: auto" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer8.png" style="height: 340px; display: inline-block; margin: auto" />

###### Approach:
1. remove the last fully connected layer and replace with a layer matching the number of classes in the new data set
2. randomly initialize the weights in the new fully connected layer
3. initialize the rest of the weights using the pre-trained weights
4. re-train the entire neural network

___
#### Case 4: Large Data Set, Different Data
Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights.


Even though the data set is different from the training data, initializing the weights from the pre-trained network might make training faster. So this case is exactly the same as the case with a large, similar data set

If using the pre-trained network as a starting point does not produce a successful model, another option is to randomly initialize the convolutional neural network weights and train the network from scratch.trans

<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer9.png" style="height: 150px; display: inline-block; margin: auto" />
<img src="//aind-notes.s3-website-eu-west-1.amazonaws.com/img/transfer10.png" style="height: 340px; display: inline-block; margin: auto" />

###### Approach:
1. remove the last fully connected layer and replace with a layer matching the number of classes in the new data set


* retrain the  network from scratch with randomly initialized weights
* alternatively, you could just use the same strategy as "Large Data Set, Similar Data"