# Convolutional Neural Network (CNN)
# What are you expected to learn in today's class
#### [Basic]
#### * CNN Properties: Why CNN for image? Why it is (sometimes ) better than DNN?
#### * How does CNN work? 
#### * Some well-known CNN architecture

#### [Advanced]
#### * Use tensorflow to build CNN
#### * Computer Vision example: cifar-10
#### * Take Home Message for training CNN

## Question: What if we use ordinary Feedforward DNN with image?

<img src="Picachu&Feedforward&noref.png">

#### Transform image to one dim vector.

#### Too much weight >> time consuming training process

#### Question: Does the top left grid relate to bottom right grid?

## Characteristic of Image

<img src="Graph_composition.png">

## Can you recognize these images?

<img src="Patterns_powerful.png">

# Why CNN for Image?

## [Property 1_What] : Some patterns are much smaller than the whole image
### A neuron does not have to see the whole image to discover the pattern
### Connecting to small region with less parameters

<img src="CNNProperty1_what.png">

## [Property 2_Where] : The same patterns appear in different regions

In [1]:
# <img src="CNNProperty2_where.png">

<img src="CNNProperty2_share_parameter.png" alt="Drawing" style="width: 750px;"/>

## [Property 3_Size] : Sampling the pixels will not change the object
### We can subsample the pixels to make the image smaller
### >>> Less parameters for the network to process the image

<img src="CNNProperty3_subsampling.png">

## Why use CNN for image?
1. What   : Some patterns are much smaller than the whole image
2. Where : The same patterns appear in different regions
3. Size     : Sampling the pixels will not change the object 

# How does Convolution work in Computer Vision (CV)?
0. Convolution, Filter and Feature Map
1. Filter Size 
2. Padding
3. Stride

## Convolution, Filters and Feature Map
### Adding each pixel and its local neighbors which are weighted by a filter (kernel)
### Use "Filter" to perform Convolution process on each pixels to generate Feature Map

<img src="HowCNNwork_1.png">

In [2]:
from IPython.display import HTML
HTML('<img src="HowCNNwork_2_changing.gif">')

## How does filter work on the graph?

<img src="CNN_mouse_filter.png" alt="Drawing" style="width: 700px;"/>

<img src="CNN_mouse_1.png" alt="Drawing" style="width: 750px;"/>

<img src="CNN_mouse_2.png" alt="Drawing" style="width: 750px;"/>

# A classical CNN architecture
### [CNN]
### * Convolutional Layer
### * Activation Layer
### * Pooling Layer
### [DNN]

<img src="CNN_architecture.png" alt="Drawing" style="width: 750px;"/>

# Convolutional Layer and its Properties

## Convolution Property 1: Convolution times

<img src="Convolution_property_1.png" alt="Drawing" style="width: 750px;"/>

### The more you execute convolution, the smaller the image become.

## Convolution Property 2: Filter size

<img src="Convolution_property_2_filtersizeS.png" alt="Drawing" style="width: 750px;"/>


<img src="Convolution_property_2_filtersizeL.png" alt="Drawing" style="width: 750px;"/>

### Filter size is one of hyperparameters you generally need to tune. 
### Some useful observations: preferred to choose smaller filters, but have greater number of those.

## Convolution Property 3: Filter number
### Different filters learn different patterns

In [6]:
from IPython.display import HTML
# HTML('<img src="HowFilterwork_changing.gif">' )
# HTML('<img src="HowFilterwork_changing_resize2.gif">' )
HTML('<img src="HowFilterwork_changing_resize25.gif">' )

## All the numbers in the filter are the network parameters to be learned!

## Convolution property 4: padding
#### padding  ='valid' means without padding
#### padding  ='same' means perform padding (default value is zero)

<img src="Convolution_property_3_zeropadding_comparison.png" alt="Drawing" style="width: 750px;"/>

### Add additional zeros at the boarder of the image. 
### >> Padding can preserve the spatial dimension of the images.
### >> Padding can preserve more edge information. 

## Convolution property 5: stride  >> how far the filter is moving

<img src="Convolution_property_4_Stride.png" alt="Drawing" style="width: 750px;"/>

## Review of Properties of Convolution Layers 
1. Convolution times >> make the image smaller
2. Filter Size               >> adjust output image's size
3. Padding                  >> used to keep the image's size
4. Stride                      >> make the image smaller

# Activation Layer

### Timing: Immediately after each conv layer 
### Purpose: nonlinearity
### Example:  ReLU (Best choice), tanh and sigmoid
### Rectified Linear Unit (ReLU): f(x) = max(0, x)

<img src="Activation_Layer.png" alt="Drawing" style="width: 850px;"/>

After each conv layer, it is convention to apply a nonlinear layer (or activation layer) immediately afterward.The purpose of this layer is to introduce nonlinearity to a system that basically has just been computing linear operations during the conv layers (just element wise multiplications and summations).In the past, nonlinear functions like tanh and sigmoid were used, but researchers found out that ReLU layers work far better because the network is able to train a lot faster (because of the computational efficiency) without making a significant difference to the accuracy. It also helps to alleviate the vanishing gradient problem, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers (Explaining this might be out of the scope of this post, but see https://en.wikipedia.org/wiki/Vanishing_gradient_problem and https://www.quora.com/What-is-the-vanishing-gradient-problem for good descriptions). The ReLU layer applies the function f(x) = max(0, x) to all of the values in the input volume. In basic terms, this layer just changes all the negative activations to 0.This layer increases the nonlinear properties of the model and the overall network without affecting the receptive fields of the conv layer.

Paper by the great Geoffrey Hinton (aka the father of deep learning). http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf

# Pooling Layer
### Why do we need pooling layers?
* Reduce the number of weights
* Prevent overfitting

### Max pooling
* Consider the existence of patterns in each region


<img src="Pooling_Layer.png" alt="Drawing" style="width: 800px;"/>

# How to connect CNN with DNN

<img src="Flatten_cnn_dnn.png" alt="Drawing" style="width: 850px;"/>

<img src="CNN_architecture_layer_car.png">

<img src="CNN_architecture_classify_car.png" alt="Drawing" style="width: 750px;"/>

# ImageNet
* ~14 million labeled images, 20k classes
* Images gathered from Internet
* Human labels via Amazon MTurk

## ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
### Classes : 1000 image categories
### Amount :  1200k training 100k testing

<img src="ImageNet_hard.png" alt="Drawing" style="width: 950px;"/>

# Modern CNN Architecture
### 1. (1988) LeNet-5 {7} 
### 2. (2012) AlexNet {8}
### 3. (2014) VGG {16/19}
### 4. (2014) GoogLeNet {22}
### 5. (2015) ResNet {152}

## LeNet-5 { 7 Layers }

<img src="Architecture_LeNet.png" alt="Drawing" style="width: 950px;"/>

### Famous : 5x5 Conv
### Simple Architecture : Conv-Pool-Conv-Pool-FC-FC (has everything Conv / Nonlinearity ReLU / Pooling / Classification )
### Limitation :  Computing power was bad in 1988...
### Useful : several banks to recognise hand-written numbers on checks digitized in 32x32 pixel greyscale images
### Demo : https://www.cs.cmu.edu/~aharley/vis/conv/flat.html
### Ref : https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

<img src="ILSVRC_2012AlexNet.png" alt="Drawing" style="width: 950px;"/>

## AlexNet { 8 Layers }

<img src="Architecture_AlexNet_NN.png" alt="Drawing" style="width: 950px;"/>
<img src="Architecture_AlexNet_Layer.png" alt="Drawing" style="width: 200px;"/>

### Famous
### 1. Data augmentation (ex. flip, crops, jittering)
### 2. First use ReLU nonlinearity
### 3. Dropout Layer
### 4. GPU implementation (50x speedup over CPU), trained on two GPUs for a week

<img src="dropout_layer_comparison.png" alt="Drawing" style="width: 900px;"/>

## Dropout Layer
### Purpose : Regularization 
### Side Effect : Ensemble

<img src="ILSVRC_2014VGG&GoogLeNet.png" alt="Drawing" style="width: 950px;"/>

## VGG 16 / VGG 19
### Famous: Use lots of 3 x 3 Conv

<img src="Architecture_VGG1619_Layer.png" alt="Drawing" style="width: 400px;"/>

<img src="VGG_Memory&parameter.png" alt="Drawing" style="width: 900px;"/>

## GoogLeNet
### Famous : 
### 1.  22 layers
### 2.  Efficient “Inception” module
### 3.  No FC layers
### 4.  Only 5 million parameters! (12x less than AlexNet)

<img src="GoogLeNet_whole.png" alt="Drawing" style="width: 1000px;"/>

<img src="GoogLeNet_inception_naive&1x1.png" alt="Drawing" style="width: 950px;"/>

<img src="GoogLeNet_stacked_Inception.png" alt="Drawing" style="width: 950px;"/>

In [None]:
# <img src="GoogLeNet_inception_concat.png" alt="Drawing" style="width: 750px;"/>

## How does 1 x 1 Convolution work ?

<img src="con1x1_1.png" alt="Drawing" style="width: 750px;"/>

<img src="con1x1_2.png" alt="Drawing" style="width: 750px;"/>

## ResNet 

<img src="ILSVRC_2015ResNet.png" alt="Drawing" style="width: 850px;"/>

# Complexity

<img src="ILSVRC_complexity_GoogLeNet.png" alt="Drawing" style="width: 950px;"/>

<img src="ILSVRC_complexity_VGG.png" alt="Drawing" style="width: 950px;"/>

<img src="ILSVRC_complexity_AlexNet.png" alt="Drawing" style="width: 950px;"/>

## Review of CNN Property and CNN Architecture

<img src="Review_CNNProperty_Architecture.png">

# Take Home Message for training CNN
### Filter size : 3x3 (Recommendation)
### Filter numbers : increasing (64 > 128 > 256)
### Data-augmentation : Increase Dataset
### Maxpooling : Not too much, otherwise you will loss too many information
### Overfitting techniques : Dropout , L2 regularization , Earlystop

#### Reference
* http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf

#### Figure Reference
* http://www.sumiaozhijia.com/touxiang/471.html
* http://122311.com/tag/su-miao/2.html
* ...