# Table of content
[1. The general convolution neural network](#conv)  
[2. Detailed AlexNet-SuperVision for ImageNet challenge](#AlexNet)  
[3. Mathematical aspect of each layer of AlexNet](#mat)  
[5.4. Convolutional neural networks for object recognition](#conv_net_obj_reg)  

In Week 5, we had studied about the convolution neural network through 2 specific examples:
- handwritten digit recognition (LeNet5).
- 3-D objects recognition (AlexNet - SuperVision).

## 1. The general convolution neural network
<a id = "conv"> </a>
- CNNs are commonly made up of 3 layer types:
    - Convolution & non-linearity (rectified linear units).
    - Pooling: (typically) local maximum.
    - Fully connected & non-linearity.
- Common architecture:
    1. Stack a few Conv-ReLU layers, followed by a Pool layer.
    2. Repeat pattern until image has been reduced to a small representation.
    3. Transition to one or more Fully connected(FC)-ReLU layers.
    4. The last FC layer computes the output.
        - Stack multiple stages of feature extractors, higher stages compute more global, more invariant features.
        - Classification layer at the end.
- For example, the architecture of LeNet5:
![lenet5](images/lenet5.png)
- Example of the architecture of AlexNet:
![alexnet](images/alexnet.png)
![alexnet_car](images/alexnet_car.png)
It can be seen as the overview like that:
![overview](images/overview.png)
- The CNNs are supervised training of convolutional filters by back-propagation classification error with mini-batch gradient descent:  
    **Loop**:
    1. Sample a batch of training data (~100 images).
    2. Forwards pass: compute loss (avg. over batch).
    3. Backwards pass: compute gradient.
    4. Update all parameters.  
    **Note**: usually called "stochastic gradient descent" even though SGD has a batch size of 1.

## 2. Detailed AlexNet-SuperVision for ImageNet challenge
<a id = "AlexNet"> </a>
- This deep CNN has 7 hidden "weight" layers.
- Entirely supervised.
- More data = good.
- Trained with stochastic gradient descent on two NVIDIA GPUs for about a week (5~6 days).
![dataflow](images/dataflow.png)![dataflow_1](images/dataflow_1.png)
![detailed_1](images/detailed_1.png)
- The max pooling layers just follow first, second and fifth convolutional layers.
![detailed_1](images/detailed_2.png)
- The procedure of training the AlexNet's:
    1. Error backpropagation.
    2. Iteratively update weight matrix (or tensor) in each layer by a stochastic gradient descent approach.
![training](images/training.png)
- Some main points improving the performance of AlexNet that is done in ImageNet Challenge 2012:
    1. Use regularization techniques:
        - data augmentation techniques that consisted of image translations, horizontal (left-right) reflections and patch extractions (256x256) $\rightarrow$ (224x224) (in order to get more training data).
![data_aug](images/data_aug.png)
        - dropout techniques in order to combat the problem of overfitting to the training data.
    2. Trained the model using batch stochastic gradient descent, with specific values for momentum and weight decay.
    3. Trained on two NVIDIA GPUs:
![gpus](images/gpus.png)
The GPUs communicate only in certain layers in layer 3 and 2 first FCs of the output's layer.

## 3. Mathematical aspect of each layer of AlexNet
<a id="mat"> </a>
![ovw](images/ovw.png)

### 3.1. Convolution
![convo](images/convo.png)
- input: images 224x224x3 (3 channels are R-G-B)
- convolution filter size: 11x11
- stride: 4
- output: 55x55x96
We have the width of images is 224, the size of filters is 11 and the stride is 4 $\rightarrow$ (224-11)/4+1 = 55 $\rightarrow$ output volume has spatial area of 55x55.
- depth (i.e. the number of filters): 96.
$\rightarrow$ output volume has size 55x55x96 = 290,400 neurons.
- each neuron is connected to a region of size 11x11x3 in the input volume $\rightarrow$ 363 weights + 1 bias.
$\rightarrow$ if each neuron has separate params, the first layer would need 290,400 * 364 > 100 million parameters.
- **However**, natural images have the property of being **stationary**:
    - the statistics in one part of the image are the same as of any other part.
    - thus, we can use the same features at all locations.  
    $\rightarrow$ constrain the neurons in each depth slice (of 96 slices) to use the same params $\rightarrow$ run the same filter or kernel over all receptive field windows (i.e. convolve the filter with the input image).
    - AlexNet example:
        - output volume has size 55x55x96 = 290,400 neurons.
        - there are 96 depth slices (96 filters(, each with 55x55 neurons: all 55x55 have the same 11x11x3+1 = 364 params.  
        $\rightarrow$ only 96 * 364 = 34,944 params $\rightarrow$ a dramatical reduction.
![convo_exp](images/convo_exp.png)
![convo_exp_1](images/convo_exp_1.png)

### 3.2. Max pooling (to downsampling)
- The most common form is a max-pooling layer with a max filter of size 2x2 applied with a stride of 2 (size = 2x2, stride = 2).
    - it downsamples every depth slice in the input by 2 along both width and height.
    - the depth dimension remains unchanged.
- Pooling reduces the spatial size of each depth slice in the output.
$\rightarrow$ fewer params in higher levels in the network.
![pool](images/pool.png)
![pool_demo](images/pool_demo.png)
![pool_1](images/pool_1.png)
- Role of pooling:
    - invariance to small transformations.
    - larger receptive fields (see more of input).

### 3.3. Local Response Normalization
- no need to input normalization with ReLUs.
- but still the following local normalization scheme helps generalizatioin.
    - Let $a^i_{x,y}$ be the activity of a neuron computed by applying kernel i at position (x,y) and then applying the ReLU nonlinearity.
    - compute the response-normalized activity, where the sum runs over n adjacent kernel maps at the same spatial position:
![lrn](images/lrn.png)
        - N: the total number of kernels in the layer.
        - n: hyper-parameter, the number of adjacent kernel maps, n = 5.
        - k: hyper-parameter, k = 2.
        - $\alpha$: hyper-parameter, $\alpha = 10^{-4}$.
        - $\beta$: hyper-parameter, $\beta = 0.75$.
    - This aids generalization even though ReLU don't require it.
    - is a sort of "brightness normalization".
    - lateral inhibition: creating competition for big activities amongst neuron outputs computed using different kernels.
![lrn_1](images/lrn_1.png)

### 3.4. Training
- After convolution layer, training the weights and then applying non-linearity ReLU.
![scheme](images/scheme.png)
- Initialization:
    - all feature extractors initialized at white Gaussian noise and learned from the data.
    - $w$: zero-mean Gaussian with standard deviation 0,01.
    - bias in neurons: 0 for 1st and 3rd layers and 1 for other layers.
- Stochastic gradient descent learning:
    - learning update rule with momentum (damping parameter) and learning rate $\epsilon$:
![update_rule](images/update_rule.png)
        - $v$: momentum variable.
        - $\epsilon$: learning rate, 0,01 initially and reduced 3 times prior to termination.
        - $<.>_{D_i}$: average over $i$-th batch $D_i$ (batch size = 128).
    - Momentum: what is it?
        - gradient descent finds only a local minima.
            - is not a problem if $J(w)$ (the loss function) is small at a local minima. Of course, we do not wish to find $w$ w.r.t. $J(w) = 0$ due to overfitting.
![good](images/good.png)
            - but, is a problem if $J(w)$ is large at a local minimum $w$.
![bad](images/bad.png)
        - momentum: popular method to avoid local minima and also speed up descent in plateau regions.
![momentum](images/momentum.png)
![update](images/update.png)
    - Weight decay: (L2 regularization)
        - is also a regularization technique because the weights "decay" each iteration:
![weight_decay](images/weight_decay.png)
        - **Note**: typically, biases are excluded from regularization.
- Backpropagation error:
