# CS470 Introduction to Artificial Intelligence
## Deep Learning Practice
#### TA. Jonghwa Lee / jongwhoa.lee@kaist.ac.kr
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

# Week 2 - Convolutional Neural Networks (CNNs)

3. **Basic of CNN** <br>
    3-1. Convolutional neural network <br>
    3-2. Layers composing CNN <br>
    3-3. Data Augmentation <br>
    3-4. Image classification using CNN
    
---

## 3. Convolutional Neural Network
### 3-1. Convolutional neural network
- A convolutional neural networks (CNN) is a class of deep neural networks, most commonly applied to analyzing visual tasks.
- The networks consist of multiple layers of small neuron collections which process portions of the input image, called receptive fields.
    - The connectivity pattern between these neurons is inspired by the organization of the animal visual cortex. <br>
![fnn_vs_cnn](images/fnn_vs_cnn.png)
- The outputs of these collections are then tiled so that their input regions overlap, to obtain a better representation of the original image; this is repeated for every such layer.

#### Example of CNN architecture
![Convolutional neural network](images/cnn-architectures.png)

- CNN consists of a series of convolutional, non-linear, pooling (downsampling), and fully connected layers.
- The convolutional, non-linear and pooling layers are to extract a feature map (or activation map).
- The fully connected layer is to classify a target using the extracted feature map (e.g. classification a single class of input image or a probability of classes that best describes the image in image classification).

### 3-2. Layers composing CNN
#### Convolutional layer
The first layer in a CNN is usually a **convolutional layer**.

![Convolution layer](images/convolution-layer.gif)

<img src=https://taewanmerepo.github.io/2018/01/cnn/conv2.jpg width="400">

##### Convolutional filters

![Convolution filter](images/convolution-filter.png)

- A convolutional filter much like a **kernel** in image recognition is a small matrix useful for blurring, sharpening, embossing, edge detection, and more.
- This is accomplished by computing a convolution between an filter and an image.
- The main difference **_here_** is that the **convolution matrices are learned**.
    - This means that the convolutional filters are learned as a weights during training the CNN models.

##### Convolution
As the filter is sliding, or convolving, around the input image, it multiplies the values in the filter with the original pixel values of the image (a.k.a. computing element-wise multiplication) and summed up all these multiplications.

![Activation map](images/activation-map.gif)

Now, we repeat this process for every location on the input volume. (Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on.) After sliding the filter over all the locations, we are left with an array of numbers usually called an activation map or feature map.

##### Stride
- It controls how the filter convolves around the input volume.
- In the above example, the filter convolves around the input volume by shifting one unit at a time.
- In that case, the stride was implicitly set at 1.
- Stride is normally set in a way so that the output volume is an integer and not a fraction.
 
![Stride - 1](images/stride-1.gif)
![Stride - 2](images/stride-2.gif)

##### Padding
- The size of the feature map is smaller than the input, because the convolution filter needs to be contained in the input. 
- If we want to maintain the same dimensionality, we can use padding to surround the input with zeros.
 
![Activation map](images/padding.gif)

**More details for computation of convolutional filter (reference material)**

##### High level perspective
Let's talk about briefly what this convolution is actually doing from a high level. Each of these filters can be thought of as **feature identifiers** (e.g. *straight edges, simple colors, curves*)

![Filter is feature identifier](images/filter-is-feature-identifier.png)

###### Visualisation of the Receptive Field
![Example image and filter](images/example-image-and-filter.png)
![Good filter](images/good-filter.png)
![Bad filter](images/bad-filter.png)

Compared to the first value, the second value is much lower! This is because there wasn’t anything in the image section that responded to the curve detector filter.

#### ReLU (Rectified Linear Units) layer
Conventionally, there are other layers that are interspersed between these convolutional layers.  
(e.g. Input → Conv → ReLU → Conv → ReLU → Pool → Conv → ReLU → Conv → ReLU → Pool → Fully connected)

After each convolutional layer, the **ReLU layer** is immediately added to apply a nonlinear layer. The purpose of this layer is to introduce nonlinearity to a system that basically has just been computing linear operations during the convolutional layers (just element-wise multiplications and summations)

#### ReLU function (Recap)
- The **ReLu** function is defined as $f(x) = \max(0, x)$.
![Relu](images/relu50.png)

#### The advantages of ReLU layer
- It works far better than other nonlinear functions (e.g. tanh and sigmoid), because the network is able to train a lot faster (due to computational efficiency) without making a significant difference to the accuracy.
- It also helps to alleviate the vanishing gradient problem, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers.
    - Sigmoid function has a maximum slope of 0.25. During backward propagation, The gradient will be multiplied with values less than 1. And if layer is deeper, the gradient will exponentially decrease by multipling value with less than 1. RELU activation solves this by having a gradient slope of 1, so during backpropagation.  
![sigmoid_tanh_with_derivate](images/sigmoid_tanh_with_derivate.png)

#### Pooling layer

After some ReLU layers, it is customary to apply a pooling layer (a.k.a. *downsampling layer*). In the pooling layers, there are also several options, with **maxpooling** being the most popular. (There are other options such average pooling and L2-norm pooling.)

![Max pool](images/max-pool.png)

##### The purpose of pooling layer
- To reduce the amount of parameters
    - The pooling layer drastically reduces the spatial dimension (the length and the width but not the depth) of the input volume.
    - By reducing the diemnsion, there remain more important features of an input image.
- To control an overfitting
    - Because reduce the amount of paraeters and complexity of a model.

#### Fully-connected layer
The last layer is an important one, namely the fully-connected layer. Basically, a fully-connected layer looks at what high level features most strongly correlate to a particular class. It has particular weights so that when you compute the products between the weights and the previous layer, we can get the correct probabilities for the different classes.

### 3-3. Data Augmentation

#### Data Augmentation  
- In real world, the data is limited.
- Due to insufficient data, training model may suffer from overfitting.
- Increases the training data to prevent overfitting  

#### Horizontal flip

![Horizontal Flip](images/horizontal_flip.png)

#### Random crop  
- Increase the image size slightly more than input size,  and crop the images at random locations
![Random Crop](images/random-crop.png)


#### SimCLR
- Powerful and data effectiveframework using data augmentation.
- When SimCLR trained on 1% of labels of ImageNet dataset, it achieves 85.8% top-5 accuracy
![simclr actitecture](images/simclr-architecture.png)
Chen at al, A simple framework for contrastive learning of visual representations, PMLR, 2020

#### Illustration of SimCLR
<img src = "images/simCLR-gif.gif" align="center" width="60%">


#### ImageNet top-1 accuracy
![SimCLR-Acc](images/SimCLR-perfomance.png)

[Reference for SimCLR](https://amitness.com/2020/03/illustrated-simclr/)

#### Going further: Convolution Arithmetic

If you want to go further with Convolution and you want to fully understand how convolution works with all the details we omitted in this notebook, I strongly suggest to read this terrific paper: 
- [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285).
- [A simple framework for contrastive learning of visual representations](http://proceedings.mlr.press/v119/chen20j/chen20j.pdf)
