# CS492 전산학특강<인공지능 산업 및 스마트에너지>
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

# Week 2 - Convolutional Neural Networks (CNNs)

## Schedule for CNN

3. **Basic of CNN** <br>
    3-1. Convolutional neural network <br>
    3-2. Layers composing CNN <br>
    3-3. Image classification using CNN
    
1. **Transfer learning with pre-trained CNNs**  <br>
    4-1. Several well-known CNN models <br>
    4-2. Tutorial of the pre-trained models using Keras API <br>
    4-3. Feature extraction using pre-trained model <br>
    4-4. Fine tuning our classification model  <br>
---

## 3. Basic of CNN
### 3-1. Convolutional neural network (CNN)
- A convolutional neural networks (CNN) is a class of deep neural networks, most commonly **<font color='red'>applied to analyzing visual tasks</font>**.
- The networks consist of multiple layers of small neuron collections which process portions of the input image, called **<font color='red'>receptive fields</font>**.
    - The connectivity pattern between these neurons is inspired by the organization of the animal visual cortex.
- The outputs of these collections are then tiled so that their input regions overlap, to obtain a **<font color='red'>better representation</font>** of the original image; this is repeated for every such layer.

#### Example of CNN structure

<img src=http://playagricola.com/Kaggle/extract.png width="800">

- CNN consists of a series of <font color='red'>convolutional, non-linear, pooling (downsampling), and fully connected layers</font>.
- The convolutiional, non-linear and pooling layers are to <font color='red'>extract a feature map (or activation map)</font>.
- The fully connected layer is to <font color='red'>classify a target using the extracted feature map</font> (e.g. classification a single class of input image or a probability of classes that best describes the image in image classification).

### 3-2. Layers composing CNN

#### Convolutional layer
The first layer in a CNN is usually a **convolutional layer**.

<img src=https://qph.fs.quoracdn.net/main-qimg-f0c8518784fdef4130a781181eae1a11 width="400">

**Convolutional filters**

<img src="images/conv_filter_ex.png">

- A convolutional filter much like a **kernel** in image recognition is a small matrix useful for blurring, sharpening, embossing, edge detection, and more.
- This is accomplished by computing a convolution between an filter and an image.
- The main difference **_here_** is that the **conv matrices are learned</font>**.
    - This means that the convolutional filters are learned as a weights during training the CNN models.

**Convolution** 

As the filter is sliding, or **<font color='red'>convolving</font>**, around the input image, it multiplies the values in the filter with the original pixel values of the image (a.k.a. computing element-wise multiplication) and add them to get a convolved value.

<img src=https://adeshpande3.github.io/assets/ActivationMap.png width="600">

Now, we repeat this process for every location on the input volume. And, next step would be moving the filter to the right by 1 unit, then right again by 1, and so on. After sliding the filter over all the locations, we are left with an array of numbers usually called an **<font color='red'>activation map</font>** or **<font color='red'>feature map</font>**.

<img src=https://miro.medium.com/max/1800/1*VVvdh-BUKFh2pwDD0kPeRA@2x.gif width="600">

**Stride and Padding**

As main parameters of convolutional layer, we can change to modify the behavior of each convolutional layer by using them.

Stride: Controls how the filter convolves around the input volume.
 - In the above example, the filter convolves around the input volume by shifting **one** unit at a time
 - In that case, the stride was implicitly set at 1.
 - Stride is normally set in a way so that the output volume is an integer and not a fraction.
 
<img src=https://miro.medium.com/max/1580/1*L4T6IXRalWoseBncjRr4wQ@2x.gif width="500">
<img src=https://miro.medium.com/max/1442/1*4wZt9G7W7CchZO-5rVxl5g@2x.gif width="500">

 
 
Padding
 - The size of the feature map is smaller than the input, because the convolution filter needs to be contained in the input. 
 - If we want to maintain the same dimensionality, we can use padding to surround the input with zeros.
 
<img src=https://miro.medium.com/max/2126/1*W2D564Gkad9lj3_6t9I2PA@2x.gif width="500">

**More details for computation of convolutional filter (reference material)**

##### High level perspective

Let’s talk about briefly what this convolution is actually doing from a high level. Each of these filters can be thought of as **feature identifiers** (e.g. *straight edges, simple colors, curves*)

<img src=https://adeshpande3.github.io/assets/Filter.png width="600">



##### Visualisation of the Receptive Field
<img src=https://adeshpande3.github.io/assets/OriginalAndFilter.png width="600">
<img src=https://adeshpande3.github.io/assets/FirstPixelMulitiplication.png width="600">
<img src=https://adeshpande3.github.io/assets/SecondMultiplication.png width="600">

The value is much lower! This is because there wasn’t anything in the image section that responded to the curve detector filter. Remember, the output of this conv layer is an activation map. 

#### ReLU (Rectified Linear Units) layer

Now, in a traditional convolutional neural network architecture, there are other layers that are interspersed between these conv layers. <br>
(e.g. Input->Conv->ReLU->Conv->ReLU->Pool->Conv->ReLU->Conv->ReLU->Pool->Fully connected)

After each conv layer, the **ReLU layer** is convention to apply a *nonlinear layer* (or **activation layer**) immediately afterward. The purpose of this layer is to introduce nonlinearity to a system that basically has just been computing linear operations during the conv layers (just element wise multiplications and summations)

**The advantages of ReLU layer**
- It works far better than other nonlinear functions (e.g. tanh and sigmoid), because the network is able to train a lot faster (due to computational efficiency) without making a significant difference to the accuracy.
- It also helps to alleviate the **vanishing gradient problem**, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers.

**ReLU function**
- The **ReLu** function is defined as $f(x) = \max(0, x)$.
- A smooth approximation to the rectifier is the *analytic function*: $f(x) = \ln(1 + e^x)$, which is called the **softplus** function.
- The derivative of softplus is $f'(x) = e^x / (e^x + 1) = 1 / (1 + e^{-x})$, i.e. the **logistic function**.


**Activation function: Sigmoid vs ReLU** 

- 다양한 activation function들이 존재
- 딥러닝 모델에서 activation function을 쓰는 이유는 복잡한 모델로 만들어 주는 역할 
    - 따라서, 주로 non-linear function을 사용 (e.g., sigmoid, tanh, etc.)
    - 만약 모델이 linear하게만 구성되면, 복잡한 데이터에 대해서 학습이 안됨 (이 경우, 결국 y=wx=b의 모델과 같음)
    - 즉, 모델을 더 복잡하게 만들어주어야 성능을 높일 수 있음

<img src=https://tariq-hasan.github.io/assets/images/activation_functions.png width="700">

**Sigmoid function**
- activation function으로서, 각 layer의 복잡도 올려줌
- output function으로서, 0~1사이의 값을 출력 (보통 binary classification에 사용) -> squash function
    - output function의 경우, 최근에는 주로 softmax function을 사용 (특히 multi-class classification)
- **문제점**
    1. Back propagation에서 미분값이 너무 작아져 (거의 0에 근접) weight가 업데이트가 사라짐 (saturation 현상, vanishing gradient) 
      => 즉, bottom layer -> top layer 결국 모든 layer의 weight 업데이트가 중지됨
    1. activation function 값이 양수 -> 편미분 값도 양수 -> weight 업데이트 부호가 통일 (+ -> +, - -> - // + -> - X)
      => Zigzag 현상 발생 -> 업데이트 느려짐
    1. np.exp 연산들어가 연산이 느리다 -> 따라서 convolution 연산은 더 오래 걸림
    
**tanh function**
- sigmoid와 유사하지만, 최소 출력값이 -1 이기때문에, Zigzag 현상이 덜 함
- 하지만, vanishing gradient 현상은 존재, 또한 exp 연산이 더 많이 시간 더 오래걸림 

<img src=https://taewanmerepo.github.io/2017/12/tanh/020.jpg width="700">



**ReLU function**
- 가장 많이 사용하며, 성능도 가장 좋음
- non-linear 하면서 가장 심플한 함수
- **saturation 되는 부분 2군데 -> 1군데로 줄었음**
- **exp연산이 없어 빠름** (loss 수렴속도가 sigmoid/tanh에 비해 6배 가까이 빨리 수렴)
- activation function 출력값 0 or 1이기 때문에 Zigzag 현생은 있음 
- 이후 여러 응용 activation function 나옴 (Leacky ReLU, Parametric ReLU, Exponential Linear Unit, ...)

#### Pooling layer

After some ReLU layers, it is customary to apply a **<font color='red'>pooling layer</font>** (aka *downsampling layer*). In this category, there are also several layer options, with **maxpooling** being the most popular. (There are other options such average pooling and L2-norm pooling.)

<img src=https://miro.medium.com/max/2344/1*ReZNSf_Yr7Q1nqegGirsMQ@2x.png width="600">


**The purpose of pooling layer**
- Reduce the amount of parameters
    - The pooling layer drastically reduces the spatial dimension (the length and the width but not the depth) of the input volume.
    - By reducing the diemnsion, there remain more important features of an input image.
- Control an overfitting
    - Because reduce the amount of paraeters and complexity of a model.

#### Fully connected layer

The last layer is an important one, namely the **Fully Connected Layer**. This was the last class we learned. 

Basically, a FC layer looks at what high level features most strongly correlate to a particular class. It has particular weights so that when you compute the products between the weights and the previous layer, you get the correct probabilities for the different classes.

#### Going further: Convolution Arithmetic

If you want to go further with Convolution and you want to fully understand how convolution works with all the details we omitted in this notebook, I strongly suggest to read this terrific paper: [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285).

This paper is also referenced (with animations) in the `theano` main documentation: [convnet tutorial](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)