# SEP532 - 인공지능 이론과 실제 (2022 Spring)
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

# 1. Neural Network Basic 

## Linear model 
Let's consider the binary classification.

<img src="https://github.com/keai-kaist//SEP532-2022-Spring/blob/master/Practice_1/imgs/bt_example.png?raw=true" align="center" width="600"/>
To solve above problem, we can define a simple linear model as follows.

<img src="https://github.com/keai-kaist//SEP532-2022-Spring/blob/master/Practice_1/imgs/linear_model.png?raw=true" align="center" width="600"/>

With the linear model, we apply it to binary classification.

<img src="https://github.com/keai-kaist//SEP532-2022-Spring/blob/master/Practice_1/imgs/lm_classification.png?raw=true" align="center" width="600"/>

#### Issues of linear models
- Most real-world data is not linearly separable
- In other words, any linear models cannot separate regions correctly
- Therefore, non-linearities is neceesary to model arbitrary complex functions 

<img src="https://github.com/keai-kaist//SEP532-2022-Spring/blob/master/Practice_1/imgs/issues_lm.png?raw=true" align="center" width="700"/>


## Multi-Layer Perceptron (MLP)

#### Injecting non-linearity 
To solve above issues, we can inject the non-linearity into the linear model through the non-linear functions such as **Sigmoid, Hyperbolic Tangent(tanh), Rectified Linear (ReLU).**

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/inject_non_liearity.png?raw=true" align="center" width="600"/>

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/act_functions.png?raw=true" align="center" width="600"/>


#### Perceptron: simplified view 
Perceptron: A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class. In short, a perceptron is a single-layer neural network. (Defined by Deep AI)

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/perceptrons.png?raw=true" align="center" width="700"/>


#### Multi-Layer Perceptron 
MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. 

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/mlp.png?raw=true" align="center" width="700"/>


The reason why do we stack more layers is that 1) hidden layers are nonlinear embeddings of the input; 2) The model can embed the data into
the linearly separable space.

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/mlp2.png?raw=true" align="center" width="700"/>

## Deep Neural Network (DNN)
By stacking the more and more layers, neural netowrks have representing and modeling ability for given complex data.

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/dnn.png?raw=true" align="center" width="700"/>

#### Example: recognizing handwritten digits

Examples of data are as follows (MNIST)

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/handwritten_digits.jpeg?raw=true" align="center" width="600"/>


To recognize the handwritten digits using DNN, we first preprocess the images in order to feed them to the model and then train the model.

**Data representation**
Representing a gray-scale image into an array (i.e. a vector)

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/data_representation.png?raw=true" align="center" width="700"/>


**Forward propagation (Embedding viewpoint)**
The vectorized image is propagated throught the layers and classified into one of the digits, 0~9. In the case of trained DNN, each layer represents non-linear embedding of input to easily separable space. 

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/fp_mnist.png?raw=true" align="center" width="700"/>


**Recognition viewpoint**
The DNN model understands given handwritten digit by combining the abstracted features represented through multiple layers. In other words, DNN learns a hierarchy of features capturing different levels of abstractions. 

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/recognition_viewpoint.png?raw=true" align="center" width="700"/>


It works similarly for other data modalities. In the case of the face image, each layer in the DNN captures hierarchical features like edges -> facial landmarks -> global face.


<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/rv_examples.png?raw=true" align="center" width="700"/>


## Training neural network 

- **Objective:** find a set of parameters that minimize the error on the dataset.
- Notations
    - Datasets: ${(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}), \dots, (x^{(N)},y^{(N)})}$ for $N$ number of training data
    - Parameters: ${\text{w}^{(1)},\text{w}^{(2)}, \dots, \text{w}^{(L)}}$ for $L$ number of layers
    
<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/dnn_training1.png?raw=true" align="center" width="600"/>


- **Loss function**
    - Measurement on the mismatch between the model prediction and the true label.
    - There are many ways to define the degree of mismatch (i.e. misprediction, error).
    - Key to “quantify the performance” of the model on the specific task and data.
    
<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/loss_func.png?raw=true" align="center" width="600"/>
    
    
- Example of loss function - Mean Squared Error(MSE)
    - Regression tasks

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/mse.png?raw=true" align="center" width="600"/>
  
  
- Example of loss function - Binary cross entropy 
    - Binary classification 
    - For predicted values, can be interpreted as a probability vector using softmax function 

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/bce.png?raw=true" align="center" width="600"/>


#### Optimizing the loss function (Back-propagation)
- **Challenges** for optimizing the loss function
    - It is highly non-convex and non-concave.
    - It is impossible to find the analytical solution.
    - 참고: https://ratsgo.github.io/deep%20learning/2017/09/25/gradient/
- **Optimiation via gradient descent**

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/gradient_descent.png?raw=true" align="center" width="600"/>

- Algorithm (gradient descent)

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/gd_a1.png?raw=true" align="center" width="600"/>

- Algorithm (stochastic gradient descent)

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/gd_a2.png?raw=true" align="center" width="600"/>

- **Algorithm (minibatch stochastic gradient descent)**

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/gd_a3.png?raw=true" align="center" width="600"/>



#### Computing gradients of weights in neural network 
- Chain rule: propagating the gradient across the layers
    - Simplest example: two-layer neural network with one hidden node
    - $\hat{y}=f(x;\text{W})$
    
<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/chain_rule.png?raw=true" align="center" width="700"/>
    

- Below shows simple fully-connected network 

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/fcn.png?raw=true" align="center" width="700"/>

- Compute the gradient of weights and bias using chain rule

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/computing_graidents.png?raw=true" align="center" width="700"/>


## Actviation functions
Neural network training is preformed by gradient update. During the back-propagation, if the gradient of weights or derivative of nonlinear function will be zero, there is no downstream gradient. Therefore, we need to prevent the zero-gradient, especially, when computing the derivative of a nonlinear function, the gradient goes zero according to types of activation functions.

- Sigmoid function 
    - Pros
        - Bounding the activation value range [0,1]
    - Cons
        - Zero gradient on saturated neurons (vanishing gradient)
        - Outputs are not zero-centered (always positive, 0~0.25) -> Zigzag problem, slow update
        - Exponetial operation is involved

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/sigmoid.png?raw=true" align="center" width="400"/>

- Hyperbolic Tangent(Tanh) function 
    - Pros
        - Bounding the activation value range [-1,1]
        - Outputs are zero centered
    - Cons
        - Zero gradient on saturated neurons (vanishing gradient)
        - Exponetial operation is involved

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/tanh.png?raw=true" align="center" width="400"/>

- Rectified Linear Unit(ReLU) function 
    - Pros
        - No saturation
        - Easy to compute
    - Cons
        - Not zero-centered output
        - Zero gradient for negative activations

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/relu.png?raw=true" align="center" width="400"/>

- Other activation functions

<img src="https://github.com/keai-kaist/SEP532-2022-Spring/blob/master/Practice_1/imgs/other_act_funcs.png?raw=true" align="center" width="800"/>