# CS470 인공지능개론
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

#  Introduction to Neural Network

1. **Neural Network Basic**  
    1-1. Linear model  
    1-2. Multi-Layer Perceptron  
    1-3. Deep Neural Network  
    1-4. Training neural network
    
---

# 1. Neural Network Basic 

## 1-1. Linear model 
Let's consider the binary classification.

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/bt_example.png?raw=1" align="center" width="600"/>

To solve above problem, we can define a simple linear model as follows.

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/linear_model.png?raw=1" align="center" width="600"/>

With the linear model, we apply it to binary classification.

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/lm_classification.png?raw=1" align="center" width="600"/>

#### Issues of linear models
- Most real-world data is not linearly separable
- In other words, any linear models cannot separate regions correctly
- Therefore, non-linearities is neceesary to model arbitrary complex functions 

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/issues_lm.png?raw=1" align="center" width="700"/>



## 1-2. Multi-Layer Perceptron (MLP)

#### Injecting non-linearity 
To solve above issues, we can inject the non-linearity into the linear model through the non-linear functions such as **Sigmoid, Hyperbolic Tangent(tanh), Rectified Linear (ReLU).**

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/inject_non_liearity.png?raw=1" align="center" width="600"/>

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/act_functions.png?raw=1" align="center" width="600"/>

If you use sigmoid-like activation functions, like sigmoid and tanh, after some epochs of training, the linear part of each neuron will have values that are very big or very small. This means that the linear part will have a big output value regardless of its sign. Consequently, the input of sigmoid-like functions in each neuron which adds non-linearity will be far from the center of these functions.
Sigmoid function 0 to 1, binary classification

#### Perceptron: simplified view 
Perceptron: A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class. In short, a perceptron is a single-layer neural network. (Defined by Deep AI)

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/perceptrons.png?raw=1" align="center" width="700"/>


#### Multi-Layer Perceptron 
MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. 

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/mlp.png?raw=1" align="center" width="700"/>


The reason why do we stack more layers is that 1) hidden layers are nonlinear embeddings of the input; 2) The model can embed the data into
the linearly separable space.

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/mlp2.png?raw=1" align="center" width="700"/>

## 1-3. Deep Neural Network (DNN)
By stacking the more and more layers, neural netowrks have representing and modeling ability for given complex data.

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/dnn.png?raw=1" align="center" width="700"/>


#### Example: recognizing handwritten digits

Examples of data are as follows (MNIST)

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/handwritten_digits.jpeg?raw=1" align="center" width="600"/>


To recognize the handwritten digits using DNN, we first preprocess the images in order to feed them to the model and then train the model.

**Data representation**
Representing a gray-scale image into an array (i.e. a vector)

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/data_representation.png?raw=1" align="center" width="700"/>


**Forward propagation (Embedding viewpoint)**
The vectorized image is propagated through the layers and classified into one of the digits, 0~9. In the case of trained DNN, each layer represents non-linear mebedding of input to easily separable space. 

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/fp_mnist.png?raw=1" align="center" width="700"/>


**Recognition viewpoint**
The DNN model understands given handwritten digit by combining the abstracted features represented through multiple layers. In other words, DNN learns a hierarchy of features capturing different levels of abstractions. 

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/recognition_viewpoint.png?raw=1" align="center" width="700"/>

## 1-4. Training neural network 

- **Objective:** find a set of parameters that minimize the error on the dataset.
- Notations
    - Datasets: ${(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}), \dots, (x^{(N)},y^{(N)})}$ for $N$ number of training data
    - Parameters: ${\text{w}^{(1)},\text{w}^{(2)}, \dots, \text{w}^{(L)}}$ for $L$ number of layers
    
   
<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/dnn_training1.png?raw=1" align="center" width="600"/>


- **Loss function**
    - Measurement on the mismatch between the model prediction and the true label.
    - There are many ways to define the degree of mismatch (i.e. misprediction, error).
    - Key to “quantify the performance” of the model on the specific task and data.
    
<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/loss_func.png?raw=1" align="center" width="600"/>
    
    
- Example of loss function - Mean Squared Error(MSE)
    - Regression tasks

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/mse.png?raw=1" align="center" width="600"/>
  
  
- Example of loss function - Binary cross entropy 
    - Binary classification 
    - For predicted values, can be interpreted as a probability vector using softmax function 

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/bce.png?raw=1" align="center" width="600"/>


#### Optimizing the loss function (Back-propagation)
- **Challenges** for optimizing the loss function
    - It is highly non-convex and non-concave.
    - It is impossible to find the analytical solution.
    
- **Optimiation via gradient descent**

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/gradient_descent.png?raw=1" align="center" width="600"/>

- Algorithm (gradient descent)

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/gd_a1.png?raw=1" align="center" width="600"/>

- Algorithm (stochastic gradient descent)

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/gd_a2.png?raw=1" align="center" width="600"/>

- **Algorithm (minibatch stochastic gradient descent)**

<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/gd_a3.png?raw=1" align="center" width="600"/>



#### Computing gradients of weights in neural network 
- Chain rule: propagating the gradient across the layers
    - Simplest example: two-layer neural network with one hidden node
    - $\hat{y}=f(x;\text{W})$
    
<img src="https://github.com/mikodham/CS470/blob/main/Lab1/Apr%2027/imgs/chain_rule.png?raw=1" align="center" width="700"/>
