# Deep Learning

## Why becomes Deep Learning popular?
 - CNN(Convolution Neural Network) appearance : it has less parameter than MLP and uses weight sharing method.
 - GPU becomes cheaper : it reduces learning time.
 - Learning data has grown significantly because of the Internet.
 - The better activation fucnction whose computation is simple : the ReLU function greatly alleviated the gradient vanishing problem.
 - The more efficient regulaization method for learning : weight decay, dropout etc.
 - Layerwise pretraining method : it made layer deeper.
 
## Feature Learning
Before deep learning, a feature extraction algorithm devised by human trial and error was used to extract feature vectors and input them into neural networks. In other words, we limited the coverage of machine learning classification or regression, and feature extraction was implemented separately from machine learning and then used in combination.(hand-crafted feature) However, Deep Learning can also automatically extract hierarchical features with multi-layers: low-level feature(edge, corner etc.), and hight-level feature(more abstract).

## DMLP(Deep MLP)
<img src="./img/2_DMLP.png" width="60%" height="60%">  


$l^{th}$ layer computation:
$$\mathbf{z}^l = \mathbf{\tau}_l (\mathbf{U}^l \mathbf{z}^{l-1}), \quad l \le l \le L$$  


### Backpropagation
$L^{th}$ layer(output layer) gradient computation:
$$\delta_k^L = \tau'_L (s_k^L)(y_k - o_k), \quad 1 \le k \le c$$  
$$\frac{\partial J}{\partial u_{kr}^L} = -\delta_k^L z_r^{L-1}, \quad 0 \le r \le n_{L-1}, 1 \le k \le c$$  
Usually $\tau_L$ is logistic sigmoid, tanh, or softmax function.  


$l^{th}$ layer(hidden layer, l is L-1, L-2, ..., 1) gradient computation:
$$\delta_j^l = \tau'_l (s_j^l) \sum_{p=1}^{n_{l+1}} \delta_p^{l+1} u_{pj}^{l+1}, \quad 1 \le j \le n_l$$  
$$\frac{\partial J}{\partial u_{ji}^l} = -\delta_j^l z_i^{l-1}, \quad 0 \le i \le n_{l-1}, 1 \le j \le n_l$$
Usually $\tau_l$ is ReLU function.  

## Convolution NN
DMLP has too may parameter because of fully connected layers so that its computation is complex. Thus, it learns slowly and can be overfitting. On the other hand, CNN has partial connected architecture(receptive field) so it dramatically reduces the complexity of the model extracting good features. The data input to the DMLP is just a line of vector structures. CNN can process 3-dimensional or higher tensor structures and input of variable size.  

1-dim feature map($s$) computation:
$$s(i) = z \circledast u = \sum_{x=-(h-1)/2}^{(h-1)/2} z(i + x)u(x) \quad \text{h is size of kernel}$$  

2-dim feature map($s$) computation:
$$s(j, i) = z \circledast u = \sum_{y=-(h-1)/2}^{(h-1)/2} \sum_{x=-(h-1)/2}^{(h-1)/2} z(j + y, i + x) u(y, x)$$  

Building Block  
<img src="./img/2_building_block.png" width="60%" height="60%">  

At the edge of the data, the kernel goes out of the data area, making it impossible to compute. Thus, if the convolution layer is repeated many times, the nodes are reduced a lot. Padding can solve this problem.  
Bias has the effect of adding a bias value to the convolution results.  

We can do down sampling with 'stride'. Proper padding at the edges of the data ensures that the input and output data are the same size if stride is 1. Generally setting stride=k, it applies a kernel by sampling every k nodes. As a results, output data is reduced 1/k times of input data. When 2-dim data is, $m*n$ matrix is reduced ${m \over 2}*{n \over 2}$ with stride=2.  

In CNN, you usually apply an active function, such as ReLU, to the result of a convolution operation, and then apply a pooling operation to the result. There are many poolings: max pooling, average pooling, weight average pooling, L2 norm pooling etc. Pooling operation extracts not only noise but also summary statistics in feature map with overly detailed information. Moreover, as the feature map goes smaller it also contribute to speedup and memory efficiency. Pooling layer has no learning parameters and keeps the number of feature maps. Also, pooling makes it insensitive to small shifts.  

CNN is affected directly by convolution's property as the basic computation of CNN is convolution.
1. Translation Equivariant: as the signal moves, the movement information is reflected in the feature map. $c(t(\mathbf{x})) = t(c(\mathbf{x}))$ , $t$ is translation operation, and $c$ is convolution.
2. Parallel distributed architecture: the CNN has a deep structure, so the range of influence grows with each layer and goes back to affect the entire map.  

### Kernel
The features of the extraction depend on what value the kernel has. So, CNN has many kernels to better feature extraction.  
<img src="./img/2_multi_feature_map_extraction.png" width="45%" height="45%">  
We should focuse on two thing in the above picture:
1. The example has kernel=3 but the actual size is 4 because we added an element to the most left for the bias node.
2. The value of the kernel element is denoted by $u_i^k$. It means the kernel value must be learned by training.  

### Case of CNN
* **AlexNet**  
AlexNet has 5 convolution layers, and 3 fully connected layers(total 8 layers) with 290,400-186,624-64,896-64,896-43,264-4,096-1,000 neurons. The input size is 224\*224 RGB 3 channel video(3\*224\*224 tensor). It uses ReLU activation function and local response normalization method which adjusts the convolution results of the kernel, taking into account the values of neighboring kernels. For avoiding overfitting, it takes regularization method such as data augmentation(cropping, flipping, PCA) and dropout.  
* **VGGNet**  
VGGNet has 13 convolution layers, and 3 fully connected layers(total 16 layers). The core idea of VGGNet is to make neural network deeper using a small kernel. Operations that apply a small 3\*3 kernel multiple times are better than operations that apply a large kernel once. Excepting the middle pooling or nonlinear activation function, they has the same effect applying 7\*7 once and applying 3\*3 three times. However there are two advantages when doing 3\*3 three times: First, if you put a nonlinear activation function in the middle, applying 3\*3 three times is superior in terms of discernment by performing more nonlinear operations. Second, assume we make c feature maps applying convolution. Using a large kernel once, c kernels have each 49c parameters so it needs total 49$c^2$ parameters, but the later needs only 27$c^2$ parameters. Therefore, the number of parameters is reduced to 55%, which makes the calculation about twice as fast.  
* **GoogLeNet**  
GoogLeNet uses inception module which modifies the idea of NIN(Network In Network by Lin). NIN performs the forward operation of MLP. Unlike MLP, MLPconv layer in NIN computes shifting kernel like convolution operation. Note that NIN presents very important idea, global average pooling. MLPconv(micro network) generates as many feature maps as the number of classes. Global average pooling simply averages the $i^{th}$ feature map and inputs it to the $i^{th}$ output node. So there is no possibility of overfitting in this part because there is no parameter. GoogLeNet is an extension neural network of NIN's idea. Unlike NIN that chooses MLPconv as a micro network, the micro network of GoogLeNet is constructed by only convolution operation-combine the results of 4 convolution performances that they makes same size of feature maps but the number of feature maps are different.  
* **ResNet**  
The deeper layers, the better feature representation. However, better performance isn't gurantted. ResNet makes 1,202 layers avoiding poor performance using residual learning. $\mathbf{F}(\mathbf{x})$ is called residual. $\mathbf{F}(\mathbf{x}) = \mathbf{\tau}(\mathbf{x} \circledast \mathbf{w}_1) \circledast \mathbf{w}_2$, $\mathbf{\tau}$ is activation function(ReLU), and $\circledast$ is convolution operation. Residual learning apply ReLU function to $\mathbf{F}(\mathbf{x}) + \mathbf{x}$, and $\mathbf{x}$ is shortcut connection. So, $\mathbf{y} = \mathbf{\tau}(\mathbf{F}(\mathbf{x}) + \mathbf{x})$. In deep neural networks, the problem of gradient vanishing problem becomes more serious. But the gradient donot be 0, no gradient vanishing problem occurs with residual learning. ResNet uses global average pooling layer which is fully connected layers in VGGNet. ResNet also applies batch normalization to convolution operation after ReLU on every layers. If you use batch normalization, you donot need dropout.  


## Generative Model
DMLP and CNN are discriminative model. It is a supervised learning that can be learning only if train dataset-feature vector $\mathbb{X} = \{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n\}$, and label $\mathbb{Y} = \{\mathbf{y}_1, \mathbf{y}_2, ..., \mathbf{y}_n\}$. The learning algorithm does not need to find the probability distribution of the feature vector $\mathbf{x}$. Just by estimating conditional probabilities $\Pr {(y\mid \mathbf{x})}$ correctly, you can solve classification and regression problems with high performance. On the other hand, generative model estimates the probability distribution of vector $\mathbf{x}$. It does not need label(unsupervised learning).  

### GAN(Generative Adversarial Network)
<img src="./img/2_GAN.png" width="30%" height="30%">  
G is generator, and D is discriminator. GAN uses log likelihood as the objective function instead of MSE(mean squared error).  


The objective function of D:
$$\widehat{\Theta}_D = \underset{\Theta_D}{\text{argmin}}  J_D(\Theta_D)$$
$$
\begin{alignat}{2}
J_D({\Theta_D}) & = \log (f_D(\mathbf{x}^{real})) + \log (1 - f_D(\mathbf{x}^{fake})) \\
& = \log (f_D(\mathbf{x}^{real})) + \log (1 - f_D(f_G(\mathbf{z})))
\end{alignat}
$$  

The objective function of G:
$$\widehat{\Theta}_G = \underset{\Theta_G}{\text{argmin}}  J_G(\Theta_G)$$
$$
J_G({\Theta_G}) = \log (1 - f_D(f_G(\mathbf{z})))
$$  