# Deep Learning

### Shallow versus Deep Neural Networks

Shallow NNet: few (hidden) layers  
    - more efficient to train
    - simpler structural decisions
    - theoretically powerful enough
    
Deep NNet: many layers
    - challenging to train
    - sophisticated structural decisions
    - 'arbitrarily' powerful
    - more 'meaningful'

### Meaningfulness of Deep Learning

- 每層需要負責的處理較少。
- 能夠處理對困難度高的 raw feature 學習問題，如影像/音頻

### Challenges and Key Techniques

- Difficult structural decisions:
    - 如何決定神經元之間的結構？
    - subjective with domain knowledge: like **Convolutional NNet** for images
    - 相近的圖片像素才鏈接到下個神經元，太遠的像素不鏈接；相近的像素間才有意義。
- High model complexity:
    - 有眾多的轉換, 與權重w 要處理
    - no big worries if *data is big enough*
    - regularization towards noise-tolerant: like 'dropout' (when network corrupted), 'denoising' (when input corrupted)
- Hard optimization problem:
    - 困難的最佳化，不是 convex, 可能落在 local minimum 上。
    - careful initialization to avoid bad local minimum: 'pre-training'
- Huge computational complexity
    - novel hardware/architecture: like mini-batch with GPU

### Two-Step Deep Learning Framework

#### STEP ONE

for $ \mathscr{l} = 1, \cdots, L, \text{ pre-train } \big\{ w_{ij}^{(\mathscr{l})} \big\} \text{ assuming } w_{*}^{(1)}, \cdots, w_{*}^{(\mathscr{l} - 1)} \text{ fixed. } $

#### STEP TWO

train with **backprop** on pre-trained NNet to fune-tune all $ \big\{ w_{ij}^{(\mathscr{l})} \big\} $

### Information-Preserving Encoding

- weights: feature transform, i.e. encoding.
- good weights: information-preserving encoding; next layer same info, with different representation.
- information-preserving: decode accurately after encoding

### Information-Preserving Neural Network

<img src="imgs/c213-info-preserve-nnet.jpg" style="width:500px" />

- Autoencoder: $ d \to \tilde{d} \to d $ NNet with goal $ g_i(x) \approx x_i $ : learn to approximate **identity function**
- encoding weights: $ w_{ij}^{(1)} $
- decoding weights: $ w_{ji}^{(2)} $

### Basic autoencoder 獲得初始權重

$ d \to \tilde{d} \to d $ NNet with error function $ \sum_{i=1}^d \big( g_i(x) - x_i \big)^2 $

- backprop easily applies; shallow and easy to train
- usually $ \tilde{d} \lt d $: compressed representation
- data: $ \big\{ (x_1,y_1 = x_1), \ \ (x_2,y_2 = x_2), \ \cdots,(x_N,y_N = x_N)  \big\} $
    - often categorized as unsupervised learning technique
- sometimes constrain $ w_{ij}^{(1)} = w_{ji}^{(2)} $ as regularization.
    - more sophisticated in calculation gradient

basic autoencoder in basic deep learning, $ \big\{  w_{ij}^{(1)} \big\} $ taken as **shallowly pre-trained weights**

Many successful pre-training techniques take 'fancier' autoencoders with different architectures &amp; regularization schema.

### Regularization in Deep Learning

high model complexity: regularization needed:

- structural decisions / constraints.
- weight decay or weight elimination regularizers
- early stopping
- add artificial noise

### denoising autoencoder:

run basic autoencoder with data:

$ \big\{ (\tilde{x}_1,y_1 = x_1), \ \ (\tilde{x}_2,y_2 = x_2), \ \cdots,(\tilde{x}_N,y_N = x_N)  \big\} $

where $ \tilde{x}_n = x_n + $ artificial noise

- often used instead of basic autoencoder in deep learning.
- useful for data/image processing: $ g(\tilde{x}) $ a denoising version of $ \tilde{x} $
- effect: 'constrain/regularize' g toward noise-tolerant denoising.