# Deep Learning

### Shallow versus Deep Neural Networks

Shallow NNet: few (hidden) layers  
    - more efficient to train
    - simpler structural decisions
    - theoretically powerful enough
    
Deep NNet: many layers
    - challenging to train
    - sophisticated structural decisions
    - 'arbitrarily' powerful
    - more 'meaningful'

### Meaningfulness of Deep Learning

- 每層需要負責的處理較少。
- 能夠處理對困難度高的 raw feature 學習問題，如影像/音頻

### Challenges and Key Techniques

- Difficult structural decisions:
    - 如何決定神經元之間的結構？
    - subjective with domain knowledge: like **Convolutional NNet** for images
    - 相近的圖片像素才鏈接到下個神經元，太遠的像素不鏈接；相近的像素間才有意義。
- High model complexity:
    - 有眾多的轉換, 與權重w 要處理
    - no big worries if *data is big enough*
    - regularization towards noise-tolerant: like 'dropout' (when network corrupted), 'denoising' (when input corrupted)
- Hard optimization problem:
    - 困難的最佳化，不是 convex, 可能落在 local minimum 上。
    - careful initialization to avoid bad local minimum: 'pre-training'
- Huge computational complexity
    - novel hardware/architecture: like mini-batch with GPU

### Two-Step Deep Learning Framework

#### STEP ONE

for $ \mathscr{l} = 1, \cdots, L, \text{ pre-train } \big\{ w_{ij}^{(\mathscr{l})} \big\} \text{ assuming } w_{*}^{(1)}, \cdots, w_{*}^{(\mathscr{l} - 1)} \text{ fixed. } $

#### STEP TWO

train with **backprop** on pre-trained NNet to fune-tune all $ \big\{ w_{ij}^{(\mathscr{l})} \big\} $

### Information-Preserving Encoding

- weights: feature transform, i.e. encoding.
- good weights: information-preserving encoding; next layer same info, with different representation.
- information-preserving: decode accurately after encoding

### Information-Preserving Neural Network

<img src="imgs/c213-info-preserve-nnet.jpg" style="width:500px" />

- Autoencoder: $ d \to \tilde{d} \to d $ NNet with goal $ g_i(x) \approx x_i $ : learn to approximate **identity function**
- encoding weights: $ w_{ij}^{(1)} $
- decoding weights: $ w_{ji}^{(2)} $

### Basic autoencoder 獲得初始權重

$ d \to \tilde{d} \to d $ NNet with error function $ \sum_{i=1}^d \big( g_i(x) - x_i \big)^2 $

- backprop easily applies; shallow and easy to train
- usually $ \tilde{d} \lt d $: compressed representation
- data: $ \big\{ (x_1,y_1 = x_1), \ \ (x_2,y_2 = x_2), \ \cdots,(x_N,y_N = x_N)  \big\} $
    - often categorized as unsupervised learning technique
- sometimes constrain $ w_{ij}^{(1)} = w_{ji}^{(2)} $ as regularization.
    - more sophisticated in calculation gradient

basic autoencoder in basic deep learning, $ \big\{  w_{ij}^{(1)} \big\} $ taken as **shallowly pre-trained weights**

Many successful pre-training techniques take 'fancier' autoencoders with different architectures &amp; regularization schema.

### Regularization in Deep Learning

high model complexity: regularization needed:

- structural decisions / constraints.
- weight decay or weight elimination regularizers
- early stopping
- add artificial noise

### denoising autoencoder:

run basic autoencoder with data:

$ \big\{ (\tilde{x}_1,y_1 = x_1), \ \ (\tilde{x}_2,y_2 = x_2), \ \cdots,(\tilde{x}_N,y_N = x_N)  \big\} $

where $ \tilde{x}_n = x_n + $ artificial noise

- often used instead of basic autoencoder in deep learning.
- useful for data/image processing: $ g(\tilde{x}) $ a denoising version of $ \tilde{x} $
- effect: 'constrain/regularize' g toward noise-tolerant denoising.

Artificial noise / hint as regularization - pratically also useful for other NNet / models.

## Principal Component Analysis

之前討論的是 nonlinear encoder, linear hypothesis for k-th component:

$$
h_k(x) = \sum_{j=0}^{\tilde{d}} w_{jk}^{(2)} \ tanh \Big( \sum_{i=0}^d w_{ij}^{(1)} x_i \Big)
$$

換成線性的 encoder, 就是將 tanh(.) 去掉:

$$
h_k(x) = \sum_{j=0}^{\tilde{d}} w_{jk}^{(2)} \ \Big( \sum_{i=0}^d w_{ij}^{(1)} x_i \Big)
$$

考慮幾個特例後，式子轉變成:

- exclude $ x_0 $: range of i same as range of k
- constrain $ w_{ij}^{(1)} = w_{ji}^{(2)} = w_{ij} $ : regularization
    - denote $ W = [w_{ij}] \text{ of size } d \times \tilde{d} $
- assume $ \tilde{d} \lt d $ : ensure non-trival solution


$$
h_k(x) = \sum_{j=0}^{\tilde{d}} w_{kj} \ \Big( \sum_{i=1}^d w_{ij} x_i \Big) \\
h(x) = W \ W^T \vec{x}
$$

如果要找出最好的 hypothesis, 就是對 W 矩陣做最佳化。

### Linear Autoencoder Error Function

最佳化的 錯誤定義，就是 "編碼再解碼" 後，數值盡量是不變的。

$$
E_{in}(h) = E_{in}(W) = \frac{1}{N} \sum_{n=1}^N \Big\Vert x_n - W W^T x_n \Big\Vert^2 \ \ , \ \  W: d \times \tilde{d}
$$

eigen-decompose $ W \ W^T = V \ \Gamma \ V^T $  
$ \Gamma $ : 做了 eigenvalue 轉換後，中間得到的 eigenvalue 值

V 是 d &times; d orthogonal matrix: $ V V^T = V^T V = I_d $  

$ \Gamma $ 是 d &times; d diagonal matrix with $ \le \tilde{d} $ non-zero,  
因為 $ \Gamma $ 是 $ W \ W^T $ 的 eigen-decomposition, W 大小是 $ d \times \tilde{d} $, rank 最多是 $ \tilde{d} $

> 特殊矩陣的特徵分解 - 對稱矩陣  
> 任意的 N×N 實對稱矩陣都有 N 個線性無關的特徵向量。並且這些特徵向量都可以正交單位化而得到一組正交且模為 1 的向量。  
> 故實對稱矩陣 A 可被分解成 $ A = Q \ \Lambda Q^T $  
> 其中 Q 為 正交矩陣， Λ 為實對角矩陣

以上關係，可以推導下面的物理意義:

$ W W^T x_n = V \Gamma V^T x_n $

$ V^T x_n $: 是乘以一個 orthonormal basis, 意思是轉換座標，如旋轉 rotate，或鏡像 relect  
$ \Gamma $: 是對角矩陣，最多有 $ \tilde{d} $ 個不為零的數，而 $ d \gt \tilde{d} $, 所以乘上 $ \Gamma $ 就是消去某些座標軸的向量，然後再放縮 scaling.  
$ V (.) $ : 再把 旋轉 rotate，或鏡像 relect 的轉換座標復原回來。

$ x_n = V \ I \ V^T x_n $: 將 $ x_n $ 轉換座標再復原回來，沒有變化。

所以上面最佳化的 $ E_{in} $ 就轉變成了對 $ V, \Gamma $ 的最佳化問題，知道這兩個，就知道最佳的 W 會是什麼。

### The Optimal $ \Gamma $

$$
\min_{V} \min_{\Gamma} \frac{1}{N} \sum_{n=1}^N \Big\Vert V I V^T x_n - V \Gamma V^T x_n \Big\Vert^2
$$

假設先固定 V, 處理最小化的 $ \Gamma $,  
座標轉換 V 如 旋轉或鏡像 是不會影像長度的，所以劃去 V

$$
\min_{\Gamma} \sum_{n=1}^N \Big\Vert \big(I - \Gamma \big) V^T x_n \Big\Vert^2
$$

$ \Gamma $ 是最多 $\tilde{d}$ 個數不為零的對角矩陣，所以如要最小化 $ ( I - \Gamma ) $，應該是希望相減結果越多零約好，  
也就是 $ \Gamma $ 中越多數為 1 越好 (最多 $ \tilde{d} $ 個 1), 所以最佳的 $ \Gamma $:

$$
optimal \ \Gamma =
\begin{bmatrix}
I_{\tilde{d}} & 0 \\
0             & 0
\end{bmatrix}
$$

接著處理最小化 V 的部份:

$$
\min_V \sum_{n=1}^N \Big\Vert
\begin{bmatrix}
0 & 0 \\
0 & I_{d - \tilde{d}}
\end{bmatrix}
\ V^T \ x_n
\Big\Vert^2
$$

可將上面的 min 轉換看成是  
向量:$ V^T x_n $ 中要 [留下] 哪些維度，可以 [最小化] 長度。等同於:  
向量:$ V^T x_n $ 中要 [拿掉] 哪些維度，可以 [最大化] 長度。

$$
\max_V \sum_{n=1}^N \Big\Vert
\begin{bmatrix}
I_{\tilde{d}} & 0 \\
0 & 0
\end{bmatrix}
\ V^T \ x_n
\Big\Vert^2
$$

先看一個例子，$ \tilde{d} = 1 $, 則 $ V^T x_n $ 的運算中，只有 V 的 first row: $ v^T $ 有用。  
最大化算式變成 (v 是 orthonormal ):

$$
\max_v \sum_{n=1}^N v^T x_n x_n^T v, \text{ subject to } v^T v = 1
$$

依據 lagrange multiplier 的原理，上式個別微分，會平行(等比例): $ \lambda $  
最佳化的 v 滿足 $  \sum_{n=1}^N x_n x_n^T v = \lambda v $

最佳化的 v 就是 topmost eigenvector of $ X^T X $

一般化的來看，$ \tilde{d}: \big\{ v_j \big\}_{j=1}^{\tilde{d}} $, 'topmost' eigenvectors of $ X^T X $

### Linear autoencoder: projecting to orthogonal patterns w

projecting to orthogonal patterns $ \{ w_j \} $ that matches $ \{ x_n \} $ most.

### Principal Component Analysis

Linear Autoencoder or PCA

STEP 1 - let $ \overline{x} = avg(x_n) $, and let $ x_n \leftarrow x_n - \overline{x} $, 減去平均數

STEP 2 - calculate $ \tilde{d} $ top eigenvectors $ w_1, w_2, \cdots, w_{\tilde{d}} $ of $ X^T X $

STEP 3 - return feature transform $ \Phi(x) = W(x - \overline{x}) $

- Linear autoencoder: $ max \sum \big( \text{ magnitude after projection} \big)^2 $
- PCA from stat.: $ max \sum \big( \text{ variance after projection} \big)^2 $
- Both useful for linear dimension reduction.