# Blending and Bagging

若使用各種不同的 algorithm, model 獲得多個 $ g: g_1, g_2, \cdots, g_T $ 時，如何綜合起來獲得最後的 g_t(s) 呢？  
可能有以下幾種方式:

ONE - select: 選擇最好的 $ E_{val}(g_t^{-}) $ 最小的  
$ G(x) = g_{t*}(x), t_{*} = argmin_{t \in \{ 1,2,\cdots,T \}} E_{val}(g_t^{-}) $

TWO - mix uniformly: 平等看待每一個 g，綜合起來  
$ G(x) = sign \big( \sum_{t=1}^T 1 \cdot g_t(x) \big) $

THREE - mix non-uniformly: 給每一個 g 加權分數後，綜合起來  
$ G(x) = sign \big( \sum_{t=1}^T \alpha_t \cdot g_t(x) \big), \ \ \alpha \ge 0 $

FOUR - 利用自訂的 constaint 限制每個 g 的使用  
$ G(x) = sign \big( \sum_{t=1}^T q_t(x) \cdot g_t(x) \big), \ \ q_t(X) \ge 0 $

### Uniform Blending (Voting) for Classification

mix uniformly: 平等看待每一個 g，綜合起來

$$ G(x) = sign \big( \sum_{t=1}^T 1 \cdot g_t(x) \big) $$

若是每一個小 g 都很類似，結果 G 沒有特別不同。

若是小 g 差異大，majority can correct minority，如同 collective intelligence 集體智慧。

### Uniform Blending for Regression

$$
G(x) = \frac{1}{T} \sum_{t=1}^T g_t(x)
$$

average **could be** more accurate than individual.

#### Diverse Hypotheses:

Even simple uniform blending can be better than any **single hypothesis**

## Linear Blending

known $ g_t $, each to be given $ \alpha_t $ ballot

$$
G(x) = sign \Big( \sum_{t=1}^T \alpha_t g_t(x) \Big), \ \Big| \ \ \alpha_t \ge 0
$$

computing 'good' $ \alpha_t : \min_{\alpha_t \ge 0} E_{in}(\alpha) $

#### Linear Blending for regression

$
\min_{\alpha_t \ge 0} \frac{1}{N} \sum_{n=1}^N \Big( y_n - \sum_{t=1}^T \alpha_t g_t(x_n) \Big)^2
$

#### LinReg + Transformation

$
\min_{w_i} \frac{1}{N} \sum_{n=1}^N \Big( y_n - \sum_{i=1}^{\tilde{d}} w_i \Phi_i(x_n) \Big)^2
$

Linear Blending for regression is like two-level learning.

Linear Blending = LinModel + hypotheses as transform + constraints.

### Linear Blending for Binary classification

二元分類，只有 是/否。若知道 g 有 99% 機會是錯的，則 -g 就有 99% 機會是對的。如此可以忽略掉 constraint : $ \alpha \ge 0 $

### Linear Blending versus Selection

like selection, blending practically done with:

- $ E_{val} $ instead of $ E_{in} $
- $ g_t^- $ from minimum $ E_{train} $

先從訓練資料 $ D_{train} $ 獲得一堆的 $ g_1^-, g_2^-, \cdots, g_T^- $ ；

然後將 validation data $ D_{val}: (x_n, y_n) $ 轉換成 $ \Big( z_n = \Phi^-(x_n), y_n \Big) $, where $ \Phi^-(x) = \big( g_1^-(x), \cdots, g_T^-(x) \big) $

#### Linear Blending Steps

接著 Linear blending 可以:

1. compute $ \alpha = Lin \Big( \ \big\{ \ (z_n, y_n)  \ \big\}  \  \Big) $
2. return $ G_{LINB}(x) = LinH \Big( \ innerprod \big( \ \alpha, \Phi(x)  \ \big)  \  \Big) $

注意到最後使用的是 $ \Phi(x) = \big( g_1(x), g_2(x), \cdots, g_T(x) \big) $, 是整體資料做出的；  
不是 $ \Phi^-, g^- $ 只有 測試 v-fold 資料做出的模型。

### Any Blending （Stacking)

1. compute $ \tilde{g} = Any \Big( \ \big\{ \ (z_n, y_n)  \ \big\}  \  \Big) $
2. return $ G_{ANYB}(x) = \tilde{g} \Big( \ \Phi(x)  \ \Big) $


### g - diversity

diversity is important for $ g_t $ aggregation

- diversity by different models: $ g_1 \in H_1, g_2 \in H_2, \cdots $
- diversity by different parameters: GD with $ \eta = 0.01, 0.001, 0.002 $
- diversity by algorithm randomness: PLA 起始不同就有不同的 random 效果。
- diversity by data randomness: Bootstrap (select with replacement)
