# Ensemble Learning / Ensemble Method

將各自獨立(independent)且歧異性(diverse)夠高的數個弱分類器(more accurate than random guessing)集成  
- 弱分類器有各自的偏見跟觀點, 集成可以相互消除各自的偏見, 集合大家的觀點  
- 假設有3個accuracy 0.6的弱分類器, 採多數決(voting), 那麼就是3個分類器都說是true或其中2個說是true  
$C(3,3)*0.6*0.6*0.6 + C(3,2)*0.6*0.6*0.4 = 0.648$  



### Enseble Learning
- Bagging  - resample training data  
Random Forest
- Boosting - reweight trainiing data + weight method  
Adaboost  
Gradient Boosting(XGBoost)  
- Stacking  - blending weak learners  

## Bagging (Bootstrap aggregating)  
- Bootstrap:  
Draw n' out of n data instances (n' < n), usually with replacement  
- Bootstrap aggregating:  
 1. Repeat Boostrap for m times  
 1. Train a model for each sample dataset  
 1. combine the models to make prediction  
- Random Forest:  
 bagging + randomized feature set  
 1. build many decision tree classifiers (or regressors)   
 each tree is trained on a subset of the training data (bagging)  
 each tree use a subset of the features
 1. combine the prediction of each tree (e.g., average or majority voting) 
 
  決策樹分類的圖像化可以看到明顯的矩型, 隨機森林樹越多在邊角會越呈現圓滑狀  
  隨機森林因為往往有上百棵樹, 所以predict需要一定的計算量  
  優點是可以一開始就隨機做好所有resample dataset跟randomized feature set 平行訓練所有的樹

## Boosting
- Assign different weight to different samples
  對於每筆訓練資料會有各自的weight
- "weighted" combination of models  
  訓練出來的弱分類器, 會依據分類能力的好壞, 在predict時的影響力有所不同  
- 建立新的弱分類器時會嘗試將常分錯, 不好分的sample盡量分對  
- **Adaboost (Adaptive Boosting):**  
$\begin{aligned}
Training:&\\ 
& Set~uniform~weights~to~each~instance~~~i.e.,~~w_i^{(0)} = \frac{1}{n}\\
& for~k~=~1~to~k:& \\
& ~~~~Train~~f_k~~by~minimizing~~(weighted)~~error \\
& ~~~~compute~Weighted~Error~of~training~instance~using~f_k  \\
& ~~~~set~\alpha_k,~the~weight~of~f_k~based~on~Weighted~Error  \\
& ~~~~set~w_i^{(k)},~the~weight~of~each~instance~based~on~Ensemble~Prediction \\
\end{aligned}$  

 k是hyper parameter決定迭代次數  
 $f_k$是每次訓練得到的弱分類器  
 $\alpha_k = 0.5 * log(\frac{1-err^{(k)}}{err^{(k)}})$,  
 > $\alpha_k \in [-\infty, \infty]$  
 > $log(1)=0$, 表示$err^{(k)}=0.5$跟亂猜一樣  
 > $err^{(k)}=0, \alpha_k=\infty$表示可信度高  
 > $err^{(k)}=1, \alpha_k=-\infty$表示盡量往反方向猜  
 
 $err^{(k)} = \sum_iW_i^{(k-1)}*\xi_k(f_k(x_i),y_i)$,  
 > 該資料$x_i$用這次$f_k$分錯誤差再乘以上次算出該資料$x_i$權重, 再將所有資料$x_i$加總起來  

 $W_i^{(k)} = \frac{W_i^{(k-1)}~e^{(-\alpha_ky_i\hat y_i)}}{z^{(k)}}$  
 > $z^{(k)}$是$\sum_iW_i^{(k)}$ 用作normalization term  
 > $exp(-\alpha_ky_i\hat y_i)$, 如果分對$y_i\hat y_i$會是正值,指數函數使值會在0~1之間, 使$W_i$變小, $\alpha_k$指這次的分類器分類能力, 越大表示可信度越大  
 如果分錯$y_i\hat y_i$會是負值,指數函數使值會在1~$\infty$之間,  $\alpha_k$越大值越接近$\infty$,權重越大  
 
 $Prediction:$  
 - $\hat y_{test} = \alpha_1f_1(x_{test}) + \alpha_2f_2(x_{test})...+\alpha_kf_k(x_{test})$  
   Predicted classes = $sign(\hat y_{test})$
 

- **Gradient Boosting for regression:**  
 1. Given D = {($x_1$,$y_1$),($x_2$,$y_2$),...,($x_n$,$y_n$)}  
 1. Train a model $f_1$ to fit D, and let F = $f_1$  
 1. Train a model $f_2$ to fit the residuals given the features  
 i.e. fitting {($x_1$,$y_1-F(x_1)$),($x_2$,$y_2-F(x_2)$),...,($x_n$,$y_n-F(x_n)$)}  
 Let F = $f_1+f_2$  
 1. Repeat the process to get $f_3,f_4,....$  
 F = $f_1(x_i)+f_2(x_i)+f_3(x_i)+...f_n(x_i)$  
 
 可以想作要找到一組參數使得F(x)跟實際y越接近越好  
 令loss function J = $\frac{1}{2}\sum_i(y_i - F(x_i))^2$  
 訓練出第一個model $f_1$, 令$F(x_j) = f_1(x_j)$  
 用Gradient descent持續迭代更新F(X)直到收斂  
 $F^{(k+1)}(x_j) = F^{(k)}(x_j) - \frac{\partial J}{\partial F(x_j)} $  
 因為$\frac{\partial J}{\partial F(x_j)}= -(y_j-F(x_j))$  
 所以$F^{(k+1)}(x_j) = F^{(k)}(x_j)+(y_j-F^{k}(x_j))$  
 - 決策樹建立一個相對複雜的tree來預測問題, Gradient Boosting使用多個Simple tree來預測問題  
 - 隨機森林產生很多樹來預測問題, 這些樹相互獨立, Gradient Boosting使用多個Simple tree來預測問題, 新的樹試著去將之前的樹的誤差分對  
 

## Stacking  
將各種不同的分類器輸出重新當作輸入  
試著去重新訓練出一個ensemble model  
ensemble model的output是label

# 隨機森林
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
```  
**[參數]**  
- n_estimators: 生成樹的數量  
- max_features: 隨機抽取feature的數量  
> If “auto”, then max_features=sqrt(n_features).  
> If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).  
> If “log2”, then max_features=log2(n_features).  
> If None, then max_features=n_features.  

- max_depth: 如果是None表示分到leaves都是pure或leaves數量小於min_samples_split為止  
             其他數值決定樹的深度  
- min_samples_leaf := 1   
- min_samples_split := 2  

Model可以看importance知道哪個feature最常被用來split:   
```python
model.feature_importances_
```

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import metrics
from sklearn.model_selection import train_test_split
%matplotlib inline

In [44]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=5)

In [45]:
model = RandomForestClassifier(n_estimators=100,criterion='gini', max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=2, random_state=123456)
model.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=123456, verbose=0,
            warm_start=False)

In [46]:
predicted = model.predict(X_test)
metrics.accuracy_score(y_pred=predicted, y_true=y_test)

0.9210526315789473

In [47]:
print(data.feature_names)
model.feature_importances_

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


array([0.07765363, 0.03215788, 0.4090637 , 0.48112478])

# Gradient Boosting
```python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
```  
**[參數]**  
- n_estimators: 生成樹的數量  
- loss : {‘deviance’, ‘exponential’}  
loss function to be optimized  
default使用deviance  
- learning_rate:  (default=0.1)  
shrinks the contribution of each tree by learning_rate  
縮小每個model的權重, 這樣可以訓練出更多的樹  
- criterion : (default=”friedman_mse”)  
The function to measure the quality of a split.  
Mean squared error with improvement score by Friedman  
“mse” or “mae”  

- max_features: 隨機抽取feature的數量  
> If “auto”, then max_features=sqrt(n_features).  
> If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).  
> If “log2”, then max_features=log2(n_features).  
> If None, then max_features=n_features.  

- max_depth: (default=3)
           決定樹的深度  
- min_samples_leaf := 1   
- min_samples_split := 2  

Model可以看importance知道哪個feature最常被用來split:   
```python
model.feature_importances_
```

In [67]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
data = load_iris()
#X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=5)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,test_size=0.25,random_state=5)
print(X_train.shape)

(112, 4)


In [64]:
model = GradientBoostingClassifier(n_estimators=100,criterion='friedman_mse',  loss='deviance',learning_rate=0.1,max_depth=3, max_features=None, min_samples_leaf=1, min_samples_split=2,random_state=123456)
model.fit(X_train,y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=123456,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [65]:
predicted = model.predict(X_test)
metrics.accuracy_score(y_pred=predicted, y_true=y_test)

0.9210526315789473

In [66]:
print(data.feature_names)
model.feature_importances_

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


array([0.02463981, 0.00749258, 0.29868364, 0.66918397])