![image.png](images/ensemble_learning.png)

**bagging默认的基模型是树模型**
![image.png](images/decision_tree.png)

**随机森林是bagging方法的一个典型应用（决策树为基模型），而且做了优化。**  
**但相应的bagging方法对其他基模型（非决策树），仍然适用。**

In [5]:
# 加载库
from numpy import mean, std

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier

In [9]:
# 创建样本数据
"""
make_classification(), 随机产生分类数据：
n_samples:样本数量
n_features:特征数量
n_informative: 有用特征数量
n_redundant: 冗余特征数量
n_classes: 类别数量
"""
X,y = make_classification(n_samples=1000, n_features=20,
                          n_informative=15, 
                          n_redundant=5, random_state=5)

In [12]:
# 训练 bagging
model = BaggingClassifier(bootstrap=True)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# erroe_score: 当预测器出现拟合错误时，raise error。
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
                          error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.855 (0.033)


In [13]:
# use pasting method
model = BaggingClassifier(bootstrap=False)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
                          error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.799 (0.044)


**通常情况下，bagging相比pasting，训练子集数据具有更高的多样性；bagging生成的模型通常性能更好。**

**包外评估**  

对于bagging(有放回)的采样方法，在max_samples=size(train_datasets)的情况下，平均每个预测器只能见到63%的样本（随机采样），仍然有37%的样本没有被使用；在这种情况下，可以使用这些包外数据（oob）充当测试集。

In [14]:
from sklearn.model_selection import train_test_split

In [17]:
train_X, test_X, train_y, test_y = train_test_split(X, y)

In [16]:
bag_clf = BaggingClassifier(
    n_estimators=50, # default=10
    bootstrap=True,
    n_jobs=-1,
    oob_score=True
)

In [18]:
# fit
bag_clf.fit(train_X, train_y)

BaggingClassifier(n_estimators=50, n_jobs=-1, oob_score=True)

In [19]:
# oob score
bag_clf.oob_score_

0.8573333333333333

In [20]:
# evluate
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(test_X)
accuracy_score(test_y, y_pred)

0.876

**随机子空间**

BaggingClassifier类也支持对特征进行采样，随机选取子特征空间。  
max_features: 特征数，类似max_samples  
bootstrap_features: 是否放回

In [24]:
bag_clf = BaggingClassifier(
    n_estimators=10, # default=10
    max_features=15,
    bootstrap=True,
    n_jobs=-1
)

In [25]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(bag_clf, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.858 (0.032)


**可以看到，相同个数的基模型的集成效果：随机子空间的准确度和方差都优于bagging.**
（当然，这可能是特例；不过，也多了一种可能性。）