# Model Ensemble

本部分主要分为三块，分别是Stacking，Blending和Bagging。这几部分所有实现均包含了单机版本和分布式版本，下面的介绍中主要从这两方面介绍。

## 1. Stacking

Stacking模型是指将多种分类器组合在一起来取得更好表现的一种集成学习模型。一般情况下，Stacking模型分为两层。第一层中我们训练多个不同的模型，然后再以第一层训练的各个模型的输出作为输入来训练第二层的模型，以得到一个最终的输出。可参考[文章](https://blog.csdn.net/data_scientist/article/details/78900265)
> 在实现上，Stacking方式主要分为StackingClassifier和StackingRegressor，两者参数完全一致，下面的介绍中仅以StackingClassifier为例

### 1.1 单机版本

In [1]:
import warnings
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, r2_score
from sklearn.datasets import make_classification, make_regression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,\
    RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression

from model_helper.model_ensemble import StackingClassifier, StackingRegressor, \
BlendingClassifier, BlendingRegressor, BaggingClassifier, BaggingRegressor

warnings.filterwarnings(action="ignore", category=FutureWarning)

In [2]:
X, y = make_classification(n_samples=5000, n_features=20, n_classes=2, random_state=234)

print("X's shape", X.shape)
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3)

X's shape (5000, 20)


#### 方法1
最简单使用方式

In [3]:
clf = StackingClassifier(k_fold=5, base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                            DecisionTreeClassifier()],
                         meta_learner=LogisticRegression())
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3(base learner model number) × 5(fold num) = 15 models.
1/15(model_index:0, fold_index:0) starts
2/15(model_index:0, fold_index:1) starts
3/15(model_index:0, fold_index:2) starts
4/15(model_index:0, fold_index:3) starts
5/15(model_index:0, fold_index:4) starts
6/15(model_index:1, fold_index:0) starts
7/15(model_index:1, fold_index:1) starts
8/15(model_index:1, fold_index:2) starts
9/15(model_index:1, fold_index:3) starts
10/15(model_index:1, fold_index:4) starts
11/15(model_index:2, fold_index:0) starts
12/15(model_index:2, fold_index:1) starts
13/15(model_index:2, fold_index:2) starts
14/15(model_index:2, fold_index:3) starts
15/15(model_index:2, fold_index:4) starts
[1]. train base learner done, cost 3 seconds.
[2]. get base learner prediction...
[2]. get base learner prediction done, cost 0 seconds.
[3]. train meta learner...
last used model index is [0, 1, 2].
[3]. train meta learner done, cost 0 seconds.
0.9825956598515821


#### 方法2

可以通过交叉验证的效果来从给定的基学习器中进行选择

In [4]:
def selector(model_metrics):
    model_avg_metric = np.array(list(map(lambda x: sum(x) / len(x), model_metrics)))
    return model_avg_metric.argsort()[-2:][::-1]  # get top 2 best model

clf = StackingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                            DecisionTreeClassifier()],
                         meta_learner=LogisticRegression(), metric_func=roc_auc_score, select_base_learner=selector)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3(base learner model number) × 5(fold num) = 15 models.
1/15(model_index:0, fold_index:0) starts
2/15(model_index:0, fold_index:1) starts
3/15(model_index:0, fold_index:2) starts
4/15(model_index:0, fold_index:3) starts
5/15(model_index:0, fold_index:4) starts
6/15(model_index:1, fold_index:0) starts
7/15(model_index:1, fold_index:1) starts
8/15(model_index:1, fold_index:2) starts
9/15(model_index:1, fold_index:3) starts
10/15(model_index:1, fold_index:4) starts
11/15(model_index:2, fold_index:0) starts
12/15(model_index:2, fold_index:1) starts
13/15(model_index:2, fold_index:2) starts
14/15(model_index:2, fold_index:3) starts
15/15(model_index:2, fold_index:4) starts
[1]. train base learner done, cost 4 seconds.
[2]. get base learner prediction...
average 5 fold metric of every model is [0.9689914937102717, 0.97972802594318, 0.9057468225768597]
[2]. get base learner prediction done, cost 0 seconds.
[3]. train meta learner...
last used model 

#### 方法3
可以指定每个基学习器的使用随机采样的特征

In [5]:
clf = StackingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                            DecisionTreeClassifier()],
                         meta_learner=LogisticRegression(),
                         feature_fraction=0.8)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3(base learner model number) × 5(fold num) = 15 models.
1/15(model_index:0, fold_index:0) starts
2/15(model_index:0, fold_index:1) starts
3/15(model_index:0, fold_index:2) starts
4/15(model_index:0, fold_index:3) starts
5/15(model_index:0, fold_index:4) starts
6/15(model_index:1, fold_index:0) starts
7/15(model_index:1, fold_index:1) starts
8/15(model_index:1, fold_index:2) starts
9/15(model_index:1, fold_index:3) starts
10/15(model_index:1, fold_index:4) starts
11/15(model_index:2, fold_index:0) starts
12/15(model_index:2, fold_index:1) starts
13/15(model_index:2, fold_index:2) starts
14/15(model_index:2, fold_index:3) starts
15/15(model_index:2, fold_index:4) starts
[1]. train base learner done, cost 3 seconds.
[2]. get base learner prediction...
[2]. get base learner prediction done, cost 0 seconds.
[3]. train meta learner...
last used model index is [0, 1, 2].
[3]. train meta learner done, cost 0 seconds.
0.9821475252120192


#### 方法4

单机版可以指定多进程的方式

In [6]:
clf = StackingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                            DecisionTreeClassifier()],
                         meta_learner=LogisticRegression(),
                         feature_fraction=0.8, enable_multiprocess=True, n_jobs=2)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3(base learner model number) × 5(fold num) = 15 models.
1/15(model_index:0, fold_index:0) starts
2/15(model_index:0, fold_index:1) starts
3/15(model_index:0, fold_index:2) starts
4/15(model_index:0, fold_index:3) starts
5/15(model_index:0, fold_index:4) starts
6/15(model_index:1, fold_index:0) starts
7/15(model_index:1, fold_index:1) starts
8/15(model_index:1, fold_index:2) starts
9/15(model_index:1, fold_index:3) starts
10/15(model_index:1, fold_index:4) starts
11/15(model_index:2, fold_index:0) starts
12/15(model_index:2, fold_index:1) starts
13/15(model_index:2, fold_index:2) starts
14/15(model_index:2, fold_index:3) starts
15/15(model_index:2, fold_index:4) starts
[1]. train base learner done, cost 2 seconds.
[2]. get base learner prediction...
[2]. get base learner prediction done, cost 0 seconds.
[3]. train meta learner...
last used model index is [0, 1, 2].
[3]. train meta learner done, cost 0 seconds.
0.982207987822119


#### 方法5

因为基学习器是根据k-fold样式来生成特征，所以在生成k-fold数据时可以指定按列进行分层采样，具体使用如下，

In [7]:
clf = StackingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                            DecisionTreeClassifier()],
                         meta_learner=LogisticRegression(),
                         feature_fraction=0.8)

clf.fit(X=train_x, y=train_y, stratify=True, stratify_col=train_y)

pred = clf.predict_proba(test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3(base learner model number) × 5(fold num) = 15 models.
1/15(model_index:0, fold_index:0) starts
2/15(model_index:0, fold_index:1) starts
3/15(model_index:0, fold_index:2) starts
4/15(model_index:0, fold_index:3) starts
5/15(model_index:0, fold_index:4) starts
6/15(model_index:1, fold_index:0) starts
7/15(model_index:1, fold_index:1) starts
8/15(model_index:1, fold_index:2) starts
9/15(model_index:1, fold_index:3) starts
10/15(model_index:1, fold_index:4) starts
11/15(model_index:2, fold_index:0) starts
12/15(model_index:2, fold_index:1) starts
13/15(model_index:2, fold_index:2) starts
14/15(model_index:2, fold_index:3) starts
15/15(model_index:2, fold_index:4) starts
[1]. train base learner done, cost 3 seconds.
[2]. get base learner prediction...
[2]. get base learner prediction done, cost 0 seconds.
[3]. train meta learner...
last used model index is [0, 1, 2].
[3]. train meta learner done, cost 0 seconds.
0.9819856988143992


### 1.2 分布式版本

在分布式版本中，使用方式和单机版本一致，不同之处是分布式版本中需要额外指定两个参数，`spark`和`distribute`，下面仅列举一个例子。

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("distribute").enableHiveSupport().getOrCreate()  # 需要创建spark连接


clf = StackingClassifier(k_fold=5, base_learner_list=[RandomForestClassifier(), 
                                                      GradientBoostingClassifier(),
                                                      DecisionTreeClassifier()], 
                         meta_learner=LogisticRegression(), 
                         distribute=True, spark=spark)  # 相比单机版，需要额外指定此两个参数
clf.fit(X=train_x, y=train_y)  # 训练

pred = clf.predict_proba(X=test_x)[:, 1]  # 预测
auc_val = roc_auc_score(y_true=test_y, y_score=pred)  # 评估
print(auc_val)

## 2. Blending

Blending与Stacking大致相同，只是Blending的主要区别在于训练集不是通过k-fold的CV策略来获得预测值从而生成第二阶段模型的特征，而是建立一个Holdout集，例如10%的训练数据。

从使用参数来看，和Stacking的唯一不同之处即是，参数由`k_fold`变为`base_train_size`(指定训练部分数据的比例)，下面仅介绍一个例子，其余使用方式和Stacking部分完全一致。

In [8]:
def selector(model_metrics):
    model_avg_metric = np.array(model_metrics)
    return model_avg_metric.argsort()[-2:][::-1]  # 选择得分最高的两个模型

clf = BlendingClassifier(base_train_size=0.8,
                         base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                            DecisionTreeClassifier()],
                         meta_learner=LogisticRegression(), metric_func=roc_auc_score, select_base_learner=selector,
                         feature_fraction=0.8)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
base learner used sample num is 2800, fraction is 0.8
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
[2]. get base learner prediction...
model's test set metric is [0.9647576732045398, 0.9819901645514725, 0.9019605436499167].
meta learner used sample num is 700, fraction is 0.2
[2]. get base learner prediction done, cost 0 seconds.
[3]. train meta learner...
last used model index is [1 0].
[3]. train meta learner done, cost 0 seconds.
0.9822897901769598


## 3. Bagging

Bagging同样是属于模型集成的一种方式，不同于Stacking和Blending的两阶段训练，Bagging只需要一阶段的训练，然后将一阶段的模型预测结果集成即可。

### 3.1 单机版本

#### 方法1

In [9]:
clf = BaggingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                           DecisionTreeClassifier()])
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
train done.
0.9795538570699464


#### 方法2
添加特征随机选取和boostrap采样

In [10]:
clf = BaggingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                           DecisionTreeClassifier()],
                        feature_fraction=0.8, bootstrap=True)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
train done.
0.9774732319576904


#### 方法3

上面的方法产生的预测值都是通过平均得到的，多数情况下上面两种够用，如果想为每个基学习器加一个权重，可以预先通过数据集训练得到每个基学习器的评估指标，然后通过自定义的方式将指标转换为权重，从而得到加权预测值，下面着重介绍这一大类的方式。

下面是通过5折交叉验证的平均auc值得到每个基学习器的指标，然后通过softmax归一转换为权重，从而实现加权预测。

In [11]:
clf = BaggingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                           DecisionTreeClassifier()],
                        feature_fraction=0.8, bootstrap=True,
                        get_model_metric=True, metric_to_weight="softmax", metric_func=roc_auc_score, metric_k_fold=5,
                        predict_strategy="weight")
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
[2]. get metric...
[2]. get metric done, cost 3.
train done.
0.9750111589081875


#### 方法4

这个方法和方法3一致，不同的是通过验证集的方式得到评估指标。
将输入数据的70%作为训练集，其余百分之30%作为测试集，通过30%的验证集的auc值作为评估指标，然后通过softmax的方式将评估指标转换为权重（概率分布）。

In [12]:
clf = BaggingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                           DecisionTreeClassifier()],
                        feature_fraction=0.8, bootstrap=False,
                        get_model_metric=True, metric_func=roc_auc_score, metric_base_train_size=0.7,
                        metric_to_weight="softmax", predict_strategy="weight")
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
[2]. get metric...
[2]. get metric done, cost 0.
train done.
0.9785571131593315


#### 方法5

下面这种方式可以自定义指标到权重的变换方式，其中`metric_sample_size`指预先从全量数据中取出该比例的数据，仅通过这部分数据进行指标的评估，往往在数据量大时，这种方式可以显著的减少训练时间。

In [13]:
def metric_to_weight(metrics):
    model_weight = np.array(metrics)
    model_weight = model_weight / sum(model_weight)
    return model_weight

clf = BaggingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                           DecisionTreeClassifier()],
                        feature_fraction=0.8, bootstrap=False, sample_fraction=0.9,
                        get_model_metric=True, metric_sample_size=0.8,
                        metric_func=roc_auc_score, metric_base_train_size=0.7,
                        metric_to_weight=metric_to_weight,
                        predict_strategy="weight", random_state=222)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
[2]. get metric...
[2]. get metric done, cost 0.
train done.
0.9805123672712335


#### 方法6

单机版中支持多进程的方式，需要指定两个参数，`enable_multiprocess`和`n_jobs`即可。

In [14]:
def metric_to_weight(metrics):
    model_weight = np.array(metrics)
    model_weight = model_weight / sum(model_weight)
    return model_weight

clf = BaggingClassifier(base_learner_list=[RandomForestClassifier(), GradientBoostingClassifier(),
                                           DecisionTreeClassifier()],
                        feature_fraction=0.8, bootstrap=False, sample_fraction=0.9,
                        get_model_metric=True, metric_sample_size=0.8,
                        metric_func=roc_auc_score, metric_base_train_size=0.7,
                        metric_to_weight=metric_to_weight,
                        predict_strategy="weight", enable_multiprocess=True,
                        n_jobs=2, random_state=222)
clf.fit(X=train_x, y=train_y)

pred = clf.predict_proba(X=test_x)[:, 1]
auc_val = roc_auc_score(y_true=test_y, y_score=pred)
print(auc_val)

[1]. train base learner...
altogether train 3 models.
1/3(model_index:0) starts
2/3(model_index:1) starts
3/3(model_index:2) starts
[1]. train base learner done, cost 0 seconds.
[2]. get metric...
[2]. get metric done, cost 0.
train done.
0.9796018714956138


### 3.2 分布式版本

和上面Stacking和Blending的使用方式一致，主要是额外添加两个参数`spark`和`distribute`。