**Blending**  

基于一个简单的思想：**与其使用一些简单的函数（例如，bagging的硬投票），来集成所有预测器的预测；我们为什么不训练一个模型来执行这个聚合呢？**

**Blending集成学习的步骤：**  
* 1，将训练集划分为两部分：train1和train2
* 2，第一层，使用train1训练多个预测器
* 3，第二层，使用上层训练好的预测器和train2，输出预测值
* 4，第三层，使用上层输出的预测值作为训练数据，train2的标签作为目标因变量，训练一个预测器，Blending模型全部训练完成
* 5，预测时，输入样本，输出为第三层的预测结果，同时也是Blending集成模型的预测结果。

In [22]:
# 导入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [18]:
# 创建数据
data, target = make_blobs(n_samples=10000, 
                          n_features=2,
                          centers=2, 
                          random_state=1, 
                          cluster_std=1.0)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2,
                                                   random_state=1)
# 划分训练集
X_train1, X_train2, y_train1, y_train2 = train_test_split(X_train, 
                                                          y_train, 
                                                          test_size=0.3,
                                                         random_state=1)

print("The shape of X_train1:",X_train1.shape)
print("The shape of X_train2:",X_train2.shape)
print("The shape of X_test:",X_test.shape)

The shape of X_train1: (5600, 2)
The shape of X_train2: (2400, 2)
The shape of X_test: (2000, 2)


In [30]:
# 定义Blending模型

# 输入层模型
layer1 = [SVC(probability=True), 
          RandomForestClassifier(n_estimators=5,
                                n_jobs=-1,
                                criterion='gini'),
         KNeighborsClassifier()]

# 输出层模型
layer2 = LinearRegression()

In [31]:
# 训练Blending模型
train2_features = np.zeros((X_train2.shape[0],len(layer1)))  # 初始化验证集结果

# 训练输入层模型
for i,clf in enumerate(layer1):
    clf.fit(X_train1,y_train1)
    feature = clf.predict_proba(X_train2)[:, 1]
    train2_features[:,i] = feature
    
# 训练输出层模型
layer2.fit(train2_features,y_train2)

LinearRegression()

In [34]:
# 测试Blending模型
test_features = np.zeros((X_test.shape[0],len(layer1)))
for i,clf in enumerate(layer1):
    feature = clf.predict_proba(X_test)[:, 1]
    test_features[:,i] = feature
cross_val_score(layer2,test_features,y_test,cv=5)

array([1., 1., 1., 1., 1.])

**Iris 案例**

In [56]:
from sklearn.linear_model import LogisticRegression

In [39]:
iris = datasets.load_iris()
data = iris.data
target = iris.target

In [43]:
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2,
                                                   random_state=1)
# 划分训练集
X_train1, X_train2, y_train1, y_train2 = train_test_split(X_train, 
                                                          y_train, 
                                                          test_size=0.3,
                                                         random_state=1)

In [57]:
# 定义Blending模型

# 输入层模型
layer1 = [SVC(probability=True), 
          RandomForestClassifier(n_estimators=200,
                                n_jobs=-1,
                                criterion='gini'),
         KNeighborsClassifier()]

# 输出层模型
layer2 = LogisticRegression()

In [58]:
# 训练Blending模型
train2_features = np.zeros((X_train2.shape[0],len(layer1)))  # 初始化验证集结果

# 训练输入层模型
for i,clf in enumerate(layer1):
    clf.fit(X_train1,y_train1)
    feature = clf.predict_proba(X_train2)[:, 1]
    train2_features[:,i] = feature
    
# 训练输出层模型
layer2.fit(train2_features,y_train2)

LogisticRegression()

In [59]:
# 测试Blending模型
test_features = np.zeros((X_test.shape[0],len(layer1)))
for i,clf in enumerate(layer1):
    feature = clf.predict_proba(X_test)[:, 1]
    test_features[:,i] = feature
cross_val_score(layer2,test_features,y_test,cv=5)

array([0.83333333, 0.83333333, 0.83333333, 0.66666667, 0.66666667])