## 堆叠

* 回归问题
* 分类问题
* 写stacking的模块

### 回归问题

**加载模块**

In [1]:
from sklearn.datasets import load_diabetes
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import KFold
from sklearn import metrics
import numpy as np

**生成训练测试集**

In [2]:
diabetes = load_diabetes()

train_x, train_y = diabetes.data[:400], diabetes.target[:400]
test_x, test_y = diabetes.data[400:], diabetes.target[400:]

**创建基础学习器(base-learner)和元学习器（meta-learner）**

In [4]:
## 基础学习器

base_learners = []
knn = KNeighborsRegressor(n_neighbors=5)

base_learners.append(knn)
dtr = DecisionTreeRegressor(max_depth=4 , random_state=123456)

base_learners.append(dtr)
ridge = Ridge()

base_learners.append(ridge)

## 元学习器
meta_learner = LinearRegression()


初始化学习者之后，我们需要为训练集创建元数据。通过首先用KFold（n_splits = 5）指定分割数（K），然后调用KF.split（train_x），将训练集分成五个。反过来，这将返回生成训练集的五个分段的训练和测试索引。对于这些每个拆分，我们使用train_indices（四个folds）指示的数据来训练我们的基础学习器，并在与test_indices相对应的数据上创建元数据。此外，我们将每个分类器的元数据存储在meta_data数组中，并将相应的目标存储在meta_targets数组中。最后，我们转置meta_data以获得（实例，特征）形状。

**在训练集创建元数据集（meta-data）**

In [12]:
# Create variables to store metadata and their targets

meta_data = np.zeros((len(base_learners), len(train_x)))
meta_targets = np.zeros(len(train_x))

In [13]:
meta_data.shape

(3, 400)

In [14]:
# Create the cross-validation folds
KF = KFold(n_splits=5)
meta_index = 0
for train_indices, test_indices in KF.split(train_x):
    for i in range(len(base_learners)):
        learner = base_learners[i]
        learner.fit(train_x[train_indices], train_y[train_indices])
        predictions = learner.predict(train_x[test_indices])
        meta_data[i][meta_index:meta_index+len(test_indices)] = predictions

    meta_targets[meta_index:meta_index+len(test_indices)] = train_y[test_indices]
    meta_index += len(test_indices)

# Transpose the metadata to be fed into the meta-learner
meta_data = meta_data.transpose()

In [15]:
meta_data

array([[221.        , 186.46031746, 179.44148461],
       [ 83.2       ,  91.72477064,  94.56884758],
       [134.4       , 186.46031746, 165.29144916],
       ...,
       [204.6       , 168.23076923, 160.66683682],
       [117.4       , 168.23076923, 156.86271927],
       [212.        , 168.23076923, 176.6069636 ]])

对于测试集，我们不需要将其拆分为折叠。我们仅在整个训练集上训练基础学习器，并在测试集上进行预测。此外，我们评估每个基础学习者并存储评估指标，以将其与整体表现进行比较。

**在测试集创建元数据集**

In [16]:
# Create the metadata for the test set and evaluate the base learners
test_meta_data = np.zeros((len(base_learners), len(test_x)))
base_errors = []
base_r2 = []
for i in range(len(base_learners)):
    learner = base_learners[i]
    learner.fit(train_x, train_y)
    predictions = learner.predict(test_x)
    test_meta_data[i] = predictions

    err = metrics.mean_squared_error(test_y, predictions)
    r2 = metrics.r2_score(test_y, predictions)

    base_errors.append(err)
    base_r2.append(r2)

test_meta_data = test_meta_data.transpose()

现在，我们已经有了训练集和测试集的元数据集，我们可以在训练集上训练元学习器并在测试集上进行评估

In [17]:
# Fit the meta-learner on the train set and evaluate it on the test set
meta_learner.fit(meta_data, meta_targets)
ensemble_predictions = meta_learner.predict(test_meta_data)

err = metrics.mean_squared_error(test_y, ensemble_predictions)
r2 = metrics.r2_score(test_y, ensemble_predictions)

# Print the results 
print('ERROR R2 Name')
print('-'*20)
for i in range(len(base_learners)):
    learner = base_learners[i]
    print(f'{base_errors[i]:.1f} {base_r2[i]:.2f} {learner.__class__.__name__}')
print(f'{err:.1f} {r2:.2f} Ensemble')

ERROR R2 Name
--------------------
2697.8 0.51 KNeighborsRegressor
3142.5 0.43 DecisionTreeRegressor
2564.8 0.54 Ridge
2066.6 0.63 Ensemble


显而易见，与最佳基础学习者相比，r平方改善了16％以上（岭回归），而MSE改善了近20％。这是一个很大的改进

### 分类问题

**加载模块**

In [18]:
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn import metrics
import numpy as np

**创建训练集和测试集**

In [19]:
bc = load_breast_cancer()

train_x, train_y = bc.data[:400], bc.target[:400]
test_x, test_y = bc.data[400:], bc.target[400:]

下面我们初始化生成基础学习器和元学习器。其中MLPClassifier具有100个神经元的单层

**初始化基础学习器和元学习器**

In [24]:
## 基础学习器
base_learners = []

knn = KNeighborsClassifier(n_neighbors=2)
base_learners.append(knn)

dtr = DecisionTreeClassifier(max_depth=4, random_state=123456)
base_learners.append(dtr)

mlpc = MLPClassifier(hidden_layer_sizes =(100, ), 
           solver='lbfgs', random_state=123456)
base_learners.append(mlpc)

## 元学习器
meta_learner = LogisticRegression(solver='lbfgs')

**在训练集创建元数据集（meta-data）**

In [25]:
meta_data = np.zeros((len(base_learners), len(train_x)))
meta_targets = np.zeros(len(train_x))

# Create the cross-validation folds
KF = KFold(n_splits=5)
meta_index = 0
for train_indices, test_indices in KF.split(train_x):
    for i in range(len(base_learners)):
        learner = base_learners[i]

        learner.fit(train_x[train_indices], train_y[train_indices])
        predictions = learner.predict_proba(train_x[test_indices])[:,0]

        meta_data[i][meta_index:meta_index+len(test_indices)] = predictions

    meta_targets[meta_index:meta_index+len(test_indices)] = train_y[test_indices]
    meta_index += len(test_indices)

# Transpose the metadata to be fed into the meta-learner
meta_data = meta_data.transpose()

**在测试集创建元数据集**

In [26]:
test_meta_data = np.zeros((len(base_learners), len(test_x)))
base_acc = []
for i in range(len(base_learners)):
    learner = base_learners[i]
    learner.fit(train_x, train_y)
    predictions = learner.predict_proba(test_x)[:,0]
    test_meta_data[i] = predictions

    acc = metrics.accuracy_score(test_y, learner.predict(test_x))
    base_acc.append(acc)
test_meta_data = test_meta_data.transpose()

**训练模型打印结果**

In [27]:
# Fit the meta-learner on the train set and evaluate it on the test set
meta_learner.fit(meta_data, meta_targets)
ensemble_predictions = meta_learner.predict(test_meta_data)

acc = metrics.accuracy_score(test_y, ensemble_predictions)

# Print the results
print('Acc Name')
print('-'*20)
for i in range(len(base_learners)):
    learner = base_learners[i]
    print(f'{base_acc[i]:.2f} {learner.__class__.__name__}')
print(f'{acc:.2f} Ensemble')

Acc Name
--------------------
0.86 KNeighborsClassifier
0.88 DecisionTreeClassifier
0.23 MLPClassifier
0.91 Ensemble


### 写stacking的模块

我们可以把之前的code总结一下然后写成新的模块重复使用

In [31]:
import numpy as np
from sklearn.model_selection import KFold
from copy import deepcopy

In [32]:
class StackingRegressor():

    def __init__(self, learners):
        # Create a list of sizes for each stacking level And a list of deep copied learners
        self.level_sizes = []
        self.learners = []
        for learning_level in learners:

            self.level_sizes.append(len(learning_level))
            level_learners = []
            for learner in learning_level:
                level_learners.append(deepcopy(learner))
            self.learners.append(level_learners)



    # Creates training meta data for every level and trains each level on the previous level's meta data
    def fit(self, x, y):
        # Create a list of training meta data, one for each stacking level
        # and another one for the targets. For the first level, the actual data
        # is used.
        meta_data = [x]
        meta_targets = [y]
        for i in range(len(self.learners)):
            level_size = self.level_sizes[i]

            # Create the meta data and target variables for this level
            data_z = np.zeros((level_size, len(x)))
            target_z = np.zeros(len(x))

            train_x = meta_data[i]
            train_y = meta_targets[i]

            # Create the cross-validation folds
            KF = KFold(n_splits=5)
            meta_index = 0
            for train_indices, test_indices in KF.split(x):
                # Train each learner on the K-1 folds and create
                # meta data for the Kth fold
                for j in range(len(self.learners[i])):

                    learner = self.learners[i][j]
                    learner.fit(train_x[train_indices], train_y[train_indices])
                    predictions = learner.predict(train_x[test_indices])

                    data_z[j][meta_index:meta_index+len(test_indices)] = predictions

                target_z[meta_index:meta_index+len(test_indices)] = train_y[test_indices]
                meta_index += len(test_indices)

            # Add the data and targets to the meta data lists
            data_z = data_z.transpose()
            meta_data.append(data_z)
            meta_targets.append(target_z)


            # Train the learner on the whole previous meta data
            for learner in self.learners[i]:
                    learner.fit(train_x, train_y)






    # The predict function. Creates meta data for the test data and returns
    # all of them. The actual predictions can be accessed with meta_data[-1]
    def predict(self, x):

        # Create a list of training meta data, one for each stacking level
        meta_data = [x]
        for i in range(len(self.learners)):
            level_size = self.level_sizes[i]

            data_z = np.zeros((level_size, len(x)))

            test_x = meta_data[i]

            # Create the cross-validation folds
            KF = KFold(n_splits=5)
            for train_indices, test_indices in KF.split(x):
                # Train each learner on the K-1 folds and create
                # meta data for the Kth fold
                for j in range(len(self.learners[i])):

                    learner = self.learners[i][j]
                    predictions = learner.predict(test_x)
                    data_z[j] = predictions



            # Add the data and targets to the meta data lists
            data_z = data_z.transpose()
            meta_data.append(data_z)

        # Return the meta_data the final layer's prediction can be accessed
        # With meta_data[-1]
        return meta_data

然后我们可以来直接调用上面写的模块了

In [34]:
diabetes = load_diabetes()

train_x, train_y = diabetes.data[:400], diabetes.target[:400]
test_x, test_y = diabetes.data[400:], diabetes.target[400:]

base_learners = []

knn = KNeighborsRegressor(n_neighbors=5)
base_learners.append(knn)

dtr = DecisionTreeRegressor(max_depth=4, random_state=123456)
base_learners.append(dtr)

ridge = Ridge()
base_learners.append(ridge)

meta_learner = LinearRegression()

# Instantiate the stacking regressor
sc = StackingRegressor([[knn,dtr,ridge],[meta_learner]])

# Fit and predict
sc.fit(train_x, train_y)
meta_data = sc.predict(test_x)

# Evaluate base learners and meta-learner
base_errors = []
base_r2 = []
for i in range(len(base_learners)):
    learner = base_learners[i]
    predictions = meta_data[1][:,i]
    err = metrics.mean_squared_error(test_y, predictions)
    r2 = metrics.r2_score(test_y, predictions)
    base_errors.append(err)
    base_r2.append(r2)

err = metrics.mean_squared_error(test_y, meta_data[-1])
r2 = metrics.r2_score(test_y, meta_data[-1])

# Print the results
print('ERROR R2 Name')
print('-'*20)
for i in range(len(base_learners)):
    learner = base_learners[i]
    print(f'{base_errors[i]:.1f} {base_r2[i]:.2f} {learner.__class__.__name__}')
print(f'{err:.1f} {r2:.2f} Ensemble')

ERROR R2 Name
--------------------
2697.8 0.51 KNeighborsRegressor
3142.5 0.43 DecisionTreeRegressor
2564.8 0.54 Ridge
2066.6 0.63 Ensemble


和之前分步做的结果一样