**简单加权融合**：   
   回归：算术平均融合，几何平均融合；      
   分类：投票   
   综合：排序融合，log融合   
**stacking、blending**：   
   构建多层模型，利用结果再拟合预测。   
**boosting、bagging**：   
   多树的提升方法

#### 回归、分类概率融合：

1.简单加权平均，结果直接融合

***

In [1]:
test_pre1 = [1.2,3.2,2.1,6.2]
test_pre2 = [0.9,3.1,2.0,5.9]
test_pre3 = [1.1,2.9,2.2,6.0]

y_test_true = [1,3,2,6]

In [2]:
import numpy as np
import pandas as pd

In [3]:
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result

In [4]:
from sklearn import metrics
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true,test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true,test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true,test_pre3))

Pred1 MAE: 0.1750000000000001
Pred2 MAE: 0.07499999999999993
Pred3 MAE: 0.10000000000000009


In [5]:
w = [0.3,0.4,0.3]
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true,Weighted_pre))

Weighted_pre MAE: 0.05750000000000027


In [None]:
# def Mean_method(test_pre1,test_pre2,test_pre3):
#     Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
#     return Mean_result

***

**Stacking融合（回归）**
***

In [8]:
from sklearn import linear_model
def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2=linear_model.LinearRegression()):
    model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
    Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
    return Stacking_result

In [9]:
train_reg1 = [3.2,8.2,9.1,5.2]
train_reg2 = [2.9,8.1,9.0,4.9]
train_reg3 = [3.1,7.9,9.2,5.0]

y_train_true = [3,8,9,5]

test_pre1 = [1.2,3.2,2.1,6.2]
test_pre2 = [0.9,3.1,2.0,5.9]
test_pre3 = [1.1,2.9,2.2,6.0]

y_test_ture = [1,3,2,6]

In [10]:
model_L2 = linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
                              test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true,Stacking_pre))

Stacking_pre MAE: 0.04213483146067476


可以看出来，模型相对于之前有了进一步提升，注意：第二层Stacking模型不宜选的过于复杂，有过拟合的风险。

In [11]:
# 多个单模型融合
clfs = [LogisticRegression(solver='lbfgs'),
       RandomForestClassifier(n_estimators=5,n_jobs=-1,criterion='gini'),
       ExtraTreesClassifier(n_estimators=5,n_jobs=-1,criterion='gini'),
       ExtraTreesClassifier(n_estimators=5,n_jobs=-1,criterion='entropy'),
       GradientBoostingClassifier(learning_rate=0.05,subsample=0.5,max_depth=6,n_estimators=5)]

# 切分一部分数据作为测试集
X,X_predict,y,y_predict = train_test_split(data,target,test_size=0.3,random_state=2020)

dataset_blend_train = np.zeros((X.shape[0],len(clfs)))         # 构建 A 的0矩阵
dataset_blend_test = np.zeros((X_predict.shape[0],len(clfs)))    # 构建 B 的0矩阵

# 5折stacking
n_splits = 5
skf = StratifiedKFold(n_splits)
skf = skf.split(X,y)

for j,clf in enumerate(clfs):  # 遍历每一个模型
    
    # 依次训练各个单模型
    dataset_blend_test_j = np.zeros((X_predict.shape[0],5))  # 单模型0矩阵
    for i,(train,test) in enumerate(skf):    # k折交叉验证--训练集--测试集  索引
        
        # 5折交叉验证，使用第i个部分作预测，剩余部分来训练模型，获得其预测的输出作为第i部分的新特征
        X_train,y_train,X_test,y_test = X[train],y[train],X[test],y[test]
        clf.fit(X_train,y_train)
        y_submission = clf.predict_proba(X_test)[:,1]
        dataset_blend_train[test,j] = y_submission    # 用每一次验证集的预测构建 A1、A2、A3
        dataset_blend_test_j[:,i] = clf.predict_proba(X_predict)[:,1]     # 用测试集的预测构建 B1\B2\B3
        
    # 对于测试集，直接用这k个模型的预测值均值作为新特征
    dataset_blend_test[:,j] = dataset_blend_test_j.mean(1)     # 将测试集的每一次预测取均值作为 B1\B2\B3
    print('val auc Score: %f'%roc_auc_score(y_predict,dataset_blend_test[:,j]))
    
# 第二重训练预测    
clf = LogisticRegression(solver='lbfgs')
clf.fit(dataset_blend_train,y)
y_submission = clf.predict_proba(dataset_blend_test)[:,1]
print('Val auc Score of Stacking:%f'%(roc_auc_score(y_predict,y_submission)))

NameError: name 'LogisticRegression' is not defined

In [12]:
pip install mlxtend

Collecting mlxtend
  Downloading https://files.pythonhosted.org/packages/86/30/781c0b962a70848db83339567ecab656638c62f05adb064cb33c0ae49244/mlxtend-0.18.0-py2.py3-none-any.whl (1.3MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.18.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = ···
clf3 = ···
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1,clf2,clf3],   # 三个基分类器
                          meta_classifier=lr        # 一个次级分类器
                         )

## 融合·涉及多个层面

1).结果层面的融合：根据结果的得分进行加权融合，条件是模型结果的分要比较近似，结果差异要比较大。   
2).特征层面融合：准确说叫分割，把特征进行切分给不同的特征。   
3).模型层面融合：堆叠，最好不同模型类型要有一定差异。

## 硬投票

In [None]:
# 对多个模型直接进行投票，不区分模型结果的相对重要度，最终投票数最多的类为要被预测的类
iris = datasets.load_iris()

x = iris.data
y = iris.target
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
clf1 = XGBClassifier(learnig_rate=0.1,n_estimators=150,max_depth=0.3,min_child_weight=2,subsample=0.7,
                    colsample_bytree=0.6,objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50,max_depth=1,min_samples_split=4,min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1)

# 硬投票
eclf = VotingClassifier(estimators=[('xgb',clf1),('rf',clf2),('svc',clf3)],voting='hard')
for clf,label in zip([clf1,clf2,clf3,eclf],['XGBBoosting','Random Forest','SVM','Ensemble']):
    scores = cross_val_score(clf,x,y,cv=5,scoring='accuracy')
    print('Accuracy:%0.2 (+/- %0.2f)[%s]'%(scores.mean(),scores.std(),label))

## 软投票

In [None]:
# 软投票：同硬投票，增加了设置权重的功能，可以为不同模型设置不同权重，进而区分模型不同的重要性
iris = datasets.load_iris()

x = iris.data
y = iris.target
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
clf1 = XGBClassifier(learnig_rate=0.1,n_estimators=150,max_depth=0.3,min_child_weight=2,subsample=0.7,
                    colsample_bytree=0.6,objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50,max_depth=1,min_samples_split=4,min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1)

# 软投票
eclf = VotingClassifier(estimators=[('xgb',clf1),('rf',clf2),('svc',clf3)],voting='soft',weight=[2,1,1])
clf1.fit(x_train,y_train)
for clf,label in zip([clf1,clf2,clf3,eclf],['XGBBoosting','Random Forest','SVM','Ensemble']):
    scores = cross_val_score(clf,x,y,cv=5,scoring='accuracy')
    print('Accuracy:%0.2 (+/- %0.2f)[%s]'%(scores.mean(),scores.std(),label))