## Ensembling with Stacking

> Stacking (also called meta ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly. For this reason, stacking is most effective when the base models are significantly different. 

http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/

In [1]:
import pandas as pd
import numpy as np
workdir = "/home/ubuntu/data/"

In [2]:
train = pd.read_csv(workdir+'numerai_training_data.csv')

In [3]:
train.columns

Index([u'feature1', u'feature2', u'feature3', u'feature4', u'feature5',
       u'feature6', u'feature7', u'feature8', u'feature9', u'feature10',
       u'feature11', u'feature12', u'feature13', u'feature14', u'feature15',
       u'feature16', u'feature17', u'feature18', u'feature19', u'feature20',
       u'feature21', u'feature22', u'feature23', u'feature24', u'feature25',
       u'feature26', u'feature27', u'feature28', u'feature29', u'feature30',
       u'feature31', u'feature32', u'feature33', u'feature34', u'feature35',
       u'feature36', u'feature37', u'feature38', u'feature39', u'feature40',
       u'feature41', u'feature42', u'feature43', u'feature44', u'feature45',
       u'feature46', u'feature47', u'feature48', u'feature49', u'feature50',
       u'target'],
      dtype='object')

In [4]:
col_names = {
    "folds" : "fold", 
    "target" : "target", 
    "ids" : "t_id", 
    "features" : [u'feature1', u'feature2', u'feature3', u'feature4', u'feature5',
           u'feature6', u'feature7', u'feature8', u'feature9', u'feature10',
           u'feature11', u'feature12', u'feature13', u'feature14', u'feature15',
           u'feature16', u'feature17', u'feature18', u'feature19', u'feature20',
           u'feature21', u'feature22', u'feature23', u'feature24', u'feature25',
           u'feature26', u'feature27', u'feature28', u'feature29', u'feature30',
           u'feature31', u'feature32', u'feature33', u'feature34', u'feature35',
           u'feature36', u'feature37', u'feature38', u'feature39', u'feature40',
           u'feature41', u'feature42', u'feature43', u'feature44', u'feature45',
           u'feature46', u'feature47', u'feature48', u'feature49', u'feature50']
    }

First step is partitioning the training data into folds

In [5]:
n_folds = 4

In [6]:
train[col_names["folds"]] = np.random.choice(range(1, n_folds + 1), train.shape[0])

In [7]:
#train.to_csv(workdir+'numerai_training_data_folds.csv', index=False)

For each of the folds, we'll use that one as testing set and the others as training set, storing the predictions of the current base model in new column in the training dataset

In [8]:
def predict_by_folds(model, model_name, train_df, folds_col, target_col, features_col): 
    
    folds = train_df[folds_col]
    target = train_df[target_col]
    data = train_df[features_col]
    
    for f in folds.unique():
        train_i = folds != f
        train1 = data[train_i]
        test1 = data[-train_i]
        y1 = target[train_i]
        model.fit(train1.as_matrix(), y1.as_matrix().ravel())
        y1_pred = model.predict_proba(test1.as_matrix())
        train_df.loc[-train_i, model_name] = y1_pred[:,1]
    
    return train_df

Each base model will be fitted with the full training dataset and used to make predictions on the test set. These predictions will be stored in a new dataset

In [9]:
def make_predictions(model, train_df, test_df, target_col, features_col): 
    X_train = train_df[features_col].as_matrix()
    y_train = train_df[target_col].as_matrix().ravel()
    X_test = test_df[features_col].as_matrix()
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_test)
    y_pred = y_pred[:,1]
    return y_pred

In [10]:
test = pd.read_csv(workdir+'numerai_tournament_data.csv')

In [13]:
def models_training_and_prediction(models, train_set, test_set, column_names):
    l1_predictions = test_set[[column_names["ids"]]]
    train = train_set.copy()
    for model in models:
        model_name = str(model).split("(")[0]
        print("Training %s ..." % (model_name))
        train = predict_by_folds(model, model_name, train, column_names["folds"], column_names["target"], column_names["features"])
        prediction = make_predictions(model, train_set, test_set, column_names["target"], column_names["features"])
        l1_predictions.loc[:, (model_name)] = prediction
        
    return train, l1_predictions

In [24]:
l1_models = []

### Level 1 Models

We'll use 6 models as level 1 (or base) models. These algorithms were chosen to be significantly different of each other because diversity is desirable while doing stacking. The hyperparameters for each model were individually optimized (in most cases)

In [25]:
#Model 1: XGBoost
from xgboost import XGBClassifier
best_params = {'colsample_bytree': 0.3, 'learning_rate': 0.01, 'min_child_weight': 3.0, 
               'n_estimators': 400, 'subsample': 0.2, 'max_depth': 5, 'gamma': 0.95}
l1_models.append(XGBClassifier(**best_params))

In [26]:
#Model 2: Naive Bayes
from sklearn.naive_bayes import GaussianNB
l1_models.append(GaussianNB())

In [27]:
#Model 3: LogisticRegression
from sklearn.linear_model import LogisticRegression
best_params = {'max_iter':500, 'C':0.1}
l1_models.append(LogisticRegression(**best_params))

In [28]:
#Model 4: Neural Network
from sklearn.neural_network import MLPClassifier
best_params = {
    "hidden_layer_sizes":(100, 100, 50, ), "activation":'tanh', "solver":'adam', 
    "alpha":0.0001, "learning_rate":'invscaling', "learning_rate_init":0.001
}
l1_models.append(MLPClassifier(**best_params))

In [29]:
#Model 5: RF
from sklearn.ensemble import RandomForestClassifier
best_params = {'min_samples_split': 4, 'n_estimators': 1200}
best_params["n_jobs"] =  -1
l1_models.append(RandomForestClassifier(**best_params))

In [30]:
#Model 6: k-NN
from sklearn.neighbors import KNeighborsClassifier
best_params = {'n_neighbors': 800}
best_params["n_jobs"] = -1
l1_models.append(KNeighborsClassifier(**best_params))

In [32]:
%%time
train1, l1_predictions = models_training_and_prediction(l1_models, train, test, col_names)

Training XGBClassifier ...
Training GaussianNB ...
Training LogisticRegression ...
Training MLPClassifier ...
Training RandomForestClassifier ...
Training KNeighborsClassifier ...
CPU times: user 5h 41min 21s, sys: 5min 43s, total: 5h 47min 4s
Wall time: 1h 28min 14s


### Level 2 Model

The level 2 model (or stacking model) will be trained using the predictions of the base models as features. Here we'll use a LogisticRegressor classifier as stacking model because is simple and fast, and allow us to check the fitted coefficients of the level 1 models so we can compare the relative weight of each model in the stack. 

In [38]:
l1_model_names = [str(m).split("(")[0] for m in l1_models]

In [39]:
l1_model_names

['XGBClassifier',
 'GaussianNB',
 'LogisticRegression',
 'MLPClassifier',
 'RandomForestClassifier',
 'KNeighborsClassifier']

In [42]:
train2 = train1[l1_model_names]
target = train[col_names["target"]]
test2 = l1_predictions[l1_model_names]

In [44]:
from sklearn.linear_model import LogisticRegressionCV
model2 = LogisticRegressionCV(max_iter=500, n_jobs=-1)
model2.fit(train2, target)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=500,
           multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [47]:
zip(l1_model_names, model2.coef_[0])

[('XGBClassifier', 1.2076147893880145),
 ('GaussianNB', -0.036017735049057129),
 ('LogisticRegression', 1.8289543650577398),
 ('MLPClassifier', 0.29037835977075915),
 ('RandomForestClassifier', 0.32329146264212677),
 ('KNeighborsClassifier', 0.85481792947028812)]

In [48]:
prediction2 = model2.predict_proba(test2)

In [49]:
results = pd.read_csv(workdir+"example_predictions.csv")
results["probability"] = prediction2[:,1]
results.to_csv(workdir+"submission_stacked_lr_1.csv", index=False)

*submission_stacked_lr_1.csv has logloss of 0.689*

### Ensembling with Average

In [53]:
prediction3 = test2.mean(axis=1)

In [55]:
results["probability"] = prediction3
results.to_csv(workdir+"submission_ensemble_average_1.csv", index=False)

*submission_ensemble_average_1.csv has logloss of 0.687*

Sometimes (oftentimes) a simple approach is the best way