This notebook is a very basic and simple introductory primer to the method of ensembling models, in particular the variant of ensembling known as Stacking. In a nutshell stacking uses as a first-level (base), the predictions of a few basic machine learning models (Regressors) and then uses another model at the second-level to predict the output from the earlier first-level predictions.

I myself am quite a newcomer to the Kaggle scene as well and the first proper ensembling/stacking script. The material in this notebook borrows heavily from Faron's script as well as anisotrpoic although ported to factor in ensembles of Regressors whilst those were ensembles of classifiers. Anyway please check out his script here:

Stacking Starter : by [Faron](http://)

In [1]:
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

# Going to use these 5 base models for the stacking
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVC
from sklearn.cross_validation import KFold;

In [2]:
print( "\nReading data from disk ...")
properties = pd.read_csv(r"../input/properties_2016.csv")
train_df = pd.read_csv("../input/train_2016_v2.csv")
test_df = pd.read_csv("../input/sample_submission.csv")
test_df = test_df.rename(columns={'ParcelId': 'parcelid'})

In [3]:
train = train_df.merge(properties, how = 'left', on = 'parcelid')
test = test_df.merge(properties, on='parcelid', how='left')

### Encoding the Variables

In [4]:
from sklearn.preprocessing import LabelEncoder  

lbl = LabelEncoder()

for c in train.columns:
    train[c]=train[c].fillna(0)
    if train[c].dtype == 'object':
        lbl.fit(list(train[c].values))
        train[c] = lbl.transform(list(train[c].values))

for c in test.columns:
    test[c]=test[c].fillna(0)
    if test[c].dtype == 'object':
        lbl.fit(list(test[c].values))
        test[c] = lbl.transform(list(test[c].values))     

## Pearson Correlation Heatmap


let us generate some correlation plots of the features to see how related one feature is to the next. To do so, we will utilise the Seaborn plotting package which allows us to plot heatmaps very conveniently

### First, Calculate Feature Importance 

We are calculating the feature importance because the variables are just too much, so we only need concern ourselves with the ones that are useful for our analysis

In [5]:
from sklearn import model_selection, preprocessing
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

train_y = train.logerror.values
train_X = train.drop(["parcelid", "transactiondate", "logerror"], axis=1)


In [6]:
xgb_params = {
    'eta': 0.05,
    'max_depth': 8,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'silent': 1
}
dtrain = xgb.DMatrix(train_X, train_y, feature_names=train_X.columns.values)
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=100)

In [7]:
featureImportance = model.get_fscore()
features = pd.DataFrame()
features['features'] = featureImportance.keys()
features['importance'] = featureImportance.values()


In [8]:
features.sort_values(by=['importance'],ascending=False,inplace=True)
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
plt.xticks(rotation=90)
sns.barplot(data=features.head(15),x="importance",y="features",ax=ax,orient="h", color = "#34495e")

In [9]:
topFeatures = features["features"].tolist()[:20]
corrMatt = train[topFeatures].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False


In [10]:
colormap = plt.cm.viridis
plt.figure(figsize=(12,20))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(corrMatt,linewidths=0.6,vmax=1.0, mask = mask, square = True, linecolor='white', annot=True)
plt.show()

### Outlier's Check

In [11]:
plt.figure(figsize=(12,8))
sns.distplot(train.logerror.values, bins=500, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()

In [12]:
train=train[ train.logerror > -0.40 ]
train=train[ train.logerror < 0.419 ]

plt.figure(figsize=(12,8))
sns.distplot(train.logerror.values, bins=50, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()

In [13]:
test.head(2)

### Helpers via Python Classes

In the section of code below, we essentially write a class SklearnHelper that allows one to extend the inbuilt methods (such as train, predict and fit) common to all the Sklearn regressor. Therefore this cuts out redundancy as won't need to write the same methods five times if we wanted to invoke four different regressor.

In [14]:
# Some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

# Class to extend the Sklearn Regressor
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
    
# Class to extend XGboost classifer

def **init** : Python standard for invoking the default constructor for the class. This means that when you want to create an object (regressor), you have to give it the parameters of clf (what sklearn regressor you want), seed (random seed) and params (parameters for the regressors).
The rest of the code are simply methods of the class which simply call the corresponding methods already existing within the sklearn regressors.

### Out-of-Fold Predictions

In [15]:
print(train.shape)
print(test.shape)

Now as alluded to above in the introductory section, stacking uses predictions of base regressors as input for training to a second-level model. However one cannot simply train the base models on the full training data, generate predictions on the full test set and then output these for the second-level training. This runs the risk of your base model predictions already having "seen" the test set and therefore overfitting when feeding these predictions.

In [16]:
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

### Generating our Base First-Level Models

So now let us prepare five learning models as our first level classification. These models can all be conveniently invoked via the Sklearn library and are listed as follows:
Random Forest regressor
Extra Trees regressor
AdaBoost regressor
Gradient Boosting regressor


In [17]:
# Put in our parameters for said regressors
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 50,
     'warm_start': True, 
     #'max_features': 0.2,
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 0
}

# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':50,
    #'max_features': 0.5,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 50,
    'learning_rate' : 0.75
}

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 50,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}


Furthermore, since having mentioned about Objects and classes within the OOP framework, let us now create 4 objects that represent our 4 learning models via our Helper Sklearn Class we defined earlier.

In [18]:
#Create 5 objects that represent our 4 models
rf = SklearnHelper(clf=RandomForestRegressor, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesRegressor, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostRegressor, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingRegressor, seed=SEED, params=gb_params)
#svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

### Features Checking

In [19]:
feature_names = list(train.columns)
print(np.setdiff1d(train.columns, test.columns))

In [20]:
do_not_include = ['parcelid', 'logerror', 'transactiondate', 'hashottuborspa',
 'propertycountylandusecode',
 'propertyzoningdesc',
 'fireplaceflag',
 'taxdelinquencyflag']

feature_names = [f for f in train.columns if f not in do_not_include]

print("We have %i features."% len(feature_names))
train[feature_names].count()

#### Creating NumPy arrays out of our train and test sets

Great. Having prepared our first layer base models as such, we can now ready the training and test test data for input into our regressors by generating NumPy arrays out of their original dataframes as follows:

In [21]:
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['logerror'].ravel()
#train = train.drop(['logerror', 'parcelid', 'transactiondate'], axis=1)


In [22]:
train = train[feature_names]
test = test[feature_names]

In [23]:
print(train.shape)
print(test.shape)

In [24]:
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data

### Output of the First level Predictions

We now feed the training and test data into our 4 base regressors and use the Out-of-Fold prediction function we defined earlier to generate our first level predictions. Allow a handful of minutes for the chunk of code below to run.

In [25]:
# Create our OOF train and test predictions. These base results will be used as new features
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
#svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector 

print("Training is complete")

### Feature importances generated from the different regressors

In [26]:
rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)


## Second-Level Predictions from the First-level Output

### First-level output as new features

Having now obtained our first-level predictions, one can think of it as essentially building a new set of features to be used as training data for the next regressor. As per the code below, we are therefore having as our new columns the first-level predictions from our earlier regressors and we train the next regressor on this.

In [32]:
base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),
     'ExtraTrees': et_oof_train.ravel(),
     'AdaBoost': ada_oof_train.ravel(),
      'GradientBoost': gb_oof_train.ravel()
    })
base_predictions_train.head()

### Correlation Heatmap of the Second Level Training set

In [33]:
data = [
    go.Heatmap(
        z= base_predictions_train.astype(float).corr().values ,
        x=base_predictions_train.columns.values,
        y= base_predictions_train.columns.values,
          colorscale='Portland',
            showscale=True,
            reversescale = True
    )
]
py.iplot(data, filename='labelled-heatmap')

 ### Making the New Training & Testing Sets

In [34]:
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test), axis=1)

### Second level learning model via XGBoost

Here we choose the eXtremely famous library for boosted tree learning model, XGBoost. It was built to optimize large-scale boosted tree algorithms. For further information about the algorithm, check out the official documentation.
Anyways, we call an XGBoost and fit it to the first-level train and target data and use the learned model to predict the test data as follows:

### Assignment of Variable

In [35]:
X = x_train
y = y_train
y_mean = np.mean(y_train)

In [36]:
from sklearn.model_selection import train_test_split

Xtr, Xv, ytr, yv = train_test_split(X, y, test_size=0.2, random_state=2000)

dtrain = xgb.DMatrix(Xtr, label=ytr)
dvalid = xgb.DMatrix(Xv, label=yv)

watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

# Try different parameters! My favorite is random search :)
xgb_params = {
    'eta': 0.025,
    'max_depth': 7,
    'subsample': 0.80,
    'objective': 'reg:linear',
    'eval_metric': 'mae',
    'lambda': 0.8,   
    'alpha': 0.4, 
    'base_score': y_mean,
    'silent': 1
}


In [37]:
model_xgb = xgb.train(xgb_params, dtrain, 2000, watchlist, early_stopping_rounds=300,
                  maximize=False, verbose_eval=15)

In [38]:
dtest = xgb.DMatrix(x_test)
predicted_test_xgb = model_xgb.predict(dtest)

### Producing the Submission file

In [None]:
sub = pd.read_csv('../input/sample_submission.csv')
for c in sub.columns[sub.columns != 'ParcelId']:
    sub[c] = predicted_test_xgb

print('Writing csv ...')
sub.to_csv('xgb_stacked.csv', index=False, float_format='%.4f')

### Steps for Further Improvement


As a closing remark it must be noted that the steps taken above just show a very simple way of producing an ensemble stacker. You hear of ensembles created at the highest level of Kaggle competitions which involves monstrous combinations of stacked classifiers/regressors as well as levels of stacking which go to more than 2 levels. 

The base models in here are not optimized, instead reduced number of estimators have been taken or otherwise it consumes a lot of time. I encourage participants to fork the script on their host machine and run it, make the changes as necessary. I think it has somewhere around 0.65 on the public Leaderboard. But it can be used as a reference to make ones own models. 

Some additional steps that may be taken to improve one's score could be:
1. Implementing a good cross-validation strategy in training the models to find optimal parameter values
2. Introduce a greater variety of base models for learning. 
3. Optimizing the parameters of base learning models

The more uncorrelated the results, the better the final score.