# Part III: Ensembles and Final Result

## AdaBoost

Train an AdaBoost classifier and compare its performance to results obtained in Part II using 10 fold CV.

In [5]:
import noshow_lib.util as utils
import noshow_lib.preprocess as preprocess
import numpy as np

file_config = utils.file_config
train_X, train_y = preprocess.load_train_data(config=file_config)

In [2]:
# AdaBoost code goes here
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier()
abc = AdaBoostClassifier(base_estimator=rf)
scores_auc = cross_val_score(abc, train_X, train_y, cv=10, scoring='roc_auc', verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 19.5min finished


In [6]:
np.mean(scores_auc)

0.688813816530363

The mean score for the original random forest classifier was 0.6922, so the adaboost is worse.

## xgBoost

Train an xgBoost classifier and compare its performance to results in Part II using 10 fold CV. `sklearn` has a gradient boosting model included http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html which you can use. The `xgboost` package https://xgboost.readthedocs.io/en/latest/python/python_intro.htmlhas a wrapper you can use with sklearn as well https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn. The latter is more efficient at training time.

In [14]:
# xgboost code here
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier

gbc = xgb.XGBRegressor()
scores_auc = cross_val_score(gbc, train_X, train_y, cv=10, scoring='roc_auc', verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.8min finished


In [12]:
np.mean(scores_auc)

0.73193051389827546

The xgBoost did better than all other classifiers.

## Stacking

Choose a set of 5 or so classifiers. Write a function that trains an ensemble using stacking

In [58]:
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

def build_stack_ensemble(X, y):
    # create train/validation sets
    # using StratifiedShuffleSplit
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y)

    # train classifiers in ensemble using train set
    c1 = xgb.XGBRegressor().fit(X_train,y_train)
    c2 = LinearSVC().fit(X_train,y_train)
    c3 = RandomForestClassifier(n_estimators=100).fit(X_train,y_train)
    c4 = RandomForestClassifier(n_estimators=100,criterion='entropy').fit(X_train,y_train)
    c5 = DecisionTreeClassifier().fit(X_train,y_train)
    
    # create new feature matrix for validation
    # set by getting predictions from the ensemble
    # classifiers
    X_predict = np.stack((c1.predict(X_test),c2.predict(X_test),c3.predict(X_test),c4.predict(X_test),c5.predict(X_test))).T
    
    # train logistic regression classifier on
    # new feature matrix
    c = LogisticRegression().fit(X_predict,y_test)
    
    # return all trained classifiers
    return c,(c1,c2,c3,c4,c5)

In [59]:
c,c1 = build_stack_ensemble(train_X,train_y)



Use 10-fold cross validation to measure performance of your stacked classifier. See Part II solution to see how to roll your own sklearn classifier along with http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator

In [101]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class TheBestEnsemble(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y):
        X, y = check_X_y(X, y)
        
        self.c_, self.cs_ = build_stack_ensemble(X,y)
        
        return self
    
    def predict(self, X):
        X_predict = np.stack((self.cs_[0].predict(X),self.cs_[1].predict(X),self.cs_[2].predict(X),
                              self.cs_[3].predict(X),self.cs_[4].predict(X))).T
        return self.c_.predict(X_predict)
    
    def predict_proba(self, X):
        X_predict = np.stack((self.cs_[0].predict(X),self.cs_[1].predict(X),self.cs_[2].predict(X),
                  self.cs_[3].predict(X),self.cs_[4].predict(X))).T
        return self.c_.predict_proba(X_predict)

In [105]:
tbe = TheBestEnsemble()

In [103]:
scores_auc = cross_val_score(tbe, train_X, train_y, cv=10, scoring='roc_auc', verbose=1)
print scores_auc

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[ 0.73809124  0.73509764  0.73755287  0.74551324  0.73276404  0.73774204
  0.73577124  0.73673849  0.74308787  0.72983612]


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  7.0min finished


In [104]:
print np.mean(scores_auc)

0.737219479022


Only slightly better than the xgBoost.

## Final Result

Choose a single model based on all previous project steps. Train this model on the complete training dataset and measure it's performance on the held out test set.

Compare to the 10-fold CV estimate you got previously.

My ensemble gave the best results so I will use that.

In [106]:
# final result goes here
tbe = TheBestEnsemble()
tbe.fit(train_X,train_y)

TheBestEnsemble()

In [107]:
test_X, test_y = preprocess.load_test_data(config=file_config)

In [113]:
from sklearn.metrics import roc_auc_score
fscore = roc_auc_score(test_y,tbe.predict_proba(test_X)[:,1])
print fscore

0.740357504247


Wow. Actually did better than the cv results with the train data.