# Ensemble of ensembles - model stacking

* **Ensemble with different types of classifiers**: 
  * Different types of classifiers (E.g., logistic regression, decision trees, random forest, etc.) are fitted on the same training data
  * Results are combined based on either 
    * majority voting (classification) or 
    * average (regression)
  

* **Ensemble with a single type of classifier**: 
  * Bootstrap samples are drawn from training data 
  * With each bootstrap sample, model (E.g., Individual model may be decision trees, random forest, etc.) will be fitted 
  * All the results are combined to create an ensemble. 
  * Suitabe for highly flexible models that is prone to overfitting / high variance. 

***

## Combining Method

* **Majority voting or average**: 
  * Classification: Largest number of votes (mode) 
  * Regression problems: Average (mean).
  
  
* **Method of application of meta-classifiers on outcomes**: 
  * Binary outcomes: 0 / 1 from individual classifiers
  * Meta-classifier is applied on top of the individual classifiers. 
  
  
* **Method of application of meta-classifiers on probabilities**: 
  * Probabilities are obtained from individual classifiers. 
  * Applying meta-classifier
  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv("data//WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.pop('EmployeeNumber')
df.pop('Over18')
df.pop('StandardHours')
df.pop('EmployeeCount')
y = df['Attrition']
X = df
X.pop('Attrition')
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)
ind_BusinessTravel = pd.get_dummies(df['BusinessTravel'], prefix='BusinessTravel')
ind_Department = pd.get_dummies(df['Department'], prefix='Department')
ind_EducationField = pd.get_dummies(df['EducationField'], prefix='EducationField')
ind_Gender = pd.get_dummies(df['Gender'], prefix='Gender')
ind_JobRole = pd.get_dummies(df['JobRole'], prefix='JobRole')
ind_MaritalStatus = pd.get_dummies(df['MaritalStatus'], prefix='MaritalStatus')
ind_OverTime = pd.get_dummies(df['OverTime'], prefix='OverTime')
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime])
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime, df.select_dtypes(['int64'])], axis=1)
df1.dropna(inplace=True)
df1.shape

(1470, 51)

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y)

In [50]:
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
y_test.tolist() == lb.transform(y_test).tolist()



True

In [38]:
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_auc_score
def print_score(clf, X_train, X_test, y_train, y_test, train=True):
    '''
    v0.1 Follow the scikit learn library format in terms of input
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score((y_train), 
                                                        (clf.predict(X_train)))))

        #cv_res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        #print("Average Accuracy: \t {0:.4f}".format(np.mean(cv_res)))
        #print("Accuracy SD: \t\t {0:.4f}".format(np.std(cv_res)))
        
    elif train==False:
        '''
        test performance
        '''
        res_test = clf.predict(X_test)
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))      
        print("ROC AUC: {0:.4f}\n".format(roc_auc_score((y_test), (res_test))))
        

## Model 1: Decision Tree

In [31]:
from sklearn.tree import DecisionTreeClassifier

In [32]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

In [39]:
print_score(tree_clf, X_train, X_test, y_train, y_test, train=True)
print_score(tree_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       921
           1       1.00      1.00      1.00       181

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[921   0]
 [  0 181]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.7908

Classification Report: 
               precision    recall  f1-score   support

           0       0.89      0.87      0.88       312
           1       0.33      0.38      0.35        56

    accuracy                           0.79       368
   macro avg       0.61      0.62      0.61       368
weighted avg       0.80      0.79      0.80       368


Confusion Matrix: 
 [[270  42]
 [ 35  21]]

ROC AUC: 0.6202



## Model 2: Random Forest

In [8]:
from sklearn.ensemble import RandomForestClassifier

In [9]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train.ravel())

In [10]:
print_score(rf_clf, X_train, X_test, y_train, y_test, train=True)
print_score(rf_clf, X_train, X_test, y_train, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       921
           1       1.00      1.00      1.00       181

    accuracy                           1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[921   0]
 [  0 181]]

ROC AUC: 1.0000

Test Result:

accuracy score: 0.8696

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.99      0.93       312
           1       0.83      0.18      0.29        56

    accuracy                           0.87       368
   macro avg       0.85      0.59      0.61       368
weighted avg       0.87      0.87      0.83       368


Confusion Matrix: 
 [[310   2]
 [ 46  10]]

ROC AUC: 0.5861



In [60]:
en_en = pd.DataFrame()
en_en

In [61]:
tree_clf.predict_proba(X_train)

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [13]:
tree_clf.predict_proba?

[1;31mSignature:[0m [0mtree_clf[0m[1;33m.[0m[0mpredict_proba[0m[1;33m([0m[0mX[0m[1;33m,[0m [0mcheck_input[0m[1;33m=[0m[1;32mTrue[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Predict class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same
class in a leaf.

Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
    The input samples. Internally, it will be converted to
    ``dtype=np.float32`` and if a sparse matrix is provided
    to a sparse ``csr_matrix``.

check_input : bool, default=True
    Allow to bypass several input checking.
    Don't use this parameter unless you know what you're doing.

Returns
-------
proba : ndarray of shape (n_samples, n_classes) or list of n_outputs             such arrays if n_outputs > 1
    The class probabilities of the input samples. The order of the
    classes corresponds to that in the attribute :term:`classes_`.
[1;3

In [56]:
pd.DataFrame(tree_clf.predict_proba(X_train))
pd.DataFrame(y_train)#.reset_index(drop=True)

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
1097,1
1098,0
1099,0
1100,0


In [62]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)
en_en

Unnamed: 0,tree_clf,rf_clf,0
0,0.0,0.07,0
1,0.0,0.10,0
2,0.0,0.06,0
3,0.0,0.13,0
4,0.0,0.09,0
...,...,...,...
1097,1.0,0.74,1
1098,0.0,0.07,0
1099,0.0,0.04,0
1100,0.0,0.08,0


In [15]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,0
0,0.0,0.07,0
1,0.0,0.1,0
2,0.0,0.06,0
3,0.0,0.13,0
4,0.0,0.09,0


In [16]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

In [17]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ind
0,0.0,0.07,0
1,0.0,0.1,0
2,0.0,0.06,0
3,0.0,0.13,0
4,0.0,0.09,0


# Meta Classifier

In [18]:
from sklearn.linear_model import LogisticRegression

In [19]:
m_clf = LogisticRegression(fit_intercept=False)

In [20]:
m_clf.fit(en_en[['tree_clf', 'rf_clf']], en_en['ind'])

In [21]:
en_test = pd.DataFrame()

In [22]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] =  pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_en.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf']])

In [23]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [24]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [25]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [26]:
en_test.columns = tmp

In [27]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined    0   1
ind              
0         263  49
1          37  19


In [28]:
print(round(accuracy_score(en_test['ind'], en_test['combined']), 4))

0.7663


In [29]:
print(classification_report(en_test['ind'], en_test['combined']))

              precision    recall  f1-score   support

           0       0.88      0.84      0.86       312
           1       0.28      0.34      0.31        56

    accuracy                           0.77       368
   macro avg       0.58      0.59      0.58       368
weighted avg       0.79      0.77      0.78       368



***