# Ensemble of Ensembles - Model Stacking

 - **Ensemble with different types of classifiers:**
 
 
     - Different types of classifiers are fitted on the same training data
     - Results are combined based on either:
         - Majority voting (classification) or
         - Average (regression)

 - **Ensemble with a single type of classifier:**
 
 
     - Bookstrap samples are drawn from training data
     - With each bookstrap sample, separate models will be fitted
     - All of the resultes are combined to create an ensemble
     - Suitable for highly flexible models that are prone to overfitting / high variance

***

### Combining Method
 - **Majority voting or average:**
 
 
     - Classification: Largest number of votes (mode)
     - Regression: Average (mean)
     
 - **Method of application of meta-classifiers on outcomes:**
 
 
     - Binary outcomes: 0/1 from individual classifiers
     - Meta-classifier is applied on top of the individual classifiers.
     
     
 - **Method of application of meta-classifiers on probabilities:**
 
 
     - Probabilities are obtained from individual classifiers
     - Applying meta-classifier

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
import os
#os.chdir('..')
#os.chdir('..')
#os.getcwd()

In [6]:
df = pd.read_csv("data\HR\WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.pop('EmployeeNumber')
df.pop('Over18')
df.pop('StandardHours')
df.pop('EmployeeCount')
y = df['Attrition']
X = df
X.pop('Attrition')
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)
ind_BusinessTravel = pd.get_dummies(df['BusinessTravel'], prefix='BusinessTravel')
ind_Department = pd.get_dummies(df['Department'], prefix='Department')
ind_EducationField = pd.get_dummies(df['EducationField'], prefix='EducationField')
ind_Gender = pd.get_dummies(df['Gender'], prefix='Gender')
ind_JobRole = pd.get_dummies(df['JobRole'], prefix='JobRole')
ind_MaritalStatus = pd.get_dummies(df['MaritalStatus'], prefix='MaritalStatus')
ind_OverTime = pd.get_dummies(df['OverTime'], prefix='OverTime')
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime], axis=1)
df1 = pd.concat([ind_BusinessTravel, ind_Department, ind_EducationField, ind_Gender, 
                 ind_JobRole, ind_MaritalStatus, ind_OverTime, df.select_dtypes(['int64'])], axis=1)
df1.dropna(inplace=True)
df1.shape

(1470, 51)

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df1, y)

In [8]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [9]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    '''
    print the accuracy score, classification report and confusion matrix of classifier
    '''
    if train:
        '''
        training performance
        '''
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train.ravel(), cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        '''
        test performance
        '''
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    

## Ensemble with different types of classifiers

### Model 1: Decision Tree

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [12]:
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       919
           1       1.00      1.00      1.00       183

   micro avg       1.00      1.00      1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[919   0]
 [  0 183]]

Average Accuracy: 	 0.7776
Accuracy SD: 		 0.0254
Test Result:

accuracy score: 0.7826

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.87      0.87       314
           1       0.26      0.26      0.26        54

   micro avg       0.78      0.78      0.78       368
   macro avg       0.57      0.57      0.57       368
weighted avg       0.78      0.78      0.78       368


Confusion Matrix: 
 [[274  40]
 [ 40  14]]



### Model 2: Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train.ravel())



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [15]:
print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 0.9819

Classification Report: 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       919
           1       0.99      0.90      0.94       183

   micro avg       0.98      0.98      0.98      1102
   macro avg       0.99      0.95      0.97      1102
weighted avg       0.98      0.98      0.98      1102


Confusion Matrix: 
 [[918   1]
 [ 19 164]]

Average Accuracy: 	 0.8466
Accuracy SD: 		 0.0179
Test Result:

accuracy score: 0.8614

Classification Report: 
               precision    recall  f1-score   support

           0       0.88      0.97      0.92       314
           1       0.58      0.20      0.30        54

   micro avg       0.86      0.86      0.86       368
   macro avg       0.73      0.59      0.61       368
weighted avg       0.83      0.86      0.83       368


Confusion Matrix: 
 [[306   8]
 [ 43  11]]



### Combine Models into DF

In [16]:
en_en = pd.DataFrame()

In [17]:
tree_clf.predict_proba(X_train)

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [18]:
en_en['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_train))[1]
en_en['rf_clf'] = pd.DataFrame(rf_clf.predict_proba(X_train))[1]
col_name = en_en.columns
en_en = pd.concat([en_en, pd.DataFrame(y_train).reset_index(drop=True)], axis=1)

In [19]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,0
0,0.0,0.1,0
1,1.0,0.7,1
2,0.0,0.1,0
3,0.0,0.0,0
4,1.0,0.8,1


In [20]:
tmp = list(col_name)
tmp.append('ind')
en_en.columns = tmp

In [21]:
en_en.head()

Unnamed: 0,tree_clf,rf_clf,ind
0,0.0,0.1,0
1,1.0,0.7,1
2,0.0,0.1,0
3,0.0,0.0,0
4,1.0,0.8,1


### Meta Classifier

In [22]:
from sklearn.linear_model import LogisticRegression

In [23]:
m_clf = LogisticRegression(fit_intercept=False)

In [24]:
m_clf.fit(en_en[['tree_clf', 'rf_clf']], en_en['ind'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=False,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [25]:
en_test = pd.DataFrame()

In [26]:
en_test['tree_clf'] = pd.DataFrame(tree_clf.predict_proba(X_test))[1]
en_test['rf_clf'] = pd.DataFrame(rf_clf.predict_proba(X_test))[1]
col_name = en_test.columns
en_test['combined'] = m_clf.predict(en_test[['tree_clf', 'rf_clf']])

In [27]:
col_name = en_test.columns
tmp = list(col_name)
tmp.append('ind')

In [28]:
tmp

['tree_clf', 'rf_clf', 'combined', 'ind']

In [29]:
en_test = pd.concat([en_test, pd.DataFrame(y_test).reset_index(drop=True)], axis=1)

In [30]:
en_test.columns = tmp

In [31]:
print(pd.crosstab(en_test['ind'], en_test['combined']))

combined    0   1
ind              
0         274  40
1          40  14


In [32]:
print(round(accuracy_score(en_test['ind'], en_test['combined']),4))

0.7826


In [33]:
print(classification_report(en_test['ind'], en_test['combined']))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87       314
           1       0.26      0.26      0.26        54

   micro avg       0.78      0.78      0.78       368
   macro avg       0.57      0.57      0.57       368
weighted avg       0.78      0.78      0.78       368



Accuracy actually went down instead of up. Possibly need to do further tuning or change/add more classifiers. Also possible that meta classifier not suitable for this exercise.

## Ensemble with single type of classifier

In [34]:
df = pd.read_csv("data\HR\WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [35]:
df.Attrition.value_counts() / df.Attrition.count()

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

Notice how unbalanced the data is

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

In [37]:
pd.Series(list(y_train)).value_counts() / pd.Series(list(y_train)).count()

[0]    0.833938
[1]    0.166062
dtype: float64

In [38]:
class_weight = {0:0.839383, 1:0.160617}

In [39]:
forest = RandomForestClassifier(class_weight=class_weight)

In [40]:
ada = AdaBoostClassifier(base_estimator=forest, n_estimators=100, ##1000 estimators taking too long 
                        learning_rate=0.5, random_state=42)

In [41]:
import warnings
warnings.filterwarnings("ignore")
ada.fit(X_train, y_train.ravel())

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True,
            class_weight={0: 0.839383, 1: 0.160617}, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=0.5, n_estimators=100, random_state=42)

In [42]:
print_score(ada, X_train, y_train, X_test, y_test, train=True)
print_score(ada, X_train, y_train, X_test, y_test, train=False)

Train Result:

accuracy score: 1.0000

Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       919
           1       1.00      1.00      1.00       183

   micro avg       1.00      1.00      1.00      1102
   macro avg       1.00      1.00      1.00      1102
weighted avg       1.00      1.00      1.00      1102


Confusion Matrix: 
 [[919   0]
 [  0 183]]

Average Accuracy: 	 0.8540
Accuracy SD: 		 0.0163
Test Result:

accuracy score: 0.8641

Classification Report: 
               precision    recall  f1-score   support

           0       0.87      0.99      0.93       314
           1       0.70      0.13      0.22        54

   micro avg       0.86      0.86      0.86       368
   macro avg       0.78      0.56      0.57       368
weighted avg       0.84      0.86      0.82       368


Confusion Matrix: 
 [[311   3]
 [ 47   7]]



Slightly worse accuracy score than just Random Forest, but also less variance.

**Add bagging**

In [43]:
bag_clf = BaggingClassifier(base_estimator=ada, n_estimators=50,
                            max_samples=1.0, max_features=1.0, bootstrap=True,
                           bootstrap_features=False, n_jobs=-1,
                           random_state=42)

In [44]:
bag_clf.fit(X_train, y_train.ravel())

BaggingClassifier(base_estimator=AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True,
            class_weight={0: 0.839383, 1: 0.160617}, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, m...se=0,
            warm_start=False),
          learning_rate=0.5, n_estimators=100, random_state=42),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=50, n_jobs=-1, oob_score=False,
         random_state=42, verbose=0, warm_start=False)

In [49]:
print_score(bag_clf, X_train, y_train, X_test, y_test, train=True)
print_score(bag_clf, X_train, y_train, X_test, y_test, train=False)