https://towardsdatascience.com/holy-grail-for-bias-variance-tradeoff-overfitting-underfitting-7fad64ab5d76?source=post_page

# Ensemble Models 
## 2. Ensemble Models with Bagging and RandomForest

Recommended reading:
* https://towardsdatascience.com/holy-grail-for-bias-variance-tradeoff-overfitting-underfitting-7fad64ab5d76?source=post_page

In [241]:
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier, BaggingClassifier, \
    AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [242]:
def classification_results(y, y_pred, name='', classes=['no', 'yes'], add_rep=False):
    acc = accuracy_score(y, y_pred)
                        
    cm = pd.DataFrame(confusion_matrix(y, y_pred), 
                      index=classes, 
                      columns=classes)

    print(name + ' accuracy: ', round(acc,4),'\n')
    print(cm,'\n')
    if (add_rep):
        print(classification_report(y, y_pred))


# Introduction
* https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2
* https://towardsdatascience.com/holy-grail-for-bias-variance-tradeoff-overfitting-underfitting-7fad64ab5d76

Ensemble models, combine the decisions from multiple models, to improve the overall performance. 

ensemble learning methods employ a group of models where the combined result out of them is almost always better in terms of prediction accuracy as compared to using a single model.

## Advanced Ensemble techniques
Bagging and Boosting are advanced ensemble techniques. They are not working on the prediction of other models, as simple ensemble technique,but rather they create a new learning algorithm, from on a base learner algorithm. For example, if we choose a Decision classification tree, Bagging and Boosting would consist of a pool of decision trees.

> sklearn.ensemble module http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble 

### Bagging (Bootstrap Aggregating)
Bagging is an ensemble method. Here are a rough scheme:

1. Create random subset samples of the training data.
2. Build a model ,Decision tree for example, and fit it to each sample.
3. Combined multiple results by using simple ensemble methods as voting and avreging.



#### Bagging demonstration
For simple demonstration,lets code a bagging example on Base learner of Decision Tree. 


    * applying multiple models for the same problem and then 
    * consider all the outcomes for the final resolution. 
    
> Such a collection of models is called **ensemble**, and it has two main flavors:
* averaging and 
* boosting.



## Bagging (Bootstrap Aggregation)

* In English the term bootstrapping means "to get something out of a situation using existing resources"
* in statistics it refers to the option of randomly resampling the data in order to create a collection of models. 
This means that bagging is similar to voting, with the difference that:
* you choose a specific type of model (called **base model**), and then fit subsamples of your data to it **many times**.



In [273]:
df = read_csv("spambase_csv.csv")
train, test = train_test_split(df, test_size=0.3)
train.shape

(3220, 58)

In [303]:
X_train = train[train.columns[:-1]]
y_train = train[train.columns[-1]]
X_test = test[test.columns[:-1]]
y_test = test[test.columns[-1]]

In [318]:
dfi = [train.sample(1000) for i in range(20)]
Dti = [DecisionTreeClassifier(max_depth=5).fit(dfi[i][df.columns[:-1]], dfi[i][df.columns[-1]]) for i in range(20)]

In [326]:
predictions = train[train.columns[-1]].to_frame()
test_predictions = test[test.columns[-1]].to_frame()

In [327]:
for i,Dt in enumerate(Dti):
    
    predictions['Dt'+str(i)]  = Dt.predict(train[train.columns[:-1]])
    test_predictions['Dt'+str(i)] = Dt.predict(test[test.columns[:-1]])

In [328]:
len(np.where(predictions[predictions.columns[1:]].sum(axis=1)>=5,1,0)), len(predictions)

(3220, 3220)

In [330]:
predictions['pred'] = np.where(predictions[predictions.columns[1:]].sum(axis=1)>=5,1,0)

predictions[::5].head(5)

Unnamed: 0,class,Dt0,Dt1,Dt2,Dt3,Dt4,Dt5,Dt6,Dt7,Dt8,...,Dt11,Dt12,Dt13,Dt14,Dt15,Dt16,Dt17,Dt18,Dt19,pred
740,1,0,0,0,1,1,0,0,1,1,...,1,0,1,0,0,1,1,1,0,1
2321,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
680,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3751,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [332]:
test_predictions.head()

Unnamed: 0,class,Dt0,Dt1,Dt2,Dt3,Dt4,Dt5,Dt6,Dt7,Dt8,...,Dt10,Dt11,Dt12,Dt13,Dt14,Dt15,Dt16,Dt17,Dt18,Dt19
2644,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3449,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1529,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4200,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1515,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [333]:
test_predictions['pred'] = np.where(test_predictions[test_predictions.columns[1:]].sum(axis=1)>=5,1,0)

In [334]:
classification_results(y_train, predictions['pred'], name='Bagging on Train:')
classification_results(y_test, test_predictions['pred'], name='Bagging on Test:')

Bagging on Train: accuracy:  0.9314 

       no   yes
no   1836   123
yes    98  1163 

Bagging on Test: accuracy:  0.9051 

      no  yes
no   751   78
yes   53  499 



### Compare to Decision Tree classifier

In [336]:
dt = DecisionTreeClassifier(max_depth=5).fit(X_train, y_train)
classification_results(y_train, dt.predict(X_train), name='Dt on Train:')
classification_results(y_test, dt.predict(X_test), name='Dt on Test:')

Dt on Train: accuracy:  0.9248 

       no   yes
no   1895    64
yes   178  1083 

Dt on Test: accuracy:  0.9044 

      no  yes
no   788   41
yes   91  461 



## Bagging in scikit-learn - BaggingClassifier

This ensemble method or meta-classifier is implemented in sk-learn by  BaggingClassifier class.
Its main arguments are:
* base_estimator - The base algorithm
* n_estimators - number of learners fitted to subsets of train set

> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html 

### Decision tree as a base model

In [347]:
Dt_base = DecisionTreeClassifier(max_depth=5)

In [348]:
Dt_bagging = BaggingClassifier(base_estimator=Dt_base,
                                n_estimators=100, verbose=0)
Dt_bagging.fit(X_train, y_train)
classification_results(y_train, Dt_bagging.predict(X_train), name='Dt_bagging on Train:')
classification_results(y_test, Dt_bagging.predict(X_test), name='Dt_bagging on Test:')

Dt_bagging on Train: accuracy:  0.9345 

       no   yes
no   1908    51
yes   160  1101 

Dt_bagging on Test: accuracy:  0.9225 

      no  yes
no   799   30
yes   77  475 



> Bagging prevent overfitting. Since each model in the collection,is exposed only to sub-set of the train data.

### Random forest 

Random forest model, technique actually uses this Begging concept.
But Random forest, add verasity to its models, by choosing a different subset of features as well to each bootstrapped sample.

but it goes a step ahead to further reduce the variance by randomly choosing a subset of features as well for each bootstrapped sample to make the splits while training (My next post will detail all about Random forest technique)

> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 

Lets Compare RandomForest 

In [344]:
Rd = RandomForestClassifier()
Rd.fit(X_train, y_train)
classification_results(y_train, Rd.predict(X_train), name='Rd on Train:')
classification_results(y_test, Rd.predict(X_test), name='Rd on Test:')

Rd on Train: accuracy:  0.9966 

       no   yes
no   1959     0
yes    11  1250 

Rd on Test: accuracy:  0.9413 

      no  yes
no   802   27
yes   54  498 



## Random Forest on big Datasets
Since Randomforest don't have the option for incremental learning, its hard to use on large datasets. 
But we can use the following trick:

* Split the data into smaller subsets, that can fit your memory.
* Fit random forests to each subset.
* append all the underlying trees together in the estimators_ member of one of the trees 
```
for i in range(1, len(forests)):
    rf[0].estimators_.extend(forests[i].estimators_)
```