# Introduction to Ensemble Method in Machine Learning
> In this notebook, it will show you a awesome learning algorithm "Ensemble Method". **This Algorithm is very important!**

> 1.What is Ensemble Method?

> 2.Brief of Voting and Averaging Based Ensemble Method

> 2.1Implementaion of Voting Based Ensemble Methods with SK-learn

> 3.Brief of Bagging Ensemble Method

> 3.1Brief of Random Forest Algorithm

> 3.2Implementation of Random Forest with SK-learn

> 4.Brief of Boosting Method

> 4.1Implementation of Adaboost with SK-learn

> 4.2Implementation of Gradient Tree Boosting

> 5.Break it down

### 1.What is Ensemble Method?
* Ensemble methods are learning algorithms that construct **a set of classiers** and then classify new data points by taking a **weighted vote** of their predictions.

* In practice, we will create **multiple models** and then **combine them** to produce improved results, and it usually produces more accurate solutions than a single model. 

* Some widely known methods of ensemble: **Voting&Averaging, Bagging(Random Forest), Boosting(GradientBoosting)** and **Stacking(it will not be briefed in this notebook)**.

### 2.Brief of Voting and Averaging Based Ensemble Methods.
* **Voting and Averaging** are two of the **easiest ensemble methods**. 
* **Voting** is used for **classification** and **Averaging** is used for **regression**.


**Steps of create this model:**
1. First step is to create **multiple classification/regression models** using some training dataset. Each base model can be created using the **same dataset with different algorithms**, or any other method. 

2. Second step is **Voting/Averaging to get final prediction**, there are few different ways for Voting/Averaging**(Majority Voting, Weighted Voting, Simple Averaging, Weighted Averaging)**.

<img src="asset/Votting.png",width=500,height=500, style="float: left;">

### 2.1 Implementaion of Voting Ensemble Methods with SK-learn
SK-Learn API: http://scikit-learn.org/stable/modules/ensemble.html#votingclassifier

In [1]:
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

In [2]:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

In [3]:
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

In [4]:
# voting='hard' means majority voting; voting="soft" is recommended
eclf = VotingClassifier(estimators=[('logistic', clf1), ('randomForest', clf2), ('gaussianNB', clf3)], voting='soft')

In [5]:
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=4, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]  K-Fold: %s" % (scores.mean(), scores.std(), label, scores))

Accuracy: 0.90 (+/- 0.04) [Logistic Regression]  K-Fold: [ 0.87179487  0.87179487  0.88888889  0.97222222]
Accuracy: 0.93 (+/- 0.03) [Random Forest]  K-Fold: [ 0.92307692  0.97435897  0.88888889  0.94444444]
Accuracy: 0.91 (+/- 0.02) [naive Bayes]  K-Fold: [ 0.8974359   0.8974359   0.91666667  0.94444444]
Accuracy: 0.95 (+/- 0.02) [Ensemble]  K-Fold: [ 0.94871795  0.94871795  0.91666667  0.97222222]


### 3.Brief of Bagging Ensemble Methods
* The name **Bagging**, also known as **“Bootstrap Aggregating”**.

* Bagging is a technique used to **reduce the variance of our predictions** by combining the result of multiple classifiers modeled on different sub-samples of the same data set.

* In the bagging algorithm, **the first step** is create multiple dataSets; **the second step** is build classifiers; **the last step** is combine classifiers which is collecting all classifiers' prediction and do mean operation for final prediction.

* **Bagging methods** work best with **strong and complex models** (e.g., fully developed decision trees).

* **Random Forest algorithm** uses the bagging technique with some differences.

<img src="asset/Bagging.png",width=500,height=500, style="float: left;">

### 3.1Brief of Random Forest Algorithm
**Pros**:

* This algorithm can solve both type of problems i.e. **classification and regression** and does a decent estimation at both fronts.

* One of benefits of Random forest which excites me most is, the power of **handle large data set with higher dimensionality**. It can handle thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods. Further, the model outputs Importance of variable, which can be a very handy feature (on some random data set).

* It has an effective method for **estimating missing data** and maintains accuracy when a large proportion of the data are missing.

* It has methods for **balancing errors** in data sets where classes are imbalanced.

* The capabilities of the above can be extended to **unlabeled data**, leading to **unsupervised clustering, data views and outlier detection.**

**Cons**:

* It surely does a **good job at classification** but **not as good as for regression problem** as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.

* Random Forest can feel like a **black box approach** for statistical modelers – you have very little control on what the model does. **You can at best – try different parameters and random seeds**

### 3.2Implementation of Random Forest with SK-learn
SK-Learn API: http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

SK-Learn Sample Generator API: http://scikit-learn.org/stable/datasets/index.html#sample-generators

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

In [7]:
# load data
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)

In [8]:
# build Decision Tree Model for comparing with the Random Forest later on
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=10)
print("Decision Tree Model Score: {0}".format(scores.mean()))

Decision Tree Model Score: 0.9828000000000001


In [9]:
# build RandomForest Model
clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=10)
print("Random Forest Model Score: {0}".format(scores.mean()))                  

Random Forest Model Score: 0.9998000000000001


In [10]:
# build Extra Tree Model
# note: This algorithm compared with Random Forest usually reduce the variance of the model a bit more, 
# at the expense of a slightly greater increase in bias.
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=10)
print("Extra Tree Model Score: {0}".format(scores.mean()))  

Extra Tree Model Score: 0.9999


So from the code above, we can see that Random Forest model is better than single Desicion Tree Model.

### 4.Brief of Boosting Method

* The term “boosting” is used to describe a family of algorithms which are able to **convert weak models to strong models**.

* Boosting incrementally builds an ensemble by **training each model with the same dataset** but where **the weights of instances are adjusted according to the error of the last prediction**. The main idea is forcing the models to focus on the instances which are hard. **Unlike bagging, boosting is a sequential method**, and so you can not use parallel operations here.

* **Adaboost** is a widely known boosting method algorithm.

* Mostly, **decision tree algorithm** is preferred as a **base algorithm for Adaboost** and in **sklearn library the default base algorithm for Adaboost is decision tree**

<img src="asset/AdaBoost.png",width=500,height=500, style="float: left;">

### 4.1Implementation of Adboost with SK-learn
Adaboost SK-Learn API:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier

In [8]:
# load data
iris = load_iris()
X, y = iris.data, iris.target

In [9]:
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, X, y)
scores.mean() 

0.95996732026143794

### 4.2Implementation of Gradient Tree Boosting
SK-Learn API:http://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting

The advantages of GBRT are:
* Natural handling of data of mixed type (= heterogeneous features)
* Predictive power
* Robustness to outliers in output space (via robust loss functions)

The disadvantages of GBRT are:
* Scalability, due to the sequential nature of boosting it can hardly be parallelized.

In [1]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
# load data
iris = load_iris()
X, y = iris.data, iris.target

In [6]:
clf = GradientBoostingClassifier(n_estimators=100)
scores = cross_val_score(clf, X, y)
scores.mean() 

0.9673202614379085

### 5.Break it down:

* In practice, **ensemble method** usually produces **more accurate solutions than a single model** would.

* **Voting** is the **easiest method of ensemble**. it **creates different classifier based on the same dataset**, and finally combine the predictions and give the final prediction based on voting.

* **Bagging methods** **work best with strong and complex models**. it **splits the dataset into sub-dataset**, and using multiple same algorithm trained on different sub-dataset to build a ensembeled model. 

* **Random Forest** is one of the most popular Bagging method, and it is also **black-box algorithm**, so in practice we'd better try **different hyper-params** to train a better model.

* **Boosting method** convert **weak models to strong models**, and it trains each model with the same dataset. **Adaboost** is widely known boosting method algorithm

* **Ensemble method** is often **not preferred in the industries** where **interpretability** is more important. however, when in practice if the case which **higher accuracy is more important than interpretability**, then consider to use it.

**Reference**: 

https://www.toptal.com/machine-learning/ensemble-methods-machine-learning

https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/#five

http://scikit-learn.org/stable/modules/ensemble.html

http://web.engr.oregonstate.edu/~tgd/publications/mcs-ensembles.pdf