best Sources = [1. Analyticsvidhya](https://www.analyticsvidhya.com/blog/2021/12/a-detailed-guide-to-ensemble-learning/#h2_1), 
[2. Analyticsvidhya](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/#h2_1)
# Introduction to Ensemble Learning

Let’s start with an example, you have a question, and you ask it around, then aggregate their answers. In most cases, you would find that this answer is way better than one expert’s answer. So this is called the **wisdom of the crowd**. Similarly, if you aggregate the predictions of models such as classifiers or regressors, you would notice that the group has better performance than the best individual model. So this group of predictors or models are called **ensemble** and hence this technique is knowns as **Ensemble learning**.

For example, you can train a group of decision trees classifiers, on different random subsets of the training data. After that to make predictions, obtain predictions from all individual trees and predict the most frequent class (i.e predicted the most). This ensemble of decision trees is called **Random Forest** and is one of the most powerful algorithms in the machine learning world.

# 01. Voting Classifiers in Ensemble Learning (Max Voting, Soft Voting)

* Suppose you have a couple of classifiers, with each one having an accuracy of about 80-85%. Now, you can have a Support Vector Classifier, a Random Forest Classifier, a Logistics Regression Classifier, a K-Nearest Neighbors classifier, and perhaps a couple more. A simple way of creating an even better classifier is to combine the predictions of all classifiers and output the most frequent class. This type of classification(Majority-vote Classification) is known as **Hard Voting Classifier**.

* **01. Max Voting / Majority-vote Classification / Hard Voting Classifier**: The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction. **For example**, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can consider this as taking the mode of all the predictions.

* **02.Soft voting** If you are working with classifiers that have the **predict_proba()** method means they can estimate the probability of a class. Then, we output the class with the highest probability which is averaged over all the classifiers. This technique is knowns as soft voting. This technique usually performs better than hard voting as it focuses more on the highly confident votes. To use this, all you need to do is replace **“hard” with “soft** and keep in mind that this will only work for the classifiers that can estimate the class probability.


* The interesting part here is, this type of voting classifier often outperforms even the best classifier in the ensemble. In fact, even if each of the classifiers is a weak learner means it is slightly better than a model which is guessing randomly. The ensembles model can still be a strong learner, assuming there are enough weak learners and are independent of one another.  **An ensemble performs well only in cases when all the classifiers are perfectly independent of each other**, i.e. making uncorrelated errors, which is quite challenging to achieve because they are all trained on the same data. Hence, they are likely to make the same types of errors. So in turn, there will be a majority of the votes for the wrong class, reducing the overall accuracy of the ensemble. One way to achieve a diverse set of classifiers is to use very different algorithms. With different algorithms, they may make different kinds of errors, which will in turn increase the accuracy of the ensemble.

* The following code initializes and trains a classifier comprising three different classifiers which will be SVM, Random Forest, and Logistics Regression. We will be using Moon’s Dataset for our implementation.

## Let’s start with our implementation
**Importing necessary libraries:**

In [65]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

**Creating dataset:**

In [66]:
X, y = make_moons(n_samples=500, noise=0.30)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 20)

**Initializing the models:**

In [67]:
lreg = LogisticRegression()
rf = RandomForestClassifier(n_estimators=100)
svm = SVC()

In [68]:
voting = VotingClassifier(
    estimators=[('logistics_regression', lreg),
                ('random_forest', rf),
                ('support_vector_machine', svm)],
    voting='hard')

Here we are setting voting = ‘hard’ means it will be following majority rule voting and n_estimators as 100. The rest of all the parameters are by default.

**Fitting training data:**

In [69]:
voting.fit(X_train, y_train)

VotingClassifier(estimators=[('logistics_regression', LogisticRegression()),
                             ('random_forest', RandomForestClassifier()),
                             ('support_vector_machine', SVC())])

**Now let’s see how this works on the test set:**

In [70]:
for Classifier in (lreg, rf, svm, voting):
    Classifier.fit(X_train, y_train)
    y_pred = Classifier.predict(X_test)
    print(Classifier.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.832
RandomForestClassifier 0.888
SVC 0.888
VotingClassifier 0.88


Boom! There we have it, as we can see the voting classifier outperforming other individual models slightly. If you are working with classifiers that have the **predict_proba()** method means they can estimate the probability of a class. Then, we output the class with the highest probability which is averaged over all the classifiers. This technique is knowns as **soft voting**. This technique usually performs better than hard voting as it focuses more on the highly confident votes. To use this, all you need to do is replace **“hard” with “soft”** and keep in mind that this will only work for the classifiers that can estimate the class probability.

# 02. Bagging and Pasting in Ensemble Learning

### 2.1. Bagging

The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result. Here’s a question: If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. So how can we solve this problem? One of the techniques is **bootstrapping**.

Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement

**Bagging** means bootstrap+aggregating and it is a ensemble method in which we first bootstrap our data and for each bootstrap sample we train one model. After that, we aggregate them with equal weights. 

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Ensemble_Bagging.svg/1280px-Ensemble_Bagging.svg.png" width = 500>

* Multiple subsets are created from the original dataset, selecting observations with replacement.
* A base model (weak model) is created on each of these subsets.
* The models run in parallel and are independent of each other.
* The final predictions are determined by combining the predictions from all the models

### 2.2. Pasting:

Whenever we select a subset of the data without replacement it is known as Pasting.

Both bagging and pasting trains predictors on a different random subset of the training data. At the end, when all the predictors are trained, the ensemble can predict new data by simply combining the prediction of all the predictors the same as the hard voting classifier(As mode in statistics). Most of the time ensemble has the same bias but less variance than an individual predictor trained on the original training data.

For more information on Bias and Variance check out this blog [here](https://www.analyticsvidhya.com/blog/2021/06/how-to-get-the-most-out-of-bias-variance-tradeoff/)

The biggest advantage of bagging and pasting is that can they can be trained to different CPUs or on different servers as well and this is one of many other reasons why both are very popular methods.

The biggest advantage of bagging and pasting is that can they can be trained to different CPUs or on different servers as well and this is one of many other reasons why both are very popular methods.

## Let’s start with our implementation:
We will be using the same dataset that we created earlier.

**01. Importing necessary libraries:**

In [71]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

**02. Initializing a bagging classifier:**

In [72]:
bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=250,
    max_samples=100, bootstrap=True, random_state=101)

Let’s break down the code a little bit here, the above code(**n_estimators=250**) will train an ensemble of 250 decision tree classifiers, and each of them is trained on 100 random subsets of the training data(**max_samples=100**)with replacement as we are working on a bagging classifier but if you wish to work on pasting instead, go ahead and set the **bootstrap = False**. Lastly, the **n_jobs** parameters specify how many CPU cores you want to use for training and prediction(-1 means it can use all available).

**For regression, you can use BaggingRegressors.**

Check out the official documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) for more information.

##### **03. Training the classifier:**

In [73]:
bagging_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=250, random_state=101)

##### **04. Testing the classifier:**

In [74]:
y_pred = bagging_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.888


If we compare a regular decision tree classifier with the bagging classifier we just created, both of them are trained on moons dataset. By observing, the below figure we can see that

<img src = "https://editor.analyticsvidhya.com/uploads/67391fig.PNG" width=500>
The ensemble is much more generalized than an individual decision tree.

# Out-Of-Bag Evaluation

Whenever you use bagging, there is some amount of data that might be sampled(selecting subset) more than once, or perhaps even more and there might be some data that is not sampled at all. Bagging Classifier samples n training instances with replacement by default, which means only around 63% of data is sampled on an average for each predictor. The rest of the 37% of the data are not sampled at all, now these data/instances are knowns as out-of-bag(OOB) instances. Remember this 37% are not the same for all predictors.

Since our model never saw this data before, it can be useful for evaluating removing the need for separate validation data.

Bagging classifier has a parameter oob_score which upon setting true will automatically use oobs for evaluation after training is done.

### Let’s get a quick look at the implementation:
The libraries and datasets used here are the same.

**Initializing the classifier:**

In [75]:
bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=250,
    bootstrap=True, oob_score=True, random_state=101)

**Training the Classifier:**

In [76]:
bagging_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=250,
                  oob_score=True, random_state=101)

**Result:**

In [77]:
bagging_clf.oob_score_

0.912

we get 91% which is close enough to what we achieved earlier(0.888). So by testing the model on oobs we can get a rough estimate of the accuracy we will achieve on the test set