## Ensemble Learning
Ensemble learning, as the name denotes, is a method that combines several machine learning models to generate a superior model, thereby decreasing variability/variance and bias, and boosting performance.
### Variance

Variance is the measure of how spread out data is. In the context of machine learning, models with high variance imply that the predictions generated on the same test set will differ considerably when different training sets are used to fit the model. The underlying reason for high variability could be attributed to the model being attuned to specific nuances of training data rather than generalizing the relationship between input and output. Ideally, we want every machine learning model to have low variance.

### Bias

Bias is the difference between the ground truth and the average value of our predictions. A low bias will indicate that the predictions are very close to the actual values. A high bias implies that the model has oversimplified the relationship between the inputs and outputs, leading to high error rates on test sets, which again is an undesirable outcome.

### Simple Methods for Ensemble Learning
**Averaging**

Averaging is a naïve way of doing ensemble learning; however, it is extremely useful too. The basic idea behind this technique is to take the predictions of multiple individual models and then average the predictions to generate a final prediction. The assumption is that by averaging the predictions of different individual learners, we eliminate the errors made by individual learners, thereby generating a model superior to the base model. One prerequisite to make averaging work is to have the predictions of the base models be uncorrelated. This would mean that the individual models should not make the same kinds of errors. The diversity of the models is a critical aspect to ensure uncorrelated errors.

When we generate an ensemble by averaging method, we generate the probability of each class instead of the class predictions; As you might know by now, the **predict()** function outputs the class that has the highest probability. The probability of each class is predicted using a separate function called **predict_proba()**.

**Weighted Averaging**

Weighted averaging is an extension of the averaging method that we saw earlier. The major difference in both of these approaches is in the way the combined predictions are generated. In the weighted averaging method, we assign weights to each model's predictions and then generate the combined predictions. The weights are assigned based on our judgment of which model would be the most influential in the ensemble. These weights, which are initially assigned arbitrarily, have to be evolved after a lot of experimentation. To start off, we assume some weights and then iterate with different weights for each model to verify whether we get any improvements in the performance.

**Max Voting**

The max voting method works on the principle of majority rule. In this method, the opinion of the majority rules the roost. In this technique, individual models, or, in ensemble learning jargon, individual learners, are fit on the training set and their predictions are then generated on the test set. Each individual learner's prediction is considered to be a vote. On the test set, whichever class gets the maximum vote is the ultimate winner.

When implementing the max voting method using the scikit-learn library, we use a special function called **VotingClassifier()**. We provide individual learners as input to VotingClassifier to create the ensemble model. This ensemble model is then used to fit the training set and then is finally used to predict on the test sets.



### Advanced Techniques for Ensemble Learning

In these techniques, the individual models or learners generate predictions and those predictions are used to form the final predictions. The individual models or learners, which generate the first set of predictions, are called **base learners** or **base estimators** and the model, which is a combination of the predictions of the base learners, is called the **meta learner** or **meta estimator**. The way in which the meta learners learn from the base learners differs for each of the advanced techniques.

**Bagging**
Bagging is a pseudonym for **Bootstrap Aggregating**. In the statistical context, bootstrapping entails taking samples from the available dataset by replacement. In bagging, multiple subsets of the data are created using bootstrapping. On each of these subsets of data, a base learner is fitted and the predictions generated. These predictions from all the base learners are then averaged to get the meta learner or the final predictions.

When implementing bagging, we use a function called BaggingClassifier(), which is available in the Scikit learn library. Some of the important arguments that are provided when creating an ensemble model include the following:

   **base_estimator**: This argument is to define the base estimator to be used.
    **n_estimator**: This argument defines the number of base estimators that will be used in the ensemble.
    **max_samples**: The maximum size of the bootstrapped sample for fitting the base estimator is defined using this argument. This is represented as a proportion (0.8, 0.7, and so on).
    **max_features**: When fitting multiple individual learners, it has been found that randomly selecting the features to be used in each dataset results in superior performance. The max_features argument indicates the number of features to be used. For example, if there were 10 features in the dataset and the max_features argument was to be defined as 0.8, then only 8 (0.8 x 10) features would be used to fit a model using the base learner.

**Boosting**

The bagging technique, which we discussed in the last section, can be termed as a parallel learning technique. This means that each base learner is fit independently of the other and their predictions are aggregated. Unlike the bagging method, boosting works in a sequential manner. It works on the principle of correcting the prediction errors of each base learner. The base learners are fit sequentially one after the other. A base learner tries to correct the error generated by the previous learner and this process continues until a superior meta learner is created.


   - **1** A base learner is fit on a subset of the dataset.
   - **2** Once the model is fit, predictions are made on the entire dataset.
   - **3** The errors in the predictions are identified by comparing them with the actual labels.
   - **4** Those examples that generated the wrong predictions are given larger weights.
   - **5** Another base learner is fit on the dataset where the weights of the wrongly predicted examples in the previous step are altered.
   - **6** This base learner tries to correct the errors of the earlier model and gives their predictions.
   - **7** Steps 4, 5, and 6 are repeated until a strong meta learner is generated.
   
When implementing the boosting technique, one method we can use is **AdaBoostClassifier()** in scikit-learn. Like the bagging estimator, some of the important arguments for the **AdaBoostClassifier()** method are **base_estimator** and **n_estimators**.

**Stacking**

Stacking, in principle, works in a similar way to bagging and boosting in that it combines base learners to form a meta learner. However, the approach for getting the meta learners from the base learners differs substantially in stacking. In stacking, the meta learner is fit on the predictions made by the base learners. The stacking algorithm can be explained as follows:

   -  **1** The training set is split into multiple parts, say, five parts.
   -  **2** A base learner (say, KNN) is fitted on four parts of the training set and then predicted on the fifth set. This process continues until the base learner predicts on each of the five parts of the training set. All the predictions, which are so generated, are collated to get the predictions for the complete training set.
   -  **3** The same base learner is then used to generate predictions on the test set as well.
   -  **4** Steps 2 and 3 are then repeated with a different base learner (say, random forest).
   -  **5** Next enters a new model, which acts as the meta learner (say, logistic regression).
   -  **6** The meta learner is fit on the predictions generated on the training set by the base learners.
   -  **7** Once the meta learner is fit on the training set, the same model is used to predict on the predictions generated on the test set by the base learners.

The implementation of stacking is done through a function called **StackingClassifier()**. This is available from a package called mlxtend. The various arguments for this function are the models that we assign as base learners and the model assigned as a meta learner.