#Ensembling Approaches
> A tutorial of ensembling approaches.
> Disclaimer : Everything present in this notebook has been taken from blogs over the internet.`

#### Scope of ensemble approaches covered

Ensemble modeling is a powerful way to improve the performance of your model. It usually pays off to apply ensemble learning over and above various models you might be building. Time and again, people have used ensemble models in competitions like Kaggle and benefited from it. Some of the ensembles covered here are below.

   - Majority Voting
   - Majority Weighted Voting
   - Simple Average
   - Weighted Average
   - Stacking Variant A
   - Stacking Variant B

![](my_icons/Slide1.GIF)

![](my_icons/Slide2.GIF)

#### Packages to import

In [None]:
from sklearn.ensemble import VotingClassifier
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from xgboost import XGBRegressor
from vecstack import StackingTransformer

![](my_icons/Slide3.GIF)

#### Voting Classifier
- In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode) of the class labels predicted by each individual classifier.

- E.g., if the prediction for a given sample is

    - classifier 1 -> class 1

    - classifier 2 -> class 1

    - classifier 3 -> class 2

    - the VotingClassifier (with voting='hard') would classify the sample as “class 1” based on the majority class label.

- In the cases of a tie, the VotingClassifier will select the class based on the ascending sort order. E.g., in the following scenario

    - classifier 1 -> class 2

    - classifier 2 -> class 1

    - the class label 1 will be assigned to the sample.

- In contrast to majority voting (hard voting), soft voting returns the class label as argmax of the sum of predicted  probabilities.

- Specific weights can be assigned to each classifier via the weights parameter. When weights are provided, the predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged. 
- The final class label is then derived from the class label with the highest average probability.


| Classifier  | class1 | class 2 | class 3 |
|------------ |--------|---------|---------|
| Classifier1 |0.2*w1  |0.5*w1   |0.3*w1   |
| Classifier2 |0.6*w2  |0.3*w2   |0.1*w2   |
| Classifier3 |0.3*w3  |0.4*w3   |0.3*w3   |
| weighted Avg|0.37    |0.4      |0.23     |

- Here, the predicted class label is 2, since it has the highest average probability.

![](my_icons/Slide4.GIF)

In [None]:
def votingClassifier(models_dict,voting='hard',weights=None):
    estimators=[(modelName,model) for modelName,model in models_dict.items()]
    if weights:
        model = VotingClassifier(estimators=estimators, voting='hard')
    else:
        model = VotingClassifier(estimators=estimators, voting='soft',weights=weights)
    return model

![](my_icons/Slide5.GIF)

#### Average

- Similar to the max voting technique, multiple predictions are made for each data point in averaging. 
- In this method, we take an average of predictions from all the models and use it to make the final prediction.
- Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

![](my_icons/Slide6.GIF)

#### Weighted Average
- This is an extension of the averaging method.
- All models are assigned different weights defining the importance of each model for prediction.

![](my_icons/Slide7.GIF)

In [None]:
def AverageWeightClassifier(predModels,weights=None):
    if weights is None:
        weights=[1/len(predModels) for _ in range(len(predModels))]
    return [sum(l*weights) for l in zip(*predModels)]

#### Stacking

- Stacked generalization consists in stacking the output of individual estimator and use a classifier to compute the final prediction. 
- Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator.

- The basic idea is to train machine learning algorithms with training dataset and then generate a new dataset with these models. Then this new dataset is used as input for the combiner machine learning algorithm.


#### Stacking Concept

1. We want to predict train set and test set with some 1st level model(s), and then use these predictions as features for 2nd level model(s).
2. Any model can be used as 1st level model or 2nd level model.
3. To avoid overfitting (for train set) we use cross-validation technique and in each fold we predict out-of-fold (OOF) part of train set.
4. The common practice is to use from 3 to 10 folds.

###### Predict test set:

- Variant A: In each fold we predict test set, so after completion of all folds we need to find mean (mode) of all temporary test set predictions made in each fold.

![](my_icons/Slide8.GIF)

- Variant B: We do not predict test set during cross-validation cycle. After completion of all folds we perform additional step: fit model on full train set and predict test set once. This approach takes more time because we need to perform one additional fitting.

![](my_icons/Slide9.GIF)

In [None]:
# Caution! All estimators and parameter values are just 
# demonstrational and shouldn't be considered as recommended.
# This is list of tuples
# Each tuple contains arbitrary unique name and estimator object
estimators_L1 = [
    ('et', ExtraTreesRegressor(random_state=0, n_jobs=-1, 
                               n_estimators=100, max_depth=3)),
        
    ('rf', RandomForestRegressor(random_state=0, n_jobs=-1, 
                                 n_estimators=100, max_depth=3)),
        
    ('xgb', XGBRegressor(random_state=0, n_jobs=-1, learning_rate=0.1, 
                         n_estimators=100, max_depth=3))
]

#### Initialize stacking transformer

In [None]:
stack = StackingTransformer(estimators=estimators_L1,   # base estimators
                            regression=False,           # regression task (if you need 
                                                        # classification - set to False)
                            variant='A',                # oof for train set, predict test 
                                                        # set in each fold and find mean
                            metric=log_loss,            # metric: callable
                            n_folds=4,                  # number of folds
                            shuffle=True,               # shuffle the data
                            random_state=0,             # ensure reproducibility
                            verbose=2,                  # print all info
                            needs_proba=True)           # gives probability scores.       

In [None]:
stack = stack.fit(X_train, y_train)

In [None]:
S_train = stack.transform(X_train)
S_test = stack.transform(X_test)

In [None]:
# Initialize 2nd level estimator
final_estimator = XGBRegressor(random_state=0, n_jobs=-1, learning_rate=0.1, 
                               n_estimators=100, max_depth=3)

# Fit
final_estimator = final_estimator.fit(S_train, y_train)

# Predict
y_pred = final_estimator.predict(S_test)

# Final prediction score
print('Final prediction score: [%.8f]' % log_loss(y_test, y_pred))

In [None]:
# Number of base estimators
# Type: int
stack.n_estimators_

# Scores for each estimator (rows) in each fold (columns)
# Type: 2d numpy array
stack.scores_

# Mean and std for each estimator
# Type: list of tuples
stack.mean_std_

# Mean and std convenient representation using pandas.DataFrame
df = pd.DataFrame.from_records(stack.mean_std_, columns=['name', 'mean', 'std'])
# Sort by column 'mean' (best on the top)
df.sort_values('mean', ascending=True)

![](my_icons/Slide10.GIF)