# Ensemble


<p>Why ensemble? </p>
<p>Finding the best Bias Variance trade off point</p>
<p> Bias:  it doesn't do a good job of bending to the data (underfitting) eg. Linear Regression</p>
<p>Variance: it changes drastically to meet the needs of every point in our dataset. (overfitting)eg. Decision tree</p>
<p>By combining algorithms, we can often build models that perform better by meeting in the middle in terms of bias and variance.These ideas are based on minimizing bias and variance based on mathematical theories, like the central limit theorem. A method that is used to improve ensemble methods is to introduce randomness into high variance algorithms before they are ensembled together this combats the tendency of these algorithms to overfit. Ways of introducing randomness : </p>
<li>Bootstrap the data -sampling the data with replacement and fitting your algorithm to the sampled data.</li>
<li>Subset the features - in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.</li>



Ensemble Simple methods:
<li>Max Voting</li>
<li>Averaging</li>
<li>Weighted Averaging</li>







<h6>Max Voting</h6>
<p> The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.</p>


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import warnings
warnings.filterwarnings("ignore")

from sklearn import preprocessing
from sklearn.model_selection import train_test_split,KFold,StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression


In [2]:
#import data
df= pd.read_csv('data.csv')
x=df[['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'Departments _RandD',
       'Departments _accounting', 'Departments _hr', 'Departments _management',
       'Departments _marketing', 'Departments _product_mng',
       'Departments _sales', 'Departments _support', 'Departments _technical', 'salary_low', 'salary_medium']]
y=df['left']


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)
print('Train set:', x_train.shape, y_train.shape)
print('Test set:', x_test.shape, y_test.shape)

Train set: (11999, 18) (11999,)
Test set: (3000, 18) (3000,)


In [3]:
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = DecisionTreeClassifier(random_state=1)
model3 = KNeighborsClassifier()

model = VotingClassifier(estimators=[('lr', model1), ('dt', model2), ('knn', model3)], voting='hard')
model.fit(x_train,y_train)
max_score = model.score(x_test,y_test)
print(max_score)


0.9673333333333334


<h6> Averaging </h6>
<p> multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction.Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.</p>


In [4]:
#Sample code
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3

<h6>Weighted Averaging</h6>
<p>All models are assigned different weights defining the importance of each model for prediction.</p>

In [6]:
#sample code
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)


<h4>Ensemble Advanced methods:</h4>
    <li>Bagging</li>
    <li>Boosting</li>
    <li>Stacking</li>

<h6>Bagging & Pasting</h6>
Use the same training algorithm for every predictor but train them on different random subsets of the training set.When the sampling is done with replacement this is called <strong>Bagging:</strong> short for (boostrap aggregating) and when done without replacemnt its called <strong>pasting</strong>


<h4>Random Forest</h4>
<p>Its an ensemble of Decision trees genearted via bagging method. </p>
<p>RandomForestClasifier has all the hyperparameters of a DecisionTreeClassifier to control growth and all hyperparameters of BaggingClassifier to control the ensemble</p>
<p>It automatically calculates feature importance for each feature after training. This can be accesed using <em>feature_importances_ </em>variable </p>

In [7]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier


In [8]:
iris = load_iris()
clf = RandomForestClassifier(n_estimators=50,n_jobs=1)
clf.fit(iris['data'],iris['target'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [9]:
for name, score in zip(iris['feature_names'],clf.feature_importances_):
    print(name,score)

sepal length (cm) 0.08728721120836877
sepal width (cm) 0.020737133325644804
petal length (cm) 0.5068712555248127
petal width (cm) 0.3851043999411737


<h3>Boosting</h3>
<p>It combines several weak learners into a strong learner.It tends to train predictors sequentially each trying to correct its predecessor.Most popular methods,<em> Adaboost </em> (Adaptive Boosting) and <em>Gradient Boosting</em></p>
<p><strong> Adaboost</strong> </p>
<p> first a base classifier is trained and used to make predictions on the training set. The relative weight of hte misclassified training instances is the increased. A second classifier is trained using the updated weughts and again it makes pedictions on the training, weights are updated and so on...Once all the predictors are trained, the ensemble makes predictions, each predictor has different weights depending on their overall accuracy on the weighted training set.</p>
<p><strong>Note:</strong> The major drawback foor sequential trainig is that it can't be parrellized</p>
<p> If Adaboost is overfitting the training set u can reduce the number of estimators or strongly regularize the base estimators.</p>

<strong>Gradient Boosting </strong>
<p>works by sequentially adding predictors to an ensemble each correcting its predecesor and it tries to fit the new predictor to the <em>residual errors</em> made by the previous predictor</p>
<p>It has <em>subsample</em> hyperparameter that specifies the fraction of training instances to be used for training each tree.</p>

<h3>Stacking</h3>
<p>Ths is short for Stacked generalization.</p>
<p> it takes heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on different weak models prediction. </p>

In [10]:
#Sample code : function that returns the predictions for train and test for each model.

def Stacking(model,train,y,test,n_fold):
    folds=StratifiedKFold(n_splits=n_fold,random_state=1)
    test_pred=np.empty((test.shape[0],1),float)
    train_pred=np.empty((0,1),float)
    for train_indices,val_indices in folds.split(train,y.values):
        x_train,x_val=train.iloc[train_indices],train.iloc[val_indices]
        y_train,y_val=y.iloc[train_indices],y.iloc[val_indices]
        
        model.fit(X=x_train,y=y_train)
        train_pred=np.append(train_pred,model.predict(x_val))
        test_pred=np.append(test_pred,model.predict(test))
    return test_pred.reshape(-1,1),train_pred

In [11]:
# Base model 1 Decision tree
model1 =DecisionTreeClassifier(random_state=1)

test_pred1 ,train_pred1=Stacking(model=model1,n_fold=10, train=x_train,test=x_test,y=y_train)

train_pred1=pd.DataFrame(train_pred1)
test_pred1=pd.DataFrame(test_pred1)

In [14]:
# Base model 2 KNN
model2 = KNeighborsClassifier()

test_pred2 ,train_pred2=Stacking(model=model2,n_fold=10,train=x_train,test=x_test,y=y_train)

train_pred2=pd.DataFrame(train_pred2)
test_pred2=pd.DataFrame(test_pred2)

In [15]:
# Model 3 Logistic regression

df = pd.concat([train_pred1, train_pred2], axis=1)
df_test = pd.concat([test_pred1, test_pred2], axis=1)

model = LogisticRegression(random_state=1)
model.fit(df,y_train)
model.score(df_test, y_test)

ValueError: could not convert string to float: 'Yes'

<strong>Takeaways</strong>
<p>The main takeaways of this post are the following:</p>

<li>ensemble learning is a machine learning paradigm where multiple models (often called weak learners or base models) are trained to solve the same problem and combined to get better performances</li>

<li>the main hypothesis is that if we combine the weak learners the right way we can obtain more accurate and/or robust models</li>

<li>in bagging methods, several instance of the same base model are trained in parallel (independently from each others) on different bootstrap samples and then aggregated in some kind of “averaging” process</li>

<li>the kind of averaging operation done over the (almost) i.i.d fitted models in bagging methods mainly allows us to obtain an ensemble model with a lower variance than its components: that is why base models with low bias but high variance are well adapted for bagging </li>
<li>in boosting methods, several instance of the same base model are trained sequentially such that, at each iteration, the way to train the current weak learner depends on the previous weak learners and more especially on how they are performing on the data</li>
<li>this iterative strategy of learning used in boosting methods, that adapts to the weaknesses of the previous models to train the current one, mainly allows us to get an ensemble model with a lower bias than its components: that is why weak learners with low variance but high bias are well adapted for boosting</li>
<li>in stacking methods, different weak learners are fitted independently from each others and a meta-model is trained on top of that to predict outputs based on the outputs returned by the base models</li>

from https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205


References:
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/, https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf, https://www.quora.com/What-is-an-intuitive-explanation-of-Gradient-Boosting , https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

    
  