# Week 10: Bagging and Boosting


Up till now, we have used our training data to traing a single model over the training set and applying it to the testing data. However, in order to reduce the final variance of our estimator, we may want to train multiple models and aggregate the results of these models. This will make our final prediction less vulnerable to overfitting and hopefully more accurate. This is the idea behind bagging and boosting. Today in section we will cover bagging and boosting as well as how to implement them in python.




## Bagging: Bootstrap Aggregating

Suppose our training data consists of $N$ samples. A bootstrap sample is also a sample of size $K$ drawn from our training sample with replacement. The idea behind bagging is to draw some $B$ bootstrap samples from our training data, estimate a model on each of our samples, and then average them in some fashion (either a simple average or take a weighted average).

Below, we will show how to do this in a very simple fashion with a linear regression model to get the basic idea down. Then, we will use a build in sklearn library to show how this can be generalized. To do so we will use the 'mpg' dataset from seaborn.

In [70]:
import pandas as pd
import numpy as np
import seaborn as sns

mpg = sns.load_dataset("mpg").dropna()
mpg['one'] = 1
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,one
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,1
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,1
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,1
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,1
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,1


We will use the displacement, cylinders horsepower, weight, and acceleration to try and predict the mpg of the vehicle. To illustrate the usefuleness, we will only try to make the prediction for cars in the year 1970 and 1971. We first fit the model the way we are used to doing so. 

In [102]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


mpg70 = mpg[mpg['model_year']<=71]
X = mpg70[['one','cylinders','displacement','horsepower','weight','acceleration']]
Y = mpg70[['mpg']]

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, random_state = 1)

wholeModel = LinearRegression(fit_intercept = False)
wholeModel.fit(Xtrain,Ytrain)

Yhat = wholeModel.predict(Xtest)
whole_mse = np.mean((Ytest - Yhat)**2)
print(float(whole_mse))




4.9501068589016715


For each of the $B$ models we want to estimate we will do the following:

1. Draw a bootstrap sample of size $K$. 
2. Estimate the model on our boostrap sample.
3. Store the coeffecients of our model.

To come up with our final model, we will take a weighted average of our model coeffecients, weighting by the inverse of the out of sample MSE.

In [103]:
def baggedLR(Xtr, Xts, Ytr, Yts, B, K_prop):
    N = Xtr.shape[0]
    K = round(K_prop*N)
    training = pd.concat([Xtr,Ytr], axis = 1)
    mse = []
    coeffecients = []
    for b in range(B):
        bsample = training.sample(K, replace = True)
        bx = bsample[['one','cylinders','displacement','horsepower','weight','acceleration']]
        by = bsample[['mpg']]
        bmodel = LinearRegression(fit_intercept=False)
        bmodel.fit(bx,by)
        bYhat = bmodel.predict(Xts)
        bmse = float(np.mean((bYhat - Yts)**2))
        bcoef = bmodel.coef_
        mse.append(bmse)
        coeffecients.append(bcoef)
    final = np.average(coeffecients, axis = 0, weights = 1/np.array(mse))[0]
    return final

test = baggedLR(Xtrain, Xtest, Ytrain, Ytest, 1000, 1)
print(test)

[ 4.49018213e+01 -1.58077817e+00 -4.20839817e-03 -2.47926297e-03
 -3.44682985e-03 -2.25247906e-01]


We can now evaluate the performance of this model:

In [104]:
baggedHat = np.dot(Xtest, test)
baggedMSE = np.mean((np.array(Ytest).flatten() - baggedHat)**2)
print(baggedMSE)

4.6195007663030045


### The SKlearn Bagging Ensemble Method

We can see that we have reduced the MSE of our model by bagging $^1$. Next we will go over how we can use the build in sklearn bagging classifier and estimator to do this automatically. The sklearn feature will even have some more functionality that we did not build into this simple model above.







$^1$ *Though this example is a bit contrived since we are not updating the testing dataset each time*

In [84]:
from sklearn.ensemble import BaggingClassifier, BaggingRegressor


The basic syntax of the BaggingClassifier or BaggingRegressor is as follows:

`` RegressionModel = BaggingRegressor(base_regression_model, n_estimators = B, bootstrap = True/False, oob_score = True/False, max_samples = K, max_features = m)``

`` ClassifierModel = BaggingRegressor(base_classification_model, n_estimators = B, bootstrap = True/False, oob_score = True/False,max_samples = K, max_features = m)``

In Order:

``base_regression_model/base_classification_model``: Default is a decision tree classifier/regressor. Here we could specify a `LinearRegression()` or a `KNearestNeighborsClassifier()`.

`n_estimators`: Specifies how many models we should aggregate to get our final model. Like $B$ in the above, the number of times we draw a bootstrap sample and estimate the model on that sample. 

`bootstrap`: Whether the bootstrap samples should be sampled with or without replacement. For this class, we will generally keep this at `True`

`oob_score`: Whether to calculate the MSE/mean accuracy on the left out samples and use this to weight bootstrap models in the final aggregation

`max_sample`: We can think of this as $K$ in the above. The size of each of our bootstrap samples. (Default is $N$)

`max_features`: This is new to the sklearn bagging method. This limits the complexity of each of the bootstrap models. If max_features is set to something lower than the total number of features the bagging method will randomly sample features from the data.

Lets see how this works below:


In [119]:
sk_bagged = BaggingRegressor(LinearRegression(fit_intercept=False), n_estimators= 100, bootstrap=True, oob_score=True, max_features = 6, random_state=2)

Now that we've set up our bagging model, we can fit and predict the model as we're used to

In [120]:
sk_bagged.fit(Xtrain,np.array(Ytrain).ravel())
sk_bagHat = sk_bagged.predict(Xtest)
sk_bagMSE = np.mean((sk_bagHat-np.array(Ytest).ravel())**2)
print(sk_bagMSE)

4.865413133529548


This does a little bit wose than the bagging estimator we estimated ourselves (which is to be expected since we were fitting to the testing data), but a bit better than the non-bagged estimator.

To see how this works with classification, lets go back to KNearestNeighbors and the diamonds dataset. We are interested in classifying the cut of a diamond.

In [141]:
diamonds = sns.load_dataset('diamonds')[['cut','carat','depth','price']]
diamonds.head()

Unnamed: 0,cut,carat,depth,price
0,Ideal,0.23,61.5,326
1,Premium,0.21,59.8,326
2,Good,0.23,56.9,327
3,Premium,0.29,62.4,334
4,Good,0.31,63.3,335


First, to review, we fit the KNeighborsClassifier without bagging.

In [175]:
from sklearn.neighbors import KNeighborsClassifier as KNN

X = diamonds[['carat','depth','price']]
Y = diamonds['cut']
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, random_state = 0)

knn = KNN(n_neighbors=5)
knn.fit(Xtrain,Ytrain)
yHat = knn.predict(Xtest)
accuracy = np.mean(1*(yHat == Ytest))

print(accuracy)



0.6419725621060437


We now "bag" (bootstrap aggregate) this classifier using the same syntax as above.

In [163]:
bagged_knn = BaggingClassifier(KNN(n_neighbors=5), n_estimators=100, max_samples=0.5, oob_score=True)
bagged_knn.fit(Xtrain,Ytrain)
bagged_yHat = bagged_knn.predict(Xtest)
accuracy = np.mean(1*(bagged_yHat == Ytest))

print(accuracy)



0.6505747126436782


The bagged classifier does a bit better in practice.

## Boosting: Sequential Model Search

Boosting can be thought of as an extension of the ideas behind bagging. Boosting works by aggregating together a bunch of simple models to come up with a more complex (and hopefully more accurate) model. Each model is fitted in a succesively, with observations with high errors in prior models being given larger weight in succesive models.




In [164]:
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor

The basic syntax of the AdaBoosting classifier/regressor is as follows 

``RegressionModel = AdaBoostRegressor(base_estimator, n_estimators, learning_rate, loss)``

``ClassificationModel = AdaBoostClassifier(base_estimator, n_estimators, learning_rate)``

In order:

`base_estimator`: Here we could specify a `LinearRegression()` or a `KNearestNeighborsClassifier()`. It is the simple model that the boosting aggregator ultimately aggregates over.

`n_estimators`: Number of times the algorithm will fit a model and update the sample weights for the next model

`learning_rate`: Governs how fast the weights are updated after each iteration. This is a tuning parameter that we will have to pick

`loss`: Only for regression models. Governs the loss used when updating the weights. Options are linear, square, or exponential.

First, let's see how this works on the regression model.



In [165]:
mpg70 = mpg[mpg['model_year']<=71]
X = mpg70[['one','cylinders','displacement','horsepower','weight','acceleration']]
Y = mpg70['mpg']

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, random_state = 1)

wholeModel = LinearRegression(fit_intercept = False)
wholeModel.fit(Xtrain,Ytrain)

Yhat = wholeModel.predict(Xtest)
whole_mse = np.mean((Ytest - Yhat)**2)
print(float(whole_mse))

4.9501068589016715


In [170]:
boosted_reg = AdaBoostRegressor(LinearRegression(fit_intercept=False), n_estimators=100, learning_rate= 0.8, loss = 'linear', random_state = 2)
boosted_reg.fit(Xtrain, Ytrain)

AdaBoostRegressor(base_estimator=LinearRegression(copy_X=True,
                                                  fit_intercept=False,
                                                  n_jobs=None,
                                                  normalize=False),
                  learning_rate=0.8, loss='linear', n_estimators=100,
                  random_state=2)

We can now predict using our model and evaluate as before

In [171]:
yHat = boosted_reg.predict(Xtest)
boosted_mse = np.mean((Ytest-yHat)**2)
print(boosted_mse)

5.132719985929552


Now revist the classification problem from before:

In [178]:
X = diamonds[['carat','depth','price']]
Y = diamonds['cut']
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, random_state = 0)

knn = KNN(n_neighbors=5)
knn.fit(Xtrain,Ytrain)
yHat = knn.predict(Xtest)
accuracy = np.mean(1*(yHat == Ytest))

print(accuracy)

0.6419725621060437


And let's try the boosted classifer. If we leave the base_model unspecified, the bosted classifier will use a decision tree, so let's try that. (Boosting won't work out of the box with KNN since there are no weights to update).

In [182]:
boosted_knn = AdaBoostClassifier(n_estimators=100, learning_rate=0.8, random_state=0)
boosted_knn.fit(Xtrain,Ytrain)
boosted_yHat = boosted_knn.predict(Xtest)
boosted_accuracy = np.mean(1*(boosted_yHat == Ytest))
print(boosted_accuracy)

0.7233222098628105


This does a fair bit better than the original (non-boosted) model, and even does better than the bagged esimtator. Though part of this could be due to the different algorithm.