# Problem Session 11
## A Concrete Strength Regression problem using Ensembles

In [2]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a dark background
sns.set_style("whitegrid")

We continue to work with the concrete compressive strength dataset from last week.

In [3]:
df = pd.read_csv('../../data/concrete.csv')

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, 
                                                    shuffle=True,
                                                    random_state=216,
                                                    test_size=.2)
df_tt, df_val = train_test_split(df_train, 
                                                    shuffle=True,
                                                    random_state=216,
                                                    test_size=.2)

features = df.columns[:-1]
target = df.columns[-1]

##### Model Selection

Train each of the following models on the training set using their default parameters. Which has the smallest "out of the box" mean squared error?

* Linear Regression
* kNN
* Support Vector Machine Regressor
* Random Forest Regressor
* AdaBoost Regressor
* Gradient Boosting Regressor
* XGBoost Regressor

Hint: It is inefficient to copy/paste the four to five lines of code needed to train each model.

I suggest instead storing the instantiated models in a dictionary and using a `for` loop!  You can, of course, use another method if you have a different preference.

##### Hyperparameter tuning and test set evaluation

Select the model which had the lowest RMSE "right out of the box".  Do cross validation hyperparameter tuning on the combined training and validation set.  Note that if you make your grid of hyperparameters too large it might take a very long time to run.

Once you obtain the hyperparameters with the best cross validation performance, train the model with those hyperparameters on the combined training and validation set.

Evaluate performance on the test set.  Is it comparable to your training set performance?

Discussion Prompt:  A construction company is building a bridge and contracted out the specs to an engineer.  The engineer told them they needed to ensure that the concrete compressive strength is at least $50 \textrm{ MPa}$ given the design constraints.  According to your model, the particular mix they are using is predicted to be $60 \textrm{ MPa}$.

Discuss this situation from a technical, ethical, and legal perspective.  Who else would you want to loop into this conversation?  

Do a little further model assessment to see what the risk is.  For example, are there any instances where the model predicted strength in excess of $55 \textrm{ MPa}$ but the actual strength was less than $50 \textrm{ MPa}$?  What else could you do to assess the risk here?

##### A more interpretable model?

What if we care more about interpretability than we do about making the best predictive model?  One option is to use a generalized additive model (GAM).  Here we use `ExplainableBoostingRegressor` from `interpret` which you can read about [here](https://interpret.ml/docs/ebm.html).  Essentially it is a GAM where each additive component function is learned using gradient boosting.

In [None]:
from interpret.glassbox import ExplainableBoostingRegressor
from interpret import show

ebm = ExplainableBoostingRegressor(interactions=0)

# Note: it is not good practice to evaluate multiple models on the test set.  
# I am only doing so here to showcase the "out of the box" performance of this new model.

ebm.fit(df_train[features], df_train[target])
root_mean_squared_error(df_test[target], ebm.predict(df_test[features]))

The model is not quite as good in terms of generalization capability (it is close!), but it is more interpretable.  Discuss any insights you gain from the following graphs:

In [None]:
show(ebm.explain_global())

I suspect the above will have taken the full hour.  If not, here are some additional questions!

### More Questions about Boosting

1. Give an example of a model which would benefit from boosting.

2. What happens if you use gradient boosting for linear regression with mean squared error loss?


3. Code your own `CustomGradientBoostingRegressor` class:

In [None]:
class CustomGradientBoostingRegressor():
    '''
    Trains a sequence of regressors.  
    Each new regressor has targets which are the residuals of the previous regressor.
    Prediction is performed by summing the predictions of each individual regressor.
    This is only designed to work with MSE loss.
   '''
    def __init__(self, base_estimator, num_estimators = 10, kwargs = None):
        '''
            Parameters:
                base_estimator: An sklearn regression class.
                num_estimators: The number of estimators in the ensemble
                kwargs:  A dictionary of key word arguments to pass to the estimators.
            Attributes:
                self.estimators is a list of estimators instantiated with their kwargs
        '''
        self.base_estimator = 
        self.num_estimators = 
        self.kwargs = kwargs if kwargs else {}
        self.estimators = 
    
    def fit(self, X, y):
        

    def predict(self, X):
        preds = 
        return preds

In [None]:
# Make sure that your class is able to run the following code.
# Does increasing the number of estimators decrease the MSE?

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X = np.random.normal(size = (100,2))
y = X[:,0]**2 + X[:,0]*X[:,1]

model =  CustomGradientBoostingRegressor(
            base_estimator = DecisionTreeRegressor, 
            num_estimators = 1000, 
            kwargs = {'max_depth' : 1}
            )

model.fit(X,y)

mean_squared_error(y, model.predict(X))