# Problem Session 11
## A Concrete Strength Regression problem using Ensembles

In [73]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a dark background
sns.set_style("whitegrid")

We continue to work with the concrete compressive strength dataset from last week.

In [74]:
df = pd.read_csv('../../data/concrete.csv')

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, 
                                                    shuffle=True,
                                                    random_state=216,
                                                    test_size=.2)
df_tt, df_val = train_test_split(df_train, 
                                                    shuffle=True,
                                                    random_state=216,
                                                    test_size=.2)

features = df.columns[:-1]
target = df.columns[-1]

##### Model Selection

Train each of the following models on the training set using their default parameters. Which has the smallest "out of the box" mean squared error?

* Linear Regression
* kNN
* Support Vector Machine Regressor
* Random Forest Regressor
* AdaBoost Regressor
* Gradient Boosting Regressor
* XGBoost Regressor

In [75]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

from sklearn.metrics import root_mean_squared_error

In [76]:
models = {
    'lr': LinearRegression(),
    'svr': SVR(),
    'knr': KNeighborsRegressor(n_neighbors=10),
    'rf': RandomForestRegressor(),
    'ab': AdaBoostRegressor(),
    'gb': GradientBoostingRegressor(),
    'xbg': XGBRegressor()
}

In [77]:
rmses = {}
for name, model in models.items():
    model.fit(df_tt[features],df_tt[target])
    rmses[name] = root_mean_squared_error(df_val[target], model.predict(df_val[features]))

In [78]:
rmses

{'lr': 9.806844754869127,
 'svr': 14.466320553798337,
 'knr': 9.606365221544358,
 'rf': 6.126701568806703,
 'ab': 7.341594257077692,
 'gb': 6.123958157860343,
 'xbg': 5.7364912621835495}

##### Hyperparameter tuning and test set evaluation

Select the model which had the lowest MSE "right out of the box".  Do cross validation hyperparameter tuning on the combined training and validation set.  Note that if you make your grid of hyperparameters too large it might take a very long time to run.

Once you obtain the hyperparameters with the best cross validation performance, train the model with those hyperparameters on the combined training and validation set.

Evaluate performance on the test set.  Is it comparable to your training set performance?

In [79]:
from sklearn.model_selection import GridSearchCV

In [80]:
param_grid = {"max_depth":    [4, 5, 6],
              "n_estimators": np.arange(100,800,100),
              "learning_rate": [0.01, 0.1, 1]}

In [81]:
xgb_reg = XGBRegressor()

In [82]:
# Note:  this took about 3 minutes to run on my 2023 MacBook Pro.
search = GridSearchCV(xgb_reg, param_grid, cv=5, scoring = 'neg_root_mean_squared_error').fit(df_train[features], df_train[target])

print("The best hyperparameters are ",search.best_params_)

The best hyperparameters are  {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': np.int64(700)}


In [83]:
xgb_reg = XGBRegressor(**search.best_params_)

In [84]:
xgb_reg.fit(df_train[features], df_train[target])

In [85]:
root_mean_squared_error(df_test[target],xgb_reg.predict(df_test[features]))

3.3767423613076932

Discussion Prompt:  A construction company is building a bridge and contracted out the specs to an engineer.  The engineer told them they needed to ensure that the concrete compressive strength is at least $50 \textrm{ MPa}$ given the design constraints.  According to your model, the particular mix they are using is predicted to be $60 \textrm{ MPa}$.

Discuss this situation from a technical, ethical, and legal perspective.  Who else would you want to loop into this conversation?

Do a little further model assessment to see what the risk is.  For example, are there any instances where the model predicted strength in excess of $55 \textrm{ MPa}$ but the actual strength was less than $50 \textrm{ MPa}$?  What else could you do to assess the risk here?

In [86]:
df['preds'] = xgb_reg.predict(df[features])
df[df.preds > 55].sort_values(target)

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)",preds
138,337.9,189.0,0.0,174.9,9.5,944.7,755.8,28,49.90,57.246658
484,446.0,24.0,79.0,162.0,10.3,967.0,712.0,56,54.77,55.277664
99,469.0,117.2,0.0,137.8,32.2,852.1,840.5,7,54.90,55.006680
459,165.0,128.5,132.1,175.1,8.1,1005.8,746.6,100,55.02,55.959930
147,388.6,97.1,0.0,157.9,12.1,852.1,925.7,56,55.20,55.185753
...,...,...,...,...,...,...,...,...,...,...
159,389.9,189.0,0.0,145.9,22.0,944.7,755.8,56,79.40,79.305260
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99,79.788383
153,323.7,282.8,0.0,183.8,10.3,942.7,659.9,56,80.20,80.080627
381,315.0,137.0,0.0,145.0,5.9,1130.0,745.0,28,81.75,81.847939


Out of $139$ instances where the model predicts that the strength is greater than $55$, only one had strength less than $50$ and even that was very close at $49.9$.  There are no examples of obtaining a strength of less than $50$ when the prediction was $60$ or higher.  While this shouldn't give us complete confidence that their mix meets the requirement it is pretty strong evidence.

##### A more interpretable model?

What if we care more about interpretability than we do about making the best predictive model?  One option is to use a generalized additive model (GAM).  Here we use `ExplainableBoostingRegressor` from `interpret` which you can read about [here](https://interpret.ml/docs/ebm.html).  Essentially it is a GAM where each additive component function is learned using gradient boosting.

In [87]:
from interpret.glassbox import ExplainableBoostingRegressor
from interpret import show

In [88]:
ebm = ExplainableBoostingRegressor(interactions=0)
ebm.fit(df_train[features], df_train[target])

In [89]:
root_mean_squared_error(df_test[target], ebm.predict(df_test[features]))

4.894075380492447

The model is not quite as good in terms of generalization capability (it is close!), but it is more interpretable.  Discuss any insights you gain from the following graphs:

In [90]:
show(ebm.explain_global())

One interesting insight is that aging concrete becomes stronger, but the impact tapers off after about 50 days.  This kind of insight would be difficult to achieve with XGBoost.

I suspect the above will have taken the full hour.  If not, here are some additional questions!

### More Questions about Boosting

1. Give an example of a model which would benefit from boosting.

Almost any model which has a tendency to under-fit is a good candidate for boosting.  Decision stumps are a classic example.

2. What happens if you use gradient boosting for linear regression with mean squared error loss?

When we fit an ordinary least squares regression model, the residuals are orthogonal to the span of the feature vectors.  Fitting linear regression to the residuals would give us the zero map.  So gradient boosting does nothing in the case of linear regression.

3. Code your own `CustomGradientBoostingRegressor` class:

In [91]:
class CustomGradientBoostingRegressor():
    '''
    Trains a sequence of regressors.  
    Each new regressor has targets which are the residuals of the previous regressor.
    Prediction is performed by summing the predictions of each individual regressor.
    This is only designed to work with MSE loss.
   '''
    def __init__(self, base_estimator, num_estimators = 10, kwargs = None):
        '''
            Parameters:
                base_estimator: An sklearn regression class.
                num_estimators: The number of estimators in the ensemble
                kwargs:  A dictionary of key word arguments to pass to the estimators.
            Attributes:
                self.estimators is a list of estimators instantiated with their kwargs
        '''
        self.base_estimator = base_estimator(**kwargs)
        self.num_estimators = num_estimators
        self.kwargs = kwargs if kwargs else {}
        self.estimators = [base_estimator(**self.kwargs) for _ in range(num_estimators)]
    
    def fit(self, X, y):
        for estimator in self.estimators:
            estimator.fit(X,y)
            y = y - estimator.predict(X)
    
    def predict(self, X):
        preds = np.zeros(len(X))
        for estimator in self.estimators:
            preds += estimator.predict(X)
        return preds

In [92]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X = np.random.normal(size = (100,2))
y = X[:,0]**2 + X[:,0]*X[:,1]

model =  CustomGradientBoostingRegressor(
            base_estimator = DecisionTreeRegressor, 
            num_estimators = 1000, 
            kwargs = {'max_depth' : 1}
            )

model.fit(X,y)

mean_squared_error(y, model.predict(X))

0.0029326862171130423