# Wrapping up Linear Models
> Part 5 of the mangoes_blog project

- branch: master
- toc: true 
- badges: false
- comments: false
- sticky_rank: 5
- author: Huon Fraser
- categories: [mangoes]

In [1]:
#collapse-hide
import pathlib
import pandas as pd
import numpy as np

import sys
sys.path.append('/notebooks/Mangoes/src/')
model_path  = '../models/'

from matplotlib import pyplot as plt

from codetiming import Timer
from sklearn.model_selection import GroupKFold
from scikit_models import *
from skopt.space import Real, Integer
from lwr import LocalWeightedRegression
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

ImportError: C extension: numpy.core.multiarray failed to import not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.

In [None]:
#collapse-hide
mangoes=load_mangoes()

train_data,test_data = train_test_split(mangoes)
train_X, train_y, train_cat = X_y_cat(train_data,min_X=684,max_X=990)
test_X, test_y, test_cat = X_y_cat(test_data,min_X=684,max_X=990)
nrow,ncol=train_X.shape
groups = train_cat['Pop']
splitter=GroupKFold()

## Ensemble methods

We now compare our approach to off the shelf ensemble models that are typically the state of the art for tabular problems. We train off-the-shelf variants of [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), the default sklearn [Gradient Boosting Regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor) and [XGBoost](https://xgboost.readthedocs.io/en/stable/). A caveat here is that each of these models could probably be optimised further. Intitial performance with standaridisation preprocessing was dissapointing so we used PLS preprocessing for these experiemnts.


### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = Pipeline([
    ('scaler', PLSRegression()),
    ('model',RandomForestRegressor())
    ])

space  = [Integer(2,ncol,name='scaler__n_components'),
         ]

opt = Optimiser(space,model,train_X,train_y,splitter=splitter,groups=groups)
model_forest,result_forest = opt.optimise(save_file=model_path+'5_random_forest')

### Scikit-learn Gradient Boosting


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

model = Pipeline([
    ('scaler', PLSRegression()),
    ('model',GradientBoostingRegressor(random_state=0))
    ])


space  = [Integer(2,ncol,name='scaler__n_components'),
         ]

opt = Optimiser(space,model,train_X,train_y,splitter=splitter,groups=groups)
model_boost,result_boost = opt.optimise(save_file=model_path+'5_gradboost')


### XGBoost

In [None]:
import xgboost as xgb
model = Pipeline([
    ('scaler', PLSRegression()),
    ('model',xgb.XGBRegressor(tree_method="gpu_hist"))
    ])
space = [
        Integer(2,ncol,name='scaler__n_components')
        ]

opt = Optimiser(space,model,train_X,train_y,splitter=splitter,groups=groups)
model_xg, result_xg = opt.optimise(save_file=model_path+'5_xgboost')

### Ensembles of  PLS-LWR

We've left them until last because typically emsembles will always give better performance; any of the models looked at in the previous 3 parts could be ensmbled. We take our previous best model (SG-PLS-LWR) and build a [bagging regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html) ensemble. This builds 10 copys of a model, with each trained on a sample (with replacement) of the dataset. Predictions are then made by taking the mean of the ensemble.

Results are no better than for the non-ensembled version. A possible explanation is that bagging reduces the density of the feature space, interfering with the locally weighted regressions.

In [None]:
from sklearn.ensemble import BaggingRegressor
from joblib import dump, load

pipe =  load(model_path+'4_pp-pls-lwr_model2.joblib') 
model = BaggingRegressor(pipe)

mse = cross_validate(model,train_X,train_y,splitter=splitter,groups=groups,plot=True)
print(f'Train set MSE: {mse}')

In [None]:
model, mse_test = evaluate(model,train_X,train_y,test_X,test_y,plot=True)

## Comparing Techniques


So far in this series, we started with a multiple linear regression (LR) and dded complexity; feature extraction with partial least squares (PLS), lazy instance weights with locally weighted regressions (LWR) and preprocessing with Savitsky Golay (SG). Adding ach of these components incrementally improved performance during cross-validation, although the hyperparameter settings were not always consistent. The final model in this series (SG-PLS-LWR) gave a cross-validation MSE of 0.7223 and a test MSE of 0.7686.

When we compared this model to off-the-shelf ensembles (including XGBoost) and a bagging-ensemble extension, we found that these underperformed our model. To round out this part of the series we compare our results to the original Mangoes results achieved by Anderson et al. 

In the table we below we compare two models by Anderson et al, LPLS, their best performing locally weighted PLS model, and Ensemble, their best ensemble based model to the models we have built in this series. Our approach gave substantially better results that both models.  Without going into too much detail, this is likely due to the Anderson et al. models fixing the number of components for PLS to a relatively low number, whereas we kept this hyperparmater flexible.

| Model         | CV Score        | Test Score  |
| ------------  | :-------------: | -----:      |
| LR            | 0.8157          | 1.1147 |
| PLS-LR        | 0.8116          | -------  |
| LWR           | 0.7868          | -------|
| PLS-LWR       | 0.7520          | 0.8113  |
| SG-PLS-LWR    | 0.7223          | 0.7686 |
| E(SG-PLS-LWR) | 0.7236          | 0.7801   |
| Anderson et al. LPLS   | 0.66   | 0.887   |
| Anderson et al. Ensemble | 0.56 | 0.850  |
