In [1]:
from rich import print
import logging

logging.basicConfig(level=logging.INFO)

# Modelling with PyCaret

In the springtime project we use[PyCaret](https://github.com/pycaret/pycaret) to
train, evaluate and compare (machine learning) models. PyCaret is a Python
wrapper around several machine learning libraries and frameworks such as
[scikit-learn](https://scikit-learn.org/stable/) and
[XGBoost](https://xgboost.readthedocs.io/en/latest/).

For a (ML) model to work with PyCaret, it [needs to adhere
to](https://pycaret.gitbook.io/docs/learn-pycaret/faqs#can-i-add-my-own-custom-models-in-pycaret)
the [scikit-learn API](https://scikit-learn.org/stable/developers/develop.html).
This API specifies, for example, that each model must have a `fit` and a
`predict` method. It also specifies the expected structure of the input data.
This is why we need to make a big effort to standardize our datasets.

## Basic experiment

Let's see what a 'simple' experiment looks like. We'll load the same example
data as before, and compare a few 'standard' models. For now, we'll stick to all
the default settings of pycaret.


In [2]:
from springtime.datasets import PEP725Phenor, EOBS
from springtime.utils import germany, PointsFromOther, join_dataframes

years = [2000, 2002]

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=years,
    area=germany,
)

eobs = EOBS(
    area=germany,
    years=years,
    variables=["mean_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)

df_pep725 = pep725.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
df = join_dataframes([df_pep725, df_eobs])
df.head()

INFO:springtime.datasets.eobs:Locating data
INFO:springtime.datasets.eobs:Looking for variable mean_temperature in period 2000-2002...
INFO:springtime.datasets.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc


Unnamed: 0_level_0,Unnamed: 1_level_0,day,mean_temperature|31,mean_temperature|59,mean_temperature|60,mean_temperature|90,mean_temperature|91,mean_temperature|120,mean_temperature|121,mean_temperature|151,mean_temperature|152,...,mean_temperature|243,mean_temperature|244,mean_temperature|273,mean_temperature|274,mean_temperature|304,mean_temperature|305,mean_temperature|334,mean_temperature|335,mean_temperature|365,mean_temperature|366
year,geometry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2000,POINT (10.00000 49.48330),129,0.323548,,3.664483,,5.358709,,9.966999,,14.801293,...,,18.310001,,13.865,,10.093225,,5.732333,,2.525484
2000,POINT (10.00000 50.85000),120,0.943226,,3.795517,,5.660645,,10.001336,,14.421612,...,,17.148067,,13.630667,,9.831611,,5.651332,,2.396452
2000,POINT (10.00000 51.71670),116,1.694194,,4.053448,,5.399354,,10.563,,14.321937,...,,17.53968,,14.346666,,10.444515,,6.55,,3.295161
2000,POINT (10.00000 52.10000),120,2.531935,,4.937242,,5.771289,,10.993333,,14.817741,...,,17.556454,,14.623999,,11.281612,,7.564668,,4.204194
2000,POINT (10.00000 53.08330),121,2.119677,,3.988276,,4.812258,,9.906999,,14.663547,...,,16.573227,,13.515334,,10.37387,,6.470334,,3.10871


In [3]:
from pycaret.regression import RegressionExperiment

df.reset_index(drop=True, inplace=True)  # drop index to avoid sorting errors

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)

Unnamed: 0,Description,Value
0,Session id,7530
1,Target,day
2,Target type,Regression
3,Original data shape,"(4729, 24)"
4,Transformed data shape,"(4729, 24)"
5,Transformed train set shape,"(3310, 24)"
6,Transformed test set shape,"(1419, 24)"
7,Numeric features,23
8,Rows with missing values,100.0%
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,4.2245,39.7006,6.2849,0.474,0.0513,0.0344,0.816
lr,Linear Regression,5.2901,48.3598,6.9462,0.3565,0.0564,0.043,0.425
dummy,Dummy Regressor,6.7257,75.994,8.7,-0.0041,0.0705,0.0548,0.039


[RandomForestRegressor(n_jobs=-1, random_state=7530),
 LinearRegression(n_jobs=-1),
 DummyRegressor()]

Pycaret successfully executed the training of three models, and it concluded
that the random forest regressor performed best, with a mean absolute error of
about 4 days and an RMSE of 6.5 days. This seems okay, but the low correlation
coefficient (R2) suggests that the skill is still limited. Obviously, a more
in-depth investigation is necessary to conclude anything from this, but the
initial scoring is already a great start.

Notice the default settings of pycaret: it uses a 10-fold cross-validation
strategy and can do simple preprocessing tasks such as simple imputation of
missing data.


In [4]:
# TODO: eobs has NANs due to DOY matching of monthly resampled data with leap years.

## Custom estimators: interpretML

Any model that adheres to the scikit-learn API can be used in pycaret. For
example, we are interested in using [interpretML](https://interpret.ml/docs/).

Note: you need to have installed interpretML in your springtime environment for this to work:

```bash
pip install interpet
```


In [5]:
from interpret.glassbox import ExplainableBoostingRegressor

ebm = ExplainableBoostingRegressor()

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy", ebm], n_select=3)

Unnamed: 0,Description,Value
0,Session id,7202
1,Target,day
2,Target type,Regression
3,Original data shape,"(4729, 24)"
4,Transformed data shape,"(4729, 24)"
5,Transformed train set shape,"(3310, 24)"
6,Transformed test set shape,"(1419, 24)"
7,Numeric features,23
8,Rows with missing values,100.0%
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
3,ExplainableBoostingRegressor,3.9667,40.5677,6.3604,0.4824,0.0553,0.0331,1.048
1,Random Forest Regressor,4.2388,44.1128,6.6274,0.4379,0.0573,0.0353,0.813
0,Linear Regression,5.2808,51.715,7.1857,0.339,0.0613,0.0437,0.045
2,Dummy Regressor,6.6979,78.6172,8.8596,-0.004,0.0743,0.0553,0.05


INFO:interpret.utils._native:EBM lib loading.
INFO:interpret.utils._native:Loading native on Linux | debug = False
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._ebm:Estimating with FAST
INFO:interpret.glassbox._ebm._bin:eval_terms


[ExplainableBoostingRegressor(),
 RandomForestRegressor(n_jobs=-1, random_state=7202),
 LinearRegression(n_jobs=-1)]

Nice! Our explainable boosting machine outperformed the random forest regression. This is promising!


## Custom estimators: Mixed-effects models

Another interesting approach to (interpretable) modelling is the use of
mixed-effects models. We found an existing package,
[MERF](https://github.com/manifoldai/merf), that combines a random forest with
random effects. However, the package was not scikit-learn compatible out of the
box. Therefore, we've made an adaptated version called [DumME] which is fully
scikit-learn compatible and replaces the default model with the scikit-learn
dummy model. You can install this package with

```bash
pip install dumme
```

The nice thing is that you are not tied to the Dummy or Random Forest models for
the fixed effects. You can combine it with any other sklearn-compatible
(regression) model. Here, we create three mixed effects models and see how they
perform.


In [6]:
import numpy as np
from dumme.dumme import MixedEffectsModel
from pycaret.regression import RegressionExperiment
from sklearn.ensemble import RandomForestRegressor

# We need to add a cluster column, otherwise each unique sample will be treated
# as a cluster and the algorithm will be very slow.
df = df.copy()
df["cluster"] = np.random.randint(0, 3, len(df))

ebm = ExplainableBoostingRegressor()
me_dummy = MixedEffectsModel(max_iterations=2)
me_rf = MixedEffectsModel(RandomForestRegressor(), max_iterations=2)
me_ebm = MixedEffectsModel(ExplainableBoostingRegressor(), max_iterations=2)

exp = RegressionExperiment()
exp.setup(data=df, target="day")

# Notice: We can pass in the fit arguments for DumME, but that breaks the other models.
# exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3, fit_kwargs={"cluster_column": "cluster"})

# Therefore, instead, we rely on the default of DumME to use the last column as cluster column
exp.compare_models(["lr", "rf", "dummy", ebm, me_dummy, me_rf, me_ebm], n_select=3)

Unnamed: 0,Description,Value
0,Session id,3711
1,Target,day
2,Target type,Regression
3,Original data shape,"(4729, 25)"
4,Transformed data shape,"(4729, 25)"
5,Transformed train set shape,"(3310, 25)"
6,Transformed test set shape,"(1419, 25)"
7,Numeric features,24
8,Rows with missing values,100.0%
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
3,ExplainableBoostingRegressor,3.9866,40.5575,6.3355,0.5093,0.0548,0.0332,1.106
6,MixedEffectsModel,3.9925,40.6097,6.3389,0.5088,0.0548,0.0333,7.054
1,Random Forest Regressor,4.2263,43.3897,6.5584,0.4742,0.0564,0.0351,0.783
5,MixedEffectsModel,4.2393,43.5772,6.5733,0.4718,0.0566,0.0352,6.469
0,Linear Regression,5.3509,53.1597,7.2676,0.3515,0.0618,0.0442,0.044
2,Dummy Regressor,6.8229,82.6911,9.0721,-0.0067,0.076,0.0564,0.039
4,MixedEffectsModel,6.8274,82.7419,9.0751,-0.0074,0.076,0.0564,5.255


INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._ebm:Estimating with FAST
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.utils._compressed_dataset:Creating native dataset
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._bin:eval_terms
INFO:interpret.glassbox._ebm._

[ExplainableBoostingRegressor(),
 MixedEffectsModel(fe_model=ExplainableBoostingRegressor(), max_iterations=2),
 RandomForestRegressor(n_jobs=-1, random_state=3711)]

That's as far as the introductions go.

## Next steps: adding your own (physical) models

While many ML packages already adhere to the scikit-learn API, there are other
methods that do not. Particularly, we want to compare our ML-based approach with
more traditional physics-based models. In the next chapter, we will show how to
add a growing-degree day model to our framework.
