In [10]:
from rich import print
import logging

logging.basicConfig()

# Modelling with PyCaret

In the springtime project we use[PyCaret](https://github.com/pycaret/pycaret) to
train, evaluate and compare (machine learning) models. PyCaret is a Python
wrapper around several machine learning libraries and frameworks such as
[scikit-learn](https://scikit-learn.org/stable/) and
[XGBoost](https://xgboost.readthedocs.io/en/latest/).

For a (ML) model to work with PyCaret, it [needs to adhere
to](https://pycaret.gitbook.io/docs/learn-pycaret/faqs#can-i-add-my-own-custom-models-in-pycaret)
the [scikit-learn API](https://scikit-learn.org/stable/developers/develop.html).
This API specifies, for example, that each model must have a `fit` and a
`predict` method. It also specifies the expected structure of the input data.
This is why we need to make a big effort to standardize our datasets.

## Example use case

Let's see what a 'simple' experiment looks like. We'll load the same example
data as before, and compare a few 'standard' models. For now, we'll stick to all
the default settings of pycaret.


In [8]:
from springtime.datasets import PEP725Phenor, EOBS
from springtime.utils import germany, PointsFromOther, join_dataframes

years = [2000, 2002]

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=years,
    area=germany,
)

eobs = EOBS(
    area=germany,
    years=years,
    variables=["mean_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)

df_pep725 = pep725.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
df = join_dataframes([df_pep725, df_eobs])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,day,mean_temperature|31,mean_temperature|59,mean_temperature|60,mean_temperature|90,mean_temperature|91,mean_temperature|120,mean_temperature|121,mean_temperature|151,mean_temperature|152,...,mean_temperature|243,mean_temperature|244,mean_temperature|273,mean_temperature|274,mean_temperature|304,mean_temperature|305,mean_temperature|334,mean_temperature|335,mean_temperature|365,mean_temperature|366
year,geometry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2000,POINT (10.00000 49.48330),129,0.323548,,3.664483,,5.358709,,9.966999,,14.801293,...,,18.310001,,13.865,,10.093225,,5.732333,,2.525484
2000,POINT (10.00000 50.85000),120,0.943226,,3.795517,,5.660645,,10.001336,,14.421612,...,,17.148067,,13.630667,,9.831611,,5.651332,,2.396452
2000,POINT (10.00000 51.71670),116,1.694194,,4.053448,,5.399354,,10.563,,14.321937,...,,17.53968,,14.346666,,10.444515,,6.55,,3.295161
2000,POINT (10.00000 52.10000),120,2.531935,,4.937242,,5.771289,,10.993333,,14.817741,...,,17.556454,,14.623999,,11.281612,,7.564668,,4.204194
2000,POINT (10.00000 53.08330),121,2.119677,,3.988276,,4.812258,,9.906999,,14.663547,...,,16.573227,,13.515334,,10.37387,,6.470334,,3.10871


In [9]:
from pycaret.regression import RegressionExperiment

df.reset_index(drop=True, inplace=True)  # drop index to avoid sorting errors

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)

Unnamed: 0,Description,Value
0,Session id,6297
1,Target,day
2,Target type,Regression
3,Original data shape,"(4729, 24)"
4,Transformed data shape,"(4729, 24)"
5,Transformed train set shape,"(3310, 24)"
6,Transformed test set shape,"(1419, 24)"
7,Numeric features,23
8,Rows with missing values,100.0%
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,4.295,42.7326,6.5206,0.455,0.0546,0.0353,0.695
lr,Linear Regression,5.2782,50.1124,7.0621,0.3606,0.0587,0.0432,0.019
dummy,Dummy Regressor,6.7109,78.4569,8.8469,-0.0015,0.0728,0.055,0.019


[RandomForestRegressor(n_jobs=-1, random_state=6297),
 LinearRegression(n_jobs=-1),
 DummyRegressor()]

Pycaret successfully executed the training of three models, and it concluded
that the random forest regressor performed best, with a mean absolute error of
about 4 days and an RMSE of 6.5 days. This seems okay, but the low correlation
coefficient (R2) suggests that the skill is still limited. Obviously, a more
in-depth investigation is necessary to conclude anything from this, but the
initial scoring is already a great start.

Notice the default settings of pycaret: it uses a 10-fold cross-validation
strategy and can do simple preprocessing tasks such as simple imputation of
missing data.


In [None]:
# TODO: eobs has NANs due to DOY matching of monthly resampled data with leap years.

## Custom estimators: interpretML

Any model that adheres to the scikit-learn API can be used in pycaret. For
example, we are interested in using [interpretML](https://interpret.ml/docs/).

Note: you need to have installed interpretML in your springtime environment for this to work:

```bash
pip install interpet
```


In [11]:
from interpret.glassbox import ExplainableBoostingRegressor

ebm = ExplainableBoostingRegressor()

exp = RegressionExperiment()
exp.setup(data=df, target="day")
exp.compare_models(["lr", "rf", "dummy", ebm], n_select=3)

Unnamed: 0,Description,Value
0,Session id,1017
1,Target,day
2,Target type,Regression
3,Original data shape,"(4729, 24)"
4,Transformed data shape,"(4729, 24)"
5,Transformed train set shape,"(3310, 24)"
6,Transformed test set shape,"(1419, 24)"
7,Numeric features,23
8,Rows with missing values,100.0%
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
3,ExplainableBoostingRegressor,4.0475,41.5444,6.4214,0.4858,0.055,0.0337,1.383
1,Random Forest Regressor,4.3391,45.0654,6.6972,0.4397,0.0571,0.036,0.739
0,Linear Regression,5.3167,52.5372,7.2342,0.3472,0.0611,0.0439,0.37
2,Dummy Regressor,6.751,80.9319,8.9787,-0.0047,0.0749,0.0558,0.024


[ExplainableBoostingRegressor(),
 RandomForestRegressor(n_jobs=-1, random_state=1017),
 LinearRegression(n_jobs=-1)]

Nice! Our explainable boosting machine outperformed the random forest regression. This is promising!


## Custom estimators: next steps

While many ML packages already adhere to the scikit-learn api, ...

We have made modifications to MERF [mixed effects random forest
(MERF)](https://manifoldai.github.io/merf/) and
[PyPhenology](https://github.com/sdtaylor/pyPhenology). In the following
chapters we'll explain the modifications and how they enable the use of these
packages in a coherent framework. In the next chapters, we'll walk through the
changes that we've made.
