# Introduction: data and modelling standards

The goal of springtime is to facilitate phenological modelling studies.
Phenology is the study of the timing of life cycle events of plants and animals.
For example, when do the leafs turn green? 🌿

We can model these phenological events, such as spring onset, in various ways,
either using "physics-based" models, statistical methods, or machine
learning techniques. In any case, the problem can be formulated as such:

$y = f(\vec{x})$

where $y$ is the event, $\vec{x}$ is a set of predictor variables, and $f$ an
unknown function. $y$ can be the exact date of the event or, for example, a
binary classification whether the event has occured. $f(\vec{x})$ could be a
very simple function of latitude, or a complex relationship between various
inputs such as temperature timeseries from weather models, greenness indices
from satellites, and categorical variables such as species, land use, and soil
type.

There are various datasets out there, and several modelling packages as well.
However, they all come with their own quirks. **The goal of springtime is to
harmonize these datasets and modelling packages, such that it becomes easy to
use them together in a single study.**
At the core of this harmonization effort lies standardization of the data
structure. A lot of standardization has already taken place in the machine
learning realm. For example, in scikit-learn, predictors are represented as a
numpy array or pandas dataframe, typically called `X`, and target variables `y`
as a one-dimensional array or series:

```py
model.fit(X, y)
new_y = model.predict(new_X)
```

On top of that, there are packages like pycaret, that enable automated training
and comparison of various models. Pycaret ingests both predictors and target as
one dataframe and the name of the target column, like so:

```py
experiment = RegressionExperiment()
experiment.setup(data = Xy,  target = 'y')
experiment.compare_models(['list of models'])
```

Internally, pycaret then splits the data again and passes it to scikit learn
just like above. But it can also pass data to other modelling packages with a
similar interface.

With springtime, **our goal is to make existing phenological data sources and
modelling packages compabible with these standards**. Thus, we need to get our
data sources in the right format, and we need to make sure our models follow the
same interface.


## Dummy use case

We illustrate the "ideal" case with dummy data and models. This example is all
about the data structure and modelling workflow, not (yet) about the science.


In [1]:
# Data format
from springtime import dummy

data = dummy.pycaret_ready(100)
data


Unnamed: 0_level_0,Unnamed: 1_level_0,spring onset (DOY),minimum temperature,mean temperature,maximum temperature
year,geometry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000,POINT (0.22438 0.16026),143,-3.033813,0.022000,3.039246
2001,POINT (-0.65165 -0.31790),160,-2.977279,0.043639,3.265090
2002,POINT (2.21276 -1.31028),160,-2.919174,-0.043401,3.013467
2003,POINT (-0.20714 -1.33902),121,-2.872743,-0.029902,2.987468
2004,POINT (0.80240 0.02859),123,-2.768337,-0.003783,2.889369
...,...,...,...,...,...
2095,POINT (-0.78062 0.62873),176,-2.900270,0.015071,2.859301
2096,POINT (-0.12186 0.82847),169,-2.649552,0.111426,2.564799
2097,POINT (-0.42651 0.80629),153,-2.956537,-0.015596,3.426379
2098,POINT (0.20776 -0.96359),121,-3.457283,-0.006289,2.840914


Typical input data for phenological modelling contains unique observations for
each year and location. These are the indexes of the dataframe. In this dummy
sample, the target variable is the spring onset day of year. And, for the
purpose of illustration, our predictors are aggregated temperature features.

With this data structure, we can run an automatic model comparison with pycaret


In [None]:
# Run a model comparison
from pycaret.regression import RegressionExperiment

exp = RegressionExperiment()
exp.setup(data=data, target="spring onset (DOY)")
exp.compare_models(["lr", "rf", "dummy"], n_select=3)


Unnamed: 0,Description,Value
0,Session id,6916
1,Target,spring onset (DOY)
2,Target type,Regression
3,Original data shape,"(100, 4)"
4,Transformed data shape,"(100, 4)"
5,Transformed train set shape,"(70, 4)"
6,Transformed test set shape,"(30, 4)"
7,Numeric features,3
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
dummy,Dummy Regressor,13.8748,269.9634,16.1924,-0.0925,0.1091,0.0953,0.018
lr,Linear Regression,14.2709,276.5581,16.3719,-0.1425,0.1103,0.0981,0.02
rf,Random Forest Regressor,14.2633,305.765,17.1991,-0.2704,0.1161,0.0977,0.124


[DummyRegressor(),
 LinearRegression(n_jobs=-1),
 RandomForestRegressor(n_jobs=-1, random_state=6916)]

## Summary & next steps

There are various nice packages out there for (phenological) modelling and data
retrieval. We choose pycaret as our "modelling framework", as it can automate
common tasks such as model fitting, scoring, and saving an experiment. Thus, we
need to make sure that all datasets and models of interest are compatible with
pycaret.

In the next steps, we will dive into the harmonization of datasets and models.


## Models

In springtime, we have so far been working with the following modelling packages:

- scikit-learn
- merf
- pyphenology
- interpretML

Since not all these models adhere to the same data structure, we have made modifications to MERF and PyPhenology.


## Datasets

Springtime implements utilities to retrieve data from the following sources:

- ...
- ...

Springtime also includes functionality to convert the data to the format specified above


## Feature extraction and other utilities

Most variables need some sort of pre-processing before they can be used from
within springtime, and springtime includes some functionality to help with this.
For example, ...
