# Adding a (physics-based) model

So far we've only looked at machine learning models. We are very keen to know
how these "smart" approaches compare to more traditional, physics-based models.

[PyPhenology](https://github.com/sdtaylor/pyPhenology) is a nice python package
with a collection of physics-based models. We would like to compare those
models, ideally within the same pycaret framework. However, pyPhenology is not
consistent with the scikit-learn API. On the other hand, it is quite possible to
cast their equations to a form that does adhere to these standards.

In this notebook, we will walk you through the steps to create a custom
estimator, following the [scikit-learn
documentation](https://scikit-learn.org/stable/developers/develop.html). We will
show how this is done for [pyPhenology's ThermalTime
model](https://pyphenology.readthedocs.io/en/master/generated/pyPhenology.models.ThermalTime.html#pyPhenology.models.ThermalTime).
At the end of the chapter, you should be able to repeat the trick for the other
pyPhenology models as well.

## The first blow is half the battle

As a starting point, we copied the [scikit-learn project template](https://github.com/scikit-learn-contrib/project-template/blob/a06bc1a701fbb320848e4d5295e4477b596078df/skltemplate/_template.py) and updated it with some of the information from the [pyphenology ThermalTime](https://github.com/sdtaylor/pyPhenology/blob/d82af2f669364e84be4bf9325a4f4e064d8d3816/pyPhenology/models/thermaltime.py) class. Specifically, we:

- Added
  [RegressorMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html#sklearn.base.RegressorMixin)
  from scikit-learn. This contains some methods specific to regression estimators.
- Replaced `check_X_y` and `check_array` with the newer `_validate_data()` (see
  [SLEP010](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html)).
- Merged docstrings of ThermalTime and sklearn template
- Changed to google-style docstrings and added type hints to method signatures instead of in the docstrings
- Set the default values of the parameters to sensible ints instead of valid ranges


In [50]:
import numpy as np
from sklearn.base import (
    BaseEstimator,
    RegressorMixin,
    check_is_fitted,
)
from numpy.typing import ArrayLike


class ThermalTime(RegressorMixin, BaseEstimator):
    """Thermal Time Model

    The classic growing degree day model using a fixed temperature threshold
    above which forcing accumulates.
    """

    def __init__(self):
        pass

    # TODO: consider adding DOY index series to fit/predict as optional argument
    def fit(self, X: ArrayLike, y: ArrayLike):
        """Fit the model to the available observations.

        Parameters:
            X: 2D Array of shape (n_samples, n_features).
                Daily mean temperatures for each unique site/year (n_samples) and
                for each DOY (n_features). The first feature should correspond to
                the first DOY, and so forth up to (max) 366.
            y: 1D Array of length n_samples
                Observed DOY of the spring onset for each unique site/year.

        Returns:
            Fitted model
        """
        X, y = self._validate_data(X, y)
        # TODO: check additional assumptions about input

        # TODO: convert to proper fit; for now set some default values
        self.t1_: int = 0
        self.T_: int = 5
        self.F_: int = 500

        # `fit` should always return `self`
        return self

    def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, ["t1_", "T_", "F_"])

        # TODO: Implement real predictions
        return np.ones(X.shape[0], dtype=np.int64)

Phew! That's a big start! Notice that we're not passing in anything during initialization (yet). By convention of scikit-learn, the parameters of the model are only set during fit. Fitted parameters can be recognized by their trailing underscore.

This class can already be used, although it doesn't
actually fit or predict anything (useful) yet. Next, we need to

- Provide an implementation for the predict method
- Provide an implementation for the fit method
- Consider doing more validation of the input, since now we simply assume that
  the data is in ordere columns from DOY 1 up to (max) 366.
- Consider allowing an additional argument to fit and predict that contains the
  column indices in case they're not neatly formatted from 1 to (max) 366. This
  is allowed by scikit-learn as long as it's an optional argument.

However, before we proceed, let's see whether we already adhere to the
scikit-learn API.

## Checking sklearn compliance

Scikit-learn provide a nice compliance checker. With a bit of extra code we can
print out which tests it fails (see below).


In [51]:
from sklearn.utils.estimator_checks import check_estimator

# This bit of code allows us to run the checks in a notebook
checks = check_estimator(ThermalTime(), generate_only=True)
passed_checks = 0
failed_checks = 0
for estimator, check in checks:
    name = check.func.__name__
    try:
        check(estimator)
        passed_checks += 1
    except Exception as exc:
        print(f"Check {name} failed with exception: {exc}")
        failed_checks += 1
print(f"Passed checks: {passed_checks}, failed checks: {failed_checks}")

Check check_regressors_train failed with exception: 
Check check_regressors_train failed with exception: 
Check check_regressors_train failed with exception: 
Passed checks: 37, failed checks: 3


So far, so good. Most of the checks passed, and if we dive deep into what's
being check, we can figure out that the others failed because the predictions
were not that good. That makes sense...

## Using pytest and source files

While it is possible to do all this in a notebook, a neater and more convenient
way is to use pytest. To this end:

- Store the class definition above in a new file called `thermaltime.py`
- Create a new file called `test_thermaltime.py` and add the following content

  ```py
  from thermaltime import ThermalTime

  from sklearn.utils.estimator_checks import parametrize_with_checks

  @parametrize_with_checks([ThermalTime(),])
  def test_sklearn_compatible_estimator(estimator, check):
      check(estimator)
  ```

- Install pytest: `pip install pytest`
- Run pytest: `pytest test_thermaltime.py`

## Implementing predict

We'll start by implementing the predict method. This is relatively
straightforward. We'll write a simple function that takes both X and the
parameters, and returns the expected DOY. For ease of reference, we copied the
docstrings from above. This implementation should be exactly the same as in
pyPhenology.


In [90]:
# Note: Copy this function to your file thermaltime.py


def thermaltime(X, t1: int = 0, T: int = 5, F: int = 500):
    """Make prediction with the thermaltime model.

    X: array-like, shape (n_samples, n_features).
       Daily mean temperatures for each unique site/year (n_samples) and for
       each DOY (n_features). The first feature should correspond to
       the first DOY, and so forth up to (max) 366.
    t1: The DOY at which forcing accumulating beings (should be within [-67,298])
    T: The threshold above which forcing accumulates (should be within [-25,25])
    F: The total forcing units required (should be within [0,1000])
    """
    # This allows us to pass both 1D and 2D arrays of temperature
    # Copying X to safely modify it later on (may not be necessary, but readable)
    X_2d = np.atleast_2d(np.copy(X))

    # Exclude days before the start of the growing season
    X_2d = X_2d[:, int(t1) :]

    # Exclude days with temperature below threshold
    X_2d[X_2d < T] = 0

    # Accumulate remaining data
    S = np.cumsum(X_2d, axis=-1)

    # Find first entry that exceeds the total forcing units required.
    doy = np.argmax(S > F, axis=-1)

    # Add t1 back to the result
    return doy + t1

Let's see how we can use this model:


In [91]:
# 10 degrees every day:
X_test = np.ones(365) * 10

# Predicted spring onset:
thermaltime(X_test)

array([50])

In [92]:
# Also check for 2D X inputs:
X_test = np.ones((10, 365)) * 10
thermaltime(X_test)

array([50, 50, 50, 50, 50, 50, 50, 50, 50, 50])

Good, it seems this works nicely, both for indivual prediction and for 2D arrays of inputs.

## Adding tests

These quick checks are super useful! We can quickly add a few more and add them to our test file (`test_thermaltime.py`). Note: also copy the `thermaltime` function from above to your file `thermaltime.py`.


In [93]:
# Note: Copy these tests to your file thermaltime.py, uncomment the imports, and remove the bottom part.

# Note: these imports must be uncommented in your test file
# import numpy as np
# from thermaltime import ThermalTime, thermaltime


def test_1d_base_case():
    # 10 degrees every day:
    X_test = np.ones(365) * 10
    assert thermaltime(X_test) == 50


def test_late_growing_season():
    # If the growing season starts later, the spring onset is later as well.
    X_test = np.ones(365) * 10
    assert thermaltime(X_test, t1=10) == 60


def test_higher_threshold():
    # If the total accumulated forcing required is higher, spring onset is later.
    X_test = np.ones(365) * 10
    assert thermaltime(X_test, F=600) == 60


def test_exclude_cold_days():
    # If some days are below the minimum growing T, spring onset is later.
    X_test = np.ones(365) * 10
    X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 3
    assert thermaltime(X_test) == 60


def test_lower_temperature_threshold():
    # If the minimum growing T is lower, fewer days are exluded. However, the
    # accumulated temperature rises more slowly.
    X_test = np.ones(365) * 10

    X_test[[1, 4, 8, 12, 17, 24, 29, 33, 38, 42]] = 5
    assert thermaltime(X_test, T=2) == 55


def test_2d():
    # Should be able to predict for multiple samples at once
    X_test = np.ones((10, 365)) * 10
    expected = np.ones(10) * 50
    result = thermaltime(X_test)
    assert np.all(result == expected)


# Note: The following lines are not needed in your test file. Pytest will
# automatically call all functions starting with "test_".
test_1d_base_case()
test_late_growing_season()
test_higher_threshold()
test_exclude_cold_days()
test_lower_temperature_threshold()
test_2d()

After you've copied the code to your files, you can run pytest again to check that all new tests pass.

Now that we're confident our new predict function works, the last thing we need to do is update the predict method on the class. Change it to look like this:

```py
    def predict(self, X: ArrayLike):
        """Predict values of y given new predictors

        Parameters:
            X: array-like, shape (n_samples, n_features).
               Daily mean temperatures for each unique site/year (n_samples) and
               for each DOY (n_features). The first feature should correspond to
               the first DOY, and so forth up to (max) 366.

        Returns:
            y: array-like, shape (n_samples,)
               Predicted DOY of the spring onset for each sample in X.
        """
        X = self._validate_data(X)
        check_is_fitted(self, ["t1_", "T_", "F_"])

        return thermaltime(X, self.t1_, self.T_, self.F_)
```


## Implementing the `fit` method

Now that we can make predictions, we can also think about optimizing the parameters of the model. Our aim is to minimize the difference between the predictions based on the training data `X` and the target data `y`. Let's first prepare some training data to test the method once we have it.


In [94]:
# Prepare some training data. Take a base temperature of 10 degrees and add a
# random fluctuation on top of it. The corresponding spring onset should
# correlate with the random temperature fluctations.

temp_signal = np.random.randn(10, 365)
X_train = np.ones((10, 365)) * 10 + temp_signal * 10
y_train = np.ones(10) * 50 + temp_signal.mean(axis=1) * 100

# Check that the values are within somewhat realistic ranges
print(X_train.min(), X_train.max())
print(y_train.min(), y_train.max())

-24.74260126249088 43.93986197940313
44.559769673387 57.8558220233387


First we try scipy's curve-fit. It looks to have the exact signature we want.


In [100]:
from scipy.optimize import curve_fit

initial_guess = [0, 5, 500]
lower_bounds = [-67, -25, 0]
upper_bounds = [298, 25, 1000]

curve_fit(
    thermaltime, X_train, y_train, p0=initial_guess, bounds=(lower_bounds, upper_bounds)
)

(array([  3.54942573,   5.        , 500.        ]),
 array([[10.81891166,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ]]))

Unfortunately, this gives terrible fits. Also: it doesn't like integer
parameters (see https://stackoverflow.com/a/22861933).

Probably this is what the developers of pyphenology also found. They are using a slighly different approach, with scipy's global optimizers instead. Let's try that, then.


Also check out the project template

https://github.com/scikit-learn-contrib/project-template/
