# Adding new datasets

Here we present a walkthrough for adding your own datasets. It's not very
detailed (yet), and some of the things may be quite advanced. If you need help
with any of these steps, don't hesitate to reach out via github.

## Base dataset

Springtime provides a semi-structured approach to loading datasets. Each dataset
is represented as a class with `download`, `raw_load` and `load` methods. This
structure is defined in an abstract base class that can be imported from the
package. To start your own class, you can inherit from it and start implementing
the aforementioned methods. This is illustrated below:


In [None]:
from springtime.datasets.abstract import Dataset


class MyNewDataset(Dataset):
    # You need to define a unique name
    dataset = "my-new-dataset"

    # Add any other attributes needed to define your dataset, e.g.
    species: list[str] = []

    def download(self):
        """Download the data."""
        # your implementation here

        # Check if already exists, if so, don't download again, unless CONFIG.force_override is True

        # Return the path(s) to the downloaded/existing data
        return []

    def raw_load(self):
        """Load the data with minimal modification."""
        paths = self.download()
        data = ...
        return data

    def load(self):
        """Load the harmonized dataset.

        This should do everything to adhere to the springtime dataset standard
        format, i.e. a geopandas dataframe with a year and geometry column and
        other relevant features also in columns.
        """
        raw_data = self.raw_load()
        harmonized_data = ...
        return harmonized_data

## Examples

While developing new datasets, it can be useful to look at the source code for
existing datasets. You can browse that
[here](https://github.com/phenology/springtime/tree/main/src/springtime/datasets).

## Pydantic

Good to know: the base `Dataset` is using [pydantic](https://docs.pydantic.dev/latest/) for
runtime validation and (de)serialization to/from recipes. You may want to read
up on their documentation.

## Utils

Several dataset need to do very similar operations, such as resample. To avoid
duplication, such functions can be generalized and shared between datasets. A
couple of generalized functions are available in `springtime.utils`.

## Adding your model to springtime

It probably makes sense to start developing your dataset class in a notebook or
simply python script. However, it would be much nicer if you can make your
dataset part of the springtime package. To this end, first, have a look at the
[contributing guide](../../develop).

After cloning the source code and making an editable installation, you can add
you dataset class to a new file in the datasets folder.

## Registering your dataset

To make sure your dataset is recognized by springtime, you have to add it to the
list of known datasets in
https://github.com/phenology/springtime/blob/main/src/springtime/datasets/__init__.py.

## Testing

To ensure continuity, we have a couple of [tests for each
dataset](https://github.com/phenology/springtime/tree/main/tests/datasets). When
you add a new dataset, it is probably a good idea to copy the tests of an
existing dataset and adapt them to your needs.
