In [1]:
from rich import print


## Datasets

In the previous section we've seen that one of the main goals of springtime is
to harmonize datasets from different sources. Here, we walk through an example
with data from PEP725 and EOBS to show how this is done.


### PEP725

**Prerequisites: phenor**

To retrieve data from pep725, we make use of a pre-existing library called [phenor](https://bluegreen-labs.github.io/phenor/). Phenor is written in R, and therefore you need to have installed R with phenor. If you already have R on your system, you can install phenor like so:

```R
devtools::install_github("bluegreen-labs/phenor@v1.3.1")
```

If you don't have R, springtime provides a conda environment file that contains
most of the necessary dependencies, so you can create a conda environment like
so:

```sh
# Obtain the environment file
curl -o environment.yml https://raw.githubusercontent.com/phenology/springtime/main/environment.yml

# Create and activate the new environment
mamba env create --file environment.yml
conda activate springtime

# Install phenor in R
Rscript -e 'devtools::install_github("bluegreen-labs/phenor", upgrade="never")'
```

**PEP725 credentials**

To authenticate with the PEP725 data servers, you need to have an account and
you need to store your credentials in a file called
`~/.config/springtime/pep725_credentials.txt`. Email adress on first line,
password on second line. This path can be modified in the springtime
configuration, but the default is quite okay.


#### Springtime's dataset interface

Springtime provides a semi-standardized interface for working with datasets. In
this case, our we will be using `PEP725Phenor` as the dataset class. You can see the full documentation of this class [here](https://springtime.readthedocs.io/en/latest/reference/springtime/datasets/insitu/pep725/).

Let's create a dataset with all PEP725 observations of the species "Syringa vulgaris"


In [2]:
from springtime.datasets.insitu.pep725 import PEP725Phenor

dataset = PEP725Phenor(species="Syringa vulgaris", phenophase=11)
print(dataset)


Notice that the credential_file has been configured automatically, and that there are some other fields that we can set. Before we dive into details about what those options mean, we will need to retrieve the data. We can do this with the `download` method.


In [3]:
dataset.download()


File already exists: /home/peter/.cache/springtime/PEP725/Syringa vulgaris.csv


If everything went well, the data should have been downloaded to some location
like `/home/username/.cache/springtime`. Springtime will skip the download if
the data is already present.

You can inspect the file on disk, but for transparancy springtime provides a
`raw_load` method that loads the data more or less without modification.


In [4]:
dataset.raw_load()


Unnamed: 0,pep_id,bbch,year,day,country,species,national_id,lon,lat,alt,name
0,6446,60,1991,130,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
1,6446,60,1984,137,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
2,6446,60,1969,124,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
3,6446,60,1989,107,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
4,6446,60,1990,112,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
...,...,...,...,...,...,...,...,...,...,...,...
173752,19283,60,2004,125,UK,Syringa vulgaris,964298,-3.7330,58.5670,-1,964298
173753,19283,60,2002,130,UK,Syringa vulgaris,964298,-3.7330,58.5670,-1,964298
173754,19283,60,2003,113,UK,Syringa vulgaris,964298,-3.7330,58.5670,-1,964298
173755,19285,60,2002,125,UK,Syringa vulgaris,968311,-3.5170,58.6000,-1,968311


As you can see, there are various columns in the data, only a few of which are relevant for us. The "day" column contains the day of year of the event. The event, in this case, is given in the 'bbch' column, which contains phenophases according to the BBCH scale. For example, phenophase 60 means "beginning of flowering". To see all possible options, have a look at http://www.pep725.eu/pep725_phase.php.

Note that this data is already interesting, but it doesn't completely conform to our standard yet. The `load` method, as opposed to `raw_load`, does some additional work to parse the data into a format that we can easily combine with other datasets.


In [5]:
dataset.load()


Unnamed: 0_level_0,Unnamed: 1_level_0,day
year,geometry,Unnamed: 2_level_1
1988,POINT (15.86660 44.80000),85
1981,POINT (15.86660 44.80000),83
1989,POINT (15.86660 44.80000),80
1985,POINT (15.86660 44.80000),94
2014,POINT (15.86660 44.80000),77
...,...,...
2000,POINT (18.25000 49.11670),105
2010,POINT (18.25000 49.11670),106
2004,POINT (18.25000 49.11670),110
2001,POINT (18.25000 49.11670),116


Notice that the year and geometry have been converted to index columns, we only
retained the "day" column, as this will be the variable that we are trying to
predict. The latitude and longitude have been combined into a "geometry" column
in geopandas format.

We can influence the behaviour of the `load` method to select and area and years of interest, for example. To this end, we need to modify the dataset.


In [16]:
area = {
    "name": "Germany",
    "bbox": [
        5.98865807458,
        47.3024876979,
        15.0169958839,
        54.983104153,
    ],
}
dataset = PEP725Phenor(species="Syringa vulgaris", years=[2000, 2002], area=area)
print(dataset)
df_pep725 = dataset.load()
print(df_pep725)


#### Dataset as recipe

You may wonder why we pass these additional arguments to the dataset itself; why not pass them directly to the load function? Part of the reason is standardization: most datasets need to know about the area and time already for downloading anything. By making it part of the dataset definition, datasets from several sources become more alike.

Another advantage of this model is that it allows us to export springtime datasets as "recipes".


In [7]:
dataset.as_recipe()


dataset: PEP725Phenor
years:
- 2000
- 2002
species: Syringa vulgaris
area:
  name: Germany
  bbox:
  - 5.98865807458
  - 47.3024876979
  - 15.0169958839
  - 54.983104153



These recipes are a `yaml` representation of the dataset definition, and it allows us to store our workflows in a standardized way.
Springtime can read and execute these recipes, either in an interactive (Jupyter) session, or from the command line. We will come back to this later, but the idea is that recipe can help to make data loading more reproducible and easier to automate.


## E-OBS

Now that we have observations (our target variables for the modelling part), we
need some predictor variables as well. Here, we will use e-obs.

e-obs is a gridded dataset, but we want to retrieve only those points for which
we have observations, so let's extract those.


In [17]:
# Note: this is (an adaptation from pointsfromother).
# TODO Refactor such that we can import it easily
# TODO Add reset_index to pointsfromother or don't set_index after all.
points = list(map(lambda p: (p.x, p.y), df_pep725.reset_index().geometry.unique()))
points


[(13.2333, 47.7833),
 (14.8833, 48.6833),
 (13.1, 47.9667),
 (12.15, 47.5667),
 (14.6333, 48.3),
 (14.1333, 47.5333),
 (14.0833, 47.4833),
 (13.3667, 48.3333),
 (14.2833, 48.3),
 (13.9833, 48.5667),
 (14.3333, 48.45),
 (13.95, 48.7),
 (13.65, 47.3667),
 (13.05, 47.7833),
 (13.6333, 47.4167),
 (14.45, 47.5667),
 (14.8833, 48.35),
 (14.9167, 48.2833),
 (14.9167, 47.8),
 (13.2167, 47.4),
 (14.1333, 48.5833),
 (13.7667, 48.0833),
 (14.55, 48.3),
 (14.5333, 48.35),
 (12.6833, 47.5667),
 (14.45, 48.0667),
 (13.3167, 47.7),
 (10.15, 54.3333),
 (10.05, 54.3167),
 (9.7, 54.4833),
 (9.81667, 54.4333),
 (8.8, 54.4),
 (8.95, 54.3167),
 (10.6833, 53.95),
 (10.6667, 53.9333),
 (9.56667, 54.7333),
 (9.9, 54.75),
 (9.95, 54.7167),
 (9.51667, 54.75),
 (10.6333, 53.4833),
 (10.7, 53.6167),
 (10.7833, 53.7),
 (10.6167, 53.75),
 (10.4833, 53.5),
 (8.96667, 54.6167),
 (9.13333, 54.6667),
 (8.65, 54.5167),
 (9.18333, 54.25),
 (8.91667, 54.2167),
 (11.2, 54.4333),
 (11.1, 54.5),
 (11.0833, 54.2333),
 (10.833

For downloading e-obs we don't make use of an existing library. Instead, we simply download the data files directly from the [source](https://surfobs.climate.copernicus.eu/dataaccess/access_eobs.php).


In [21]:
from springtime.datasets.meteo.eobs import EOBS

ds_eobs = EOBS(years=["2000", "2002"])
print(ds_eobs)
ds_eobs.download()


In [None]:
The data comes in netCDF format, so we need to do some more tweaking to reformat and extract the relevant data.


In [None]:
from springtime import dummy

dummy.generate_predictors(df_obs)


Unnamed: 0,year,geometry,1,2,3,4,5,6,7,8,...,356,357,358,359,360,361,362,363,364,365
68208,2000,POINT (6.65000 49.75000),1.116463,-0.688353,-0.503553,1.774375,0.049324,1.090155,0.140801,0.472000,...,0.689849,2.348531,0.125374,1.365846,0.739939,-0.591394,-0.221024,-0.203230,-1.039899,-1.220383
68318,2002,POINT (6.65000 49.73330),-0.239879,-1.033331,0.675147,-0.519626,-0.403926,-1.167334,0.534352,-0.761932,...,0.122577,1.063429,0.158820,1.022882,1.018519,-0.252565,0.255319,-0.468462,-0.278847,0.950993
68326,2001,POINT (6.65000 49.73330),-0.410449,-1.374016,0.758016,0.024483,1.300947,0.545808,0.524207,0.261131,...,-1.508852,1.226687,-0.316282,2.493663,0.934022,-0.676068,1.225582,1.119418,0.177450,-1.246418
68330,2000,POINT (6.65000 49.73330),-0.538925,-0.337236,-1.161303,2.183539,-0.980944,1.878990,-0.931485,-0.611007,...,-0.748790,-0.814699,-0.561182,0.266305,-0.525563,-0.216269,-0.466161,0.203338,1.568758,-0.842692
68491,2000,POINT (6.96667 49.88330),-0.771925,0.835432,-0.220990,1.429999,1.042128,1.262007,-0.751430,-0.710847,...,1.466735,-1.352892,0.050918,-1.056669,0.290464,0.237181,-0.321397,0.201766,-0.260834,0.083581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123435,2000,POINT (6.90000 49.46670),0.067772,0.299906,0.012934,-0.768966,-0.604711,0.356127,-0.521823,-0.516673,...,-0.819023,0.561336,-0.725051,0.385341,0.600767,-0.467161,0.295440,1.310074,-1.735785,-0.060573
123447,2002,POINT (6.90000 49.46670),-0.686435,-0.710115,1.456732,-0.634069,0.323016,1.117101,1.274012,2.809542,...,0.021878,0.616872,-0.310768,0.274349,-1.471396,-0.026675,-0.571190,0.849287,1.336973,-0.736312
123830,2002,POINT (7.00000 49.60000),1.326328,-0.597952,0.495801,0.373001,0.546661,0.832158,0.651126,-2.646906,...,-0.315807,0.256395,-0.521293,0.439544,-1.429150,-0.052931,0.022595,-0.000893,1.344362,0.262532
123840,2000,POINT (7.00000 49.60000),0.789579,0.688247,-0.083570,-1.071885,1.528925,0.820053,-0.524620,-1.121226,...,-1.525891,-0.987999,1.252437,-0.153178,-0.941270,-0.017011,1.251560,2.762789,-0.187803,-0.208674


In [None]:
from springtime.datasets.meteo.eobs import EOBSMultiplePoints

datasource = EOBSMultiplePoints(
    points=((5.740135, 47.751076), (4.740135, 46.751076)),
    # points=get_points(df_obs),
    product_type="ensemble_mean",
    grid_resolution="0.1deg",
    years=[2000, 2016],
    variables=["mean_temperature", "maximum_temperature", "minimum_temperature"],
    resample={"frequency": "month", "operator": "median"},
)
datasource.download()
df_meteo = datasource.load()
# df = datasource.load(resample={'frequency': 'month', 'operator': 'median'})


Loading E-OBS for 2 points for YearRange(start=2000, end=2016)
Loading mean_temperature for 1995-2010
Loaded mean_temperature for 1995-2010 in 18.23507595062256 seconds
Loading maximum_temperature for 1995-2010
Loaded maximum_temperature for 1995-2010 in 13.043226480484009 seconds
Loading minimum_temperature for 1995-2010
Loaded minimum_temperature for 1995-2010 in 19.799267053604126 seconds
Loading mean_temperature for 2011-2022
Loaded mean_temperature for 2011-2022 in 19.479981184005737 seconds
Loading maximum_temperature for 2011-2022
Loaded maximum_temperature for 2011-2022 in 13.864587545394897 seconds
Loading minimum_temperature for 2011-2022
Loaded minimum_temperature for 2011-2022 in 10.049194574356079 seconds


In [None]:
df_meteo


Unnamed: 0,geometry,datetime,mean_temperature,maximum_temperature,minimum_temperature
4018,POINT (4.74014 46.75108),2000-01-01,4.23,6.77,2.04
0,POINT (5.74014 47.75108),2000-01-01,2.43,3.85,0.93
4019,POINT (4.74014 46.75108),2000-01-02,5.26,6.78,3.48
1,POINT (5.74014 47.75108),2000-01-02,3.57,4.00,2.79
4020,POINT (4.74014 46.75108),2000-01-03,4.87,8.84,0.40
...,...,...,...,...,...
2189,POINT (5.74014 47.75108),2016-12-29,0.28,2.80,-2.40
4382,POINT (4.74014 46.75108),2016-12-30,-2.67,-2.06,-2.95
2190,POINT (5.74014 47.75108),2016-12-30,-4.44,-2.09,-6.17
4383,POINT (4.74014 46.75108),2016-12-31,-2.83,-0.94,-3.69


In [None]:
springtime.join(df_obs, df_meteo)


NameError: name 'springtime' is not defined

In [None]:
# Create an instance of data
dataset = EOBSMultiplePoints(
    points=((5.740135, 47.751076), (4.740135, 46.751076)),
    years=[2010, 2011],
    grid_resolution="0.25deg",
)
dataset.download()
df_meteo = dataset.load()
df_meteo


Downloading E-OBS variable mean_temperature for 2011-2022 period from https://knmi-ecad-assets-prd.s3.amazonaws.com/ensembles/data/Grid_0.25deg_reg_ensemble/tg_ens_mean_0.25deg_reg_2011-2022_v26.0e.nc to /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.25deg_reg_2011-2022_v26.0e.nc
Loading E-OBS for 2 points for YearRange(start=2010, end=2011)
Loading mean_temperature for 2011-2022
Loaded mean_temperature for 2011-2022 in 0.5430669784545898 seconds


Unnamed: 0,geometry,datetime,mean_temperature
365,POINT (4.74014 46.75108),2011-01-01,-0.43
0,POINT (5.74014 47.75108),2011-01-01,-1.37
366,POINT (4.74014 46.75108),2011-01-02,0.63
1,POINT (5.74014 47.75108),2011-01-02,0.23
367,POINT (4.74014 46.75108),2011-01-03,-1.19
...,...,...,...
362,POINT (5.74014 47.75108),2011-12-29,3.21
728,POINT (4.74014 46.75108),2011-12-30,4.99
363,POINT (5.74014 47.75108),2011-12-30,3.86
729,POINT (4.74014 46.75108),2011-12-31,7.13
