In [1]:
from rich import print

## Datasets

In the previous section we've seen that one of the main goals of springtime is
to harmonize datasets from different sources. Here, we walk through an example
with data from PEP725 and EOBS to show how this is done.


### PEP725

**Prerequisites: phenor**

To retrieve data from pep725, we make use of a pre-existing library called [phenor](https://bluegreen-labs.github.io/phenor/). Phenor is written in R, and therefore you need to have installed R with phenor. If you already have R on your system, you can install phenor like so:

```R
devtools::install_github("bluegreen-labs/phenor@v1.3.1")
```

If you don't have R, springtime provides a conda environment file that contains
most of the necessary dependencies, so you can create a conda environment like
so:

```sh
# Obtain the environment file
curl -o environment.yml https://raw.githubusercontent.com/phenology/springtime/main/environment.yml

# Create and activate the new environment
mamba env create --file environment.yml
conda activate springtime

# Install phenor in R
Rscript -e 'devtools::install_github("bluegreen-labs/phenor", upgrade="never")'
```

**PEP725 credentials**

To authenticate with the PEP725 data servers, you need to have an account and
you need to store your credentials in a file called
`~/.config/springtime/pep725_credentials.txt`. Email adress on first line,
password on second line. This path can be modified in the springtime
configuration, but the default is quite okay.


#### Springtime's dataset interface

Springtime provides a semi-standardized interface for working with datasets. In
this case, our we will be using `PEP725Phenor` as the dataset class. You can see the full documentation of this class [here](https://springtime.readthedocs.io/en/latest/reference/springtime/datasets/insitu/pep725/).

Let's create a dataset with all PEP725 observations of the species "Syringa vulgaris"


In [2]:
from springtime.datasets.insitu.pep725 import PEP725Phenor

dataset = PEP725Phenor(species="Syringa vulgaris", phenophase=11)
print(dataset)

Notice that the credential_file has been configured automatically, and that there are some other fields that we can set. Before we dive into details about what those options mean, we will need to retrieve the data. We can do this with the `download` method.


In [3]:
print(PEP725Phenor(species="Syringa vulgaris", years=[2000, 2002]).to_recipe())

In [4]:
dataset.download()

File already exists: /home/peter/.cache/springtime/PEP725/Syringa vulgaris.csv


If everything went well, the data should have been downloaded to some location
like `/home/username/.cache/springtime`. Springtime will skip the download if
the data is already present.

You can inspect the file on disk, but for transparancy springtime provides a
`raw_load` method that loads the data more or less without modification.


In [5]:
dataset.raw_load()

Unnamed: 0,pep_id,bbch,year,day,country,species,national_id,lon,lat,alt,name
0,6446,60,1991,130,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
1,6446,60,1984,137,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
2,6446,60,1969,124,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
3,6446,60,1989,107,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
4,6446,60,1990,112,AT,Syringa vulgaris,5120,14.4167,48.2167,225,ASTEN
...,...,...,...,...,...,...,...,...,...,...,...
173752,19283,60,2004,125,UK,Syringa vulgaris,964298,-3.7330,58.5670,-1,964298
173753,19283,60,2002,130,UK,Syringa vulgaris,964298,-3.7330,58.5670,-1,964298
173754,19283,60,2003,113,UK,Syringa vulgaris,964298,-3.7330,58.5670,-1,964298
173755,19285,60,2002,125,UK,Syringa vulgaris,968311,-3.5170,58.6000,-1,968311


As you can see, there are various columns in the data, only a few of which are relevant for us. The "day" column contains the day of year of the event. The event, in this case, is given in the 'bbch' column, which contains phenophases according to the BBCH scale. For example, phenophase 60 means "beginning of flowering". To see all possible options, have a look at http://www.pep725.eu/pep725_phase.php.

Note that this data is already interesting, but it doesn't completely conform to our standard yet. The `load` method, as opposed to `raw_load`, does some additional work to parse the data into a format that we can easily combine with other datasets.


In [6]:
dataset.load().reset_index(drop=True)

Unnamed: 0,year,geometry,day
0,1988,POINT (15.86660 44.80000),85
1,1981,POINT (15.86660 44.80000),83
2,1989,POINT (15.86660 44.80000),80
3,1985,POINT (15.86660 44.80000),94
4,2014,POINT (15.86660 44.80000),77
...,...,...,...
1426,2000,POINT (18.25000 49.11670),105
1427,2010,POINT (18.25000 49.11670),106
1428,2004,POINT (18.25000 49.11670),110
1429,2001,POINT (18.25000 49.11670),116


Notice that the year and geometry have been converted to index columns, we only
retained the "day" column, as this will be the variable that we are trying to
predict. The latitude and longitude have been combined into a "geometry" column
in geopandas format.

We can influence the behaviour of the `load` method to select an area and years of interest, for example. To this end, we need to modify the dataset.


In [7]:
germany = {
    "name": "Germany",
    "bbox": [
        5.98865807458,
        47.3024876979,
        15.0169958839,
        54.983104153,
    ],
}
dataset = PEP725Phenor(species="Syringa vulgaris", years=[2000, 2002], area=germany)
print(dataset)
df_pep725 = dataset.load()
df_pep725

Unnamed: 0,year,geometry,day
0,2001,POINT (13.23330 47.78330),130
1,2000,POINT (13.23330 47.78330),131
2,2002,POINT (13.23330 47.78330),132
3,2002,POINT (14.88330 48.68330),122
4,2000,POINT (14.88330 48.68330),123
...,...,...,...
4718,2002,POINT (11.98330 50.70000),130
4719,2000,POINT (11.98330 50.70000),121
4720,2001,POINT (11.98330 50.70000),133
4721,2002,POINT (11.90000 50.65000),138


#### Dataset as recipe

You may wonder why we pass these additional arguments to the dataset itself; why not pass them directly to the load function? Part of the reason is standardization: most datasets need to know about the area and time already for downloading anything. By making it part of the dataset definition, datasets from several sources become more alike.

Another advantage of this model is that it allows us to export springtime datasets as "recipes".


In [8]:
recipe = dataset.to_recipe()
print(recipe)

These recipes are a `yaml` representation of the dataset definition. With their succinct and readible format, they can be stored and shared in a standardized way. We can easily load them again:


In [9]:
from springtime.datasets import load_dataset

reloaded_ds = load_dataset(recipe)
reloaded_ds == dataset

True

Moreover, springtime can read and execute these recipes from the command line as well. We will come back to this later, but the idea is that recipes can help to make data loading more reproducible and easier to automate.


## E-OBS

Now that we have observations (our target variables for the modelling part), we
need some predictor variables as well. Here, we will use e-obs.


For downloading e-obs we don't make use of an existing library. Instead, we simply download the data files directly from the [source](https://surfobs.climate.copernicus.eu/dataaccess/access_eobs.php).


In [10]:
from springtime.datasets import EOBS

ds_eobs = EOBS(
    years=["2000", "2002"],  # pyright: ignore (https://t.ly/gukmj)
    variables=[
        "mean_temperature",
        "minimum_temperature",
    ],
    area=germany,
)
print(ds_eobs)
ds_eobs.download()

[PosixPath('/home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc'),
 PosixPath('/home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc')]

The data comes in netCDF format, so we represent the raw data as an xarray object.


In [11]:
ds = ds_eobs.raw_load()
ds

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 7.14 GiB 127.75 MiB Shape (5844, 465, 705) (1525, 120, 183) Dask graph 64 chunks in 2 graph layers Data type float32 numpy.ndarray",705  465  5844,

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 7.14 GiB 127.75 MiB Shape (5844, 465, 705) (1525, 120, 183) Dask graph 64 chunks in 2 graph layers Data type float32 numpy.ndarray",705  465  5844,

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


As you can see, the raw EOBS data span a larger domain and longer time period than we specified. The servers don't offer more fine-grained downloads. Thus, the first thing that the `load` function will do is extract the years and area specified in the dataset definition.

Normally, all the raw data will be stored in your springtime cache directory. This makes it easy to load other years or areas without re-downloading. However, if you want to save on disk space, you can set `minimize_cache` to true.


Clearly, we need to do some more tweaking to reformat and extract the relevant
data, in order to match our standardized data format.

Firstly, notice that eobs has a time dimension that spans more than one record
per year, whereas our target data has only one unique row for each per
year/location. Thus, we need to reshape and/or aggregate the data.

Secondly, we need to extract only those points that are of interest. In this process, we choose the eobs grid cell that closest to the observations, recognizing that it might not be the exact same point. However, in order to join the datasets later on, we will use the input coordinates in the final dataframe.

**Dealing with time**

We start with the time dimension. While it is not impossible to work with daily data, for this example we are first going to resample it to monthly sums instead. Then, we'll split the time dimension in two: year and day of year.


In [12]:
# TODO: move to easier path?
from springtime.datasets.meteo.eobs import split_time
import numpy as np

ds = ds.resample(time="M").mean()  # [1]
ds = split_time(ds)
ds

# [1] see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a full list

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 460.20 MiB 1.93 MiB Shape (465, 705, 16, 23) (120, 183, 1, 23) Dask graph 256 chunks in 785 graph layers Data type float32 numpy.ndarray",465  1  23  16  705,

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 460.20 MiB 1.93 MiB Shape (465, 705, 16, 23) (120, 183, 1, 23) Dask graph 256 chunks in 785 graph layers Data type float32 numpy.ndarray",465  1  23  16  705,

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


**Extracing points / alignment with observations**

Next, we noted that e-obs is a gridded dataset, but we want to retrieve only those points for which
we have observations, so let's extract those. Two utility functions are available for this: extract points, or extract records. The difference is that extract records also takes the year index into account.

Let's illustrate this starting with a few points:


In [13]:
import geopandas as gpd
from springtime.datasets.meteo.eobs import extract_points

points_pep725 = gpd.GeoSeries(gpd.points_from_xy(x=[0, 5, 7], y=[5, 10, 12]))
extract_points(ds, points_pep725)

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.31 kiB 276 B Shape (3, 16, 23) (3, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  3,

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.31 kiB 276 B Shape (3, 16, 23) (3, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  3,

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


We've received 3 points, as expected. Notice that we've made a little effort to pass our points as a geopandas array. This makes it very easy to pass in the points from our pep725 dataframe:


In [14]:
# Use points from pep725
extract_points(ds, df_pep725.geometry)

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.50 MiB 159.92 kiB Shape (1780, 16, 23) (1780, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  1780,

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.50 MiB 159.92 kiB Shape (1780, 16, 23) (1780, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  1780,

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


That's very convenient! However, we ended up with 1780 unique locations \* 16 years = 28480 records, much more than the 4723 observation dataframe! That's because the observations are not taken at the same location each year. Instead, we want to make sure we have collocated pep725 and eobs data. To this end, we can use the `extract_records` method:


In [15]:
# TODO move to better path
from springtime.datasets.meteo.eobs import extract_records

ds = extract_records(ds, df_pep725)
ds

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 424.33 kiB 424.33 kiB Shape (4723, 23) (4723, 23) Dask graph 1 chunks in 787 graph layers Data type float32 numpy.ndarray",23  4723,

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 424.33 kiB 424.33 kiB Shape (4723, 23) (4723, 23) Dask graph 1 chunks in 787 graph layers Data type float32 numpy.ndarray",23  4723,

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


At this stage, most of the heavy lifting is done, and the size of the total dataset is substantially reduced. Now, we can convert our data to a dataframe.


In [16]:
df_eobs = ds.to_dataframe()
df_eobs

Unnamed: 0_level_0,Unnamed: 1_level_0,year,mean_temperature,minimum_temperature,geometry
index,doy,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,31,2001,-0.027742,-3.352580,POINT (13.2333 47.7833)
0,59,2001,1.162500,-1.963571,POINT (13.2333 47.7833)
0,60,2001,,,POINT (13.2333 47.7833)
0,90,2001,5.345483,1.442258,POINT (13.2333 47.7833)
0,91,2001,,,POINT (13.2333 47.7833)
...,...,...,...,...,...
4722,305,2000,10.458709,7.613226,POINT (11.9 50.65)
4722,334,2000,,,POINT (11.9 50.65)
4722,335,2000,5.600334,2.783000,POINT (11.9 50.65)
4722,365,2000,,,POINT (11.9 50.65)


Notice that the DOY is still an index column. Since we want only one record per location/year, we can stack the DOY column and combine it with the variable name. Effectively, it means we treat the cumulative temperature for each month as a separate predictor.

The EOBS loader has this build in under the hood, such that we can do:


In [17]:
df_eobs = ds_eobs._to_dataframe(ds)
df_eobs

Unnamed: 0,year,geometry,mean_temperature|31,mean_temperature|59,mean_temperature|60,mean_temperature|90,mean_temperature|91,mean_temperature|120,mean_temperature|121,mean_temperature|151,...,minimum_temperature|243,minimum_temperature|244,minimum_temperature|273,minimum_temperature|274,minimum_temperature|304,minimum_temperature|305,minimum_temperature|334,minimum_temperature|335,minimum_temperature|365,minimum_temperature|366
0,2001,POINT (13.23330 47.78330),-0.027742,1.162500,,5.345483,,5.289666,,14.228063,...,13.442905,,6.644666,,9.290646,,-1.518667,,-6.672903,
1,2000,POINT (13.23330 47.78330),-2.406774,,1.763793,,2.866452,,9.612332,,...,,13.460967,,9.547999,,6.900968,,1.890333,,-0.024516
2,2002,POINT (13.23330 47.78330),-0.748709,3.706428,,4.863225,,6.443333,,13.560322,...,12.848707,,7.639001,,4.716774,,2.709000,,-1.445806,
3,2002,POINT (14.88330 48.68330),-2.181936,3.039643,,3.765162,,6.581666,,14.010002,...,12.012580,,6.130666,,2.722903,,0.613667,,-4.385161,
4,2000,POINT (14.88330 48.68330),-4.247742,,1.692069,,2.593548,,9.577000,,...,,10.741290,,7.121333,,5.788710,,-0.049667,,-3.313871
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4718,2002,POINT (11.98330 50.70000),0.469032,4.536786,,4.693226,,6.893667,,13.538388,...,14.494839,,8.478333,,4.596774,,2.489667,,-3.375161,
4719,2000,POINT (11.98330 50.70000),0.211290,,3.563103,,4.662581,,10.020000,,...,,12.347098,,10.091666,,7.811936,,2.919333,,0.433226
4720,2001,POINT (11.98330 50.70000),0.044839,2.016071,,3.662258,,6.862000,,13.639998,...,13.560323,,8.660666,,8.577096,,0.671333,,-3.940323,
4721,2002,POINT (11.90000 50.65000),0.109032,4.131786,,4.365160,,6.540333,,13.248710,...,14.109676,,8.086999,,4.349354,,2.244334,,-3.562258,


Finally, our e-obs data have the exact same format as the PEP725 observations. The next step will be to merge the dataframes together.


In [18]:
from springtime.utils import join_dataframes

join_dataframes([df_pep725, df_eobs])

Unnamed: 0_level_0,Unnamed: 1_level_0,day,mean_temperature|31,mean_temperature|59,mean_temperature|60,mean_temperature|90,mean_temperature|91,mean_temperature|120,mean_temperature|121,mean_temperature|151,mean_temperature|152,...,minimum_temperature|243,minimum_temperature|244,minimum_temperature|273,minimum_temperature|274,minimum_temperature|304,minimum_temperature|305,minimum_temperature|334,minimum_temperature|335,minimum_temperature|365,minimum_temperature|366
year,geometry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2000,POINT (10.00000 49.48330),129,0.323548,,3.664483,,5.358709,,9.966999,,14.801293,...,,12.718388,,9.747333,,7.286451,,2.840333,,0.369677
2000,POINT (10.00000 50.85000),120,0.943226,,3.795517,,5.660645,,10.001336,,14.421612,...,,11.400322,,9.931665,,6.439678,,2.687333,,-0.179032
2000,POINT (10.00000 51.71670),116,1.694194,,4.053448,,5.399354,,10.563000,,14.321937,...,,12.211291,,10.457333,,6.996452,,3.725999,,1.026129
2000,POINT (10.00000 52.10000),120,2.531935,,4.937242,,5.771289,,10.993333,,14.817741,...,,12.660645,,11.131998,,8.224839,,5.002000,,2.268064
2000,POINT (10.00000 53.08330),121,2.119677,,3.988276,,4.812258,,9.906999,,14.663547,...,,11.897419,,10.317667,,7.165806,,3.831666,,0.932581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2002,POINT (9.96667 50.15000),120,0.033871,5.121071,,5.777419,,8.711332,,13.648706,,...,14.005805,,7.406667,,4.896451,,3.722333,,-1.399677,
2002,POINT (9.96667 50.95000),131,0.667097,5.050714,,4.759677,,7.534667,,13.160967,,...,13.421612,,6.840666,,4.556774,,3.135667,,-2.176774,
2002,POINT (9.96667 52.81670),131,2.531290,4.775357,,4.818065,,7.735334,,14.055484,,...,15.073547,,9.269666,,4.023226,,2.149333,,-3.318065,
2002,POINT (9.98333 49.76670),118,0.221613,5.803572,,6.415483,,9.338666,,14.178065,,...,14.762579,,8.864333,,5.864517,,4.532332,,-0.294839,


### Summary

Bringing everything together, we can reduce this whole notebook to a few lines of code.


In [19]:
from springtime.datasets import EOBS, PEP725Phenor
from springtime.utils import PointsFromOther, join_dataframes

germany = {
    "name": "Germany",
    "bbox": [
        5.98865807458,
        47.3024876979,
        15.0169958839,
        54.983104153,
    ],
}

pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=[2000, 2002],
    area=germany,
)

eobs = EOBS(
    area=germany,
    years=["2000", "2002"],
    variables=["mean_temperature", "minimum_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=PointsFromOther(source="pep725"),
)

# Load and join data frames
df_pep725 = pep725.load()
eobs.points.get_points(df_pep725)
df_eobs = eobs.load()
df_joined = join_dataframes([df_pep725, df_eobs])

# Bonus: from recipe to workflow

We've already had a sneak preview of yaml recipes above. We can also combine the two datasets into what we call a "workflow". Such workflows can also be represented in recipes.


In [20]:
from springtime.main import Workflow, Session

workflow = Workflow(datasets={"pep725": pep725, "eobs": eobs})
print(workflow.to_recipe())

The workflows can be executed in an interactive python session, or from the command line with

```bash
springtime recipe_pep_eobs.yaml
```

This will create a fresh output directory for the loaded/joined data and save it there.
From an interactive session, it works like this. We set log level to info to get more updates about the progress.


In [21]:
import logging

logging.basicConfig(level=logging.INFO)

session = Session()
workflow.execute(session)

INFO:springtime.main:Dataset pep725 loaded with 4723 rows
INFO:springtime.datasets.meteo.eobs:Locating data
INFO:springtime.datasets.meteo.eobs:Found /home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc, skipping download
INFO:springtime.datasets.meteo.eobs:Found /home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc, skipping download
INFO:springtime.main:Dataset eobs loaded with 4723 rows
INFO:springtime.main:Datasets joined to shape: (4729, 47)
INFO:springtime.main:Data saved to: /tmp/output/data.csv


# To Do:

- What to do with keep_grid_location?
- Different kind of transpose ds to df (keep time as series?)
- Update examples


In [None]:
# Old eobs examples (need to be updated)

from springtime.datasets.meteo.eobs import EOBS

datasource = EOBS(
    product_type="elevation", variables=["land_surface_elevation"], years=[2000, 2002]
)
datasource.download()
ds = datasource.load()


from springtime.datasets.meteo.eobs import EOBSSinglePoint

datasource = EOBSSinglePoint(
    point=[5, 50],
    product_type="ensemble_mean",
    grid_resolution="0.25deg",
    years=[2000, 2002],
)
datasource.download()
df_eobs = datasource.load()


from springtime.datasets.meteo.eobs import EOBSSinglePoint

datasource = EOBSSinglePoint(
    point=[5, 50],
    product_type="elevation",
    variables=["land_surface_elevation"],
    years=[2000, 2002],
)
datasource.download()
df_eobs = datasource.load()

from springtime.datasets.meteo.eobs import EOBSMultiplePoints

datasource = EOBSMultiplePoints(
    points=[
        [5, 50],
        [5, 55],
    ],
    product_type="ensemble_mean",
    grid_resolution="0.25deg",
    years=[2000, 2002],
)
datasource.download()
df_eobs = datasource.load()


from springtime.datasets.meteo.eobs import EOBSBoundingBox

dataset = EOBSBoundingBox(
    years=[2000, 2002],
    area={"name": "amsterdam", "bbox": [4, 50, 5, 55]},
    grid_resolution="0.25deg",
)
dataset.download()
df_eobs = dataset.load()