# E-OBS

Gridded meteorological observations over Europe from [E-OBS](https://surfobs.climate.copernicus.eu/dataaccess/access_eobs.php)

---


## Basic download & raw data

For downloading e-obs we don't make use of an existing library. Instead, we simply download the data files directly from the [source](https://surfobs.climate.copernicus.eu/dataaccess/access_eobs.php).


In [2]:
from springtime.datasets import EOBS
from springtime.utils import germany

ds_eobs = EOBS(
    years=["2000", "2002"],  # pyright: ignore (https://t.ly/gukmj)
    variables=[
        "mean_temperature",
        "minimum_temperature",
    ],
    area=germany,
)
print(ds_eobs)
ds_eobs.download()

[PosixPath('/home/peter/.cache/springtime/e-obs/tg_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc'),
 PosixPath('/home/peter/.cache/springtime/e-obs/tn_ens_mean_0.1deg_reg_1995-2010_v26.0e.nc')]

The data comes in netCDF format, so we represent the raw data as an xarray object.


In [3]:
ds = ds_eobs.raw_load()
ds

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 7.14 GiB 127.75 MiB Shape (5844, 465, 705) (1525, 120, 183) Dask graph 64 chunks in 2 graph layers Data type float32 numpy.ndarray",705  465  5844,

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 7.14 GiB 127.75 MiB Shape (5844, 465, 705) (1525, 120, 183) Dask graph 64 chunks in 2 graph layers Data type float32 numpy.ndarray",705  465  5844,

Unnamed: 0,Array,Chunk
Bytes,7.14 GiB,127.75 MiB
Shape,"(5844, 465, 705)","(1525, 120, 183)"
Dask graph,64 chunks in 2 graph layers,64 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Minimizing cache

As you can see, the raw EOBS data span a larger domain and longer time period than we specified. The servers don't offer more fine-grained downloads. Thus, the first thing that the `load` function will do is extract the years and area specified in the dataset definition.

Normally, all the raw data will be stored in your springtime cache directory. This makes it easy to load other years or areas without re-downloading. However, if you want to save on disk space, you can set `minimize_cache` to true.

## Additional options for `load`

Clearly, we need to do some more tweaking to reformat and extract the relevant
data, in order to match our standardized data format.

Firstly, notice that eobs has a time dimension that spans more than one record
per year, whereas phenological datasets typically have only one unique row for
each year/location. Thus, we need to reshape and/or aggregate the data.

Secondly, we need to extract only those points that are of interest. Typically, we will first download observations (e.g. pep725) and then the corresponding grid points from E-OBS.

### Dealing with time

We start with the time dimension. While it is not impossible to work with daily data, for this example we are first going to resample it to monthly sums instead. Then, we'll split the time dimension in two: year and day of year.


In [4]:
# TODO: move to easier path?
from springtime.datasets.meteo.eobs import split_time
import numpy as np

ds = ds.resample(time="M").mean()  # [1]
ds = split_time(ds)
ds

# [1] see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases for a full list

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 460.20 MiB 1.93 MiB Shape (465, 705, 16, 23) (120, 183, 1, 23) Dask graph 256 chunks in 785 graph layers Data type float32 numpy.ndarray",465  1  23  16  705,

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 460.20 MiB 1.93 MiB Shape (465, 705, 16, 23) (120, 183, 1, 23) Dask graph 256 chunks in 785 graph layers Data type float32 numpy.ndarray",465  1  23  16  705,

Unnamed: 0,Array,Chunk
Bytes,460.20 MiB,1.93 MiB
Shape,"(465, 705, 16, 23)","(120, 183, 1, 23)"
Dask graph,256 chunks in 785 graph layers,256 chunks in 785 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Extracing points / alignment with observations

Next, we noted that e-obs is a gridded dataset, but we want to retrieve only those points for which
we have observations, so let's extract those. Two utility functions are available for this: extract points, or extract records. The difference is that extract records also takes the year index into account.

Let's illustrate this starting with a few points:


In [5]:
import geopandas as gpd
from springtime.datasets.meteo.eobs import extract_points

points_pep725 = gpd.GeoSeries(gpd.points_from_xy(x=[0, 5, 7], y=[5, 10, 12]))
extract_points(ds, points_pep725)

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.31 kiB 276 B Shape (3, 16, 23) (3, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  3,

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.31 kiB 276 B Shape (3, 16, 23) (3, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  3,

Unnamed: 0,Array,Chunk
Bytes,4.31 kiB,276 B
Shape,"(3, 16, 23)","(3, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


We've received 3 points, as expected. Notice that we've made a little effort to pass our points as a geopandas array. This makes it very easy to reuse points from other datasets. For example:


In [6]:
from springtime.datasets import PEP725Phenor

df_pep725 = PEP725Phenor(
    species="Syringa vulgaris",
    years=[2000, 2002],
    area=germany,
).load()

# Use points from pep725
extract_points(ds, df_pep725.geometry)

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.50 MiB 159.92 kiB Shape (1780, 16, 23) (1780, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  1780,

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.50 MiB 159.92 kiB Shape (1780, 16, 23) (1780, 1, 23) Dask graph 16 chunks in 787 graph layers Data type float32 numpy.ndarray",23  16  1780,

Unnamed: 0,Array,Chunk
Bytes,2.50 MiB,159.92 kiB
Shape,"(1780, 16, 23)","(1780, 1, 23)"
Dask graph,16 chunks in 787 graph layers,16 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


That's very convenient! However, we ended up with 1780 unique locations \* 16 years = 28480 records, much more than the 4723 observation dataframe! That's because the observations are not taken at the same location each year. Instead, we want to make sure we have collocated pep725 and eobs data. To this end, we can use the `extract_records` method:


In [7]:
from springtime.datasets.meteo.eobs import extract_records

ds = extract_records(ds, df_pep725)
ds

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 424.33 kiB 424.33 kiB Shape (4723, 23) (4723, 23) Dask graph 1 chunks in 787 graph layers Data type float32 numpy.ndarray",23  4723,

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 424.33 kiB 424.33 kiB Shape (4723, 23) (4723, 23) Dask graph 1 chunks in 787 graph layers Data type float32 numpy.ndarray",23  4723,

Unnamed: 0,Array,Chunk
Bytes,424.33 kiB,424.33 kiB
Shape,"(4723, 23)","(4723, 23)"
Dask graph,1 chunks in 787 graph layers,1 chunks in 787 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In this process, we choose the eobs grid cell that closest to the observations, recognizing that it might not be the exact same point. However, in order to join the datasets later on, the final dataframe retains the input coordinates.

At this stage, most of the heavy lifting is done, and the size of the total dataset is substantially reduced. Now, we can convert our data to a dataframe.


In [8]:
df_eobs = ds.to_dataframe()
df_eobs

Unnamed: 0_level_0,Unnamed: 1_level_0,year,mean_temperature,minimum_temperature,geometry
index,doy,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,31,2001,-0.027742,-3.352580,POINT (13.2333 47.7833)
0,59,2001,1.162500,-1.963571,POINT (13.2333 47.7833)
0,60,2001,,,POINT (13.2333 47.7833)
0,90,2001,5.345483,1.442258,POINT (13.2333 47.7833)
0,91,2001,,,POINT (13.2333 47.7833)
...,...,...,...,...,...
4722,305,2000,10.458709,7.613226,POINT (11.9 50.65)
4722,334,2000,,,POINT (11.9 50.65)
4722,335,2000,5.600334,2.783000,POINT (11.9 50.65)
4722,365,2000,,,POINT (11.9 50.65)


Notice that the DOY is still an index column. Since we want only one record per location/year, we can stack the DOY column and combine it with the variable name. Effectively, it means we treat the cumulative temperature for each month as a separate predictor.

The EOBS loader has this build in under the hood, such that we can do:


In [9]:
df_eobs = ds_eobs._to_dataframe(ds)
df_eobs

Unnamed: 0,year,geometry,mean_temperature|31,mean_temperature|59,mean_temperature|60,mean_temperature|90,mean_temperature|91,mean_temperature|120,mean_temperature|121,mean_temperature|151,...,minimum_temperature|243,minimum_temperature|244,minimum_temperature|273,minimum_temperature|274,minimum_temperature|304,minimum_temperature|305,minimum_temperature|334,minimum_temperature|335,minimum_temperature|365,minimum_temperature|366
0,2001,POINT (13.23330 47.78330),-0.027742,1.162500,,5.345483,,5.289666,,14.228063,...,13.442905,,6.644666,,9.290646,,-1.518667,,-6.672903,
1,2000,POINT (13.23330 47.78330),-2.406774,,1.763793,,2.866452,,9.612332,,...,,13.460967,,9.547999,,6.900968,,1.890333,,-0.024516
2,2002,POINT (13.23330 47.78330),-0.748709,3.706428,,4.863225,,6.443333,,13.560322,...,12.848707,,7.639001,,4.716774,,2.709000,,-1.445806,
3,2002,POINT (14.88330 48.68330),-2.181936,3.039643,,3.765162,,6.581666,,14.010002,...,12.012580,,6.130666,,2.722903,,0.613667,,-4.385161,
4,2000,POINT (14.88330 48.68330),-4.247742,,1.692069,,2.593548,,9.577000,,...,,10.741290,,7.121333,,5.788710,,-0.049667,,-3.313871
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4718,2002,POINT (11.98330 50.70000),0.469032,4.536786,,4.693226,,6.893667,,13.538388,...,14.494839,,8.478333,,4.596774,,2.489667,,-3.375161,
4719,2000,POINT (11.98330 50.70000),0.211290,,3.563103,,4.662581,,10.020000,,...,,12.347098,,10.091666,,7.811936,,2.919333,,0.433226
4720,2001,POINT (11.98330 50.70000),0.044839,2.016071,,3.662258,,6.862000,,13.639998,...,13.560323,,8.660666,,8.577096,,0.671333,,-3.940323,
4721,2002,POINT (11.90000 50.65000),0.109032,4.131786,,4.365160,,6.540333,,13.248710,...,14.109676,,8.086999,,4.349354,,2.244334,,-3.562258,


## Summary

We started with a `raw_load` of the E-OBS data. After going through all the nitty-gritty details, we can appreciate all the work that happens under the hood when we call load directly:


In [12]:
eobs = EOBS(
    area=germany,
    years=["2000", "2002"],
    variables=["mean_temperature", "minimum_temperature"],
    resample={"frequency": "M", "operator": "mean"},
    points=[(5, 10), (10, 12)],
)
eobs.load()

Unnamed: 0,year,geometry,mean_temperature|31,mean_temperature|59,mean_temperature|60,mean_temperature|90,mean_temperature|91,mean_temperature|120,mean_temperature|121,mean_temperature|151,...,minimum_temperature|243,minimum_temperature|244,minimum_temperature|273,minimum_temperature|274,minimum_temperature|304,minimum_temperature|305,minimum_temperature|334,minimum_temperature|335,minimum_temperature|365,minimum_temperature|366
0,2000,POINT (5.00000 10.00000),2.223871,,5.508276,,7.324193,,11.011667,,...,,14.570322,,11.682666,,8.179032,,4.663,,3.367419
1,2000,POINT (10.00000 12.00000),-4.589999,,-1.344483,,-0.575161,,4.090333,,...,,9.272257,,5.804999,,3.015162,,-2.132333,,-2.679032
2,2001,POINT (5.00000 10.00000),4.325484,4.837501,,8.908065,,8.607333,,16.301291,...,14.453549,,8.990667,,9.785806,,0.957333,,-1.507742,
3,2001,POINT (10.00000 12.00000),-2.454194,-1.650714,,2.299355,,0.978,,10.014839,...,9.411936,,2.642333,,5.125806,,-4.241666,,-9.656453,
4,2002,POINT (5.00000 10.00000),2.929678,6.716786,,8.146128,,10.153666,,12.974515,...,13.938711,,9.912333,,7.014516,,5.327667,,3.429355,
5,2002,POINT (10.00000 12.00000),-2.547742,0.474286,,1.406774,,3.338666,,8.437097,...,8.149031,,4.204667,,1.624516,,-1.128333,,-3.478387,


We can also represent this dataset as a recipe for easy sharing and reproducability.


In [13]:
print(eobs.to_recipe())