# Extract ERA5 Site Data and Generate Samples
## ERA5 Site Data
For each site in the LFMC sample data, extract the full time series of ERA5 data, gap-fill and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written.

## ERA5 Sample Data
For each sample, extract the timeseries ERA5 data. The timeseries length is determined by the ERA5_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series). The extracted ERA5 data is combined to a single dataframe and saved.

## Input Files
- `LFMC_sites.csv` and `LFMC_samples.csv` created by the `Extract Auxiliary Data.ipynb` notebook.

## Output Files
- The extracted sites data are created in `ERA5_DIR` (by default located in `DATA_DIR/GEE_DIR`). The directory will contain a CSV file for each site.
- The extracted ERA5 data for all samples are created in ``DATASETS_DIR`. File name include the time series length (i.e. 365days) of the extracted ERA5 data. So, with the default settings is `era5_365days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` notebook.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 10 hours to run if there are no existing site extracts.
4. Data for 67724 samples will be extracted. 1051 samples will not be extracted as they were collected before March 2002 (so less than 1 year MODIS data available).
5. There will be 4 invalid sites ('C6_340', 'C6_698', 'C6_935', and 'C16_5'), but occasionally extraction from GEE will fail for another site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

## Initialisation

In [1]:
import os
os.environ["HDF5_DISABLE_VERSION_CHECK"] = "1"
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from data_extract_utils import get_sample_data, extract_timeseries_data, sort_key
from timeseries_extractor import GeeTimeseriesReduceExtractor

### Connect to GEE

In [2]:
import ee
ee.Initialize()

### Program parameters and constants

GEE Parameters
- Weather product is ECMWF/ERA5_LAND/HOURLY - hourly weather data from ERA5
- Scale/proj set to convert to MODIS resolution/projection
- Start date is 01/03/2000, to match MODIS data availability

In [3]:
START_DATE = "2000-03-01"
END_DATE = "2019-01-01"     # Final day retrieved will be 2018-12-31

# ERA5 time series constants
ERA5_TS_LENGTH = 365
ERA5_TS_OFFSET = 1      # ts end this number of days before the sampling date
ERA5_TS_FREQ = 1        # days between consecutive elements in the ts

# ERA5 data details
PRODUCT = "ECMWF/ERA5_LAND/HOURLY"
BANDS = {
    'pre_tot': 'total_precipitation_max',
    'temp_mean': 'temperature_2m_mean',
    'temp_min': 'temperature_2m_min',
    'temp_max': 'temperature_2m_max',
    'dewpt_mean': 'dewpoint_temperature_2m_mean',
    'pet_min': 'potential_evaporation_hourly_min',
    'pet_max': 'potential_evaporation_hourly_max',
}
# Extract data using MODIS scale and projection
SCALE = 463.3127
PROJ = "SR-ORG:6974"

EARLIEST_SAMPLE = datetime.strptime(START_DATE, '%Y-%m-%d') + timedelta(
    days=ERA5_TS_LENGTH * ERA5_TS_FREQ + ERA5_TS_OFFSET - 1)

### Directories and Files

In [4]:
# Directories
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
ERA5_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "ERA5")

# File Names
SITES_INPUT = os.path.join(common.DATASETS_DIR, "LFMC_europe_sites.csv")
SAMPLES_INPUT = os.path.join(common.DATASETS_DIR, "LFMC_europe_samples.csv")
SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"europe_samples_{ERA5_TS_LENGTH * ERA5_TS_FREQ}days.csv")
ERA5_OUTPUT = os.path.join(common.DATASETS_DIR, f"europe_era5_{ERA5_TS_LENGTH * ERA5_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(ERA5_DIR):
    os.makedirs(ERA5_DIR)

## Generate the ERA5 sample data
For each site
- Get the weather data at MODIS proj/scale
- Then get the sample data for each sample at the site

In [5]:
sites = pd.read_csv(SITES_INPUT, float_precision="high")
samples = pd.read_csv(SAMPLES_INPUT, float_precision="high")

In [6]:
reducers = [ee.Reducer.mean(), ee.Reducer.min(), ee.Reducer.max()]
era5_extractor = GeeTimeseriesReduceExtractor(PRODUCT, BANDS, reducers, START_DATE, END_DATE,
                                              gap_fill=False, dir_name=ERA5_DIR)
era5_extractor.set_proj_scale(PROJ, SCALE)
era5_data, valid_data, invalid_pixels, invalid_sites = extract_timeseries_data(
    era5_extractor, sites, samples, EARLIEST_SAMPLE, ERA5_TS_OFFSET, ERA5_TS_LENGTH, ERA5_TS_FREQ)

Processing site C2_1
Extracting data for C2_1 (lat: 38.92292 long: -1.71649)
Processing site C2_2
Extracting data for C2_2 (lat: 38.30625 long: -2.15313)
Processing site C2_3
Extracting data for C2_3 (lat: 38.20625 long: -2.30925)
Processing site C2_4
Extracting data for C2_4 (lat: 36.75625 long: -5.43723)
Processing site C2_5
Extracting data for C2_5 (lat: 36.76042 long: -5.38552)
Processing site C2_6
Extracting data for C2_6 (lat: 42.29375 long: -0.90971)
Processing site C2_7
Extracting data for C2_7 (lat: 42.28542 long: -0.89833)
Processing site C2_8
Extracting data for C2_8 (lat: 42.27708 long: -1.01647)
Processing site C2_9
Extracting data for C2_9 (lat: 42.40208 long: -0.92821)
Processing site C2_10
Extracting data for C2_10 (lat: 41.25208 long: -1.2165)
Processing site C2_11
Extracting data for C2_11 (lat: 41.19792 long: -1.18227)
Processing site C2_12
Extracting data for C2_12 (lat: 41.25625 long: -1.29971)
Processing site C2_13
Extracting data for C2_13 (lat: 41.31042 long: -1

Summary of sites/samples not extracted

In [7]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}')
print(invalid_sites)
print(invalid_pixels)

Invalid sites: 3; Invalid pixels: 0
['C9_28', 'C9_42', 'C12_1']
[]


## Save and display sample weather data

In [8]:
era5_data = pd.DataFrame(era5_data)
ts_days = (ERA5_TS_LENGTH - 1) * ERA5_TS_FREQ
era5_data.columns = ["ID"] + [f'{day-ERA5_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, ERA5_TS_FREQ)
                               for band in range(len(BANDS))]
era5_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
era5_data

Unnamed: 0,ID,-365_1,-365_2,-365_3,-365_4,-365_5,-365_6,-365_7,-364_1,-364_2,...,-002_5,-002_6,-002_7,-001_1,-001_2,-001_3,-001_4,-001_5,-001_6,-001_7
0,C2_1_1,4.994869e-06,279.343393,272.426605,286.665955,266.562732,-0.000982,-1.172237e-04,0.000000e+00,282.091725,...,275.727264,-0.000940,1.974404e-07,0.000002,288.199837,280.332153,295.517288,274.865705,-0.000963,-3.344938e-05
1,C2_1_3,4.700086e-05,300.064088,293.295334,305.981644,278.305990,-0.001138,-1.177415e-04,4.699528e-05,299.458803,...,284.632844,-0.001179,-3.071502e-06,0.000166,299.828926,291.924576,307.572403,284.558547,-0.001129,-1.158565e-05
2,C2_2_1,0.000000e+00,281.677337,274.818909,288.502975,270.162066,-0.000892,-7.554889e-06,1.297291e-06,281.923733,...,276.670230,-0.000796,5.029142e-07,0.000039,287.716304,280.806351,295.198883,277.632065,-0.000785,5.029142e-07
3,C2_2_10,1.722574e-06,299.227581,290.983368,305.792938,278.115561,-0.001068,-4.016422e-05,1.710653e-06,298.724834,...,281.555777,-0.001087,-1.929700e-05,0.000095,299.176420,290.457779,306.713028,285.018345,-0.001055,-3.539026e-07
4,C2_3_1,0.000000e+00,279.570078,272.320862,286.348679,269.395220,-0.000818,-6.139278e-06,1.297291e-06,280.326321,...,276.276025,-0.000768,8.456409e-07,0.000139,285.884516,279.189163,292.888336,276.765692,-0.000767,1.762062e-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8393,C15_1_13,5.412996e-05,294.831380,288.597534,302.681396,285.030111,-0.001073,-2.433546e-06,0.000000e+00,295.770727,...,290.111422,-0.001096,-4.036352e-06,0.001074,298.003263,292.430466,302.726273,286.349009,-0.001196,-5.979091e-06
8394,C15_1_19,2.782345e-05,298.152774,292.087646,305.539581,288.400694,-0.001074,-1.419336e-06,1.761168e-04,297.875079,...,285.737535,-0.001460,-1.625158e-06,0.000059,295.294369,289.159363,301.848465,281.938951,-0.001150,-1.290627e-04
8395,C15_1_25,2.311292e-03,297.696341,291.804367,305.228455,286.585228,-0.001150,-9.424984e-07,2.746779e-03,295.381376,...,291.939104,-0.000916,-1.621619e-05,0.001561,297.682663,293.108673,303.832428,285.431698,-0.001314,-3.438257e-05
8396,C15_1_31,8.583069e-07,300.132219,294.407043,307.781067,288.305840,-0.001104,-2.764165e-06,8.523463e-07,298.516801,...,282.525814,-0.001012,-2.685934e-06,0.008602,293.761960,291.439362,297.265640,290.789291,-0.000722,1.836568e-06


In [9]:
era5_data.to_csv(ERA5_OUTPUT, index=False, float_format='%.4g')

Save and display the valid samples

In [10]:
valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
valid_samples

Unnamed: 0,ID,Latitude,Longitude,Sampling date,Sampling year,Land Cover,LFMC value,Site,Czone1,Czone2,Czone3,Day_sin,Day_cos,Long_sin,Long_cos,Lat_norm,Elevation,Slope,Aspect_sin,Aspect_cos
317,C2_1_1,38.92292,-1.71649,2006-04-10,2006,"Mosaic natural vegetation (tree, shrub, herbac...",74.33500,C2_1,B,BS,BSk,-0.99111,0.13301,-0.02995,0.99955,0.71624,0.15258,0.16544,0.94004,0.34108
318,C2_1_3,38.92292,-1.71649,2006-07-25,2006,"Mosaic natural vegetation (tree, shrub, herbac...",87.42500,C2_1,B,BS,BSk,0.37771,0.92592,-0.02995,0.99955,0.71624,0.15258,0.16544,0.94004,0.34108
501,C2_2_1,38.30625,-2.15313,2006-04-11,2006,"Tree cover, needleleaved, evergreen, closed to...",86.10800,C2_2,B,BS,BSk,-0.98868,0.15006,-0.03757,0.99929,0.71281,0.16236,0.15632,0.22862,0.97352
502,C2_2_10,38.30625,-2.15313,2006-07-25,2006,"Tree cover, needleleaved, evergreen, closed to...",60.51667,C2_2,B,BS,BSk,0.37771,0.92592,-0.03757,0.99929,0.71281,0.16236,0.15632,0.22862,0.97352
513,C2_3_1,38.20625,-2.30925,2006-04-11,2006,Mosaic tree and shrub (>50%) / herbaceous cove...,110.14500,C2_3,B,BS,BSk,-0.98868,0.15006,-0.04029,0.99919,0.71226,0.19286,0.06576,-0.99408,0.10867
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,C15_1_13,41.34375,1.05726,2017-06-27,2017,"Tree cover, broadleaved, deciduous, closed to ...",106.92565,C15_1,C,Cs,Csb,-0.09454,0.99552,0.01845,0.99983,0.72969,0.12533,0.20084,-0.57996,0.81464
110,C15_1_19,41.34375,1.05726,2017-07-26,2017,"Tree cover, broadleaved, deciduous, closed to ...",84.77083,C15_1,C,Cs,Csb,0.39359,0.91929,0.01845,0.99983,0.72969,0.12533,0.20084,-0.57996,0.81464
111,C15_1_25,41.34375,1.05726,2017-08-09,2017,"Tree cover, broadleaved, deciduous, closed to ...",79.33530,C15_1,C,Cs,Csb,0.60162,0.79878,0.01845,0.99983,0.72969,0.12533,0.20084,-0.57996,0.81464
112,C15_1_31,41.34375,1.05726,2017-09-05,2017,"Tree cover, broadleaved, deciduous, closed to ...",69.21358,C15_1,C,Cs,Csb,0.89584,0.44438,0.01845,0.99983,0.72969,0.12533,0.20084,-0.57996,0.81464
