# Extract ERA5 Site Data and Generate Samples
## ERA5 Site Data
For each site in the LFMC sample data, extract the full time series of ERA5 data, gap-fill and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written.

## ERA5 Sample Data
For each sample, extract the timeseries ERA5 data. The timeseries length is determined by the ERA5_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series). The extracted ERA5 data is combined to a single dataframe and saved.

## Input Files
- `LFMC_sites.csv` and `LFMC_samples.csv` created by the `Extract Auxiliary Data.ipynb` notebook.

## Output Files
- The extracted sites data are created in `ERA5_DIR` (by default located in `DATA_DIR/GEE_DIR`). The directory will contain a CSV file for each site.
- The extracted ERA5 data for all samples are created in ``DATASETS_DIR`. File name include the time series length (i.e. 365days) of the extracted ERA5 data. So, with the default settings is `era5_365days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` notebook.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 10 hours to run if there are no existing site extracts.
4. Data for 67724 samples will be extracted. 1051 samples will not be extracted as they were collected before March 2002 (so less than 1 year MODIS data available).
5. There will be 4 invalid sites ('C6_340', 'C6_698', 'C6_935', and 'C16_5'), but occasionally extraction from GEE will fail for another site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

## Initialisation

In [1]:
import os
os.environ["HDF5_DISABLE_VERSION_CHECK"] = "1"
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from data_extract_utils import get_sample_data, extract_timeseries_data, sort_key
from timeseries_extractor import GeeTimeseriesReduceExtractor

### Connect to GEE

In [2]:
import ee
ee.Initialize()

### Program parameters and constants

GEE Parameters
- Weather product is ECMWF/ERA5_LAND/HOURLY - hourly weather data from ERA5
- Scale/proj set to convert to MODIS resolution/projection
- Start date is 01/03/2000, to match MODIS data availability

In [3]:
START_DATE = "2000-03-01"
END_DATE = "2019-01-01"     # Final day retrieved will be 2018-12-31

# ERA5 time series constants
ERA5_TS_LENGTH = 365
ERA5_TS_OFFSET = 1      # ts end this number of days before the sampling date
ERA5_TS_FREQ = 1        # days between consecutive elements in the ts

# ERA5 data details
PRODUCT = "ECMWF/ERA5_LAND/HOURLY"
BANDS = {
    'pre_tot': 'total_precipitation_max',
    'temp_mean': 'temperature_2m_mean',
    'temp_min': 'temperature_2m_min',
    'temp_max': 'temperature_2m_max',
    'dewpt_mean': 'dewpoint_temperature_2m_mean',
    'pet_min': 'potential_evaporation_hourly_min',
    'pet_max': 'potential_evaporation_hourly_max',
}
# Extract data using MODIS scale and projection
SCALE = 463.3127
PROJ = "SR-ORG:6974"

EARLIEST_SAMPLE = datetime.strptime(START_DATE, '%Y-%m-%d') + timedelta(
    days=ERA5_TS_LENGTH * ERA5_TS_FREQ + ERA5_TS_OFFSET - 1)

### Directories and Files

In [4]:
# Directories
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
ERA5_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "ERA5")

# File Names
SAMPLES_INPUT = os.path.join(common.DATASETS_DIR, "samples_455days.csv")
SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"samples_{ERA5_TS_LENGTH * ERA5_TS_FREQ}days.csv")
ERA5_OUTPUT = os.path.join(common.DATASETS_DIR, f"era5_{ERA5_TS_LENGTH * ERA5_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(ERA5_DIR):
    os.makedirs(ERA5_DIR)

## Generate the ERA5 sample data
For each site
- Get the weather data at MODIS proj/scale
- Then get the sample data for each sample at the site

In [5]:
sites = pd.read_csv(common.LFMC_SITES, float_precision="high")
samples = pd.read_csv(common.LFMC_SAMPLES, float_precision="high")

In [6]:
reducers = [ee.Reducer.mean(), ee.Reducer.min(), ee.Reducer.max()]
era5_extractor = GeeTimeseriesReduceExtractor(PRODUCT, BANDS, reducers, START_DATE, END_DATE,
                                              gap_fill=False, dir_name=ERA5_DIR)
era5_extractor.set_proj_scale(PROJ, SCALE)
era5_data, valid_data, invalid_pixels, invalid_sites = extract_timeseries_data(
    era5_extractor, sites, samples, EARLIEST_SAMPLE, ERA5_TS_OFFSET, ERA5_TS_LENGTH, ERA5_TS_FREQ)

Processing site C4_1
Processing site C4_2
Processing site C4_3
Processing site C4_4
Processing site C4_5
Processing site C6_1
Processing site C6_2
Processing site C6_3
Processing site C6_4
Processing site C6_5
Processing site C6_6
Processing site C6_7
Processing site C6_8
Processing site C6_9
Processing site C6_10
Processing site C6_11
Processing site C6_12
Processing site C6_13
Processing site C6_14
Processing site C6_16
Processing site C6_17
Processing site C6_18
Processing site C6_19
Processing site C6_20
Processing site C6_21
Processing site C6_22
Processing site C6_23
Processing site C6_24
Processing site C6_25
Processing site C6_26
Processing site C6_27
Processing site C6_28
Processing site C6_29
Processing site C6_30
Processing site C6_31
Processing site C6_32
Processing site C6_33
Processing site C6_34
Processing site C6_35
Processing site C6_36
Processing site C6_37
Processing site C6_38
Processing site C6_39
Processing site C6_40
Processing site C6_41
Processing site C6_42
Pr

Summary of sites/samples not extracted

In [7]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}')
print(invalid_sites)
print(invalid_pixels)

Invalid sites: 4; Invalid pixels: 0
['C6_340', 'C6_698', 'C6_935', 'C16_5']
[]


## Save and display sample weather data

In [8]:
era5_data = pd.DataFrame(era5_data)
ts_days = (ERA5_TS_LENGTH - 1) * ERA5_TS_FREQ
era5_data.columns = ["ID"] + [f'{day-ERA5_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, ERA5_TS_FREQ)
                               for band in range(len(BANDS))]
era5_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
era5_data

Unnamed: 0,ID,-365_1,-365_2,-365_3,-365_4,-365_5,-365_6,-365_7,-364_1,-364_2,...,-002_5,-002_6,-002_7,-001_1,-001_2,-001_3,-001_4,-001_5,-001_6,-001_7
0,C4_1_1,0.000542,296.530792,291.436172,301.042404,277.281698,-0.000895,-2.011657e-07,5.423814e-04,293.205321,...,275.639860,-0.001018,6.780028e-07,8.583068e-07,295.475432,289.114777,303.882629,273.889772,-0.001021,-6.426126e-06
1,C4_1_2,0.000921,297.345078,292.379990,301.693054,279.785858,-0.000923,-3.129244e-07,1.575947e-04,297.258537,...,274.502177,-0.000914,-3.017485e-06,2.458692e-06,297.341183,291.287109,304.700317,273.564700,-0.000885,-2.980232e-07
2,C4_1_3,0.000803,299.459848,294.304749,304.683365,283.649661,-0.000859,-3.762543e-07,2.530980e-03,299.739579,...,274.384192,-0.001027,-5.409122e-06,0.000000e+00,304.793328,298.170731,310.831070,274.564700,-0.000883,-8.635223e-06
3,C4_1_4,0.000379,298.654336,292.756805,305.252792,272.641654,-0.000859,-2.585351e-06,8.523463e-07,299.803421,...,280.926898,-0.000868,-2.544373e-06,3.432226e-05,300.006526,293.719437,305.549316,281.398877,-0.000799,-8.344650e-07
4,C4_1_5,0.002115,292.097139,287.816391,300.834045,281.832570,-0.000536,-2.741814e-06,2.950550e-03,289.521110,...,276.744818,-0.000780,-2.171844e-06,2.847910e-04,298.592454,293.327850,303.655106,278.496980,-0.000834,-2.708286e-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67711,C13_4_14,0.000215,293.617204,286.537247,301.238647,279.187840,-0.003312,-5.494803e-06,2.151787e-04,292.207192,...,271.797415,-0.003569,-7.301569e-06,9.264946e-05,293.928433,284.252731,302.596359,276.540686,-0.003648,-2.317503e-05
67712,C13_4_15,0.000002,291.288058,282.811432,299.470810,271.133790,-0.006421,-5.368143e-06,8.523463e-07,292.185450,...,265.889236,-0.004061,-9.324402e-06,0.000000e+00,287.922465,281.608810,295.046280,268.582804,-0.004099,-6.858259e-06
67713,C13_4_16,0.000034,291.169561,280.568604,300.066238,275.212795,-0.003411,5.513430e-07,5.960464e-08,290.831379,...,271.548433,-0.004121,-7.260591e-06,1.250207e-04,288.774312,281.650787,298.668213,272.269825,-0.003459,-1.125783e-05
67714,C13_4_17,0.004198,283.734919,279.935745,287.508652,277.565645,-0.001839,-2.250075e-06,4.202329e-03,282.264055,...,272.110679,-0.005692,-7.748604e-07,2.550205e-05,286.259954,278.274933,294.291885,272.549385,-0.005157,-2.026558e-06


In [9]:
era5_data.to_csv(ERA5_OUTPUT, index=False, float_format='%.4g')

Save and display the valid samples

In [14]:
valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
valid_samples

Unnamed: 0,ID,Latitude,Longitude,Sampling date,Sampling year,Land Cover,LFMC value,Site,Czone1,Czone2,Czone3,Day_sin,Day_cos,Long_sin,Long_cos,Lat_norm,Elevation,Slope,Aspect_sin,Aspect_cos
163,C4_1_1,40.21458,-112.21868,2005-06-20,2005,Shrubland,156.76300,C4_1,B,BS,BSk,-0.21352,0.97694,-0.92575,-0.37814,0.72341,0.26207,0.01972,-0.03023,0.99954
164,C4_1_2,40.21458,-112.21868,2005-07-05,2005,Shrubland,128.27700,C4_1,B,BS,BSk,0.04302,0.99907,-0.92575,-0.37814,0.72341,0.26207,0.01972,-0.03023,0.99954
165,C4_1_3,40.21458,-112.21868,2005-07-21,2005,Shrubland,92.48200,C4_1,B,BS,BSk,0.31311,0.94972,-0.92575,-0.37814,0.72341,0.26207,0.01972,-0.03023,0.99954
166,C4_1_4,40.21458,-112.21868,2005-08-08,2005,Shrubland,82.09300,C4_1,B,BS,BSk,0.58779,0.80902,-0.92575,-0.37814,0.72341,0.26207,0.01972,-0.03023,0.99954
167,C4_1_5,40.21458,-112.21868,2005-08-23,2005,Shrubland,78.95300,C4_1,B,BS,BSk,0.77488,0.63210,-0.92575,-0.37814,0.72341,0.26207,0.01972,-0.03023,0.99954
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,C13_4_14,46.89791,-113.43535,2012-08-28,2012,"Tree cover, needleleaved, evergreen, closed (>...",102.44207,C13_4,B,BS,BSk,0.83593,0.54884,-0.91751,-0.39771,0.76054,0.21077,0.07241,0.53863,0.84254
61,C13_4_15,46.89791,-113.43535,2012-09-04,2012,"Tree cover, needleleaved, evergreen, closed (>...",88.76436,C13_4,B,BS,BSk,0.89584,0.44438,-0.91751,-0.39771,0.76054,0.21077,0.07241,0.53863,0.84254
62,C13_4_16,46.89791,-113.43535,2012-09-11,2012,"Tree cover, needleleaved, evergreen, closed (>...",88.79382,C13_4,B,BS,BSk,0.94276,0.33347,-0.91751,-0.39771,0.76054,0.21077,0.07241,0.53863,0.84254
63,C13_4_17,46.89791,-113.43535,2012-09-18,2012,"Tree cover, needleleaved, evergreen, closed (>...",81.72345,C13_4,B,BS,BSk,0.97601,0.21772,-0.91751,-0.39771,0.76054,0.21077,0.07241,0.53863,0.84254


In [15]:
EARLIEST_SAMPLE

datetime.datetime(2001, 3, 1, 0, 0)