# Extract ERA5 Site Data and Generate Samples
## ERA5 Site Data
For each site in the LFMC sample data, extract the full time series of ERA5 data, gap-fill and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written.

## ERA5 Sample Data
For each sample, extract the timeseries ERA5 data. The timeseries length is determined by the ERA5_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series). The extracted ERA5 data is combined to a single dataframe and saved.

## Input Files
- `LFMC_sites.csv` and `LFMC_samples.csv` created by the `Extract Auxiliary Data.ipynb` notebook.

## Output Files
- The extracted sites data are created in `ERA5_DIR` (by default located in `DATA_DIR/GEE_DIR`). The directory will contain a CSV file for each site.
- The extracted ERA5 data for all samples are created in ``DATASETS_DIR`. File name include the time series length (i.e. 365days) of the extracted ERA5 data. So, with the default settings is `era5_365days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` notebook.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 10 hours to run if there are no existing site extracts.
4. Data for 67724 samples will be extracted. 1051 samples will not be extracted as they were collected before March 2002 (so less than 1 year MODIS data available).
5. There will be 4 invalid sites ('C6_340', 'C6_698', 'C6_935', and 'C16_5'), but occasionally extraction from GEE will fail for another site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

## Initialisation

In [1]:
import os
os.environ["HDF5_DISABLE_VERSION_CHECK"] = "1"
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from data_extract_utils import get_sample_data, extract_timeseries_data, sort_key
from timeseries_extractor import GeeTimeseriesReduceExtractor

### Connect to GEE

In [2]:
import ee
ee.Initialize()

### Program parameters and constants

GEE Parameters
- Weather product is ECMWF/ERA5_LAND/HOURLY - hourly weather data from ERA5
- Scale/proj set to convert to MODIS resolution/projection
- Start date is 01/03/2000, to match MODIS data availability

In [3]:
START_DATE = "2000-03-01"
END_DATE = "2019-01-01"     # Final day retrieved will be 2018-12-31

# ERA5 time series constants
ERA5_TS_LENGTH = 365
ERA5_TS_OFFSET = 1      # ts end this number of days before the sampling date
ERA5_TS_FREQ = 1        # days between consecutive elements in the ts

# ERA5 data details
PRODUCT = "ECMWF/ERA5_LAND/HOURLY"
BANDS = {
    'pre_tot': 'total_precipitation_max',
    'temp_mean': 'temperature_2m_mean',
    'temp_min': 'temperature_2m_min',
    'temp_max': 'temperature_2m_max',
    'dewpt_mean': 'dewpoint_temperature_2m_mean',
    'pet_min': 'potential_evaporation_hourly_min',
    'pet_max': 'potential_evaporation_hourly_max',
}
# Extract data using MODIS scale and projection
SCALE = 463.3127
PROJ = "SR-ORG:6974"

EARLIEST_SAMPLE = datetime.strptime(START_DATE, '%Y-%m-%d') + timedelta(
    days=ERA5_TS_LENGTH * ERA5_TS_FREQ + ERA5_TS_OFFSET - 1)

### Directories and Files

In [4]:
# Directories
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
ERA5_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "ERA5")

# File Names
SITES_INPUT = os.path.join(common.DATASETS_DIR, "LFMC_australia_sites.csv")
SAMPLES_INPUT = os.path.join(common.DATASETS_DIR, "australia_samples_365days.csv")
#SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"australia_samples_{ERA5_TS_LENGTH * ERA5_TS_FREQ}days.csv")
ERA5_OUTPUT = os.path.join(common.DATASETS_DIR, f"australia_era5_{ERA5_TS_LENGTH * ERA5_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(ERA5_DIR):
    os.makedirs(ERA5_DIR)

## Generate the ERA5 sample data
For each site
- Get the weather data at MODIS proj/scale
- Then get the sample data for each sample at the site

In [6]:
sites = pd.read_csv(SITES_INPUT, float_precision="high")
samples = pd.read_csv(SAMPLES_INPUT, float_precision="high")

In [7]:
reducers = [ee.Reducer.mean(), ee.Reducer.min(), ee.Reducer.max()]
era5_extractor = GeeTimeseriesReduceExtractor(PRODUCT, BANDS, reducers, START_DATE, END_DATE,
                                              gap_fill=False, dir_name=ERA5_DIR)
era5_extractor.set_proj_scale(PROJ, SCALE)
era5_data, valid_data, invalid_pixels, invalid_sites = extract_timeseries_data(
    era5_extractor, sites, samples, EARLIEST_SAMPLE, ERA5_TS_OFFSET, ERA5_TS_LENGTH, ERA5_TS_FREQ)

Processing site C10_1
Extracting data for C10_1 (lat: -37.63542 long: 144.22103)
Processing site C10_2
Extracting data for C10_2 (lat: -35.40625 long: 149.80151)
Processing site C10_3
Extracting data for C10_3 (lat: -35.41875 long: 149.78896)
Processing site C10_4
Extracting data for C10_4 (lat: -38.22708 long: 145.56676)
Processing site C10_5
Extracting data for C10_5 (lat: -33.68125 long: 117.61153)
Processing site C10_6
Extracting data for C10_6 (lat: -35.36875 long: 149.0574)
Processing site C10_7
Extracting data for C10_7 (lat: -42.84792 long: 147.48628)
Processing site C10_8
Extracting data for C10_8 (lat: -38.08958 long: 145.43017)
Processing site C10_9
Extracting data for C10_9 (lat: -26.16458 long: 121.56263)
Processing site C10_10
Extracting data for C10_10 (lat: -35.27708 long: 149.19474)
Processing site C10_11
Extracting data for C10_11 (lat: -35.26875 long: 150.40931)
Processing site C10_12
Extracting data for C10_12 (lat: -35.30625 long: 149.17192)
Processing site C10_13


Summary of sites/samples not extracted

In [8]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}')
print(invalid_sites)
print(invalid_pixels)

Invalid sites: 0; Invalid pixels: 0
[]
[]


## Save and display sample weather data

In [9]:
era5_data = pd.DataFrame(era5_data)
ts_days = (ERA5_TS_LENGTH - 1) * ERA5_TS_FREQ
era5_data.columns = ["ID"] + [f'{day-ERA5_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, ERA5_TS_FREQ)
                               for band in range(len(BANDS))]
era5_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
era5_data

Unnamed: 0,ID,-365_1,-365_2,-365_3,-365_4,-365_5,-365_6,-365_7,-364_1,-364_2,...,-002_5,-002_6,-002_7,-001_1,-001_2,-001_3,-001_4,-001_5,-001_6,-001_7
0,C10_1_1,0.000061,295.124693,288.766220,302.282272,279.088717,-0.004818,-0.000309,0.010432,283.935483,...,278.557932,-0.003289,-0.000044,0.000133,283.820129,279.266205,291.084976,278.758999,-0.002144,-0.000261
1,C10_1_2,0.000297,287.837313,281.145538,296.361023,279.966503,-0.003230,0.000005,0.000297,286.158151,...,278.902625,-0.002179,-0.000180,0.000842,287.112688,281.363922,293.051132,281.518106,-0.002685,0.000010
2,C10_1_3,0.002852,294.199064,288.245087,299.953400,286.361894,-0.003820,-0.000072,0.011684,289.021126,...,277.902976,-0.002715,-0.000005,0.001687,286.639353,280.360550,291.988602,278.997329,-0.003486,-0.000112
3,C10_1_4,0.014677,287.186193,282.757324,293.467300,283.127290,-0.001828,-0.000306,0.003240,286.878009,...,278.146975,-0.003139,0.000003,0.000001,290.996375,284.195877,298.594116,278.461547,-0.003820,-0.000046
4,C10_2_1,0.002991,289.597839,284.514587,294.820389,281.867422,-0.001600,-0.000059,0.005482,288.243148,...,284.211318,-0.001603,-0.000043,0.000674,287.517785,285.034653,292.274506,283.220676,-0.000770,-0.000090
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,C18_3_22,0.020727,289.265651,287.339340,292.840866,287.830176,-0.001270,0.000007,0.012755,288.694913,...,281.862825,-0.003697,-0.000107,0.001624,285.524158,282.708603,288.827255,282.448569,-0.001894,-0.000067
386,C18_3_25,0.001329,288.429486,283.362030,295.446060,279.188215,-0.004617,-0.000037,0.001361,285.515853,...,279.982619,-0.003146,-0.000006,0.000013,288.006959,281.540192,293.852280,280.038822,-0.003724,0.000002
387,C18_3_28,0.005883,291.174055,286.284088,299.219818,286.980419,-0.003154,0.000006,0.005883,290.862527,...,282.789017,-0.005019,-0.000047,0.001885,288.619006,281.176453,296.258453,279.316615,-0.004001,-0.000006
388,C18_3_31,0.009584,277.928763,275.981705,280.640457,276.250906,-0.001021,0.000008,0.001479,278.339953,...,279.069406,-0.000386,0.000005,0.002139,279.247279,274.524612,285.500626,275.517584,-0.001956,0.000003


In [10]:
era5_data.to_csv(ERA5_OUTPUT, index=False, float_format='%.4g')

Save and display the valid samples

In [11]:
# valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
# valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
# valid_samples