# Extract MODIS Site Data and Generate Samples
## MODIS Site Data
For each site in the LFMC sample data, extract the full time series of MODIS reflectance and snow-cover data, and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written. The data is gap-filled after being saved but before being used to create the MODIS sample data.

## MODIS Sample Data
For each sample, extract the timeseries MODIS reflectance data. The timeseries length is determined by the MODIS_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series). The extracted MODIS data is combined to a single dataframe and saved. Another LFMC sample dataset containing only the valid samples is also created.

## Input Files
- `LFMC_sites.csv` and `samples_365days.csv` created by the `Extract Auxiliary Data.ipynb` and `Extract ERA5 DATA.ipynb` notebooks.

## Output Files
- The extracted sites data are created in `MODIS_DIR` (by default `MCD43A4` located in `LFMC_DATA_DIR/GEE_DIR`). The directory will contain a CSV file for each site.
- The extracted MODIS data for each samples and updated samples data are created in `DATASETS_DIR`. File names include the time series length (i.e. 730days) of the extracted MODIS data. So, with the default settings they are `modis_730days.csv` and `samples_730days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` and `Extract ERA5 DATA.ipynb` notebooks.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 8.5 hours to run if there are no existing site extracts.
4. Data for 67724 samples will be extracted. If the samples input is the ERA5 extract sample output, there should be no invalid or discarded samples.
5. There should be no invalid sites, but occasionally extraction from GEE will fail for a site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

In [1]:
import os
os.environ["HDF5_DISABLE_VERSION_CHECK"] = "1"
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from data_extract_utils import extract_timeseries_data, sort_key
from timeseries_extractor import GeeTimeseriesExtractor

### Program parameters and constants

GEE Parameters
- Reflectance product is MCD43A4 - daily reflectance using 8-day composites
- Scale set to use native MODIS resolution
- Extracts two years' of data for each sample to allow for a 1-year time series with a 1-year lead time

In [2]:
START_DATE = "2000-03-01"
END_DATE = "2019-01-01"  # Final day retrieved will be 2018-12-31

# MODIS time series constants
MODIS_TS_LENGTH = 365
MODIS_TS_OFFSET = 1
MODIS_TS_FREQ = 1

# MODIS data details
PRODUCT = "MODIS/006/MCD43A4"
BANDS = ["Nadir_Reflectance_Band1",
         "Nadir_Reflectance_Band2",
         "Nadir_Reflectance_Band3",
         "Nadir_Reflectance_Band4",
         "Nadir_Reflectance_Band5",
         "Nadir_Reflectance_Band6",
         "Nadir_Reflectance_Band7"]
# Extract data using default MODIS scale and projection
SCALE = 463.3127
PROJ = "SR-ORG:6974"

EARLIEST_SAMPLE = datetime.strptime(START_DATE, '%Y-%m-%d') + timedelta(
    days=MODIS_TS_LENGTH * MODIS_TS_FREQ + MODIS_TS_OFFSET  - 1)

### Directories and Files

In [6]:
# Sub-directories for GEE extracts
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
MODIS_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "MCD43A4")

# File Names
SITES_INPUT = os.path.join(common.DATASETS_DIR, "LFMC_australia_sites.csv")
SAMPLES_INPUT = os.path.join(common.DATASETS_DIR, "LFMC_australia_samples.csv")
SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"australia_samples_{MODIS_TS_LENGTH * MODIS_TS_FREQ}days.csv")
MODIS_OUTPUT = os.path.join(common.DATASETS_DIR, f"australia_modis_{MODIS_TS_LENGTH * MODIS_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(MODIS_DIR):
    os.makedirs(MODIS_DIR)

## Main Processing
Connect to GEE

In [4]:
import ee
ee.Initialize()

### Generate MODIS sample data

For each site, get the sample data for each sample at the site

Note: This gets the data using the default MODIS scale and projection. To change the scale and/or projection, add calls to the `GeeTimeseriesExtractor.setProjScale` method

In [9]:
sites = pd.read_csv(SITES_INPUT, float_precision="high")
samples = pd.read_csv(SAMPLES_INPUT, float_precision="high")
modis_extractor = GeeTimeseriesExtractor(PRODUCT, BANDS, START_DATE, END_DATE,
                                         dir_name=MODIS_DIR)
modis_data, valid_data, invalid_pixels, invalid_sites = extract_timeseries_data(
    modis_extractor, sites, samples, EARLIEST_SAMPLE,
    MODIS_TS_OFFSET, MODIS_TS_LENGTH, MODIS_TS_FREQ)

Processing site C10_1
Processing site C10_2
Processing site C10_3
Processing site C10_4
Processing site C10_5
Processing site C10_6
Processing site C10_7
Processing site C10_8
Processing site C10_9
Processing site C10_10
Processing site C10_11
Processing site C10_12
Processing site C10_13
Processing site C10_14
Processing site C10_15
Processing site C10_16
Processing site C10_17
Processing site C10_18
Processing site C10_19
Processing site C10_20
Processing site C10_21
Processing site C11_1
Processing site C11_2
Processing site C11_3
Processing site C11_4
Processing site C11_5
Processing site C11_6
Processing site C11_7
Processing site C11_8
Processing site C11_9
Processing site C11_10
Processing site C11_11
Extracting data for C11_11 (lat: -37.47708 long: 145.23275)
Processing site C11_12
Processing site C11_13
Processing site C11_14
Processing site C11_15
Processing site C11_16
Processing site C11_17
Processing site C11_18
Processing site C18_1
Processing site C18_2
Processing site C

Summary of sites/samples not extracted

In [10]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}; ')
print(invalid_sites)
print(invalid_pixels)

Invalid sites: 0; Invalid pixels: 0; 
[]
[]


### Save Results
Save and display sample reflectance data

In [11]:
modis_data = pd.DataFrame(modis_data)
ts_days = (MODIS_TS_LENGTH - 1) * MODIS_TS_FREQ
modis_data.columns = ["ID"] + [f'{day-MODIS_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, MODIS_TS_FREQ)
                               for band in range(len(BANDS))]
modis_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
modis_data.to_csv(MODIS_OUTPUT, index=False)
modis_data

Unnamed: 0,ID,-365_1,-365_2,-365_3,-365_4,-365_5,-365_6,-365_7,-364_1,-364_2,...,-002_5,-002_6,-002_7,-001_1,-001_2,-001_3,-001_4,-001_5,-001_6,-001_7
0,C10_1_1,634,4030,304,723,3911,2288,1020,657,4036,...,3765,2372,1142,798,3840,384,870,3766,2379,1137
1,C10_1_2,589,3731,292,706,3628,2208,1145,615,3725,...,4010,3073,1856,1084,3162,538,909,3994,3128,1880
2,C10_1_3,716,3315,335,708,3402,2244,1053,735,3306,...,3619,2987,2058,1233,2750,512,1069,3709,2989,2136
3,C10_1_4,1220,2742,616,971,3695,3484,2211,1216,2710,...,3441,3203,1963,957,2573,481,796,3438,3182,1929
4,C10_2_1,995,3537,498,939,4321,3406,1987,1036,3510,...,4381,3702,1989,1504,3363,567,1030,4415,3718,2062
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
383,C18_3_22,447,1875,248,424,2297,1596,808,450,1947,...,2611,1612,812,482,2106,261,444,2556,1655,834
384,C18_3_25,410,2114,225,399,2308,1450,684,409,2117,...,2452,1553,763,408,2056,204,363,2482,1528,765
385,C18_3_28,520,2190,229,469,2477,1524,706,519,2188,...,2339,1532,732,435,2034,241,415,2371,1533,734
386,C18_3_31,420,1895,231,385,2086,1440,639,420,1887,...,2435,1305,656,354,2014,179,340,2462,1310,663


Save and display the valid samples

In [12]:
valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
valid_samples

Unnamed: 0,ID,Latitude,Longitude,Sampling date,Sampling year,LC Category,Land Cover,LFMC value,Site,Czone1,...,Czone3,Day_sin,Day_cos,Long_sin,Long_cos,Lat_norm,Elevation,Slope,Aspect_sin,Aspect_cos
322,C10_1_1,-37.63542,144.22103,2008-10-20,2008,Grassland,Grassland,260.57000,C10_1,C,...,Cfb,0.94560,-0.32534,0.58466,-0.81128,0.29091,0.08333,0.02654,-0.54559,0.83805
323,C10_1_2,-37.63542,144.22103,2008-11-10,2008,Grassland,Grassland,162.34000,C10_1,C,...,Cfb,0.76941,-0.63875,0.58466,-0.81128,0.29091,0.08333,0.02654,-0.54559,0.83805
324,C10_1_3,-37.63542,144.22103,2008-12-01,2008,Grassland,Grassland,132.66000,C10_1,C,...,Cfb,0.49378,-0.86959,0.58466,-0.81128,0.29091,0.08333,0.02654,-0.54559,0.83805
325,C10_1_4,-37.63542,144.22103,2009-01-19,2009,Grassland,Grassland,95.81000,C10_1,C,...,Cfb,-0.30492,-0.95238,0.58466,-0.81128,0.29091,0.08333,0.02654,-0.54559,0.83805
40,C10_2_1,-35.40625,149.80151,2006-01-05,2006,Agriculture,Mosaic cropland (>50%) / natural vegetation (t...,63.00000,C10_2,C,...,Cfb,-0.06880,-0.99763,0.50300,-0.86429,0.30330,0.11424,0.02920,-0.80444,0.59404
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
246,C18_3_22,-35.60625,148.86310,2015-12-23,2015,Forest,"Tree cover, broadleaved, evergreen, closed to ...",163.75463,C18_3,C,...,Cfb,0.15431,-0.98802,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
247,C18_3_25,-35.60625,148.86310,2016-01-18,2016,Forest,"Tree cover, broadleaved, evergreen, closed to ...",126.33867,C18_3,C,...,Cfb,-0.28848,-0.95749,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
248,C18_3_28,-35.60625,148.86310,2016-02-16,2016,Forest,"Tree cover, broadleaved, evergreen, closed to ...",136.38340,C18_3,C,...,Cfb,-0.71166,-0.70253,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
249,C18_3_31,-35.60625,148.86310,2016-09-02,2016,Forest,"Tree cover, broadleaved, evergreen, closed to ...",145.09527,C18_3,C,...,Cfb,0.88001,0.47495,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
