# Extract MODIS Site Data and Generate Samples
## MODIS Site Data
For each site in the LFMC sample data, extract the full time series of MODIS reflectance and snow-cover data, and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written. The data is gap-filled after being saved but before being used to create the MODIS sample data.
## MODIS Sample Data
For each sample, extract the timeseries MODIS reflectance data. The timeseries length is determined by the MODIS_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series) or the snow-cover data shows the pixel was snow-covered on the sampling date. The extracted MODIS data is combined to a single dataframe and saved. Another LFMC sample dataset containing only the valid samples is also created.
## Input Files
- `LFMC_sites.csv` and `LFMC_samples.csv` created by the `Extract Auxiliary Data.ipynb` notebook.

## Output Files
- The extracted sites data are created in `MODIS_DIR` and `SNOW_DIR` (by default located in `DATA_DIR/GEE_DIR`). Both directories contain a CSV file for each site.
- The extracted MODIS data for each samples and updated samples data are created in `FINAL_DIR` (by default the Datasets sub-directory of `DATA_DIR/EXTRACT_NAME`). File names include the time series length (i.e. 365days) of the extracted MODIS data. So, with the default settings they are `modis_365days.csv` and `samples_365days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` notebook.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 8.5 hours to run if there are no existing site extracts.
4. 824 samples will not be extracted due to snow cover.
5. 1005 samples will not be extracted as they were collected before March 2001 (so less than 1 year MODIS data available).
6. There should be no invalid sites, but occasionally extraction from GEE will fail for a site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

In [None]:
import os
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from timeseries_extractor import GeeTimeseriesExtractor
from data_extract_utils import get_sample_data, sort_key

### Program parameters and constants

GEE Parameters - current settings are to extract the data needed to reproduce 1st paper
- Reflectance product is MCD43A4 - daily reflectance using 8-day composites
- Snow cover product is MOD10A1 - daily snow cover as MOD10A2 is not in GEE.
- Scale set to use native MODIS resolution

In [None]:
# MODIS time series constants
MODIS_TS_LENGTH = 365
MODIS_TS_OFFSET = 1
MODIS_TS_FREQ = 1

# MODIS data details
PRODUCT = "MODIS/006/MCD43A4"
BANDS = ["Nadir_Reflectance_Band1",
         "Nadir_Reflectance_Band2",
         "Nadir_Reflectance_Band3",
         "Nadir_Reflectance_Band4",
         "Nadir_Reflectance_Band5",
         "Nadir_Reflectance_Band6",
         "Nadir_Reflectance_Band7"]
SNOW_PRODUCT = "MODIS/006/MOD10A1"
SNOW_BANDS = ["NDSI_Snow_Cover"]

EARLIEST_SAMPLE = datetime.strptime(common.START_DATE, '%Y-%m-%d') + timedelta(
    days=MODIS_TS_LENGTH * MODIS_TS_FREQ + MODIS_TS_OFFSET)

### Directories and Files

In [None]:
# Sub-directories for GEE extracts
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
MODIS_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "MCD43A4")
SNOW_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "MOD10A1")

# File Names
SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"samples_{MODIS_TS_LENGTH * MODIS_TS_FREQ}days.csv")
MODIS_OUTPUT = os.path.join(common.DATASETS_DIR, f"modis_{MODIS_TS_LENGTH * MODIS_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(MODIS_DIR):
    os.makedirs(MODIS_DIR)
if not os.path.exists(SNOW_DIR):
    os.makedirs(SNOW_DIR)

## Main Processing
Connect to GEE

In [None]:
import ee
ee.Initialize()

Check if a sample was collected in snow conditions

In [None]:
def is_snow_sample(date_str, bands_df, snow_df):
    sample_date = datetime.strptime(date_str, '%Y-%m-%d')
    return snow_df[SNOW_BANDS[0]][sample_date] >= 10

### Generate MODIS sample data

For each site
- Get the reflectance and snow data
- Then get the sample data for each sample at the site

Note: This gets the data using the default MODIS scale and projection. To change the scale and/or projection, add calls to the `GeeTimeseriesExtractor.setProjScale` method

In [None]:
sites = pd.read_csv(common.LFMC_SITES, float_precision="high")
samples = pd.read_csv(common.LFMC_SAMPLES, float_precision="high")
modis_extractor = GeeTimeseriesExtractor(PRODUCT, BANDS, common.START_DATE, common.END_DATE,
                                         dir_name=MODIS_DIR)
modis_extractor.set_proj_scale(common.PROJ, common.SCALE)
snow_extractor = GeeTimeseriesExtractor(SNOW_PRODUCT, SNOW_BANDS, common.START_DATE, common.END_DATE,
                                        dir_name=SNOW_DIR)
snow_extractor.set_proj_scale(common.PROJ, common.SCALE)
modis_data = []
valid_data = [False] * samples.shape[0]
invalid_pixels = []
snow_pixels = []
invalid_sites = []
for site_idx, site in sites.iterrows():
    print(f'Processing site {site.Site}')
    site_samples = samples[samples.Site == site.Site]
    try:
        modis_df = modis_extractor.get_and_save_data(site)
        snow_df = snow_extractor.get_and_save_data(site)
    except:
        print(f'Failed to extract data for {site.Site}')
        invalid_sites.append(site.Site)
        continue
    for index, sample in site_samples.iterrows():
        if is_snow_sample(sample["Sampling date"], modis_df, snow_df):
            print(f'Snow pixel: {sample["Sampling date"]}')
            snow_pixels.append(index)
        else:
            sample_data = get_sample_data(sample["Sampling date"], modis_df,
                                          MODIS_TS_OFFSET, MODIS_TS_LENGTH, MODIS_TS_FREQ)
            if sample_data is None or np.isnan(sample_data.sum()):
                invalid_pixels.append(index)
            else:
                modis_data.append([sample.ID] + list(sample_data))
                valid_data[index] = True

Summary of sites/samples not extracted

In [None]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}; ' \
      + f'Snow pixels: {len(snow_pixels)}')
print(invalid_sites)
print(invalid_pixels)
print(snow_pixels)

### Save Results
Save and display sample reflectance data

In [None]:
modis_data = pd.DataFrame(modis_data)
ts_days = (MODIS_TS_LENGTH - 1) * MODIS_TS_FREQ
modis_data.columns = ["ID"] + [f'{day-MODIS_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, MODIS_TS_FREQ)
                               for band in range(len(BANDS))]
modis_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
modis_data.to_csv(MODIS_OUTPUT, index=False)
modis_data

Save and display the valid samples

In [None]:
valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
valid_samples