# Extract PRISM Site Data and Generate Samples
## PRISM Site Data
For each site in the LFMC sample data, extract the full time series of PRISM data, gap-fill and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written.
## PRISM Sample Data
For each sample, extract the timeseries PRISM data. The timeseries length is determined by the PRISM_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series). The extracted PRISM data is combined to a single dataframe and saved. Another LFMC sample dataset containing only the valid samples is also created.

## Input Files
- `LFMC_sites.csv` and `samples_365days.csv` created by the `Extract Auxiliary Data.ipynb` and `Extract MODIS DATA.ipynb` notebooks.

## Output Files
- The extracted sites data are created in `PRISM_DIR` (by default located in `DATA_DIR/GEE_DIR`). The directory will contain a CSV file for each site.
- The extracted PRISM data for all samples are created in `FINAL_DIR` (by default the Datasets sub-directory of `DATA_DIR/EXTRACT_NAME`). File name include the time series length (i.e. 365days) of the extracted PRISM data. So, with the default settings is `prism_365days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` and `Extract MODIS DATA.ipynb` notebooks.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 6.5 hours to run if there are no existing site extracts.
4. There should be no invalid sites or samples, but occasionally extraction from GEE will fail for a site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

In [None]:
import os
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from timeseries_extractor import GeeTimeseriesExtractor
from data_extract_utils import get_sample_data, extract_timeseries_data, sort_key

### Program parameters and constants

GEE Parameters
- Weather product is OREGONSTATE/PRISM/AN81d - daily weather data from PRISM group
- Scale/proj set to convert to MODIS resolution/projection
- Start date is 01/03/2000, to match MODIS data availability

In [None]:
# PRISM time series constants
PRISM_TS_LENGTH = 365
PRISM_TS_OFFSET = 1      # ts end this number of days before the sampling date
PRISM_TS_FREQ = 1        # days between consecutive elements in the ts

# PRISM data details
PRODUCT = "OREGONSTATE/PRISM/AN81d"
BANDS = ["ppt", "tmean", "tmin", "tmax", "tdmean", "vpdmin", "vpdmax"]

EARLIEST_SAMPLE = datetime.strptime(common.START_DATE, '%Y-%m-%d') + timedelta(
    days=PRISM_TS_LENGTH * PRISM_TS_FREQ + PRISM_TS_OFFSET - 1)

### Directories and Files

In [None]:
# Directories
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
PRISM_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "PRISM")

# File Names
SAMPLES_INPUT = os.path.join(common.DATASETS_DIR, "samples_365days.csv")
PRISM_OUTPUT = os.path.join(common.DATASETS_DIR, f"prism_{PRISM_TS_LENGTH * PRISM_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(PRISM_DIR):
    os.makedirs(PRISM_DIR)


Connect to GEE

In [None]:
import ee
ee.Initialize()

Generate the PRISM sample data:

For each site
- Get the soil moisture data at MODIS proj/scale
- Then get the sample data for each sample at the site

In [None]:
sites = pd.read_csv(common.LFMC_SITES, float_precision="high")
samples = pd.read_csv(SAMPLES_INPUT, float_precision="high")
prism_extractor = GeeTimeseriesExtractor(PRODUCT, BANDS, common.START_DATE, common.END_DATE,
                                         gap_fill=False, dir_name=PRISM_DIR)
prism_extractor.set_proj_scale(common.PROJ, common.SCALE)
prism_data, valid_data, invalid_pixels, invalid_sites = extract_timeseries_data(
    prism_extractor, sites, samples, EARLIEST_SAMPLE, PRISM_TS_OFFSET, PRISM_TS_LENGTH, PRISM_TS_FREQ)

Summary of sites/samples not extracted

In [None]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}')
print(invalid_sites)
print(invalid_pixels)

Save and display sample meteorological data

In [None]:
prism_data = pd.DataFrame(prism_data)
ts_days = (PRISM_TS_LENGTH - 1) * PRISM_TS_FREQ
prism_data.columns = ["ID"] + [f'{day-PRISM_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, PRISM_TS_FREQ)
                               for band in range(len(BANDS))]
prism_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
prism_data

In [None]:
prism_data.to_csv(PRISM_OUTPUT, index=False, float_format='%.3f')

Save and display the valid samples
- This code should only be run if the PRISM data for some sites/samples could not be extracted.
- If it is needed, the data for the invalid samples also needs to be removed from the MODIS data.

In [None]:
# SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"samples_{PRISM_TS_LENGTH * PRISM_TS_FREQ}days_prism.csv")
# valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
# valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
# valid_samples