# Extract PRISM Site Data and Generate Samples
## PRISM Site Data
For each site in the LFMC sample data, extract the full time series of PRISM data, gap-fill and save to CSV files. Note: if the output csv files already exist they are assumed to be correct and are not over-written.
## PRISM Sample Data
For each sample, extract the timeseries PRISM data. The timeseries length is determined by the PRISM_TS_LENGTH value. The sample is rejected if the full timeseries cannot be extracted (start/end outside the full site time series). The extracted PRISM data is combined to a single dataframe and saved. Another LFMC sample dataset containing only the valid samples is also created.

## Input Files
- `LFMC_sites.csv` and `samples_365days.csv` created by the `Extract Auxiliary Data.ipynb` and `Extract MODIS DATA.ipynb` notebooks.

## Output Files
- The extracted sites data are created in `PRISM_DIR` (by default located in `DATA_DIR/GEE_DIR`). The directory will contain a CSV file for each site.
- The extracted PRISM data for all samples are created in `FINAL_DIR` (by default the Datasets sub-directory of `DATA_DIR/EXTRACT_NAME`). File name include the time series length (i.e. 365days) of the extracted PRISM data. So, with the default settings is `prism_365days.csv`.

## Notes
1. This notebook should be run after running the `Extract Auxiliary Data.ipynb` and `Extract MODIS DATA.ipynb` notebooks.
2. The generated name for `GEE_DIR` includes the projection and scale of the extracted GEE data. 
3. It will take about 6.5 hours to run if there are no existing site extracts.
4. There should be no invalid sites or samples, but occasionally extraction from GEE will fail for a site. If this happens re-run the notebook (keep the existing site CSV files so they are not re-extracted).

In [1]:
import os
import numpy as np
import pandas as pd
import re
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from timeseries_extractor import GeeTimeseriesExtractor
from data_extract_utils import get_sample_data, extract_timeseries_data, sort_key

### Program parameters and constants

GEE Parameters
- Weather product is OREGONSTATE/PRISM/AN81d - daily weather data from PRISM group
- Scale/proj set to convert to MODIS resolution/projection
- Start date is 01/03/2000, to match MODIS data availability

In [2]:
# PRISM time series constants
PRISM_TS_LENGTH = 365
PRISM_TS_OFFSET = 1      # ts end this number of days before the sampling date
PRISM_TS_FREQ = 1        # days between consecutive elements in the ts

# PRISM data details
PRODUCT = "OREGONSTATE/PRISM/AN81d"
BANDS = ["ppt", "tmean", "tmin", "tmax", "tdmean", "vpdmin", "vpdmax"]

EARLIEST_SAMPLE = datetime.strptime(common.START_DATE, '%Y-%m-%d') + timedelta(
    days=PRISM_TS_LENGTH * PRISM_TS_FREQ + PRISM_TS_OFFSET - 1)

### Directories and Files

In [3]:
# Directories
GEE_DIR = f"GEE_{common.PROJ.replace(':', '-')}_{int(common.SCALE)}"
PRISM_DIR = os.path.join(common.LFMC_DATA_DIR, GEE_DIR, "PRISM")

# File Names
SAMPLES_INPUT = os.path.join(common.DATASETS_DIR, "samples_365days.csv")
PRISM_OUTPUT = os.path.join(common.DATASETS_DIR, f"prism_{PRISM_TS_LENGTH * PRISM_TS_FREQ}days.csv")

# Create output directories if necessary
if not os.path.exists(PRISM_DIR):
    os.makedirs(PRISM_DIR)


Connect to GEE

In [4]:
import ee
ee.Initialize()

Generate the PRISM sample data:

For each site
- Get the soil moisture data at MODIS proj/scale
- Then get the sample data for each sample at the site

In [5]:
sites = pd.read_csv(common.LFMC_SITES, float_precision="high")
samples = pd.read_csv(SAMPLES_INPUT, float_precision="high")
prism_extractor = GeeTimeseriesExtractor(PRODUCT, BANDS, common.START_DATE, common.END_DATE,
                                         gap_fill=False, dir_name=PRISM_DIR)
prism_extractor.set_proj_scale(common.PROJ, common.SCALE)
prism_data, valid_data, invalid_pixels, invalid_sites = extract_timeseries_data(
    prism_extractor, sites, samples, EARLIEST_SAMPLE, PRISM_TS_OFFSET, PRISM_TS_LENGTH, PRISM_TS_FREQ)

Processing site C4_1
Processing site C4_2
Processing site C4_3
Processing site C4_4
Processing site C4_5
Processing site C6_1
Processing site C6_2
Processing site C6_3
Processing site C6_4
Processing site C6_5
Processing site C6_6
Processing site C6_7
Processing site C6_8
Processing site C6_9
Processing site C6_10
Processing site C6_11
Processing site C6_12
Processing site C6_13
Processing site C6_14
Processing site C6_16
Processing site C6_17
Processing site C6_18
Processing site C6_19
Processing site C6_20
Processing site C6_21
Processing site C6_22
Processing site C6_23
Processing site C6_24
Processing site C6_25
Processing site C6_26
Processing site C6_27
Processing site C6_28
Processing site C6_29
Processing site C6_30
Processing site C6_31
Processing site C6_32
Processing site C6_33
Processing site C6_34
Processing site C6_35
Processing site C6_36
Processing site C6_37
Processing site C6_38
Processing site C6_39
Processing site C6_40
Processing site C6_41
Processing site C6_42
Pr

Summary of sites/samples not extracted

In [6]:
print(f'Invalid sites: {len(invalid_sites)}; Invalid pixels: {len(invalid_pixels)}')
print(invalid_sites)
print(invalid_pixels)

Invalid sites: 0; Invalid pixels: 0
[]
[]


Save and display sample meteorological data

In [7]:
prism_data = pd.DataFrame(prism_data)
ts_days = (PRISM_TS_LENGTH - 1) * PRISM_TS_FREQ
prism_data.columns = ["ID"] + [f'{day-PRISM_TS_OFFSET:04}_{band+1}'
                               for day in range(-ts_days, 1, PRISM_TS_FREQ)
                               for band in range(len(BANDS))]
prism_data.sort_values('ID', inplace=True, key=lambda x: x.apply(sort_key))
prism_data

Unnamed: 0,ID,-365_1,-365_2,-365_3,-365_4,-365_5,-365_6,-365_7,-364_1,-364_2,...,-002_5,-002_6,-002_7,-001_1,-001_2,-001_3,-001_4,-001_5,-001_6,-001_7
0,C4_1_1,0.000,19.181002,9.138,29.225000,2.761,4.577,34.377998,0.000,15.612000,...,2.324,2.142,26.906000,0.000,19.244001,7.757000,30.733000,-1.192,7.623,39.293999
1,C4_1_2,0.000,19.182001,9.776,28.589001,0.280,6.004,33.238998,0.000,20.149000,...,0.787,6.008,34.004002,0.000,20.738001,8.927000,32.549999,1.033,6.802,43.630001
2,C4_1_3,0.106,22.410002,11.989,32.832001,7.399,3.191,40.682999,2.056,21.825001,...,1.864,11.652,60.967999,0.000,27.316002,16.101999,38.530998,3.543,13.631,61.556000
3,C4_1_4,0.000,21.671001,10.949,32.393002,-1.210,7.321,42.755001,0.000,23.180000,...,6.118,6.084,45.021000,0.000,24.412001,14.290000,34.535000,5.910,7.033,46.325001
4,C4_1_5,1.118,13.986001,8.879,19.094000,5.555,2.281,13.716000,0.036,15.522000,...,3.389,6.698,39.536999,5.135,22.674002,14.260000,31.089001,4.224,8.342,37.481998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66941,C13_4_14,0.000,16.835001,8.450,25.222000,8.925,1.019,21.334999,0.000,18.342001,...,1.700,1.551,34.394001,0.083,21.123001,8.113000,34.134998,8.382,1.042,43.375999
66942,C13_4_15,0.000,16.171001,1.513,30.830999,-1.002,1.443,40.081001,0.000,16.585001,...,-5.181,2.223,29.249001,0.000,13.284000,0.760000,25.808001,-2.967,1.859,29.101999
66943,C13_4_16,0.000,18.229000,5.212,31.246000,3.250,1.614,38.695000,0.000,17.530001,...,1.346,1.997,27.805000,0.000,13.436001,3.855000,23.017000,-0.485,2.391,23.080000
66944,C13_4_17,3.320,9.257000,-0.051,18.566000,4.774,0.200,12.979000,0.000,9.228001,...,1.065,1.251,26.354000,0.000,13.158001,-0.076000,26.393000,-0.126,0.752,29.087000


In [8]:
prism_data.to_csv(PRISM_OUTPUT, index=False, float_format='%.3f')

Save and display the valid samples
- This code should only be run if the PRISM data for some sites/samples could not be extracted.
- If it is needed, the data for the invalid samples also needs to be removed from the MODIS data.

In [9]:
# SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, f"samples_{PRISM_TS_LENGTH * PRISM_TS_FREQ}days_prism.csv")
# valid_samples = samples[valid_data].sort_values('ID', key=lambda x: x.apply(sort_key))
# valid_samples.to_csv(SAMPLES_OUTPUT, index=False)
# valid_samples