# v2.3 run RNN Class with Spatial Training

This notebook serves as a guide for using the RNN code in this project. It walks through the core functionality for the data pre-processing, setting up model hyperparameters, structuring data to feed into RNN, and evaluating prediction error with spatiotemporal cross-validation. 

## Setup

We will import certain functions at code cells in relevant sections for clarity, but everything used will be included in this setup cell.

In [8]:
import numpy as np
from utils import print_dict_summary, print_first, str2time, logging_setup
import pickle
import logging
import os.path as osp
from moisture_rnn_pkl import pkl2train
from moisture_rnn import RNNParams, RNNData, RNN, rnn_data_wrap
from utils import hash2, read_yml, read_pkl, retrieve_url, Dict, print_dict_summary
from moisture_rnn import RNN
import reproducibility
from data_funcs import rmse, to_json, combine_nested, build_train_dict
from moisture_models import run_augmented_kf
import copy
import pandas as pd
import matplotlib.pyplot as plt
import yaml
import time

In [6]:
from IPython.display import Markdown, display

# Helper function to make documentation a little prettier
def print_markdown_docstring(func):
    display(Markdown(f"```python\n{func.__doc__}\n```"))

## Acquiring Data

The expected format of the input data for this project is in the form of nested dictionaries with a particular structure. These dictionaries are produced by the process `build_fmda_dicts` within the `wrfxpy` branch `develop-72-jh`. These files are staged remotely as `pickle` files on the OpenWFM Demo site. The data consist of ground-based observations from RAWS stations and atmospheric data from the HRRR weather model interpolated to the location of the RAWS site. These data were collected by specifying a time period and a spatial bounding box, and all RAWS with FMC sensors were collected within those bounds and time frame.

<mark>NOTE: as of 2024-10-22 the wrfxpy code is still needs to be merged with the latest changed from Angel. The code that makes fmda dictionaries shouldn't depend much on other changes within wrfxpy</mark>

The first step is just to retrieve the files. The method is called `retrieve_url`. It uses a method that calls `wget` as a subprocesses and saves to a target directory if the file doesn't already exist. You can force it to download with a function argument. The function documentation is printed below, then it is called using f-strings to make the code more concise.

In [5]:
print_markdown_docstring(retrieve_url)

```python

    Downloads a file from a specified URL to a destination path.

    Parameters:
    -----------
    url : str
        The URL from which to download the file.
    dest_path : str
        The destination path where the file should be saved.
    force_download : bool, optional
        If True, forces the download even if the file already exists at the destination path.
        Default is False.

    Warnings:
    ---------
    Prints a warning if the file extension of the URL does not match the destination file extension.

    Raises:
    -------
    AssertionError:
        If the download fails and the file does not exist at the destination path.

    Notes:
    ------
    This function uses the `wget` command-line tool to download the file. Ensure that `wget` is 
    installed and accessible from the system's PATH.

    Prints:
    -------
    A message indicating whether the file was downloaded or if it already exists at the 
    destination path.
    
```

In [7]:
filename = "fmda_rocky_202403-05_f05.pkl"
retrieve_url(
    url = f"https://demo.openwfm.org/web/data/fmda/dicts/{filename}", 
    dest_path = f"data/{filename}")

Target data already exists at data/fmda_rocky_202403-05_f05.pkl


### Exploring the Nested Dictionary Structure 

The data dictionaries have the following structure:

* Top level keys are RAWS station IDs and some additional string related to the time period.
* For each of the RAWS sites, there are 3 subdictionaries consisting of different types of data that pertain to that location.
    - A `loc` subdirectory that consists of static information about the physical location of the RAWS site. This includes station ID name, longitude, latitude, elevation, and two grid coordinates named "pixel_x" and "pixel_y" <mark>This will be renamed to "grid_coordinate" in the future</mark>. These correspond to the transformation of the lon/lat coordinates from the RAWS site onto the regular HRRR grid.
    - A `RAWS` subdirectory that includes at least FMC observations and the associated times returned by Synoptic. These times may not line up perfectly with the requested regular hours. In addition to the FMC data, any available ground-based sensor data for variables relevant to FMC were collected. These data are intended to be used as validation for the accuracy of the interpolated HRRR data.
    - A `HRRR` subdirectory that includes atmospheric variables relevant to FMC. The formatted table below shows the variables used by this project, . <mark>More variables will be collected in the future</mark>

In [10]:
dat.keys()

dict_keys(['CPTC2_202403', 'CHAC2_202403', 'CHRC2_202403', 'DYKC2_202403', 'LKGC2_202403', 'CCEC2_202403', 'RDKC2_202403', 'RFRC2_202403', 'SAWC2_202403', 'WLCC2_202403', 'CCRU1_202403', 'HSRU1_202403', 'YLSU1_202403', 'BRLW4_202403', 'GTGW4_202403', 'SAWW4_202403', 'SPKW4_202403', 'ESPC2_202403', 'MRFC2_202403', 'PKLC2_202403', 'BRAU1_202403', 'NLPU1_202403', 'TS010_202403', 'CUHC2_202403', 'BAWC2_202403', 'BTAC2_202403', 'SOPC2_202403', 'BMOC2_202403', 'CYNC2_202403', 'TR223_202403', 'TCTM8_202403', 'TR337_202403', 'VLRW4_202403', 'TR383_202403', 'LEIW4_202403', 'TR390_202403', 'TS001_202403', 'HSYN1_202403', 'HRSN1_202403', 'SBFN1_202403', 'DOHS2_202403', 'BKFS2_202403', 'CRRS2_202403', 'NMOS2_202403', 'RDCS2_202403', 'TGSK1_202403', 'QNRK1_202403', 'RESN1_202403', 'VRFN1_202403', 'KSHC2_202403', 'TR563_202403', 'CGLK1_202403', 'MRLS2_202403', 'TR755_202403', 'DEOI4_202403', 'CCYC2_202403', 'HBOM8_202403', 'TR937_202403', 'TR956_202403', 'PINS2_202403', 'DVLW4_202403', 'TS040_202403

In [11]:
dat['CPTC2_202403'].keys()

dict_keys(['loc', 'RAWS', 'HRRR'])

In [12]:
dat['CPTC2_202403']['loc'].keys()

dict_keys(['STID', 'lat', 'lon', 'elev', 'pixel_x', 'pixel_y'])

In [9]:
dat = read_pkl(f"data/{filename}")
print_dict_summary(dat)

loading file data/fmda_rocky_202403-05_f05.pkl
CPTC2_202403
     loc
           STID : CPTC2
           lat : 38.45964
           lon : -109.04731
           elev : 8124
           pixel_x : 565.3953218828111
           pixel_y : 509.89701435338947
     RAWS
          temp: NumPy array of shape (2206,), min: 265.37199999999996, max: 298.15
          fm: NumPy array of shape (2206,), min: nan, max: nan
          precip_accum: NumPy array of shape (2206,), min: 234.442, max: 329.692
          rh: NumPy array of shape (2206,), min: nan, max: nan
          solar: NumPy array of shape (2206,), min: nan, max: nan
          wind: NumPy array of shape (2206,), min: 0.0, max: 9.836
          time_raws: NumPy array of shape (2206,), type object
           hours : 2206
          rain: NumPy array of shape (2206,), min: nan, max: nan
          time: NumPy array of shape (2208,), type object
          Ed: NumPy array of shape (2206,), min: nan, max: nan
          Ew: NumPy array of shape (2206,), m

In [14]:
band_df_hrrr = pd.DataFrame({
    'Band': [616, 620, 624, 628, 629, 661, 561, 612, 643],
    'hrrr_name': ['TMP', 'RH', "WIND", 'PRATE', 'APCP',
                  'DSWRF', 'SOILW', 'CNWAT', 'GFLUX'],
    'dict_name': ["temp", "rh", "wind", "rain", "precip_accum",
                 "solar", "soilm", "canopyw", "groundflux"],
    'descr': ['2m Temperature [K]', 
              '2m Relative Humidity [%]', 
              '10m Wind Speed [m/s]'
              'surface Precip. Rate [kg/m^2/s]',
              'surface Total Precipitation [kg/m^2]',
              'surface Downward Short-Wave Radiation Flux [W/m^2]',
              'surface Total Precipitation [kg/m^2]',
              '0.0m below ground Volumetric Soil Moisture Content [Fraction]',
              'Plant Canopy Surface Water [kg/m^2]',
              'surface Ground Heat Flux [W/m^2]']
})

band_df_hrrr

Unnamed: 0,Band,hrrr_name,dict_name,descr
0,616,TMP,temp,2m Temperature [K]
1,620,RH,rh,2m Relative Humidity [%]
2,624,WIND,wind,10m Wind Speed [m/s]surface Precip. Rate [kg/m...
3,628,PRATE,rain,surface Total Precipitation [kg/m^2]
4,629,APCP,precip_accum,surface Downward Short-Wave Radiation Flux [W/...
5,661,DSWRF,solar,surface Total Precipitation [kg/m^2]
6,561,SOILW,soilm,0.0m below ground Volumetric Soil Moisture Con...
7,612,CNWAT,canopyw,Plant Canopy Surface Water [kg/m^2]
8,643,GFLUX,groundflux,surface Ground Heat Flux [W/m^2]


## Formatting Data