# v2.3 run RNN Class with Spatial Training

This notebook serves as a guide for using the RNN code in this project. It walks through the core functionality for the data pre-processing, setting up model hyperparameters, structuring data to feed into RNN, and evaluating prediction error with spatiotemporal cross-validation. 

## Setup

We will import certain functions at code cells in relevant sections for clarity, but everything used will be included in this setup cell.

In [None]:
import numpy as np
from utils import print_dict_summary, print_first, str2time, logging_setup
import pickle
import logging
import os.path as osp
from moisture_rnn_pkl import pkl2train
from moisture_rnn import RNNParams, RNNData, RNN, rnn_data_wrap
from utils import hash2, read_yml, read_pkl, retrieve_url, Dict, print_dict_summary
from moisture_rnn import RNN
import reproducibility
from data_funcs import rmse, to_json, combine_nested, build_train_dict
from moisture_models import run_augmented_kf
import copy
import pandas as pd
import matplotlib.pyplot as plt
import yaml
import time

In [None]:
from IPython.display import Markdown, display

# Helper function to make documentation a little prettier
def print_markdown_docstring(func):
    display(Markdown(f"```python\n{func.__doc__}\n```"))

## Acquiring Data

The expected format of the input data for this project is in the form of nested dictionaries with a particular structure. These dictionaries are produced by the process `build_fmda_dicts` within the `wrfxpy` branch `develop-72-jh`. These files are staged remotely as `pickle` files on the OpenWFM Demo site. The data consist of ground-based observations from RAWS stations and atmospheric data from the HRRR weather model interpolated to the location of the RAWS site. These data were collected by specifying a time period and a spatial bounding box, and all RAWS with FMC sensors were collected within those bounds and time frame.

<mark>NOTE: as of 2024-10-22 the wrfxpy code is still needs to be merged with the latest changed from Angel. The code that makes fmda dictionaries shouldn't depend much on other changes within wrfxpy</mark>

The first step is just to retrieve the files. The method is called `retrieve_url`, and lives in a python module `utils`. The `utils` functions are meant to apply to a general context, not anything specific to this project. It uses a method that calls `wget` as a subprocesses and saves to a target directory if the file doesn't already exist. You can force it to download with a function argument. The function documentation is printed below, then it is called using f-strings to make the code more concise.

In [None]:
print_markdown_docstring(retrieve_url)

In [None]:
filename = "fmda_rocky_202403-05_f05.pkl"
retrieve_url(
    url = f"https://demo.openwfm.org/web/data/fmda/dicts/{filename}", 
    dest_path = f"data/{filename}")

### Exploring the Nested Dictionary Structure 

The data dictionaries have the following structure:

* Top level keys are RAWS station IDs and some additional string related to the time period.
* For each of the RAWS sites, there are 3 subdictionaries consisting of different types of data that pertain to that location.
    - A `loc` subdirectory that consists of static information about the physical location of the RAWS site. This includes station ID name, longitude, latitude, elevation, and two grid coordinates named "pixel_x" and "pixel_y" <mark>This will be renamed to "grid_coordinate" in the future</mark>. These correspond to the transformation of the lon/lat coordinates from the RAWS site onto the regular HRRR grid.
    - A `RAWS` subdirectory that includes at least FMC observations and the associated times returned by Synoptic. These times may not line up perfectly with the requested regular hours. In addition to the FMC data, any available ground-based sensor data for variables relevant to FMC were collected. These data are intended to be used as validation for the accuracy of the interpolated HRRR data.
    - A `HRRR` subdirectory that includes atmospheric variables relevant to FMC. The formatted table below shows the variables used by this project, where band numbers come from [NOAA documentation](https://www.nco.ncep.noaa.gov/pmb/products/hrrr/hrrr.t00z.wrfprsf00.grib2.shtml). <mark>More variables will be collected in the future</mark>. The HRRR subdirectory is organized into forecast hours. Each forecast hour subdirectory should have all the same information, just at different times from the HRRR forecast. 

In [None]:
dat = read_pkl(f"data/{filename}")

# Print top level keys, each corresponds to a RAWS site
dat.keys()

In [None]:
# Check structure within 
dat['CPTC2_202403'].keys()

In [None]:
print_dict_summary(dat['CPTC2_202403'])

In [None]:
# Print dataframe used to organize HRRR band retrievals
band_df_hrrr = pd.DataFrame({
    'Band': [616, 620, 624, 628, 629, 661, 561, 612, 643],
    'hrrr_name': ['TMP', 'RH', "WIND", 'PRATE', 'APCP',
                  'DSWRF', 'SOILW', 'CNWAT', 'GFLUX'],
    'dict_name': ["temp", "rh", "wind", "rain", "precip_accum",
                 "solar", "soilm", "canopyw", "groundflux"],
    'descr': ['2m Temperature [K]', 
              '2m Relative Humidity [%]', 
              '10m Wind Speed [m/s]'
              'surface Precip. Rate [kg/m^2/s]',
              'surface Total Precipitation [kg/m^2]',
              'surface Downward Short-Wave Radiation Flux [W/m^2]',
              'surface Total Precipitation [kg/m^2]',
              '0.0m below ground Volumetric Soil Moisture Content [Fraction]',
              'Plant Canopy Surface Water [kg/m^2]',
              'surface Ground Heat Flux [W/m^2]']
})

band_df_hrrr

## Data Processing - Reading and Cleaning Data

The `build_train_dict` function reads the previously described dictionary and processes it in a few ways. The function lives in the `data_funcs` python module, which is intended to include code that is specific to the particular formatting decisions of this project. The `build_train_dict` function can receive some important parameters that control how it processes the data:

* `params_data`: this is a configuration file. An example is saved internally in this project as `params_data.yaml`. This file includes hyperparameters related to data filtering. These hyperparameters control how suspect data is flagged and filtered.
* `atm_source`: this specifies the subdictionary source for the atmospheric data. Currently this is one of "HRRR" or "RAWS".
* `forecast_hour`: this specifies which HRRR forecast hour should be used. At the 0th hour, the HRRR weather model is very smooth and there is no accumulated precipitation yet. Within `wrfxpy`, the 3rd forecast hour is used.
* `spatial`: controls whether or not the separate locations are combined into a single dictionary or not. The reason not to do it is if you want to analyze timeseries at single locations more easily, perhaps to run the ODE+KF physical model of FMC.

The `build_train_dict` function performs the following operations:

* Reads a list of file names
* Extracts FMC and all possible modeling variables. This includes
    * Extracting static variables, like elevation, and extending them by the number of timeseries hours to fit a tabular data format for machine learning.
    * Calculates derived features like hour of day and day of year.
    * Calculates hourly precipitation (mm/hr) from accumulated precipitation.
* Temporally interpolate RAWS data, including FMC, to make it line up in time with the HRRR data. The HRRR data is always on a regular hourly interval, but the RAWS data can have missing data or return values not exactly on the hour requested.
* Shift the atmospheric data by the given `forecast_hour`. So if you want to build a timeseries at 3pm using the 3hr HRRR forecast data, you would start your data with the 3hr forecast from noon.
* Perform a series of data filtering steps:
    * If specified, the total timeseries within the input dictioanry is broken up into chunks of a specified number of `hours`. This makes the data filtering much easier, since we want continuous timeseries for training the RNN models, and if chunks of data are missing in time from the RAWS data it is easier to break the whole timeseries into smaller pieces and filter out the bad ones.
    * Physically reasonable min and max values for various variables are applied as filters
    * Two main parameters control what is fully excluded from the training data:
        * `max_intp_time`: this is the maximum number of hours that is allowed for temporal interpolation. Any RAWS site with a longer stretch of missing data will be flagged and removed.
        *  `zero_lag_threshold`: this is the maximum number of hours where there can be zero change in a variable before it is flagged as a broken sensor and values are set to NaN for that period.
        *  NOTE: since this is training data for a model where ample input data is available, we will air on the side of aggressively filtering out suspect data since more can always be collected if volume is an issue. It is possible that sensors break nonrandomly, maybe more missing data in a particular season of the year. This merits further study. 

In [None]:
params_data = read_yml("params_data.yaml") 
params_data

In [None]:
from data_funcs import build_train_dict

file_paths = f"data/{filename}"

In [None]:
train = build_train_dict(
    input_file_paths = [f"data/{filename}"], 
    atm_source="HRRR", 
    params_data = params_data, 
    forecast_step = 3,
    spatial=True, 
    verbose=True
)

In [None]:
# Print Data dictionary keys at the end of the process
train.keys()

## RNN Parameters Custom Classes

This project utilizes a few custom classes. The `RNNParams` custom class is used to make modeling easier and provide checks to avoid errors from incompatible models and data. It takes a dictionary as an input. Dictionaries are used since it easily works with the structure of a json file or a yaml file, two commonly used file formats for storing parameter lists. The parameters includes a number of hyperparameters related to model architecture and data formatting decisions. The `RNNParams` object is needed to pre-process data for the RNN model since it specifies things like percentages of data to be used for train/validation/test. To use this custom class, you read in the yaml file, perhaps using the `read_yml` utility function in this project, and create an RNNParams object out of it.

These are some of the required elements of an input dictionary to make a RNNParams object and the internal checks associated with them:

* `features_list`: list of features by name intended to be used for modeling. See `features_list` from the previously processed object `train` for a list of all possible feature names.
    * Internally, a value `n_features` is calculated as the length of this list. This can only be done internally, and changing the features list automatically changes `n_features` to avoid the situation where there is any mismatch.
* `batch_size`, `timesteps`: these parameters control the input data shape. They must be integers.
    * Along with `features_list`, these parameters control the input layer to the model. The input data to the model will thus be `(batch_size, timesteps, n_features)`
* `hidden_layers`, `hidden_units`, `hidden_activation`: each are lists that control hidden layer specifications. Internal checks are run that they must be the same length. Layer type is one of "rnn" (simple RNN layer), "lstm", "attention", "dropout", or "dense". The units specifies the number of cells, and should be None for attention and dropout layers. The activation type is one of tensorflows recognized activations, including 'linear', 'tanh', and 'sigmoid'. Similarly, the activation type should be None for attention and dropout layers 
* `output_layer`, `output_activation`, `output_dimension`: Currently it is a dense layer with 1 cell and linear activation. This is typical for a regression problem where the outcome is a single continuous scalar. Adding to output_dimenision would require changing the target data structure, but this could be done if you desire outputting multiple time steps or values from multiple locations.
* `return_sequences`: whether or not the final recurrent layer should return an entire sequence or only the last time step in the input sequence. This is a tunable hyperparameter. Typically, False leads to better predictions for sequence-to-scalar networks, but True is likely required for sequence-to-sequence networks (not tested yet).
* `time_fracs`, `space_fracs`: these are lists that control the percentage of data used for cross-validation train/validation/test splits. Each must be a list of 3 floats that sum up to 1. `time_fracs` partitions the data based on time, so train must proceed validaton in time, and validation proceeds test in time. `space_fracs` randomly samples physical locations. A physical location should only be included in one of train/validation/test sets.

In [None]:
from moisture_rnn import RNNParams

In [None]:
file = read_yml("params.yaml", subkey = "rnn")
params = RNNParams(file)
params.update({
    'learning_rate': 0.0001
}) # update some params here for illustrative purposes

## RNN Data Custom Class

Using the input dictionary and the parameters discussed previously, we create a custom class `RNNData` which controls data scaling and restructuring. The important methods for this class are:

* `train_test_split`: this splits into train/validation/test sets based on both space and time. This should be run before scaling data, so that only training data is used to scale test data to avoid data leakage. NOTE: the main data `X` and `y` are still organized as lists of ndarrays at this point. This is to make handling spatial locations easier, but it might be desirable to switch these to higher dimensional arrays or pandas dataframes.
* `scale_data`: this applies the given scaler, either MinMax or Standard (Gaussian). The scaler is fit on the training data and applied to the validation and test data.
* `batch_reshape`: this method combines the list of input and target data into 3-d arrays, based on the format `(batch_size, timesteps, n_features)`. This method utilizes a data structuring technique that allows for stateful RNN models to be trained with data from multiple timeseries. For more inforamtion see FMDA with Recurrent Neural Netowrks document, chapter XX <mark> add link </mark>
* `print_hashes`: this runs a utility `hash_ndarray` on all internal data in the object. This data produces a unique string for the data object. 

In [None]:
from moisture_rnn import RNNData

In [None]:
# Set random seeds, affects random sample of locations
reproducibility.set_seed(123)

rnn_dat = RNNData(
    train, # input dictionary
    scaler=params['scaler'],  # data scaling type
    features_list = params['features_list'] # features for predicting outcome
)

In [None]:
rnn_dat.train_test_split(   
    time_fracs = params['time_fracs'], # Percent of total time steps used for train/val/test
    space_fracs = params['space_fracs'] # Percent of total timeseries used for train/val/test
)

In [None]:
rnn_dat.scale_data()

In [None]:
rnn_dat.batch_reshape(
    timesteps = params['timesteps'], # Timesteps aka sequence length for RNN input data. 
    batch_size = params['batch_size'], # Number of samples of length timesteps for a single round of grad. descent
    start_times = np.zeros(len(rnn_dat.loc['train_locs']))
)    

In [None]:
rnn_dat.print_hashes()

## RNN Model Class

### Building a Model

The `RNN` custom class is used to streamline building a model with different layers and handling training and predicting easier. It requires a `RNNParams` object as an input to initialize. Several processes call a utility `hash_weights` which produces a unique hash value for model weights, which is a list a ndarrays. 

On initialization, the `RNNParams` object builds and compiles two neural networks based on the input hyperparameters. One network is used when calling `.fit`, which we will call the "training network". The training network has restrictions on the input data shape to be `(batch_size, timesteps, n_features)`. After fitting, the weights are copied over into another neural network, called the "prediction network", which is identical except for the input shape is related to be `(None, None, n_features)`. The two networks are used since certain training schemes, particularly stateful, require consistent batch size across samples. But when it comes time for prediction, we want to be able to predict at an arbitrary number of locations and an arbitrary number of timesteps. That is the purpose of the prediction network. But the prediction network is not intended to be used for training, it always just receives it's weights copied over from the training. For more infomation on train versus prediction networks, see Geron 2019 chapter 16 <mark> add cite </mark>. To illustrate this method we will redefine some parameters and examine the resulting networks.

To run `.fit`, you must set the random seed using the `reproducibility.py` module, which collects all the various types of random seeds that need setting in this project. In this project, tensorflow is configured to run deterministically to ensure reproducibility, so the random seed must be set or tensorflow throws errors.

In [None]:
from moisture_rnn import RNN, rnn_data_wrap
import reproducibility

In [None]:
params.update({
    'hidden_layers': ['dense', 'lstm', 'dense', 'dropout'],
    'hidden_units': [64, 32, 16, None],
    'hidden_activation': ['relu', 'tanh', 'relu', None],
    'return_sequences': False
})

In [None]:
reproducibility.set_seed(123)
model = RNN(params)

In [None]:
model.model_train.summary()

In [None]:
model.model_predict.summary()

Notice how in the training model, since we set `return_sequences` to False, the output shape loses a dimension. The final dense layer outputs a single value for each sample in the batch, so output shape of `(batch_size, 1)`. For the prediction model, each layer accepts None for the first two dimensions. In practice, we use this to predict at a certain number of locations for an arbitrary number of timesteps. But in both cases, the number of trainable parameters are the same. This shows is the utility of using two separate models: we can leverage sophisticated training mechanisms that restrict the input data type, but then copy these weights over to a more flexible network that is easier to use for forecasting.

<mark> Question for Jan: </mark> help me understand the linear algebra of why this works and why it's the same number of parameters.

### Running the Model

Internally, the `RNN` class has a `.fit` and a `.predict` method that access the relevant internal models. The fit method also sets up certain callbacks used to control things about the training, such as early stopping based on validation data error. Additionally, the fit method automatically sets the weights of the prediction model at the end.

We call `.fit` below. Note that this method will access internal hyperparameters that were used to initialize the object, such as the number of epochs and the batch size.

In [None]:
test_epochs = model.fit(
    rnn_dat.X_train, 
    rnn_dat.y_train,
    validation_data = (rnn_dat.X_val, rnn_dat.y_val),
    plot_history = True, # plots train/validation loss by epoch
    verbose_fit = True, # prints epochs updates
    return_epochs = True # returns the epoch used for early stopping. Used for hyperparam tuning
)

print(f"{test_epochs=}")

Next, we demonstrate here how the fitted training model weights are identical to the prediction model weights. Then, we predict new values using the prediction model. The shape of the test data will be `(n_locations, n_times, n_features)`. This mimics the formatting before, but for the training model the `batch_size` and `timesteps` were tunable hyperparameters. Here `n_locations` and `n_times` could be any integer values and are determined by the user based on their forecasting needs.

In [None]:
from utils import hash_weights

print(f"Fitted Training Model Weights Hash: {hash_weights(model.model_train)}")
print(f"Prediction Model Weights Hash:      {hash_weights(model.model_predict)}")

In [None]:
# Show test data format, (n_loc, n_times, n_features)
print(f"Number of Locations in Test Set:   {len(rnn_dat.loc['test_locs'])}")
print(f"Number of Features used in Model:  {model.params['n_features']}")

print(f"X_test shape:                      {rnn_dat.X_test.shape}")
print(f"y_test shape:                      {rnn_dat.y_test.shape}")


In [None]:
preds = model.predict(
    rnn_dat.X_test
)

print(f"{preds.shape = }")

Finally, we calculate the RMSE for each location. If desired, you could calculate the overall RMSE, but we are choosing to group by location and then average the results at the end. This methodology prioritizes accuracy across space, and avoids the situation where large errors at one location get masked by small errors at the other locations. We use a utility `rmse_3d` for this purpose which calculates means and squares across a 3d array in the proper way.

In [None]:
from utils import rmse_3d

print(f"{rmse_3d(preds, rnn_dat.y_test) = }")

The `RNN` class has a method `run_model` which combines these steps based on just an input `RNNData` object. It prints out a lot of other information related to parameter configurations. We will reinitialize the model to show reproducibility. The method returns a list of model predictions for each test location and an RMSE associated with that location. Compare the printed weight hashes to before to ensure they match.

In [None]:
reproducibility.set_seed(123)
model = RNN(params)
m, errs = model.run_model(rnn_dat)

In [None]:
print(f"{errs.mean() = }")