# Build Machine Learning Dataset


The data retrieval process loops through a range of dates, retrieves and joins RAWS, HRRR, and other data sources and saves to a local directory.

This notebook describes the process of reading that data, performing the final set of quality control filters, and formatting into data that can be fed into the various models used in this project. 

## Setup

In [1]:
import os.path as osp
from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
from sklearn.metrics import mean_squared_error
sys.path.append('../src')
from utils import Dict, read_yml, read_pkl, str2time, print_dict_summary, time_range, rename_dict
import models.moisture_models as mm
from models.moisture_models import XGB, LM
import ingest.RAWS as rr
import ingest.HRRR as ih
import data_funcs 

NameError: name 'f' is not defined

In [None]:
start = "2023-01-01T00:00:00Z"
end = "2023-01-06T23:00:00Z"

In [None]:
params_data = Dict(read_yml("../etc/params_data.yaml"))
print_dict_summary(params_data)

## Retrieve Data

Nested dictionary with top level key corresponding to a RAWS and subkeys for RAWS, atmospheric data (HRRR), geographic info, etc

This format is used because different FMC models used in this project require different data formatting. The ODE+KF physics-based model is run pointwise and does not incorporate info from other locations. The static ML models have the least restrictive input data structure, and all observations can be thrown into one set of tabular data. The RNN models require structuring input data with the format (batch_size, timesteps, features). Thus, it is simpler to keep all data separate at separate locations and recombine in various ways at the modeling step. Also, data filters for suspect RAWS sensors are applied in the next step. This is because the raw data retrieval should not depend on hyperparameter choices related to data filters, so it is easier to collect everything and apply filters later.

In [None]:
paths = ["../data/rocky_fmda/202301/fmda_20230101.pkl", 
         "../data/rocky_fmda/202301/fmda_20230102.pkl",
         "../data/rocky_fmda/202301/fmda_20230103.pkl",
         "../data/rocky_fmda/202301/fmda_20230104.pkl",
         "../data/rocky_fmda/202301/fmda_20230105.pkl",
         "../data/rocky_fmda/202301/fmda_20230106.pkl"
        ]

In [None]:
import importlib
import data_funcs
importlib.reload(data_funcs)
from data_funcs import combine_fmda_files

In [None]:
raws_dict = data_funcs.combine_fmda_files(paths, save_path="../data/test_data/test_fmda_combined.pkl")

## Build ML Dataset

Filter data and merge RAWS and HRRR and other sources. The file `etc/params_data.yaml` has hyperparameters related to filtering data. The steps include:

- Determine atmospheric data source. Intended to be "HRRR" for production, but "RAWS" used for research purposes.
- Combine atmospheric data predictors with FMC
- Break timeseries into 72 hour periods, adding a column "st_period" starting at 0 (see README for info on why 72)
- Apply data filters to 72 hour periods to RAWS data and remove from samples. HRRR data should already be QC'ed, so filtering will not be performed.

In [None]:
params_data

In [None]:
ml_dict = data_funcs.build_ml_data(raws_dict, hours=params_data.hours, 
                                   max_linear_time = params_data.max_linear_time, 
                                   save_path = "../data/test_data/test_ml_dat.pkl")

In [None]:
len(raws_dict.keys())

In [None]:
len(ml_dict.keys())

In [None]:
from utils import hash_ndarray
hash_ndarray(ml_dict["RFRC2"]["data"]["fm"].to_numpy())

## Setup CV

In [None]:
train_times, val_times, test_times = data_funcs.cv_time_setup("2023-01-05T00:00:00Z", 
                                                train_hours=48*2, forecast_hours=48)

In [None]:
stids = [*ml_dict.keys()]

tr_sts, val_sts, te_sts = data_funcs.cv_space_setup(stids, random_state=42)

In [None]:
print(val_sts)

In [None]:
print(te_sts)

In [None]:
train = data_funcs.get_sts_and_times(ml_dict, tr_sts, train_times)

In [None]:
val = data_funcs.get_sts_and_times(ml_dict, val_sts, val_times)

In [None]:
test = data_funcs.get_sts_and_times(ml_dict, te_sts, test_times)

## ODE+KF Data

* Run on 72 hour stretches (24 spinup, 48 val)
* Get test station list used by other models
* For those test stations, use `get_sts_and_times` accounting for the spinup period
    * So adjust test times by subtracting 24 hours to account for spinup
 
Function `get_ode_data` wraps the `get_sts_and_times` function... 

In [None]:
ode_data = data_funcs.get_ode_data(ml_dict, te_sts, test_times)

In [None]:
ode = mm.ODE_FMC()
m, errs = ode.run_model(ode_data, hours=72, h2=24)

In [None]:
print(f"RMSE Over Test Period: {errs}")

## Static ML Data

Throw all train/val/test data together without worrying about timesteps samples. In other words, data can all be jumbled up in any order as observations are considered independent in time.

Data is stored as a custom class `StaticMLData` defined in `models/moisture_models.py`. A custom class is used to organize data scaling and inverse scaling. A scaler should be fit using only the training data, and then applied to the val and test data to avoid data leakage. This is done internally in the StaticMLData class. 

In [None]:
dat = data_funcs.StaticMLData(train, val, test)

In [None]:
print(dat.X_train[:, 0].mean())

In [None]:
dat.scale_data()

In [None]:
tr, v, te = dat.inverse_scale(save_changes=False)

In [None]:
print(dat.X_train[:, 0].mean())

In [None]:
print(tr[:, 0].mean())

### Fitting Static Models

Using StaticMLData custom class above, fit and predict using some static ML models.

In [None]:
xgb_model = XGB(mm.xgb_params)
m, err = xgb_model.run_model(dat)
print(f"XGBoost Test RMSE: {err}")

In [None]:
lm_model = LM(mm.lm_params)
m, err = lm_model.run_model(dat)
print(f"LM Test RMSE: {err}")

## RNN Data

For training RNNs (simple, LSTM, GRU included), the data must be structured as `(batch_size, timesteps, features)`. So a single "sample" in this context is a timeseries of length `timesteps` and dimensionality `features`. RNNs can be trained with different size timesteps and batch sizes, which is often useful in the context of natural language processing. However, if running an RNN in "stateful" model, which maintains the dependence between different samples from the same location, the data must have consistent number of timesteps and batch size across all inputs. Further, when using static features like lon/lat or elevation, it is desirable to have samples from different locations within the same batch. Otherwise, if a batch is constructed with samples all from the same location, the static features will have zero variance for a given batch and the model cannot learn any relationship between the static features and the outcome variable for that batch.

Data is stored in a custom class `RNNData` defined in `models/moisture_rnn.py`. A custom class is used to organize scaling as well as batch construction. 

Steps:
* Remove data from train/test/val shorter than needed length of timeseries
    * For non-stateful models, sequences of data must be greater than or equal to the `timesteps` hyperparameter. (So for a given loc, we can have stretches of data of length timesteps over any period defined as the train times)
    * For stateful models, we need continuity across samples. We therefore discard any locations where the obsevations are not continuous over a certain length of time. (So for a given loc, we need samples of length `timesteps` that line up in time) 

In [None]:
features_list = ["Ed", "Ew", "rain"]
df = train["AENC2"]["data"][features_list]
df["ind"] = np.arange(0, len(df.copy()))
X = df.to_numpy()
y = train["AENC2"]["data"]["fm"].to_numpy().reshape(-1, 1)
times = train["AENC2"]["times"]
print(df.shape)

In [None]:
from utils import is_consecutive_hours

In [None]:
is_consecutive_hours(times)

In [None]:
def staircase(X, y, timesteps=12, method="consecutive", return_sequences=False, verbose=True):
    """
    NON-STATEFUL method. For given cases in input data dictionary, 
    extract samples of length `timesteps` and data of shorter length.
    Non-stateful since samples of length timesteps for a given location
    need not be directly ordered in time. Allows for getting samples of 
    length timesteps for a given location over any time within train/val/test 
    window.

    If consecutive, samples of length timesteps taken in order, so no overlap
        num samples with be total_times // timesteps
    If sliding, samples of length timesteps taken while shifting one step, so lots of overlap
        num samples will be total_times - timesteps + 1

    Args: 
        - X: numpy ndarray of dims (total_times, n_features)
        - method: (str) one of "sliding" or consecutive
        - return_sequences: (bool) whether to return the entire sequence of y values for the sample or only the
                            last time. If False, y has dims (n_samples, 1). If True, (n_samples, timesteps)
    Returns: 
        - X_samples: either shape (total_times // timesteps, timesteps, n_features) OR (total_times - timesteps + 1, timesteps, n_features) 
        depending on consecutive or sliding method
        - y_samples: Either shape (num_samples, 1) or (num_samples, timesteps), 
        where num_samples determined by X shape
    """
    
    total_times, features = X.shape
    
    if method == "sliding":
        nsamples = total_times - timesteps + 1
        X_samples = np.lib.stride_tricks.sliding_window_view(X, (timesteps, features)).squeeze(axis=1)
        y_samples = np.lib.stride_tricks.sliding_window_view(y.squeeze(), timesteps)


    elif method == "consecutive":
        nsamples = total_times // timesteps  # Only full batches
        X_samples = X[:nsamples * timesteps].reshape(nsamples, timesteps, features)
        y_samples = y[:nsamples * timesteps].reshape(nsamples, timesteps)
    else:
        raise ValueError("Method must be either 'consecutive' or 'sliding'.")

    if not return_sequences:
        y_samples = y_samples[:, -1].reshape(-1, 1)  # Keep only the last timestep
    
    if verbose:
        print('staircase: shape X = ',X_samples.shape)
        print('staircase: shape y = ',y_samples.shape)
        print('staircase: timesteps=',timesteps)
        print('staircase: return_sequences=',return_sequences)        

    return X_samples, y_samples

In [None]:
X1, y1 = staircase(X, y, timesteps=12, method="consecutive", return_sequences=False)

In [None]:
X1.shape

In [None]:
y1.shape

In [None]:
X2, y2 = staircase(X, y, timesteps=12, method="sliding", return_sequences=False)

In [None]:
X2.shape

In [None]:
y2.shape

In [None]:
features_list = ["Ed", "Ew", "rain"]
y_col="fm"
hours = 36
# y_list = [d["data"][y_col].values for d in train.values()]
# X_list = [d["data"][features_list].values for d in train.values()]

In [None]:
# Get lists of X, y and times
X_list = [d["data"].values for d in train.values()]
y_list = [d["data"][y_col].values for d in train.values()]
times_list = [d["times"] for d in train.values()]

In [None]:
from utils import is_consecutive_hours

In [None]:
import warnings

def build_training_batches(X_list, y_list, batch_size, timesteps=12, hours = 36,
                           method="consecutive", return_sequences=False, start_times="zeros", verbose=True):
    """
    Construct data for RNN training (and validation data) with format (batch_size, timesteps, features) 
    Runs staircase with given params, then interlaces the data so that a single batch has samples from different
    locations and thus can learn relationships for features that are static for a given location

    Args:
        - X_list: (list) list of numpy ndarrays of predictors
        - y_list: (list) list of numpy ndarrays of response data
        - batch_size: (int) number of samples of length timesteps to include in a single iteration of weight updates
        - timesteps: (int) number of discrete time steps that defines a single sample
        - hours: (bool) Number of hours to . Any set of samples less than hours will be discarded. 
            For stateful structure, hidden state will be maintained for number of samples N where N*timesteps=hours
            NOTE, hours should be divisible by timesteps
            If Non-stateful structure, set hours equal to timesteps
        - method: one of "consecutive" or "sliding"
        - start_times: if "zeros" all samples start at time 0. (Only one for now)
    Returns:
        XX, yy: tuple of structured predictors and outcomes variables. 
            XX shape will be (num_samples, timesteps, features), where num_samples determined by batch size and input X length
            yy shape will be (num_samples, 1) OR (num_samples, timesteps) if return sequences
    """

    if method != "consecutive":
        raise ValueError("Only method=consecutive is implemented yet")
    if hours % timesteps != 0:
        warnings.warn(f"Input hours {hours} not divisible by input timesteps {timesteps}, may lead to unexpected behavior")

    

In [None]:
# Apply staircase to each list element
X_samples_list, y_samples_list = zip(*[staircase(X, y, timesteps=12, method="consecutive", verbose=False, return_sequences=False) for X, y in zip(X_list, y_list)])
X_samples_list, y_samples_list = list(X_samples_list), list(y_samples_list)

In [None]:
from data_funcs import MLData

In [None]:
class RNNData(MLData):
    """
    Custom class to handle RNN data. Performs data scaling and stateful batch structuring.
    In this context, a single "sample" from RNNData is a timeseries with dimensionality (timesteps, n_features)
    """
    def _setup_data(self, train, val, test, y_col="fm", verbose=True):
        """
        Combines DataFrames under 'data' keys for train, val, and test. 
        Batch structure using staircase functions.

        Creates numpy ndarrays X_train, y_train, X_val, y_val, X_test, y_test
        """
        if verbose:
            print(f"Subsetting input data to {self.features_list}")        

In [None]:
ml_dict["RLAS2"].keys()

In [None]:
ml_dict["RLAS2"]["data"]

In [None]:
features_list = ["Ed", "Ew", "rain"]
y_col="fm"
y_list = [d["data"][y_col].values for d in train.values()]
X_list = [d["data"][features_list].values for d in train.values()]

In [None]:
len(y_list)

In [None]:
import matplotlib.pyplot as plt
plt.plot(y_list[0])

In [None]:
len(X_list)

In [None]:
X_list[0].shape

In [None]:
type(X_list[0])

In [None]:
import importlib
import models.moisture_rnn
importlib.reload(models.moisture_rnn)
from models.moisture_rnn import staircase_spatial, batch_setup

In [None]:
XX, yy, n_seqs = staircase_spatial(
    X_list, y_list, batch_size = 32, timesteps=12,
    start_times = "zeros"
)

In [None]:
n_loc = len(y_list) # assuming each list entry for y is a separate location
loc_ids = np.arange(n_loc)
start_times = np.zeros(n_loc)
batch_size = 32

In [None]:
loc_batch, t_batch =  batch_setup(loc_ids, batch_size), batch_setup(start_times, batch_size)

In [None]:
len(loc_batch)

In [None]:
loc_batch[0]

In [None]:
loc_batch[1]

In [None]:
hours = min(len(yi) for yi in y_list)
print(hours)

In [None]:
from models.moisture_rnn import staircase_2

In [None]:
# Loop over batches and construct with staircase_2
Xs = []
ys = []
for i in range(0, len(loc_batch)):
    locs_i = loc_batch[i]
    ts = t_batch[i]
    for j in range(0, len(locs_i)):
        t0 = int(ts[j])
        tend = t0 + hours
        # Create RNNData Dict
        # Subset data to given location and time from t0 to t0+hours
        k = locs_i[j] # Used to account for fewer locations than batch size
        X_temp = X[k][t0:tend,:]
        y_temp = y[k][t0:tend].reshape(-1,1)

        # Format sequences
        Xi, yi = staircase_2(
            X_temp, 
            y_temp, 
            timesteps = timesteps, 
            batch_size = 1,  # note: using 1 here to format sequences for a single location, not same as target batch size for training data
            verbose=False)
    
        Xs.append(Xi)
        ys.append(yi)   