# Build Machine Learning Dataset


The data retrieval process loops through a range of dates, retrieves and joins RAWS, HRRR, and other data sources and saves to a local directory.

This notebook describes the process of reading that data, performing the final set of quality control filters, and formatting into data that can be fed into the various models used in this project. 

## Setup

In [None]:
import os.path as osp
from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
sys.path.append('../src')
from utils import Dict, read_yml, read_pkl, str2time, print_dict_summary, time_range, rename_dict
import models.moisture_models as mm
import ingest.RAWS as rr
import ingest.HRRR as ih
import data_funcs 

In [None]:
start = "2023-01-01T00:00:00Z"
end = "2023-01-06T23:00:00Z"

In [None]:
params_data = Dict(read_yml("../etc/params_data.yaml"))
print_dict_summary(params_data)

## Retrieve Data

Nested dictionary with top level key corresponding to a RAWS and subkeys for RAWS, atmospheric data (HRRR), geographic info, etc

This format is used because different FMC models used in this project require different data formatting. The ODE+KF physics-based model is run pointwise and does not incorporate info from other locations. The static ML models have the least restrictive input data structure, and all observations can be thrown into one set of tabular data. The RNN models require structuring input data with the format (batch_size, timesteps, features). Thus, it is simpler to keep all data separate at separate locations and recombine in various ways at the modeling step. Also, data filters for suspect RAWS sensors are applied in the next step. This is because the raw data retrieval should not depend on hyperparameter choices related to data filters, so it is easier to collect everything and apply filters later.

In [None]:
paths = ["../data/rocky_fmda/202301/fmda_20230101.pkl", 
         "../data/rocky_fmda/202301/fmda_20230102.pkl",
         "../data/rocky_fmda/202301/fmda_20230103.pkl",
         "../data/rocky_fmda/202301/fmda_20230104.pkl",
         "../data/rocky_fmda/202301/fmda_20230105.pkl",
         "../data/rocky_fmda/202301/fmda_20230106.pkl"
        ]

In [None]:
import importlib
import data_funcs
importlib.reload(data_funcs)
from data_funcs import combine_fmda_files

In [None]:
raws_dict = data_funcs.combine_fmda_files(paths, save_path="../data/test_data/test_fmda_combined.pkl")

## Build ML Dataset

Filter data and merge RAWS and HRRR and other sources. The file `etc/params_data.yaml` has hyperparameters related to filtering data. The steps include:

- Determine atmospheric data source. Intended to be "HRRR" for production, but "RAWS" used for research purposes.
- Combine atmospheric data predictors with FMC
- Break timeseries into 72 hour periods, adding a column "st_period" starting at 0 (see README for info on why 72)
- Apply data filters to 72 hour periods to RAWS data and remove from samples. HRRR data should already be QC'ed, so filtering will not be performed.

In [None]:
params_data

In [None]:
ml_dict = data_funcs.build_ml_data(raws_dict, hours=params_data.hours, 
                                   max_linear_time = params_data.max_linear_time, 
                                   save_path = "../data/test_data/test_ml_dat.pkl")

In [None]:
len(raws_dict.keys())

In [None]:
len(ml_dict.keys())

## Setup CV

In [None]:
train_times, val_times, test_times = data_funcs.cv_time_setup("2023-01-05T00:00:00Z", 
                                                train_hours=48*2, forecast_hours=48)

In [None]:
stids = [*ml_dict.keys()]

tr_sts, val_sts, te_sts = data_funcs.cv_space_setup(stids, random_state=42)

In [None]:
print(val_sts)

In [None]:
print(te_sts)

In [None]:
train = data_funcs.get_sts_and_times(ml_dict, tr_sts, train_times)

In [None]:
val = data_funcs.get_sts_and_times(ml_dict, val_sts, val_times)

In [None]:
test = data_funcs.get_sts_and_times(ml_dict, te_sts, test_times)

## ODE+KF Data

* Run on 72 hour stretches (24 spinup, 48 val)
* Get test station list used by other models
* For those test stations, use `get_sts_and_times` accounting for the spinup period
    * So adjust test times by subtracting 24 hours to account for spinup
 
Function `get_ode_data` wraps the `get_sts_and_times` function... 

In [None]:
ode_data = data_funcs.get_ode_data(ml_dict, te_sts, test_times)

In [None]:
ode = mm.ODE_FMC()
m, errs = ode.run_model(ode_data, hours=72, h2=24)

In [None]:
print(f"RMSE Over Test Period: {errs}")

## Static ML Data

Throw all training data into a pool