# Build Machine Learning Dataset


The data retrieval process loops through a range of dates, retrieves and joins RAWS, HRRR, and other data sources and saves to a local directory.

This notebook describes the process of reading that data, performing the final set of quality control filters, and formatting into data that can be fed into the various models used in this project. 

## Setup

In [None]:
import os.path as osp
from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
import pickle
from sklearn.metrics import mean_squared_error
sys.path.append('../src')
import reproducibility
from utils import Dict, read_yml, read_pkl, str2time, print_dict_summary, time_range, rename_dict
# import models.moisture_models as mm
import models.moisture_rnn as mrnn
from models.moisture_ode import ODE_FMC
from models.moisture_static import XGB, LM
import ingest.RAWS as rr
import ingest.HRRR as ih
import data_funcs 

In [None]:
start = "2023-01-01T00:00:00Z"
end = "2023-01-06T23:00:00Z"

In [None]:
params_data = Dict(read_yml("../etc/params_data.yaml"))
print_dict_summary(params_data)

## Retrieve Data

Nested dictionary with top level key corresponding to a RAWS and subkeys for RAWS, atmospheric data (HRRR), geographic info, etc

This format is used because different FMC models used in this project require different data formatting. The ODE+KF physics-based model is run pointwise and does not incorporate info from other locations. The static ML models have the least restrictive input data structure, and all observations can be thrown into one set of tabular data. The RNN models require structuring input data with the format (batch_size, timesteps, features). Thus, it is simpler to keep all data separate at separate locations and recombine in various ways at the modeling step. Also, data filters for suspect RAWS sensors are applied in the next step. This is because the raw data retrieval should not depend on hyperparameter choices related to data filters, so it is easier to collect everything and apply filters later.

In [None]:
# paths = ["../data/rocky_fmda/202301/fmda_20230101.pkl", 
#          "../data/rocky_fmda/202301/fmda_20230102.pkl",
#          "../data/rocky_fmda/202301/fmda_20230103.pkl",
#          "../data/rocky_fmda/202301/fmda_20230104.pkl",
#          "../data/rocky_fmda/202301/fmda_20230105.pkl",
#          "../data/rocky_fmda/202301/fmda_20230106.pkl"
#         ]
paths = data_funcs.sort_files_by_date("../data/rocky_fmda/202301")

In [None]:
raws_dict = data_funcs.combine_fmda_files(paths, save_path="../data/test_data/test_fmda_combined.pkl")

## Build ML Dataset

Filter data and merge RAWS and HRRR and other sources. The file `etc/params_data.yaml` has hyperparameters related to filtering data. The steps include:

- Determine atmospheric data source. Intended to be "HRRR" for production, but "RAWS" used for research purposes.
- Combine atmospheric data predictors with FMC
- Break timeseries into 72 hour periods, adding a column "st_period" starting at 0 (see README for info on why 72)
- Apply data filters to 72 hour periods to RAWS data and remove from samples. HRRR data should already be QC'ed, so filtering will not be performed.

In [None]:
params_data

In [None]:
ml_dict = data_funcs.build_ml_data(raws_dict, hours=params_data.hours, 
                                   max_linear_time = params_data.max_linear_time, 
                                   save_path = "../data/test_data/test_ml_dat.pkl")

In [None]:
len(raws_dict.keys())

In [None]:
len(ml_dict.keys())

## Setup CV

Steps:
* Determine time ranges for train/val/test
* Get stations with data availability in those periods
* Sample stations for train/val/test


Different stations will have different gaps of data availability for the train/val/test time periods. When selecting stations for inclusion in those periods, we use the following methodology:
* Let $N$ be the total number of stations that returned data over the combined train/val/test times
* Let $N_t$ be the number of stations included in each of the validation and test sets, and are chosen to be the nearest integer to 10\% of $N$
* Starting with the test time period, we select $N_t$ of the number of stations with data availability in the period. In other words, there may be less than $N$ stations with data availability in the test period, but we select $N_t$ if possible
* Then, we select $N_t$ stations for inclusion in the validation set, excluding any of the $N_t$ stations included in the test set
* Finally, we use any remaining stations for the training set that weren't included in either of the validation or test sets. So there is a maximum of $N-2\cdot N_t$ stations included in the training set 

This methodology makes it so the number of stations included in the training set varies and is sometimes less than $N-2\cdot N_t$. We fix the number of stations in the test and validation sets and allow the number of stations in the training set to vary. This is because we don't want accuracy metrics to be calculated consistently for those periods. If there are fewer stations with data availability for a certain period, we want that be to reflected in a smaller training set and presumably less accurate metrics on the test set.

In [None]:
train, val, test = data_funcs.cv_data_wrap(ml_dict, "2023-01-29T00:00:00Z", 
                train_hours = 24*28, forecast_hours=48,
                random_state=42)

## ODE+KF Data

* Run on 72 hour stretches (24 spinup, 48 val)
* Get test station list used by other models
* For those test stations, use `get_sts_and_times` accounting for the spinup period
    * So adjust test times by subtracting 24 hours to account for spinup
 
Function `get_ode_data` wraps the `get_sts_and_times` function... 

In [None]:
te_sts = [*test.keys()]
test_times = test[te_sts[0]]["times"]

ode_data = data_funcs.get_ode_data(ml_dict, te_sts, test_times)

In [None]:
ode = ODE_FMC()
m, errs = ode.run_model(ode_data, hours=72, h2=24)

In [None]:
print(f"RMSE Over Test Period: {errs}")

## Static ML Data

Throw all train/val/test data together without worrying about timesteps samples. In other words, data can all be jumbled up in any order as observations are considered independent in time.

Data is stored as a custom class `StaticMLData` defined in `models/moisture_models.py`. A custom class is used to organize data scaling and inverse scaling. A scaler should be fit using only the training data, and then applied to the val and test data to avoid data leakage. This is done internally in the StaticMLData class. 

In [None]:
dat = data_funcs.StaticMLData(train, val, test)

In [None]:
dat.scale_data()

In [None]:
tr, v, te = dat.inverse_scale(save_changes=False)

In [None]:
print(dat.X_train[:, 0].mean())

In [None]:
print(tr[:, 0].mean())

In [None]:
dat.scale_data()

### Fitting Static Models

Using StaticMLData custom class above, fit and predict using some static ML models.

In [None]:
xgb_model = XGB(random_state=42)
xgb_model.fit(dat.X_train, dat.y_train)
print("~"*50)
err_val = xgb_model.test_eval(dat.X_val, dat.y_val)
err = xgb_model.test_eval(dat.X_test, dat.y_test)

In [None]:
lm_model = LM()
lm_model.fit(dat.X_train, dat.y_train)

print("~"*50)
err_val = lm_model.test_eval(dat.X_val, dat.y_val)
err = lm_model.test_eval(dat.X_test, dat.y_test)

## RNN Data

For training RNNs (simple, LSTM, GRU included), the data must be structured as `(batch_size, timesteps, features)`. So a single "sample" in this context is a timeseries of length `timesteps` and dimensionality `features`. RNNs can be trained with different size timesteps and batch sizes, which is often useful in the context of natural language processing. However, if running an RNN in "stateful" model, which maintains the dependence between different samples from the same location, the data must have consistent number of timesteps and batch size across all inputs. Further, when using static features like lon/lat or elevation, it is desirable to have samples from different locations within the same batch. Otherwise, if a batch is constructed with samples all from the same location, the static features will have zero variance for a given batch and the model cannot learn any relationship between the static features and the outcome variable for that batch.

Data is stored in a custom class `RNNData` defined in `models/moisture_rnn.py`. A custom class is used to organize scaling as well as batch construction. 

In [None]:
dat = mrnn.RNNData(train, val, test,
                  method="random", random_state=None)

In [None]:
# Save Test Data

with open("../data/test_data/test_rnn_dat.pkl", 'wb') as handle:
    pickle.dump(dat, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
dat.X_train.shape

In [None]:
np.mean(dat.X_train, axis=(0,1))

In [None]:
np.mean(dat.X_val, axis=(0,1))

In [None]:
np.mean(dat.X_test, axis=(0,1))

In [None]:
dat.scale_data()

In [None]:
np.mean(dat.X_train, axis=(0,1))

In [None]:
np.mean(dat.X_val, axis=(0,1))

In [None]:
np.mean(dat.X_test, axis=(0,1))

In [None]:
a, b, c = dat.inverse_scale()

In [None]:
np.mean(a, axis=(0,1))

In [None]:
np.mean(b, axis=(0,1))

In [None]:
np.mean(c, axis=(0,1))

In [None]:
dat.y_train.shape

In [None]:
dat.X_val.shape

In [None]:
dat.y_val.shape

In [None]:
dat.X_test.shape

In [None]:
dat.y_test.shape

### Fitting RNN Models

In [None]:
import importlib
import models.moisture_rnn
importlib.reload(models.moisture_rnn)
import models.moisture_rnn as mrnn

In [None]:
params = mrnn.params_models["rnn"]
params.update({
    'stateful': False,
    'return_sequences': True,
    'hidden_units': [20, 20, None], 
    'batch_size': 32,
    'timesteps': None,
    'epochs':100,
    'random_state': 42
})
params

rnn = mrnn.RNN_Flexible(n_features=dat.n_features, params=params)

In [None]:
dat.X_train.shape

In [None]:
dat.y_train.shape

In [None]:
type(dat.y_val[0,0,0])

In [None]:
type(dat.y_train[0,0,0])

In [None]:
rnn.fit(dat.X_train, dat.y_train, 
        validation_data=(dat.X_val, dat.y_val),
        batch_size = params["batch_size"],
        epochs = 3,
        verbose_fit = True
       )

In [None]:
rnn.test_eval(dat.X_test, dat.y_test)

In [None]:
p = rnn.predict(dat.X_test)
np.sqrt(mean_squared_error(p.flatten(), dat.y_test.flatten()))

In [None]:
dat.X_train.shape