# Build Machine Learning Dataset


The data retrieval process loops through a range of dates, retrieves and joins RAWS, HRRR, and other data sources and saves to a local directory.

This notebook describes the process of reading that data, performing the final set of quality control filters, and formatting into data that can be fed into the various models used in this project. 

## Setup

In [None]:
import os.path as osp
from datetime import datetime, timezone
from dateutil.relativedelta import relativedelta
import synoptic
import json
import sys
import numpy as np
import polars as pl
import pandas as pd
from sklearn.metrics import mean_squared_error
sys.path.append('../src')
import reproducibility
from utils import Dict, read_yml, read_pkl, str2time, print_dict_summary, time_range, rename_dict
import models.moisture_models as mm
import models.moisture_rnn as mrnn
from models.moisture_models import XGB, LM
import ingest.RAWS as rr
import ingest.HRRR as ih
import data_funcs 

In [None]:
start = "2023-01-01T00:00:00Z"
end = "2023-01-06T23:00:00Z"

In [None]:
params_data = Dict(read_yml("../etc/params_data.yaml"))
print_dict_summary(params_data)

## Read Data


In [None]:
dat = read_pkl("../data/test_data/test_ml_dat.pkl")

In [None]:
params_data

## Setup CV

Steps:
* Determine time ranges for train/val/test
* Get stations with data availability in those periods
* Sample stations for train/val/test


Different stations will have different gaps of data availability for the train/val/test time periods. When selecting stations for inclusion in those periods, we use the following methodology:
* Let $N$ be the total number of stations that returned data over the combined train/val/test times
* Let $N_t$ be the number of stations included in each of the validation and test sets, and are chosen to be the nearest integer to 10\% of $N$
* Starting with the test time period, we select $N_t$ of the number of stations with data availability in the period. In other words, there may be less than $N$ stations with data availability in the test period, but we select $N_t$ if possible
* Then, we select $N_t$ stations for inclusion in the validation set, excluding any of the $N_t$ stations included in the test set
* Finally, we use any remaining stations for the training set that weren't included in either of the validation or test sets. So there is a maximum of $N-2\cdot N_t$ stations included in the training set 

This methodology makes it so the number of stations included in the training set varies and is sometimes less than $N-2\cdot N_t$. We fix the number of stations in the test and validation sets and allow the number of stations in the training set to vary. This is because we don't want accuracy metrics to be calculated consistently for those periods. If there are fewer stations with data availability for a certain period, we want that be to reflected in a smaller training set and presumably less accurate metrics on the test set.

In [None]:
train_times, val_times, test_times = data_funcs.cv_time_setup("2023-01-29T00:00:00Z", 
                                                train_hours=24*28, forecast_hours=48)

In [None]:
tr_sts, val_sts, te_sts = data_funcs.cv_space_setup(dat, 
                                                    val_times=val_times, 
                                                    test_times=test_times, 
                                                    random_state=42)

In [None]:
print(val_sts)

In [None]:
print(te_sts)

In [None]:
train = data_funcs.get_sts_and_times(dat, tr_sts, train_times)

In [None]:
val = data_funcs.get_sts_and_times(dat, val_sts, val_times)

In [None]:
test = data_funcs.get_sts_and_times(dat, te_sts, test_times)

## Batching Data for RNN

In [None]:
for st in train:
    print(train[st]["data"].shape)

In [None]:
# Get all samples with staircase

X, y, t = mrnn.staircase_dict(train, sequence_length=12)

In [None]:
for Xi in X:
    print(Xi.shape)

### Stateful batching

The algorithm is too complex for variable number of samples at different times

In [None]:
def stateful_batches(X_list, y_list, batch_size = 32, timesteps=12, 
                           return_sequences=False, start_times="zeros", verbose=True):
    """
    Construct data for RNN training (and validation data) with format (batch_size, timesteps, features) 
    Intended to be run on train set and validation set (if using)

    Given list of staircase structured data, i.e. output of staircase_dict, create batches by getting samples from
    each list element, so samples within a batch are from different physical locations.

    If start_times is zeros, in the first batch, and any new batch with all new locations, select the 0th (aka first in python)
    sample to build for the batch.

    Args:
        - X_list: (list) list of numpy ndarrays of predictors
        - y_list: (list) list of numpy ndarrays of response data
        - batch_size: (int) number of samples of length timesteps to include in a single iteration of weight updates
        - timesteps: (int) number of discrete time steps that defines a single sample
        - return_sequences: (bool) Whether to include all response y values for timesteps, or just last step
        - start_times: if "zeros" all samples start at time 0. (Only one for now)
    Returns:
        XX, yy: tuple of structured predictors and outcomes variables. 
            XX shape will be (num_samples, timesteps, features), where num_samples determined by batch size and input X length
            yy shape will be (num_samples, 1) OR (num_samples, timesteps) if return sequences
    """

    # Run some checks
    if len(X_list) != len(y_list):
        raise ValueError(f"Mismatch data. {len(X_list)=}, {len(y_list)=}. Check they were created together")
    if len(X_list) < batch_size:
        raise ValueError(f"Batch size greter than number of locations. Method not implemented for this, try a smaller batch size. {len(X_list)=}, {batch_size=}.")

    # Set up return objects    
    X_batches = []
    y_batches = []
    loc_batches = []
    t_batches = []
    
    # Set up indices for first batch
    loc_index = np.arange(batch_size)
    loc_counter = loc_index.max() # used to iterate to new locations
    loc_resets = []
    X_set = set(np.arange(len(X_list)))
    # t_index0 = np.arange(batch_size) # used to reset times on new location
    # t_index = np.arange(batch_size)
    t_index0 = np.zeros(batch_size)
    t_index = np.zeros(batch_size)
    
    b = 0 # batch index     
    run = True
    while run:
        print("~"*75)
        print(f"Batch {b}:")

        print(f"Location Indices: {loc_index}")
        print(f"Time Indices: {t_index}")
        
        # Get data
        X_batch = np.array([X_list[loc][int(t)] for loc, t in zip(loc_index, t_index)])
        y_batch = np.array([y_list[loc][int(t)] for loc, t in zip(loc_index, t_index)])
        if not return_sequences:
            y_batch = y_batch[:, -1, :] # Get last time step of sequence

        # Save batch info by appending
        X_batches.append(X_batch.copy())
        y_batches.append(y_batch.copy())
        t_batches.append(t_index.copy())
        loc_batches.append(loc_index.copy())
        
        # Update indices for next iteration
        t_index += timesteps # iterate time index by timesteps param

        # Check times and locations, adjust if needed
        for i in range(0, len(loc_index)):
            loci = loc_index[i]
            ti = t_index[i]
            Xi = X_list[loci]
            if Xi.shape[0] <= ti:
                # Condition triggered that requested time index is 
                # greater than samples available for given location
                # So iterate location index and reset time to t_index0
                t_index[i] = t_index0[i]
                loc_counter += 1
                new_loc_i = loc_counter % len(X_list)
                loc_resets.append(loc_index[i].copy()) # Keep track of which locations get reset

                if not set(loc_resets) - X_set:                
                    # Condition triggered when maximum loc index has been reset to 0
                    # Indicates we have cycles through all locations, STOP
                    print(f"Stopping at batch {b}")
                    run = False
                    break
                loc_index[i] = new_loc_i
                print(f"Changing location {i} index to: {new_loc_i}")
                print(f"    With Time index to: {t_index0[i]}")
        
        b += 1 # iterate batch


    # return np.array(X_batches), np.array(y_batches), t_batches, loc_batches
    return np.concatenate(X_batches, axis=0), np.concatenate(y_batches, axis=0), t_batches, loc_batches

## RNN Data

For training RNNs (simple, LSTM, GRU included), the data must be structured as `(batch_size, timesteps, features)`. So a single "sample" in this context is a timeseries of length `timesteps` and dimensionality `features`. RNNs can be trained with different size timesteps and batch sizes, which is often useful in the context of natural language processing. However, if running an RNN in "stateful" model, which maintains the dependence between different samples from the same location, the data must have consistent number of timesteps and batch size across all inputs. Further, when using static features like lon/lat or elevation, it is desirable to have samples from different locations within the same batch. Otherwise, if a batch is constructed with samples all from the same location, the static features will have zero variance for a given batch and the model cannot learn any relationship between the static features and the outcome variable for that batch.

Data is stored in a custom class `RNNData` defined in `models/moisture_rnn.py`. A custom class is used to organize scaling as well as batch construction. 

In [None]:
dat = mrnn.RNNData(train, val, test,
                  method="random", random_state=None)

In [None]:
dat.X_train.shape

In [None]:
np.mean(dat.X_train, axis=(0,1))

In [None]:
np.mean(dat.X_val, axis=(0,1))

In [None]:
np.mean(dat.X_test, axis=(0,1))

In [None]:
dat.scale_data()

In [None]:
np.mean(dat.X_train, axis=(0,1))

In [None]:
np.mean(dat.X_val, axis=(0,1))

In [None]:
np.mean(dat.X_test, axis=(0,1))

In [None]:
dat.y_val.shape

In [None]:
dat.y_test.shape

In [None]:
a, b, c = dat.inverse_scale()

In [None]:
np.mean(a, axis=(0,1))

In [None]:
np.mean(b, axis=(0,1))

In [None]:
np.mean(c, axis=(0, 1))

In [None]:
dat.y_train.shape