# Building Samples from Irregularly Sampled Timeseries

This notebook proposes a way to build samples for sequence learning tasks given an observed timeseries with irregular sampling intervals. The FMC data from the Carlson field study was sampled 2x daily, and we want to be able to train models that predict FMC hourly or even subhourly. 

Input sequences for training will still be 48-hours, hourly weather data. Response variable sequences will be length 48 with missing value placeholders (-9999), and then a loss function will mask them out properly to calculate loss ~4 times over the 48-hour sequence. 

Stride length is a hyperparam. Stride length 1 moves the 48-hour window over 1 hour at a time, maximizing data but making highly correlated samples. In principle, it could be tuned, but we will opt to the average response variable frequency of 12hrs. 

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from src.utils import time_intp, read_yml, Dict
from src.reproducibility import set_seed
import joblib

In [None]:
weather = pd.read_excel("data/processed_data/dvdk_weather.xlsx")
fm = pd.read_excel("data/processed_data/ok_100h.xlsx")

conf = Dict(read_yml("etc/thesis_config.yaml"))
params = Dict(read_yml("models/params.yaml"))
scaler = joblib.load("models/scaler.joblib")

## Build Sparse Response-Variable Sequences

Given a set of sparsely observed FM values, get weather data at that time and a lookup period back in time.

Steps:
- Define train/val/test period
  - For Carlson data, only 1 location so can't do spatial holdout
- For train, build 48-hour sequences of weather and response, fill mask -9999 in missing
- For test and val, get all data and keep sequential order

In [None]:
# Combine weather and fm, fill na, add geographic features
df = weather.merge(
    fm[["utc_rounded", "utc_prov", "fm100"]],
    left_on="utc",
    right_on="utc_rounded",
    how="left"
).drop(columns="utc_rounded")

df["elev"] = conf.ok_elev
df["lon"] = conf.ok_lon
df["lat"] = conf.ok_lat

df["fm100"] = df["fm100"].fillna(-9999)
df[["utc", "utc_prov", "fm100", "lon", "lat", "elev"]].head(5)

In [None]:
# Split times
X_train = df[(df.utc >= conf.train_start) & (df.utc <= conf.train_end)][params.features_list]
y_train = df[(df.utc >= conf.train_start) & (df.utc <= conf.train_end)]["fm100"].to_numpy()

X_val = df[(df.utc >= conf.val_start) & (df.utc <= conf.val_end)][params.features_list]
y_val = df[(df.utc >= conf.val_start) & (df.utc <= conf.val_end)]["fm100"].to_numpy()

X_test  = df[(df.utc >= conf.f_start) & (df.utc <= conf.f_end)][params.features_list]
y_test = df[(df.utc >= conf.f_start) & (df.utc <= conf.f_end)]["fm100"].to_numpy()

print(f"{X_train.shape=}")
print(f"{y_train.shape=}")
print(f"{X_val.shape=}")
print(f"{y_val.shape=}")
print(f"{X_test.shape=}")
print(f"{y_test.shape=}")

In [None]:
# Scale using saved scaler object from RNN, reshape val and test to 3d array
X_train_scaled = scaler.transform(X_train)

XX_val = scaler.transform(X_val)
XX_val = XX_val.reshape(1, *XX_val.shape)

XX_test = scaler.transform(X_test)
XX_test = XX_test.reshape(1, *XX_test.shape)

In [None]:
# Check consistent 1h time steps, should be 1hr unique time diff plut NaT at the start:

print(df.utc.diff().unique()[1:])

In [None]:
def build_training_batches_univariate(X, y, seq_length=48, stride_length=12, mask_val=-9999):
    """
    Build fixed-length sequence samples from an hourly univariate time series.

    Inputs
    - X: array-like, shape (N, n_features)
    - y: array-like, shape (N,)
         Response with missing labels encoded as mask_val.
    - seq_length: int
         Sequence length (e.g., 48 hours).
    - stride_legnth: int
        Number of time steps to shift sequence legnth window
    - mask_val: float
         Sentinel value indicating missing y.

    Returns 
    - X: np.ndarray, shape (n_samples, seq_length, n_features)
    - y: np.ndarray, shape (n_samples, seq_length, 1)
    - mask: np.ndarray, shape (n_samples, seq_length)  (1 where observed, 0 where missing)
    """

    # Checks
    if X.ndim != 2:
        raise ValueError(f"X must be 2D (N, n_features). Got shape {X.shape}")
    if y.ndim != 1:
        raise ValueError(f"y must be 1D (N,). Got shape {y.shape}")
    if len(X) != len(y):
        raise ValueError(f"X and y must have same length. Got {len(X)} and {len(y)}")
    if seq_length <= 0:
        raise ValueError("seq_length must be > 0")
    if stride_length <= 0:
        raise ValueError("stride_length must be > 0")
    N = len(y)
    if N < seq_length:
        raise ValueError(f"Need N >= seq_length. Got N={N}, seq_length={seq_length}")

    X_list = []
    y_list = []
    mask_list = []
    for start in range(0, N - seq_length + 1, stride_length):
        X_i = X[start:(start+seq_length),:]
        y_i = y[start:(start+seq_length)]
        mask_i = (y_i != mask_val)
        X_list.append(X_i); y_list.append(y_i); mask_list.append(mask_i)
    

    XX = np.array(X_list); yy = np.array(y_list)[..., np.newaxis]; mask = np.array(mask_list)
    return XX, yy, mask

In [None]:
X_train_samples, y_train_samples, masks = build_training_batches_univariate(X = X_train_scaled, y=y_train)

In [None]:
print(f"{X_train_samples.shape=}")
print(f"{y_train_samples.shape=}")
print(f"{masks.shape=}")

In [None]:
# Check masking:
print(y_train_samples[0,:,0])
print(y_train_samples[0,masks[0],0])