# Cross-Validation Setup 

The purpose of this notebook is to split the collected data into training and test splits.

To get reliable estimates of forecast error of a spatiotemporal model, care must be taken to avoid data leakage. See: https://github.com/jh-206/FRAMSC-2024---FMDA-Data-and-CV-Methods/blob/main/Spatiotemporal%20Cross%20Validation.ipynb

In [None]:
import sys
sys.path.append('src')
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import inspect
from sklearn.model_selection import train_test_split
# Local Modules
from utils import make_st_map_interactive
from data_funcs import train_test_split_spacetime
import reproducibility

In [None]:
df = pd.read_pickle("data/rocky_2023_05-09.pkl")

In [None]:
# make_st_map_interactive(df)

In [None]:
df

## Spatiotemporal CV

For a meaningful analysis of forecast error for a spatiotemporal model, the test set must consist of locations that were not included in the training and at times in the future of training. To conduct this split, we use a custom function `train_test_split_spacetime`, that mimics the return format of the typicaly `sklearn` function `train_test_split`, while accounting for relationships in space and time.

In [None]:
# Print function
print(inspect.getsource(train_test_split_spacetime))

In [None]:
reproducibility.set_seed(4)
X_train, X_test, y_train, y_test = train_test_split_spacetime(df)

In [None]:
print(f"Number of Training Observations: {X_train.shape[0]}")
print(f"Number of Training Locations: {len(X_train.stid.unique())}")
print(f"Time range Train: {X_train.date.min().strftime('%Y-%m-%d %H:%M:%S'), X_train.date.max().strftime('%Y-%m-%d %H:%M:%S')}")
print("~"*50)
print(f"Number of Test Observations: {X_test.shape[0]}")
print(f"Number of Test Locations: {len(X_test.stid.unique())}")
print(f"Time range Train: {X_test.date.min().strftime('%Y-%m-%d %H:%M:%S'), X_test.date.max().strftime('%Y-%m-%d %H:%M:%S')}")