# Creating the test and validation datasets

- **Test dataset**: test data is like the final test of the chosen model. 
- **Validation dataset**: validation data is the data used in between different steps of learning to eventually adjust the following step during the algorithm learning.

In regular regression or classification, we usually sample a few records at random and set them aside. But while dealing with time series, we need to respect the temporal aspect of the dataset. Therefore, a best practice is to set aside the latest part of the dataset as the test data.

Another rule of thumb is to set equal-sized validation and test datasets so that the key modeling decisions we make based on the validation data are as close as possible to the test data. 

In [19]:
import pandas as pd
import numpy as np

In [20]:
treated_data_rep = r'../0_Data/wrangled/' 
HKDaily_AGG = pd.read_pickle(treated_data_rep+"HKDaily_AGG.pkl")
daily_temp= HKDaily_AGG[['TX']]
daily_temp_10 = daily_temp.loc[daily_temp.index>'2014/12/31'] # 10 years period + January and February 2025 for test & validation
np.unique(daily_temp_10.index.year)

array([2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025],
      dtype=int32)

The observation period is from the first of January 2015 to the last day of February 2025. So, we can use the data of February 2025 as the final test dataset, and the one of January 2025 as the validation dataset, leaving the data of 10 years as the train dataset.

In [21]:
test_mask = (daily_temp_10.index.year==2025) & (daily_temp_10.index.month==2)
val_mask = (daily_temp_10.index.year==2025) & (daily_temp_10.index.month==1)

In [24]:
daily_temp_10.loc[:,'unique_id'] = 1.0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  daily_temp_10.loc[:,'unique_id'] = 1.0


In [26]:
daily_temp_10.columns = ['y', 'unique_id']

In [27]:
train = daily_temp_10[~(val_mask|test_mask)]
val = daily_temp_10[val_mask]
test = daily_temp_10[test_mask]
print(f"# of Training samples: {len(train)} | # of Validation samples: {len(val)} | # of Test samples: {len(test)}")
print(f"Max Date in Train: {train.index.max()} | Min Date in Validation: {val.index.min()} | Min Date in Test: {test.index.min()}")

# of Training samples: 3653 | # of Validation samples: 31 | # of Test samples: 28
Max Date in Train: 2024-12-31 00:00:00 | Min Date in Validation: 2025-01-01 00:00:00 | Min Date in Test: 2025-02-01 00:00:00


In [28]:
train.to_parquet("../0_Data/wrangled/daily_temp_train.parquet")
val.to_parquet("../0_Data/wrangled/daily_temp_val.parquet")
test.to_parquet("../0_Data/wrangled/daily_temp_test.parquet")