# Train/Test/Val Split Analysis

This notebook analyzes the final train/validation/test split data for the energy forecasting models.

## Configuration
- **Dataset**: `building` (TrainDatasetBuilding)
- **Split Method**: `time` (time-based split, 80/10/10 ratio)
- **Resolutions**: Daily and Hourly

## Overview

The train/test/val split is performed in `src/energy_forecast/utils/train_test_val_split.py`:
- **Time-based split**: Each building's time series is split chronologically (80% train, 10% val, 10% test)
- Series too short to create at least one training example (< lag_in + lag_out) are discarded
- Split preserves temporal order within each building

In [1]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from loguru import logger

from src.energy_forecast.dataset import TrainDatasetBuilding
from src.energy_forecast.utils.train_test_val_split import get_train_test_val_split

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

[32m2025-12-31 12:43:31.619[0m | [1mINFO    [0m | [36msrc.energy_forecast.config[0m:[36m<module>[0m:[36m15[0m - [1mPROJ_ROOT path is: /home/marja/PycharmProjects/energy-forecast-wahl[0m
2025-12-31 12:43:32.208738: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-31 12:43:32.219198: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767181412.230712   26127 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767181412.234831   26127 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory 

## Daily Dataset (7-day forecast)

In [2]:
# Configuration for daily dataset
config_daily = {
    "dataset": "building",
    "res": "daily",
    "interpolate": 1,
    "lag_in": 7,   # 7 days of historical data
    "lag_out": 7,  # 7 days forecast
    "n_in": 7,
    "n_out": 7,
    "energy": "all",
    "train_test_split_method": "time",
    "scale_mode": "individual",
    "scaler": "standard",
    "feature_code": 10  # FEATURE_SET_10 for building features
}

print("Loading daily dataset...")
ds_daily = TrainDatasetBuilding(config_daily)
ds_daily.load_feat_data(interpolate=True)
ds_daily.preprocess()

print(f"\nDataset shape before split: {ds_daily.df.shape}")
print(f"Number of unique buildings: {ds_daily.df['id'].n_unique()}")

Loading daily dataset...
[32m2025-12-31 12:43:33.936[0m | [1mINFO    [0m | [36msrc.energy_forecast.dataset[0m:[36mpreprocess[0m:[36m606[0m - [1mTraining Features: ['heated_area', 'wpgt', 'weekend', 'tsun', 'holiday', 'pres', 'tavg', 'daily_avg', 'wspd', 'tmax', 'tmin', 'hum_avg', 'hum_max', 'typ_0', 'primary_energy_gas', 'typ_1', 'primary_energy_district heating', 'wdir', 'hum_min', 'typ_4', 'diff', 'typ_2', 'prcp', 'day_of_month_sin', 'weekday_sin', 'day_of_month_cos', 'weekday_cos'][0m

Dataset shape before split: (56955, 48)
Number of unique buildings: 132


In [3]:
# Perform train/test/val split
print("Performing train/test/val split...")
ds_daily = get_train_test_val_split(ds_daily)

print(f"\nSplit completed!")
print(f"Number of discarded series (too short): {len(ds_daily.discarded_ids)}")
print(f"Remaining series: {ds_daily.df['id'].n_unique() - len(ds_daily.discarded_ids)}")

Performing train/test/val split...
[32m2025-12-31 12:43:34.089[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mtrain_test_split_time_based[0m:[36m85[0m - [1mRemoved 40 series because they were too short[0m
[32m2025-12-31 12:43:34.089[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mtrain_test_split_time_based[0m:[36m87[0m - [1mRemaining series: 92[0m
[32m2025-12-31 12:43:34.129[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mget_train_test_val_split[0m:[36m236[0m - [1mTrain data shape: (43343, 26)[0m
[32m2025-12-31 12:43:34.129[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mget_train_test_val_split[0m:[36m237[0m - [1mTest data shape: (5402, 26)[0m
[32m2025-12-31 12:43:34.129[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mget_train_test_val_split[0m:[36m238[0m - [1mValid

### Date Ranges Analysis

In [4]:
# Analyze date ranges for train/val/test splits
import polars as pl

# Get dataframe with datetime
df_daily = ds_daily.df.filter(~pl.col("id").is_in(ds_daily.discarded_ids))

# Get train/val/test data using indices
train_df = df_daily.filter(pl.col("index").is_in(ds_daily.train_idxs))
val_df = df_daily.filter(pl.col("index").is_in(ds_daily.val_idxs))
test_df = df_daily.filter(pl.col("index").is_in(ds_daily.test_idxs))

# Compute date ranges
date_ranges = pd.DataFrame({
    'Split': ['Train', 'Validation', 'Test'],
    'Start Date': [
        train_df['datetime'].min(),
        val_df['datetime'].min(),
        test_df['datetime'].min()
    ],
    'End Date': [
        train_df['datetime'].max(),
        val_df['datetime'].max(),
        test_df['datetime'].max()
    ],
    'Total Days': [
        (train_df['datetime'].max() - train_df['datetime'].min()).days,
        (val_df['datetime'].max() - val_df['datetime'].min()).days,
        (test_df['datetime'].max() - test_df['datetime'].min()).days
    ],
    'Total Samples': [len(train_df), len(val_df), len(test_df)]
})

print("Daily Dataset - Date Ranges per Split:\n")
print(date_ranges.to_string(index=False))

Daily Dataset - Date Ranges per Split:

     Split Start Date   End Date  Total Days  Total Samples
     Train 2017-03-21 2023-08-01        2324          43343
Validation 2017-11-08 2023-09-19        2141           5361
      Test 2017-10-13 2023-08-19        2136           5402


### Per-Series Statistics

In [16]:
# Count unique series per split
train_series = train_df.select('id').unique()
val_series = val_df.select('id').unique()
test_series = test_df.select('id').unique()

print(f"Unique series per split:")
print(f"  Train: {len(train_series)} series")
print(f"  Validation: {len(val_series)} series")
print(f"  Test: {len(test_series)} series")
print(f"  Total: {len(train_series)} series (same buildings across all splits)")

# Compute statistics per series for all splits
train_stats = train_df.group_by('id').agg([
    pl.len().alias('count'),
    pl.col('datetime').min().alias('start'),
    pl.col('datetime').max().alias('end')
])

val_stats = val_df.group_by('id').agg([
    pl.len().alias('count'),
    pl.col('datetime').min().alias('start'),
    pl.col('datetime').max().alias('end')
])

test_stats = test_df.group_by('id').agg([
    pl.len().alias('count'),
    pl.col('datetime').min().alias('start'),
    pl.col('datetime').max().alias('end')
])

# Display statistics
stats_summary = pd.DataFrame({
    'Split': ['Train', 'Validation', 'Test'],
    'Min': [train_stats['count'].min(), val_stats['count'].min(), test_stats['count'].min()],
    'Max': [train_stats['count'].max(), val_stats['count'].max(), test_stats['count'].max()],
    'Mean': [train_stats['count'].mean(), val_stats['count'].mean(), test_stats['count'].mean()],
    'Median': [train_stats['count'].median(), val_stats['count'].median(), test_stats['count'].median()]
})

print(f"\nPer-series sample statistics:")
print(stats_summary.to_string(index=False))

Unique series per split:
  Train: 92 series
  Validation: 92 series
  Test: 92 series
  Total: 92 series (same buildings across all splits)

Per-series sample statistics:
     Split  Min  Max       Mean  Median
     Train  124 1333 471.119565   384.5
Validation   15  166  58.271739    47.5
      Test   15  166  58.717391    48.0


### Dataset Reduction Analysis

The data goes through several preprocessing steps that reduce the total samples:
1. **Original interpolated data**: Raw dataset with all buildings and timestamps
2. **Lag feature creation**: Adding `lag_in` historical and `lag_out` future columns creates NaN values at the start/end of each series
3. **Null removal**: Rows with NaN lag features are dropped
4. **Short series filtering**: Buildings with insufficient data after split are discarded

Let's compare the original dataset size to the final preprocessed size.

In [18]:
# Load original interpolated dataset (before preprocessing)
import polars as pl
original_daily = pl.read_csv('../data/processed/dataset_interpolate_daily_feat.csv')

print("Dataset Size Comparison (Daily):\n")
print(f"Original interpolated dataset: {len(original_daily):,} samples")
print(f"After preprocessing (with lag features): {len(ds_daily.df):,} samples")
print(f"  Reduction: {len(original_daily) - len(ds_daily.df):,} samples ({((len(original_daily) - len(ds_daily.df)) / len(original_daily) * 100):.1f}%)")
print(f"\nAfter discarding {len(ds_daily.discarded_ids)} short series: {len(df_daily):,} samples")
print(f"  Total reduction: {len(original_daily) - len(df_daily):,} samples ({((len(original_daily) - len(df_daily)) / len(original_daily) * 100):.1f}%)")

print(f"\n{'='*70}")
print("Final Windowed Samples per Split:")
print(f"{'='*70}")
print(f"Train samples: {ds_daily.X_train.shape[0]:,} samples × {ds_daily.X_train.shape[1]} features")
print(f"Validation samples: {ds_daily.X_val.shape[0]:,} samples × {ds_daily.X_val.shape[1]} features")
print(f"Test samples: {ds_daily.X_test.shape[0]:,} samples × {ds_daily.X_test.shape[1]} features")
print(f"\nTarget shape:")
print(f"Train targets: {ds_daily.y_train.shape[0]:,} samples × {ds_daily.y_train.shape[1]} timesteps")
print(f"Validation targets: {ds_daily.y_val.shape[0]:,} samples × {ds_daily.y_val.shape[1]} timesteps")
print(f"Test targets: {ds_daily.y_test.shape[0]:,} samples × {ds_daily.y_test.shape[1]} timesteps")

# Calculate per-split percentages
total_final = len(ds_daily.X_train) + len(ds_daily.X_val) + len(ds_daily.X_test)
print(f"\nSplit distribution:")
print(f"  Train: {len(ds_daily.X_train)/total_final*100:.1f}%")
print(f"  Validation: {len(ds_daily.X_val)/total_final*100:.1f}%")
print(f"  Test: {len(ds_daily.X_test)/total_final*100:.1f}%")

Dataset Size Comparison (Daily):

Original interpolated dataset: 103,999 samples
After preprocessing (with lag features): 56,955 samples
  Reduction: 47,044 samples (45.2%)

After discarding 40 short series: 54,106 samples
  Total reduction: 49,893 samples (48.0%)

Final Windowed Samples per Split:
Train samples: 43,343 samples × 26 features
Validation samples: 5,361 samples × 26 features
Test samples: 5,402 samples × 26 features

Target shape:
Train targets: 43,343 samples × 7 timesteps
Validation targets: 5,361 samples × 7 timesteps
Test targets: 5,402 samples × 7 timesteps

Split distribution:
  Train: 80.1%
  Validation: 9.9%
  Test: 10.0%


## Hourly Dataset (72-hour forecast)

In [21]:
# Configuration for hourly dataset
config_hourly = {
    "dataset": "building",
    "res": "hourly",
    "interpolate": 1,
    "lag_in": 72,   # 72 hours (3 days) of historical data
    "lag_out": 72,  # 72 hours (3 days) forecast
    "n_in": 72,
    "n_out": 72,
    "energy": "all",
    "train_test_split_method": "time",
    "scale_mode": "individual",
    "scaler": "standard",
    "feature_code": 15  # FEATURE_SET_15 for hourly building features
}

print("Loading hourly dataset...")
ds_hourly = TrainDatasetBuilding(config_hourly)
ds_hourly.load_feat_data(interpolate=True)
ds_hourly.preprocess()

print(f"\nDataset shape before split: {ds_hourly.df.shape}")
print(f"Number of unique buildings: {ds_hourly.df['id'].n_unique()}")

Loading hourly dataset...
[32m2025-12-31 12:59:02.977[0m | [1mINFO    [0m | [36msrc.energy_forecast.dataset[0m:[36mpreprocess[0m:[36m606[0m - [1mTraining Features: ['heated_area', 'wpgt', 'coco', 'weekend', 'tsun', 'holiday', 'pres', 'prcp', 'daily_avg', 'wspd', 'temp', 'dwpt', 'typ_0', 'primary_energy_gas', 'rhum', 'typ_1', 'primary_energy_district heating', 'wdir', 'typ_4', 'diff', 'typ_2', 'snow', 'day_of_month_sin', 'weekday_sin', 'day_of_month_cos', 'weekday_cos'][0m

Dataset shape before split: (142085, 176)
Number of unique buildings: 44


In [22]:
# Perform train/test/val split
print("Performing train/test/val split...")
ds_hourly = get_train_test_val_split(ds_hourly)

print(f"\nSplit completed!")
print(f"Number of discarded series (too short): {len(ds_hourly.discarded_ids)}")
print(f"Remaining series: {ds_hourly.df['id'].n_unique() - len(ds_hourly.discarded_ids)}")

Performing train/test/val split...
[32m2025-12-31 12:59:07.283[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mtrain_test_split_time_based[0m:[36m85[0m - [1mRemoved 19 series because they were too short[0m
[32m2025-12-31 12:59:07.283[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mtrain_test_split_time_based[0m:[36m87[0m - [1mRemaining series: 25[0m
[32m2025-12-31 12:59:07.430[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mget_train_test_val_split[0m:[36m236[0m - [1mTrain data shape: (108849, 25)[0m
[32m2025-12-31 12:59:07.430[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mget_train_test_val_split[0m:[36m237[0m - [1mTest data shape: (13603, 25)[0m
[32m2025-12-31 12:59:07.430[0m | [1mINFO    [0m | [36msrc.energy_forecast.utils.train_test_val_split[0m:[36mget_train_test_val_split[0m:[36m238[0m - [1mVal

### Date Ranges Analysis

In [23]:
# Analyze date ranges for train/val/test splits (hourly)
df_hourly = ds_hourly.df.filter(~pl.col("id").is_in(ds_hourly.discarded_ids))

# Get train/val/test data using indices
train_df_h = df_hourly.filter(pl.col("index").is_in(ds_hourly.train_idxs))
val_df_h = df_hourly.filter(pl.col("index").is_in(ds_hourly.val_idxs))
test_df_h = df_hourly.filter(pl.col("index").is_in(ds_hourly.test_idxs))

# Compute date ranges
date_ranges_h = pd.DataFrame({
    'Split': ['Train', 'Validation', 'Test'],
    'Start Date': [
        train_df_h['datetime'].min(),
        val_df_h['datetime'].min(),
        test_df_h['datetime'].min()
    ],
    'End Date': [
        train_df_h['datetime'].max(),
        val_df_h['datetime'].max(),
        test_df_h['datetime'].max()
    ],
    'Total Days': [
        (train_df_h['datetime'].max() - train_df_h['datetime'].min()).days,
        (val_df_h['datetime'].max() - val_df_h['datetime'].min()).days,
        (test_df_h['datetime'].max() - test_df_h['datetime'].min()).days
    ],
    'Total Hours': [
        int((train_df_h['datetime'].max() - train_df_h['datetime'].min()).total_seconds() / 3600),
        int((val_df_h['datetime'].max() - val_df_h['datetime'].min()).total_seconds() / 3600),
        int((test_df_h['datetime'].max() - test_df_h['datetime'].min()).total_seconds() / 3600)
    ],
    'Total Samples': [len(train_df_h), len(val_df_h), len(test_df_h)]
})

print("Hourly Dataset - Date Ranges per Split:\n")
print(date_ranges_h.to_string(index=False))

Hourly Dataset - Date Ranges per Split:

     Split          Start Date            End Date  Total Days  Total Hours  Total Samples
     Train 2021-10-01 04:00:00 2023-08-29 16:00:00         697        16740         108849
Validation 2021-12-21 05:00:00 2023-09-22 08:00:00         640        15363          13591
      Test 2021-12-12 05:00:00 2023-09-14 22:00:00         641        15401          13603


### Per-Series Statistics

In [24]:
# Count unique series per split
train_series_h = train_df_h.select('id').unique()
val_series_h = val_df_h.select('id').unique()
test_series_h = test_df_h.select('id').unique()

print(f"Unique series per split:")
print(f"  Train: {len(train_series_h)} series")
print(f"  Validation: {len(val_series_h)} series")
print(f"  Test: {len(test_series_h)} series")
print(f"  Total: {len(train_series_h)} series (same buildings across all splits)")

# Compute statistics per series for all splits
train_stats_h = train_df_h.group_by('id').agg([
    pl.len().alias('count'),
    pl.col('datetime').min().alias('start'),
    pl.col('datetime').max().alias('end')
])

val_stats_h = val_df_h.group_by('id').agg([
    pl.len().alias('count'),
    pl.col('datetime').min().alias('start'),
    pl.col('datetime').max().alias('end')
])

test_stats_h = test_df_h.group_by('id').agg([
    pl.len().alias('count'),
    pl.col('datetime').min().alias('start'),
    pl.col('datetime').max().alias('end')
])

# Display statistics
stats_summary_h = pd.DataFrame({
    'Split': ['Train', 'Validation', 'Test'],
    'Min': [train_stats_h['count'].min(), val_stats_h['count'].min(), test_stats_h['count'].min()],
    'Max': [train_stats_h['count'].max(), val_stats_h['count'].max(), test_stats_h['count'].max()],
    'Mean': [train_stats_h['count'].mean(), val_stats_h['count'].mean(), test_stats_h['count'].mean()],
    'Median': [train_stats_h['count'].median(), val_stats_h['count'].median(), test_stats_h['count'].median()]
})

print(f"\nPer-series sample statistics:")
print(stats_summary_h.to_string(index=False))

Unique series per split:
  Train: 25 series
  Validation: 25 series
  Test: 25 series
  Total: 25 series (same buildings across all splits)

Per-series sample statistics:
     Split  Min   Max    Mean  Median
     Train 1431 11898 4353.96  2732.0
Validation  178  1487  543.64   341.0
      Test  179  1487  544.12   341.0


### Dataset Reduction Analysis

In [25]:
# Load original interpolated dataset (before preprocessing)
original_hourly = pl.read_csv('../data/processed/dataset_interpolate_hourly_feat.csv')

print("Dataset Size Comparison (Hourly):\n")
print(f"Original interpolated dataset: {len(original_hourly):,} samples")
print(f"After preprocessing (with lag features): {len(ds_hourly.df):,} samples")
print(f"  Reduction: {len(original_hourly) - len(ds_hourly.df):,} samples ({((len(original_hourly) - len(ds_hourly.df)) / len(original_hourly) * 100):.1f}%)")
print(f"\nAfter discarding {len(ds_hourly.discarded_ids)} short series: {len(df_hourly):,} samples")
print(f"  Total reduction: {len(original_hourly) - len(df_hourly):,} samples ({((len(original_hourly) - len(df_hourly)) / len(original_hourly) * 100):.1f}%)")

print(f"\n{'='*70}")
print("Final Windowed Samples per Split:")
print(f"{'='*70}")
print(f"Train samples: {ds_hourly.X_train.shape[0]:,} samples × {ds_hourly.X_train.shape[1]} features")
print(f"Validation samples: {ds_hourly.X_val.shape[0]:,} samples × {ds_hourly.X_val.shape[1]} features")
print(f"Test samples: {ds_hourly.X_test.shape[0]:,} samples × {ds_hourly.X_test.shape[1]} features")
print(f"\nTarget shape:")
print(f"Train targets: {ds_hourly.y_train.shape[0]:,} samples × {ds_hourly.y_train.shape[1]} timesteps")
print(f"Validation targets: {ds_hourly.y_val.shape[0]:,} samples × {ds_hourly.y_val.shape[1]} timesteps")
print(f"Test targets: {ds_hourly.y_test.shape[0]:,} samples × {ds_hourly.y_test.shape[1]} timesteps")

# Calculate per-split percentages
total_final_h = len(ds_hourly.X_train) + len(ds_hourly.X_val) + len(ds_hourly.X_test)
print(f"\nSplit distribution:")
print(f"  Train: {len(ds_hourly.X_train)/total_final_h*100:.1f}%")
print(f"  Validation: {len(ds_hourly.X_val)/total_final_h*100:.1f}%")
print(f"  Test: {len(ds_hourly.X_test)/total_final_h*100:.1f}%")

Dataset Size Comparison (Hourly):

Original interpolated dataset: 834,313 samples
After preprocessing (with lag features): 142,085 samples
  Reduction: 692,228 samples (83.0%)

After discarding 19 short series: 136,043 samples
  Total reduction: 698,270 samples (83.7%)

Final Windowed Samples per Split:
Train samples: 108,849 samples × 25 features
Validation samples: 13,591 samples × 25 features
Test samples: 13,603 samples × 25 features

Target shape:
Train targets: 108,849 samples × 72 timesteps
Validation targets: 13,591 samples × 72 timesteps
Test targets: 13,603 samples × 72 timesteps

Split distribution:
  Train: 80.0%
  Validation: 10.0%
  Test: 10.0%


## Summary Comparison

Key insights from the train/test/val split analysis:

In [27]:
# Summary comparison
# Calculate total reduction from original interpolated datasets
total_final_daily = len(ds_daily.X_train) + len(ds_daily.X_val) + len(ds_daily.X_test)
total_final_hourly = len(ds_hourly.X_train) + len(ds_hourly.X_val) + len(ds_hourly.X_test)

reduction_daily = ((len(original_daily) - total_final_daily) / len(original_daily) * 100)
reduction_hourly = ((len(original_hourly) - total_final_hourly) / len(original_hourly) * 100)

summary = pd.DataFrame({
    'Metric': [
        'Forecast Horizon',
        'Lag In (History)',
        'Lag Out (Future)',
        'Remaining Series',
        'Discarded Series',
        'Train Samples (Windowed)',
        'Val Samples (Windowed)',
        'Test Samples (Windowed)',
        'Total Features',
        'Total Reduction (%)'
    ],
    'Daily': [
        '7 days',
        '7 days',
        '7 days',
        len(train_series),
        len(ds_daily.discarded_ids),
        f"{len(ds_daily.X_train):,}",
        f"{len(ds_daily.X_val):,}",
        f"{len(ds_daily.X_test):,}",
        ds_daily.X_train.shape[1],
        f"{reduction_daily:.1f}%"
    ],
    'Hourly': [
        '72 hours (3 days)',
        '72 hours (3 days)',
        '72 hours (3 days)',
        len(train_series_h),
        len(ds_hourly.discarded_ids),
        f"{len(ds_hourly.X_train):,}",
        f"{len(ds_hourly.X_val):,}",
        f"{len(ds_hourly.X_test):,}",
        ds_hourly.X_train.shape[1],
        f"{reduction_hourly:.1f}%"
    ]
})

print("\n" + "="*80)
print("SUMMARY: Daily vs Hourly Dataset Comparison")
print("="*80 + "\n")
print(summary.to_string(index=False))
print("\n" + "="*80)


SUMMARY: Daily vs Hourly Dataset Comparison

                  Metric  Daily            Hourly
        Forecast Horizon 7 days 72 hours (3 days)
        Lag In (History) 7 days 72 hours (3 days)
        Lag Out (Future) 7 days 72 hours (3 days)
        Remaining Series     92                25
        Discarded Series     40                19
Train Samples (Windowed) 43,343           108,849
  Val Samples (Windowed)  5,361            13,591
 Test Samples (Windowed)  5,402            13,603
          Total Features     26                25
     Total Reduction (%)  48.0%             83.7%



### Key Observations

1. **Time-based Split**: The 80/10/10 split ensures temporal ordering is preserved within each building's time series, making it suitable for time series forecasting evaluation.

2. **Sliding Window Effect**: Converting raw time series to sliding window format reduces the number of samples:
   - Each window needs `lag_in` + `lag_out` consecutive timesteps
   - First `lag_in` and last `lag_out-1` samples cannot form complete windows
   - Larger windows (72 hours for hourly) cause more reduction than smaller windows (7 days for daily)

3. **Series Filtering**: Buildings with insufficient data (< `lag_in` + `lag_out` samples in any split) are discarded to ensure every building can contribute at least one training example.

4. **Feature Dimensionality**: 
   - Daily dataset uses FEATURE_SET_10 (building metadata + daily weather)
   - Hourly dataset uses FEATURE_SET_15 (building metadata + hourly weather)

5. **Temporal Coverage**: Both datasets span similar time ranges but at different granularities, enabling both short-term (hourly) and medium-term (daily) forecasting.