# GSPHAR Model Input Inspection

This notebook focuses specifically on examining the input shapes, data transformation, and actual values that are fed into the GSPHAR model. It provides a detailed look at how time series data is processed and structured for the model.

In [None]:
# Import libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from torch.utils.data import DataLoader

# Add the parent directory to the path to import from the GSPHAR package
sys.path.insert(0, os.path.abspath('..'))

# Import from local modules
from config import settings
from src.data import load_data, split_data, create_lagged_features
from src.utils.date_aware_dataset import IndexMappingDataset

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

## 1. Load and Prepare Data

In [None]:
# Load cryptocurrency realized volatility data
data_file = '../data/rv5_sqrt_38_crypto.csv'

# Custom loading for crypto data which has 'Open Time' as the date column
try:
    data = pd.read_csv(data_file, parse_dates=['Open Time'])
    data.set_index('Open Time', inplace=True)
    data = data * 100  # Scale by 100 to match the convention in load_data
except Exception as e:
    print(f"Error loading cryptocurrency data with custom loader: {e}")
    print("Falling back to standard load_data function")
    data = load_data(data_file)

# Display basic information about the data
print(f"Data shape: {data.shape}")
print(f"Date range: {data.index[0]} to {data.index[-1]}")
print(f"Number of cryptocurrencies: {data.shape[1]}")
print(f"Cryptocurrencies: {', '.join(data.columns.tolist()[:5])}...")

# Display the first few rows
data.head()

In [None]:
# Split data into train and test sets
train_dataset_raw, test_dataset_raw = split_data(data, train_ratio=0.8)

print(f"Train data shape: {train_dataset_raw.shape}")
print(f"Train date range: {train_dataset_raw.index[0]} to {train_dataset_raw.index[-1]}")
print(f"Test data shape: {test_dataset_raw.shape}")
print(f"Test date range: {test_dataset_raw.index[0]} to {test_dataset_raw.index[-1]}")

In [None]:
# Select a subset of cryptocurrencies for analysis
# Using the top 10 by market cap for a more focused analysis
# Make sure these are actually in the dataset
all_cryptos = data.columns.tolist()
top_cryptos_candidates = ['BTCUSDT', 'ETHUSDT', 'BNBUSDT', 'XRPUSDT', 'ADAUSDT', 
                          'DOGEUSDT', 'DOTUSDT', 'LTCUSDT', 'LINKUSDT', 'XLMUSDT']

# Filter to only include cryptos that are in the dataset
top_cryptos = [crypto for crypto in top_cryptos_candidates if crypto in all_cryptos]
print(f"Using these cryptocurrencies: {top_cryptos}")

# Filter data to include only selected cryptocurrencies
train_data_subset = train_dataset_raw[top_cryptos]
test_data_subset = test_dataset_raw[top_cryptos]

# Get market indices
market_indices_list = top_cryptos

# Set prediction horizon
h = 5  # 5-day ahead prediction

## 2. Create Lagged Features

Now we'll create the lagged features that will be used as input to the model. This is a critical step in the data transformation process.

In [None]:
# Create lagged features
try:
    train_dataset = create_lagged_features(
        train_data_subset,
        market_indices_list,
        h,
        settings.LOOK_BACK_WINDOW
    )
    test_dataset = create_lagged_features(
        test_data_subset,
        market_indices_list,
        h,
        settings.LOOK_BACK_WINDOW
    )
    
    print(f"Train dataset with lagged features shape: {train_dataset.shape}")
    print(f"Test dataset with lagged features shape: {test_dataset.shape}")
    
    # Display the column names to see the lagged features
    print("\nFirst 20 column names:")
    print(train_dataset.columns[:20])
    
    # Display the first few rows of the dataset with lagged features
    print("\nFirst few rows of the dataset with lagged features:")
    display(train_dataset.head())
except Exception as e:
    print(f"Error creating lagged features: {e}")

## 3. Create Date-Aware Datasets

Now we'll create the date-aware datasets that will be used to feed data into the model. This step organizes the lagged features into the specific tensor structures expected by the model.

In [None]:
# Create datasets directly using IndexMappingDataset
try:
    lag_list = list(range(1, settings.LOOK_BACK_WINDOW + 1))
    
    train_dataset_indexed = IndexMappingDataset(
        train_dataset,
        lag_list,
        h
    )
    
    test_dataset_indexed = IndexMappingDataset(
        test_dataset,
        lag_list,
        h
    )
    
    # Create dataloaders
    batch_size = settings.BATCH_SIZE
    dataloader_train = DataLoader(
        train_dataset_indexed,
        batch_size=batch_size,
        shuffle=True
    )
    
    dataloader_test = DataLoader(
        test_dataset_indexed,
        batch_size=batch_size,
        shuffle=False
    )
    
    print(f"Train dataset length: {len(train_dataset_indexed)}")
    print(f"Test dataset length: {len(test_dataset_indexed)}")
    print(f"Number of batches in train dataloader: {len(dataloader_train)}")
    print(f"Number of batches in test dataloader: {len(dataloader_test)}")
except Exception as e:
    print(f"Error creating dataloaders: {e}")

## 4. Inspect Model Inputs

Now let's examine the actual tensors that are fed into the model. This will give us a clear understanding of the input shapes and values.

In [None]:
# Get a single batch from the test dataloader
for batch in dataloader_test:
    x_lag1, x_lag5, x_lag22, y = batch
    break  # Just get the first batch

# Print shapes
print("\n===== MODEL INPUT SHAPES =====\n")
print(f"x_lag1 shape: {x_lag1.shape}")
print(f"x_lag5 shape: {x_lag5.shape}")
print(f"x_lag22 shape: {x_lag22.shape}")
print(f"y shape: {y.shape}")

# Explain the dimensions
print("\n===== DIMENSION EXPLANATION =====\n")
print(f"x_lag1: [batch_size={x_lag1.shape[0]}, time_steps={x_lag1.shape[1]}, features={x_lag1.shape[2]}]")
print(f"x_lag5: [batch_size={x_lag5.shape[0]}, time_steps={x_lag5.shape[1]}, features={x_lag5.shape[2]}]")
print(f"x_lag22: [batch_size={x_lag22.shape[0]}, time_steps={x_lag22.shape[1]}, features={x_lag22.shape[2]}]")
print(f"y: [batch_size={y.shape[0]}, prediction_horizon={y.shape[1]}, features={y.shape[2]}]")

In [None]:
# Function to safely display tensor data
def display_tensor_sample(tensor, name, sample_idx=0, max_rows=5, max_cols=10):
    print(f"\n===== {name} TENSOR SAMPLE =====\n")
    print(f"Shape: {tensor.shape}")
    
    # Convert to numpy for easier handling
    sample = tensor[sample_idx].numpy()
    
    # Create a DataFrame with appropriate columns
    if sample.ndim == 1:
        # 1D tensor - just show as a single row
        print(f"First {min(max_cols, len(sample))} values: {sample[:max_cols]}")
        if len(sample) > max_cols:
            print(f"Last {min(max_cols, len(sample))} values: {sample[-max_cols:]}")
    elif sample.ndim == 2:
        # 2D tensor - show as a table with row and column indices
        rows, cols = sample.shape
        
        # Show a subset of the data
        display_rows = min(rows, max_rows)
        display_cols = min(cols, max_cols)
        
        print(f"Showing first {display_rows} rows and first {display_cols} columns:")
        subset = sample[:display_rows, :display_cols]
        
        # Create a DataFrame for better display
        df = pd.DataFrame(subset)
        display(df)
        
        if cols > max_cols:
            print(f"... and {cols - max_cols} more columns")
        if rows > max_rows:
            print(f"... and {rows - max_rows} more rows")
    else:
        print(f"Cannot display tensor with {sample.ndim} dimensions")

# Display tensor samples
display_tensor_sample(x_lag1, "x_lag1 (MOST RECENT LAG)")
display_tensor_sample(x_lag5, "x_lag5 (5-DAY LAGS)")
display_tensor_sample(x_lag22, "x_lag22 (22-DAY LAGS)")
display_tensor_sample(y, "y (TARGET VALUES)")

In [None]:
# Print statistics for each input tensor
print("\n===== INPUT TENSOR STATISTICS =====\n")
print(f"x_lag1 mean: {x_lag1.mean().item():.4f}, std: {x_lag1.std().item():.4f}, min: {x_lag1.min().item():.4f}, max: {x_lag1.max().item():.4f}")
print(f"x_lag5 mean: {x_lag5.mean().item():.4f}, std: {x_lag5.std().item():.4f}, min: {x_lag5.min().item():.4f}, max: {x_lag5.max().item():.4f}")
print(f"x_lag22 mean: {x_lag22.mean().item():.4f}, std: {x_lag22.std().item():.4f}, min: {x_lag22.min().item():.4f}, max: {x_lag22.max().item():.4f}")
print(f"y mean: {y.mean().item():.4f}, std: {y.std().item():.4f}, min: {y.min().item():.4f}, max: {y.max().item():.4f}")

## 5. Visualize Input Data

Let's visualize the input data to better understand the structure and patterns.

In [None]:
# Visualize a single feature across all lags
feature_idx = 0  # Just use the first feature
sample_idx = 0   # First sample in the batch

plt.figure(figsize=(14, 8))

# Extract values for the selected feature
lag1_values = x_lag1[sample_idx, :, feature_idx].numpy().flatten()
lag5_values = x_lag5[sample_idx, :, feature_idx].numpy().flatten()
lag22_values = x_lag22[sample_idx, :, feature_idx].numpy().flatten()
target_values = y[sample_idx, :, feature_idx].numpy().flatten()

# Create x-axis values (days relative to prediction)
lag1_x = [-1]  # Most recent lag
lag5_x = list(range(-5, 0))  # 5-day lags
lag22_x = list(range(-22, 0))  # 22-day lags
target_x = list(range(len(target_values)))  # Future values

# Plot
plt.plot(lag22_x, lag22_values, 'bo-', label='22-day lags', alpha=0.7)
plt.plot(lag5_x, lag5_values, 'go-', label='5-day lags', linewidth=2)
plt.plot(lag1_x, lag1_values, 'ro-', label='1-day lag', markersize=8)
plt.plot(target_x, target_values, 'mo-', label='target values', linewidth=2)

plt.axvline(x=0, color='k', linestyle='--', label='prediction point')
plt.title(f'Input and Target Values for Feature {feature_idx}', fontsize=16)
plt.xlabel('Time (days relative to prediction point)', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Visualize multiple features for the most recent lag
sample_idx = 0  # First sample in the batch
num_features = min(10, x_lag1.shape[2])  # Show up to 10 features

plt.figure(figsize=(14, 8))

# Extract values for multiple features
for i in range(num_features):
    feature_values = x_lag1[sample_idx, 0, i].item()
    plt.bar(i, feature_values, alpha=0.7)

plt.title('Most Recent Lag Values for Multiple Features', fontsize=16)
plt.xlabel('Feature Index', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Visualize the correlation between features
sample_idx = 0  # First sample in the batch
time_idx = 0    # Most recent time step
num_features = min(20, x_lag1.shape[2])  # Show up to 20 features

# Extract feature values from all three lag tensors
lag1_features = x_lag1[sample_idx, 0, :num_features].numpy()
lag5_features = x_lag5[sample_idx, 0, :num_features].numpy()  # Most recent of the 5-day lags
lag22_features = x_lag22[sample_idx, 0, :num_features].numpy()  # Most recent of the 22-day lags

# Create a DataFrame with these features
feature_df = pd.DataFrame({
    'lag1': lag1_features,
    'lag5': lag5_features,
    'lag22': lag22_features
})

# Calculate and visualize the correlation matrix
corr_matrix = feature_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Between Different Lag Features', fontsize=16)
plt.tight_layout()
plt.show()

## 6. Summary of Model Input Structure

Based on our inspection, here's a summary of the GSPHAR model's input structure:

1. **Input Tensors**:
   - `x_lag1`: Shape [batch_size, 1, num_features] - Most recent lag (t-1)
   - `x_lag5`: Shape [batch_size, 5, num_features] - 5-day lags (t-5 to t-1)
   - `x_lag22`: Shape [batch_size, 22, num_features] - 22-day lags (t-22 to t-1)

2. **Target Tensor**:
   - `y`: Shape [batch_size, horizon, num_features] - Future values (t to t+horizon-1)

3. **Data Transformation Process**:
   - Raw time series data → Split into train/test → Create lagged features → Create date-aware datasets → Generate batches

4. **Key Characteristics**:
   - Multi-scale approach captures both short-term and long-term patterns
   - Date-aware datasets maintain the connection between tensor indices and actual dates
   - Features show varying degrees of correlation across different time scales

This structure allows the GSPHAR model to effectively capture temporal dependencies and cross-cryptocurrency relationships for volatility forecasting.