# Homework: Latent Neural ODE for Bond Trading Volume Prediction

## Overview

In this homework, you will implement a **Latent Neural ODE** model to predict the trading volume for Apple corporate bonds. Bond trades occur at irregular intervals, making this an ideal application for Neural ODEs that can handle continuous-time dynamics.

**Learning Objectives:**
- Understand how to preprocess irregular time series data
- Implement key components of a Latent ODE model
- Train and evaluate the model against baselines
- Interpret results in a financial context

**Data:** Apple (AAPL) corporate bonds TRACE prints since 2025-01-01

**Task:** Given historical trading volumes at irregular time points, predict the volume that will be traded the next time this bond trades.

---

## Grading Rubric

| Part | Description | Points |
|------|-------------|--------|
| 1 | Data Loading & Exploration | 10 |
| 2 | Data Preprocessing | 15 |
| 3 | Model Architecture | 25 |
| 4 | Training | 20 |
| 5 | Evaluation & Baselines | 15 |
| 6 | Interpretation Questions | 15 |
| **Total** | | **100** |

---

**Instructions:**
- Fill in code where you see `# TODO: Your code here`
- Do not modify provided code unless instructed
- Answer all interpretation questions in the designated markdown cells
- Run all cells to ensure your code works before submission

In [None]:
# Required imports (DO NOT MODIFY)
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torchdiffeq import odeint
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---
## Part 1: Data Loading & Exploration (10 points)

### 1.1 Load the Data

In [None]:
# Load the Apple bonds TRACE data (PROVIDED)
df = pd.read_csv('apple_ bonds_trace_prints_since_20250101.csv', index_col=0)

print(f"Total trades loaded: {len(df)}")
print(f"\nDate range: {df['trd_exctn_dt'].min()} to {df['trd_exctn_dt'].max()}")
print(f"\nUnique CUSIPs: {df['cusip_id'].nunique()}")

# Display relevant columns
display_cols = ['cusip_id', 'trd_exctn_dt', 'trd_exctn_tm', 'ascii_rptd_vol_tx', 
                'rptg_party_type', 'contra_party_type', 'side', 'rptd_pr']
df[display_cols].head(10)

### 1.2 Filter to Dealer-to-Customer (D2C) Trades (5 points)

In TRACE data:
- `rptg_party_type`: Type of reporting party (D=Dealer, C=Customer)
- `contra_party_type`: Type of counterparty (D=Dealer, C=Customer)

A D2C trade has one party as Dealer and the other as Customer.

In [None]:
# TODO: Create a boolean mask for D2C trades (5 points)
# A D2C trade is when:
# - rptg_party_type is 'D' AND contra_party_type is 'C', OR
# - rptg_party_type is 'C' AND contra_party_type is 'D'

d2c_mask = # TODO: Your code here

df_d2c = df[d2c_mask].copy()
print(f"D2C trades: {len(df_d2c)} ({len(df_d2c)/len(df)*100:.1f}% of total)")

### 1.3 Remove Odd Lots (5 points)

Odd lots are trades with par value less than $100,000. These are typically retail trades and may have different dynamics.

In [None]:
# Convert volume to numeric
df_d2c['volume'] = pd.to_numeric(df_d2c['ascii_rptd_vol_tx'], errors='coerce')

# TODO: Remove odd lots (trades with volume < 100,000) (5 points)
df_filtered = # TODO: Your code here

print(f"After removing odd lots (<100k): {len(df_filtered)} trades")
print(f"Removed {len(df_d2c) - len(df_filtered)} odd lot trades")

---
## Part 2: Data Preprocessing (15 points)

### 2.1 Aggregate at Daily Level (10 points)

We need to aggregate multiple trades on the same day for the same bond.

In [None]:
# Convert date to datetime
df_filtered['date'] = pd.to_datetime(df_filtered['trd_exctn_dt'])

# TODO: Aggregate by CUSIP and date (10 points)
# Group by 'cusip_id' and 'date'
# Compute:
# - 'volume': sum of volumes
# - 'rptd_pr': mean of prices
# - 'msg_seq_nb': count of trades

daily_trades = # TODO: Your code here

# Rename columns
daily_trades.columns = ['cusip_id', 'date', 'total_volume', 'avg_price', 'num_trades']

print(f"Daily aggregated trades: {len(daily_trades)}")
daily_trades.head(10)

### 2.2 Select CUSIPs for Analysis

In [None]:
# Select top 2 most active CUSIPs (PROVIDED)
cusip_activity = daily_trades.groupby('cusip_id').agg({
    'date': 'count',
    'total_volume': 'sum'
}).reset_index()
cusip_activity.columns = ['cusip_id', 'trading_days', 'total_volume']
cusip_activity = cusip_activity.sort_values('trading_days', ascending=False)

print("Top 10 Most Active CUSIPs:")
print(cusip_activity.head(10))

selected_cusips = cusip_activity.head(2)['cusip_id'].tolist()
print(f"\nSelected CUSIPs: {selected_cusips}")

# Filter to selected CUSIPs
df_selected = daily_trades[daily_trades['cusip_id'].isin(selected_cusips)].copy()
df_selected = df_selected.sort_values(['cusip_id', 'date']).reset_index(drop=True)

### 2.3 Visualize Trading Patterns

In [None]:
# Visualization (PROVIDED)
fig, axes = plt.subplots(len(selected_cusips), 2, figsize=(14, 5*len(selected_cusips)))
if len(selected_cusips) == 1:
    axes = axes.reshape(1, -1)

for i, cusip in enumerate(selected_cusips):
    cusip_data = df_selected[df_selected['cusip_id'] == cusip].copy()
    
    ax = axes[i, 0]
    ax.bar(cusip_data['date'], cusip_data['total_volume'] / 1e6, alpha=0.7)
    ax.set_xlabel('Date')
    ax.set_ylabel('Volume (Millions $)')
    ax.set_title(f'{cusip} - Daily Trading Volume')
    ax.tick_params(axis='x', rotation=45)
    ax.grid(True, alpha=0.3)
    
    ax = axes[i, 1]
    time_gaps = cusip_data['date'].diff().dt.days.dropna()
    ax.hist(time_gaps, bins=20, edgecolor='black', alpha=0.7)
    ax.axvline(time_gaps.mean(), color='red', linestyle='--', 
               label=f'Mean: {time_gaps.mean():.1f} days')
    ax.set_xlabel('Days Between Trades')
    ax.set_ylabel('Frequency')
    ax.set_title(f'{cusip} - Trade Interval Distribution')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Apple Bond Trading Patterns', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 2.4 Prepare Sequences for Training (5 points)

In [None]:
def prepare_sequences(df, cusip, lookback=10, horizon=1):
    """
    Prepare sequences for Latent ODE training.
    
    Args:
        df: DataFrame with date and total_volume
        cusip: CUSIP to filter
        lookback: Number of past observations to use
        horizon: Number of future points to predict
    Returns:
        sequences: List of dictionaries
        scaler: Fitted StandardScaler
    """
    cusip_data = df[df['cusip_id'] == cusip].copy()
    cusip_data = cusip_data.sort_values('date').reset_index(drop=True)
    
    # Convert dates to numeric (days since first date)
    first_date = cusip_data['date'].min()
    cusip_data['days'] = (cusip_data['date'] - first_date).dt.days
    
    # Normalize volumes
    scaler = StandardScaler()
    cusip_data['volume_scaled'] = scaler.fit_transform(cusip_data[['total_volume']])
    
    sequences = []
    n = len(cusip_data)
    
    # TODO: Create sequences (5 points)
    # For each valid starting position i (from lookback to n-horizon):
    # 1. Get observation data: cusip_data.iloc[i-lookback:i]
    # 2. Get target data: cusip_data.iloc[i:i+horizon]
    # 3. Normalize time to [0, 1] within the sequence
    # 4. Append dictionary with t_obs, x_obs, t_target, x_target, volume_raw, date
    
    for i in range(lookback, n - horizon + 1):
        obs_data = cusip_data.iloc[i-lookback:i]
        target_data = cusip_data.iloc[i:i+horizon]
        
        # Normalize time to [0, 1]
        t_start = obs_data['days'].iloc[0]
        t_end = target_data['days'].iloc[-1]
        t_range = t_end - t_start
        
        if t_range > 0:
            # TODO: Compute normalized times
            t_obs = # TODO: Your code here
            t_target = # TODO: Your code here
            
            sequences.append({
                't_obs': t_obs.astype(np.float32),
                'x_obs': obs_data['volume_scaled'].values.astype(np.float32),
                't_target': t_target.astype(np.float32),
                'x_target': target_data['volume_scaled'].values.astype(np.float32),
                'volume_raw': target_data['total_volume'].values,
                'date': target_data['date'].values
            })
    
    return sequences, scaler

# Prepare sequences
lookback = 10
all_sequences = {}
scalers = {}

for cusip in selected_cusips:
    sequences, scaler = prepare_sequences(df_selected, cusip, lookback=lookback)
    all_sequences[cusip] = sequences
    scalers[cusip] = scaler
    print(f"{cusip}: Created {len(sequences)} sequences")

**Question 2.1 (Part of 15 points):** Why do we aggregate trades at the daily level instead of using individual trade-level data? What information might we lose?

**Your Answer:** 

*[Write your answer here]*

---
## Part 3: Model Architecture (25 points)

### 3.1 ODE Function (PROVIDED)

In [None]:
# Latent dynamics function (PROVIDED)
class ODEFunc(nn.Module):
    """Defines the latent dynamics dz/dt = f(z, t)."""
    def __init__(self, latent_dim, hidden_dim=32):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, latent_dim)
        )
    
    def forward(self, t, z):
        return self.net(z)

### 3.2 Encoder (10 points)

The encoder processes the irregular time series and outputs parameters of the latent distribution.

In [None]:
class Encoder(nn.Module):
    """RNN encoder for irregular time series."""
    def __init__(self, input_dim=1, hidden_dim=32, latent_dim=8):
        super().__init__()
        # GRU takes input_dim + 1 (for time delta)
        self.gru = nn.GRU(input_dim + 1, hidden_dim, batch_first=True)
        self.fc_mean = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)
    
    def forward(self, x, t):
        """
        Encode observations into latent distribution.
        
        Args:
            x: Observations [batch, seq_len, 1]
            t: Time points [batch, seq_len]
        Returns:
            mean: Mean of latent distribution [batch, latent_dim]
            logvar: Log variance [batch, latent_dim]
        """
        # TODO: Compute time deltas (10 points)
        # Time delta at position i is t[i] - t[i-1]
        # First delta should be 0
        t_delta = torch.zeros_like(t)
        # TODO: Fill in t_delta[:, 1:] with differences
        t_delta[:, 1:] = # TODO: Your code here
        
        # Concatenate observations with time deltas
        # x shape: [batch, seq_len, 1]
        # t_delta shape: [batch, seq_len]
        # Result should be [batch, seq_len, 2]
        x_with_time = # TODO: Your code here
        
        # Pass through GRU
        _, h = self.gru(x_with_time)  # h shape: [1, batch, hidden_dim]
        h = h.squeeze(0)  # [batch, hidden_dim]
        
        # Map to latent distribution parameters
        mean = self.fc_mean(h)
        logvar = self.fc_logvar(h)
        
        return mean, logvar

### 3.3 Decoder (PROVIDED)

In [None]:
# Decoder (PROVIDED)
class Decoder(nn.Module):
    """Decode latent state to observation."""
    def __init__(self, latent_dim=8, hidden_dim=32, output_dim=1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def forward(self, z):
        return self.net(z)

### 3.4 Complete Latent ODE Model (15 points)

In [None]:
class LatentODEForecaster(nn.Module):
    """Latent ODE for time series forecasting."""
    def __init__(self, input_dim=1, hidden_dim=32, latent_dim=8):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim, latent_dim)
        self.ode_func = ODEFunc(latent_dim, hidden_dim)
        self.decoder = Decoder(latent_dim, hidden_dim, input_dim)
        self.latent_dim = latent_dim
    
    def reparameterize(self, mean, logvar):
        """
        Reparameterization trick for VAE.
        Sample z = mean + std * epsilon, where epsilon ~ N(0, I)
        """
        # TODO: Implement reparameterization trick (5 points)
        # 1. Compute std from logvar: std = exp(0.5 * logvar)
        # 2. Sample epsilon from standard normal
        # 3. Return mean + std * epsilon
        
        std = # TODO: Your code here
        eps = # TODO: Your code here
        return # TODO: Your code here
    
    def forward(self, x_obs, t_obs, t_pred):
        """
        Forward pass: encode, evolve latent state, decode.
        
        Args:
            x_obs: Observations [batch, obs_len, 1]
            t_obs: Observation times [batch, obs_len]
            t_pred: Prediction times [batch, pred_len]
        Returns:
            predictions: [batch, pred_len, 1]
            mean, logvar: Latent distribution parameters
        """
        # Encode observations
        mean, logvar = self.encoder(x_obs, t_obs)
        
        # Sample initial latent state
        z0 = self.reparameterize(mean, logvar)
        
        # Get the last observation time for each sample
        t_last_obs = t_obs[:, -1]  # [batch]
        batch_size = x_obs.shape[0]
        pred_len = t_pred.shape[1]
        
        # TODO: Evolve latent state and decode (10 points)
        # For each sample in batch:
        # 1. Create time grid from t_last_obs to t_pred times
        # 2. Solve ODE using odeint
        # 3. Decode the predicted latent states
        
        predictions = torch.zeros(batch_size, pred_len, 1).to(x_obs.device)
        
        for b in range(batch_size):
            # Create time grid: start from last observation, then prediction times
            t_grid = # TODO: Your code here (concatenate t_last_obs[b:b+1] with t_pred[b])
            
            # Solve ODE
            z_traj = odeint(self.ode_func, z0[b:b+1], t_grid, method='dopri5')
            # z_traj shape: [len(t_grid), 1, latent_dim]
            
            # Skip first time point (t_last_obs), keep predictions
            z_pred = z_traj[1:].squeeze(1)  # [pred_len, latent_dim]
            
            # Decode
            predictions[b] = # TODO: Your code here
        
        return predictions, mean, logvar
    
    def predict(self, x_obs, t_obs, t_pred):
        """Deterministic prediction using mean of latent distribution."""
        self.eval()
        with torch.no_grad():
            mean, _ = self.encoder(x_obs, t_obs)
            z0 = mean  # Use mean instead of sampling
            
            batch_size = x_obs.shape[0]
            t_last_obs = t_obs[:, -1]
            pred_len = t_pred.shape[1]
            
            predictions = torch.zeros(batch_size, pred_len, 1).to(x_obs.device)
            
            for b in range(batch_size):
                t_grid = torch.cat([t_last_obs[b:b+1], t_pred[b]])
                z_traj = odeint(self.ode_func, z0[b:b+1], t_grid, method='dopri5')
                z_pred = z_traj[1:].squeeze(1)
                predictions[b] = self.decoder(z_pred)
        
        return predictions

**Question 3.1 (Part of 25 points):** Why do we include time deltas as input features to the encoder? What information does this provide?

**Your Answer:**

*[Write your answer here]*

---
## Part 4: Training (20 points)

### 4.1 Batch Preparation (5 points)

In [None]:
def prepare_batch(sequences, device='cpu'):
    """
    Prepare a batch of sequences for training.
    
    Args:
        sequences: List of sequence dictionaries
        device: Device to put tensors on
    Returns:
        x_obs, t_obs, t_target, x_target as tensors
    """
    batch_size = len(sequences)
    obs_len = len(sequences[0]['t_obs'])
    
    # Initialize tensors
    x_obs = torch.zeros(batch_size, obs_len, 1)
    t_obs = torch.zeros(batch_size, obs_len)
    t_target = torch.zeros(batch_size, 1)
    x_target = torch.zeros(batch_size, 1, 1)
    
    # TODO: Fill tensors from sequences (5 points)
    for i, seq in enumerate(sequences):
        x_obs[i, :, 0] = # TODO: Your code here
        t_obs[i, :] = # TODO: Your code here
        t_target[i, 0] = # TODO: Your code here
        x_target[i, 0, 0] = # TODO: Your code here
    
    return (x_obs.to(device), t_obs.to(device), 
            t_target.to(device), x_target.to(device))

### 4.2 Loss Function (10 points)

In [None]:
def compute_vae_loss(pred, target, mean, logvar, kl_weight=0.01):
    """
    Compute VAE loss: reconstruction + KL divergence.
    
    Args:
        pred: Predictions [batch, pred_len, 1]
        target: Targets [batch, pred_len, 1]
        mean: Latent mean [batch, latent_dim]
        logvar: Latent log variance [batch, latent_dim]
        kl_weight: Weight for KL term
    Returns:
        total_loss, mse_loss, kl_loss
    """
    # TODO: Compute MSE reconstruction loss (5 points)
    mse_loss = # TODO: Your code here (use nn.functional.mse_loss)
    
    # TODO: Compute KL divergence (5 points)
    # KL(q(z) || p(z)) where p(z) = N(0, I)
    # Formula: -0.5 * sum(1 + logvar - mean^2 - exp(logvar))
    # Average over batch
    kl_loss = # TODO: Your code here
    
    total_loss = mse_loss + kl_weight * kl_loss
    
    return total_loss, mse_loss, kl_loss

### 4.3 Training Loop (5 points)

In [None]:
def train_model(model, train_seqs, val_seqs, epochs=100, batch_size=16, lr=0.001, kl_weight=0.01):
    """Train the Latent ODE model."""
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)
    
    train_losses = []
    val_losses = []
    
    n_train_batches = len(train_seqs) // batch_size
    
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        
        indices = np.random.permutation(len(train_seqs))
        
        for i in range(n_train_batches):
            batch_idx = indices[i*batch_size:(i+1)*batch_size]
            batch_seqs = [train_seqs[j] for j in batch_idx]
            
            x_obs, t_obs, t_target, x_target = prepare_batch(batch_seqs, device)
            
            optimizer.zero_grad()
            
            # TODO: Forward pass and compute loss (5 points)
            pred, mean, logvar = # TODO: Your code here
            loss, mse, kl = # TODO: Your code here
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            epoch_loss += loss.item()
        
        train_losses.append(epoch_loss / n_train_batches)
        
        # Validation
        model.eval()
        with torch.no_grad():
            x_obs, t_obs, t_target, x_target = prepare_batch(val_seqs, device)
            pred = model.predict(x_obs, t_obs, t_target)
            val_mse = nn.functional.mse_loss(pred, x_target).item()
        val_losses.append(val_mse)
        
        scheduler.step(val_mse)
        
        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_losses[-1]:.6f} | Val MSE: {val_mse:.6f}")
    
    return train_losses, val_losses

In [None]:
# Train on first CUSIP (PROVIDED setup)
cusip = selected_cusips[0]
sequences = all_sequences[cusip]

# Split data
n_total = len(sequences)
n_train = int(0.7 * n_total)
n_val = int(0.15 * n_total)

train_seqs = sequences[:n_train]
val_seqs = sequences[n_train:n_train+n_val]
test_seqs = sequences[n_train+n_val:]

print(f"Training on {cusip}")
print(f"Train: {len(train_seqs)}, Val: {len(val_seqs)}, Test: {len(test_seqs)}")

# Create model
model = LatentODEForecaster(input_dim=1, hidden_dim=32, latent_dim=8).to(device)
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

# Train
train_losses, val_losses = train_model(
    model, train_seqs, val_seqs,
    epochs=150, batch_size=8, lr=0.005, kl_weight=0.001
)

In [None]:
# Plot training curves (PROVIDED)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(train_losses)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(val_losses, color='orange')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MSE')
axes[1].set_title('Validation MSE')
axes[1].grid(True, alpha=0.3)

plt.suptitle(f'Latent ODE Training for {cusip}', fontsize=14)
plt.tight_layout()
plt.show()

**Question 4.1 (Part of 20 points):** What is the role of the KL weight (kl_weight) in the loss function? What happens if we set it too high or too low?

**Your Answer:**

*[Write your answer here]*

---
## Part 5: Evaluation & Baselines (15 points)

### 5.1 Evaluation Function

In [None]:
def evaluate_model(model, test_seqs, scaler):
    """Evaluate model on test set."""
    model.eval()
    predictions = []
    actuals = []
    
    with torch.no_grad():
        for seq in test_seqs:
            x_obs = torch.tensor(seq['x_obs']).unsqueeze(0).unsqueeze(-1).to(device)
            t_obs = torch.tensor(seq['t_obs']).unsqueeze(0).to(device)
            t_target = torch.tensor(seq['t_target']).unsqueeze(0).to(device)
            
            pred = model.predict(x_obs, t_obs, t_target)
            pred_scaled = pred.squeeze().cpu().numpy()
            
            pred_original = scaler.inverse_transform([[pred_scaled]])[0, 0]
            actual_original = seq['volume_raw'][0]
            
            predictions.append(pred_original)
            actuals.append(actual_original)
    
    return np.array(predictions), np.array(actuals)

# Evaluate Latent ODE
predictions, actuals = evaluate_model(model, test_seqs, scalers[cusip])
print(f"Latent ODE - Test samples: {len(predictions)}")

### 5.2 Compute Metrics (5 points)

In [None]:
def compute_metrics(predictions, actuals):
    """
    Compute evaluation metrics.
    
    Returns dict with MSE, RMSE, MAE, MAPE
    """
    mse = np.mean((predictions - actuals) ** 2)
    rmse = np.sqrt(mse)
    mae = np.mean(np.abs(predictions - actuals))
    
    # TODO: Compute MAPE (Mean Absolute Percentage Error) (5 points)
    # MAPE = mean(|pred - actual| / |actual|) * 100
    mape = # TODO: Your code here
    
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'MAPE': mape}

metrics_ode = compute_metrics(predictions, actuals)
print(f"Latent ODE Metrics for {cusip}:")
for metric, value in metrics_ode.items():
    if metric in ['MSE', 'RMSE', 'MAE']:
        print(f"  {metric}: ${value:,.0f}")
    else:
        print(f"  {metric}: {value:.2f}%")

### 5.3 GRU Baseline (10 points)

Implement a standard GRU model (without ODE) for comparison.

In [None]:
class GRUBaseline(nn.Module):
    """
    Standard GRU baseline without ODE dynamics.
    Takes observations and directly predicts the next value.
    """
    def __init__(self, input_dim=1, hidden_dim=32, output_dim=1):
        super().__init__()
        # TODO: Define GRU and output layer (10 points)
        # GRU should take input_dim + 1 (for time delta) similar to encoder
        self.gru = # TODO: Your code here
        self.fc_out = # TODO: Your code here
    
    def forward(self, x, t):
        """
        Args:
            x: Observations [batch, seq_len, 1]
            t: Time points [batch, seq_len]
        Returns:
            Prediction [batch, 1]
        """
        # Compute time deltas
        t_delta = torch.zeros_like(t)
        t_delta[:, 1:] = t[:, 1:] - t[:, :-1]
        
        # Concatenate
        x_with_time = torch.cat([x, t_delta.unsqueeze(-1)], dim=-1)
        
        # TODO: Pass through GRU and output layer
        _, h = # TODO: Your code here
        h = h.squeeze(0)
        out = # TODO: Your code here
        
        return out

In [None]:
# Train GRU baseline (PROVIDED training loop)
gru_model = GRUBaseline(input_dim=1, hidden_dim=32, output_dim=1).to(device)
optimizer_gru = torch.optim.Adam(gru_model.parameters(), lr=0.005)

print("Training GRU Baseline...")
for epoch in range(150):
    gru_model.train()
    indices = np.random.permutation(len(train_seqs))
    epoch_loss = 0
    n_batches = len(train_seqs) // 8
    
    for i in range(n_batches):
        batch_idx = indices[i*8:(i+1)*8]
        batch_seqs = [train_seqs[j] for j in batch_idx]
        x_obs, t_obs, _, x_target = prepare_batch(batch_seqs, device)
        
        optimizer_gru.zero_grad()
        pred = gru_model(x_obs, t_obs)
        loss = nn.functional.mse_loss(pred, x_target.squeeze(-1))
        loss.backward()
        optimizer_gru.step()
        epoch_loss += loss.item()
    
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/150 | Loss: {epoch_loss/n_batches:.6f}")

# Evaluate GRU baseline
gru_model.eval()
gru_preds = []
with torch.no_grad():
    for seq in test_seqs:
        x_obs = torch.tensor(seq['x_obs']).unsqueeze(0).unsqueeze(-1).to(device)
        t_obs = torch.tensor(seq['t_obs']).unsqueeze(0).to(device)
        pred = gru_model(x_obs, t_obs)
        pred_scaled = pred.squeeze().cpu().numpy()
        pred_original = scalers[cusip].inverse_transform([[pred_scaled]])[0, 0]
        gru_preds.append(pred_original)

gru_preds = np.array(gru_preds)
metrics_gru = compute_metrics(gru_preds, actuals)
print(f"\nGRU Baseline Metrics:")
for metric, value in metrics_gru.items():
    if metric in ['MSE', 'RMSE', 'MAE']:
        print(f"  {metric}: ${value:,.0f}")
    else:
        print(f"  {metric}: {value:.2f}%")

### 5.4 Simple Baselines (PROVIDED)

In [None]:
# Last Value Baseline
last_preds = []
for seq in test_seqs:
    last_scaled = seq['x_obs'][-1]
    pred = scalers[cusip].inverse_transform([[last_scaled]])[0, 0]
    last_preds.append(pred)
last_preds = np.array(last_preds)
metrics_last = compute_metrics(last_preds, actuals)

# Moving Average Baseline (3-day)
ma_preds = []
for seq in test_seqs:
    ma_scaled = np.mean(seq['x_obs'][-3:])
    pred = scalers[cusip].inverse_transform([[ma_scaled]])[0, 0]
    ma_preds.append(pred)
ma_preds = np.array(ma_preds)
metrics_ma = compute_metrics(ma_preds, actuals)

# Comparison table
print("\n" + "="*65)
print(f"{'Model':<20} {'RMSE ($)':<15} {'MAE ($)':<15} {'MAPE (%)':<10}")
print("-"*65)
print(f"{'Last Value':<20} {metrics_last['RMSE']:>12,.0f} {metrics_last['MAE']:>12,.0f} {metrics_last['MAPE']:>8.2f}")
print(f"{'Moving Avg (3)':<20} {metrics_ma['RMSE']:>12,.0f} {metrics_ma['MAE']:>12,.0f} {metrics_ma['MAPE']:>8.2f}")
print(f"{'GRU Baseline':<20} {metrics_gru['RMSE']:>12,.0f} {metrics_gru['MAE']:>12,.0f} {metrics_gru['MAPE']:>8.2f}")
print(f"{'Latent ODE':<20} {metrics_ode['RMSE']:>12,.0f} {metrics_ode['MAE']:>12,.0f} {metrics_ode['MAPE']:>8.2f}")
print("="*65)

### 5.5 Visualization (PROVIDED)

In [None]:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Predicted vs Actual
ax = axes[0, 0]
ax.scatter(actuals/1e6, predictions/1e6, alpha=0.6, s=50, label='Latent ODE')
ax.scatter(actuals/1e6, gru_preds/1e6, alpha=0.6, s=50, label='GRU', marker='x')
max_val = max(actuals.max(), predictions.max(), gru_preds.max()) / 1e6
ax.plot([0, max_val], [0, max_val], 'r--', linewidth=2)
ax.set_xlabel('Actual (Millions $)')
ax.set_ylabel('Predicted (Millions $)')
ax.set_title('Predicted vs Actual')
ax.legend()
ax.grid(True, alpha=0.3)

# Time series
ax = axes[0, 1]
ax.plot(actuals/1e6, 'b-', label='Actual', linewidth=2)
ax.plot(predictions/1e6, 'r--', label='Latent ODE', linewidth=2)
ax.plot(gru_preds/1e6, 'g:', label='GRU', linewidth=2)
ax.set_xlabel('Test Sample')
ax.set_ylabel('Volume (Millions $)')
ax.set_title('Predictions Over Test Set')
ax.legend()
ax.grid(True, alpha=0.3)

# Error distribution - Latent ODE
ax = axes[1, 0]
errors_ode = ((predictions - actuals) / actuals) * 100
ax.hist(errors_ode, bins=20, edgecolor='black', alpha=0.7)
ax.axvline(0, color='red', linestyle='--', linewidth=2)
ax.set_xlabel('Percentage Error (%)')
ax.set_ylabel('Frequency')
ax.set_title('Latent ODE Error Distribution')
ax.grid(True, alpha=0.3)

# Error distribution - GRU
ax = axes[1, 1]
errors_gru = ((gru_preds - actuals) / actuals) * 100
ax.hist(errors_gru, bins=20, edgecolor='black', alpha=0.7)
ax.axvline(0, color='red', linestyle='--', linewidth=2)
ax.set_xlabel('Percentage Error (%)')
ax.set_ylabel('Frequency')
ax.set_title('GRU Baseline Error Distribution')
ax.grid(True, alpha=0.3)

plt.suptitle(f'Model Comparison for {cusip}', fontsize=14)
plt.tight_layout()
plt.show()

### 5.6 Train on Second CUSIP (PROVIDED)

In [None]:
# Train on second CUSIP
if len(selected_cusips) > 1:
    cusip2 = selected_cusips[1]
    sequences2 = all_sequences[cusip2]
    
    n_total2 = len(sequences2)
    n_train2 = int(0.7 * n_total2)
    n_val2 = int(0.15 * n_total2)
    
    train_seqs2 = sequences2[:n_train2]
    val_seqs2 = sequences2[n_train2:n_train2+n_val2]
    test_seqs2 = sequences2[n_train2+n_val2:]
    
    print(f"\nTraining on {cusip2}")
    print(f"Train: {len(train_seqs2)}, Val: {len(val_seqs2)}, Test: {len(test_seqs2)}")
    
    model2 = LatentODEForecaster(input_dim=1, hidden_dim=32, latent_dim=8).to(device)
    train_losses2, val_losses2 = train_model(
        model2, train_seqs2, val_seqs2,
        epochs=150, batch_size=8, lr=0.005, kl_weight=0.001
    )
    
    predictions2, actuals2 = evaluate_model(model2, test_seqs2, scalers[cusip2])
    metrics2 = compute_metrics(predictions2, actuals2)
    
    print(f"\nLatent ODE Metrics for {cusip2}:")
    for metric, value in metrics2.items():
        if metric in ['MSE', 'RMSE', 'MAE']:
            print(f"  {metric}: ${value:,.0f}")
        else:
            print(f"  {metric}: {value:.2f}%")

---
## Part 6: Interpretation Questions (15 points)

Answer each question in 3-5 sentences.

### Question 6.1 (4 points)

Why is irregular sampling problematic for traditional time series models like ARIMA? How do Neural ODEs address this challenge?

**Your Answer:**

*[Write your answer here]*

### Question 6.2 (4 points)

How would you modify this model to predict the average traded **price** (instead of volume) for the next trading day? What changes would be needed in the preprocessing and model?

**Your Answer:**

*[Write your answer here]*

### Question 6.3 (3 points)

Compare the results between the two CUSIPs. What factors might explain any differences in model performance?

**Your Answer:**

*[Write your answer here]*

### Question 6.4 (4 points)

How would you adapt the preprocessing and model to predict **dealer buy** and **dealer sell** volumes separately? What insights might this provide about market dynamics?

**Your Answer:**

*[Write your answer here]*

---
## Submission Instructions

1. Ensure all code cells run without errors
2. Make sure all TODO sections are completed
3. Answer all interpretation questions
4. Save this notebook with your answers
5. Submit the completed `.ipynb` file

**Good luck!**