Using the main.ipynb for training Transformers was too chaotic for me. This file will be ONLY contain code that is absolutely necessary for finding out how to train the Transformer model.

In [1]:
# if "preprocessing" folder in current folders -> cd back to original folder
%cd /content
import os
if os.path.exists("bsc-thesis"):
  # if bsc-thesis folder already exists; completely remove
  !rm -rf bsc-thesis

branch = "main"
!git clone --branch $branch https://github.com/maviddoerdijk/bsc-thesis.git
%cd bsc-thesis/src
%ls

/content
Cloning into 'bsc-thesis'...
remote: Enumerating objects: 508, done.[K
remote: Counting objects: 100% (252/252), done.[K
remote: Compressing objects: 100% (197/197), done.[K
remote: Total 508 (delta 113), reused 153 (delta 46), pack-reused 256 (from 1)[K
Receiving objects: 100% (508/508), 23.30 MiB | 14.42 MiB/s, done.
Resolving deltas: 100% (260/260), done.
Filtering content: 100% (14/14), 1.75 GiB | 63.81 MiB/s, done.
/content/bsc-thesis/src
[0m[01;34mbacktesting[0m/  [01;34mdata[0m/      main.ipynb  [01;34mmodels[0m/         [01;34mutils[0m/
[01;34mconfig[0m/       [01;34mexternal[0m/  main.py     [01;34mpreprocessing[0m/


In [2]:
!pip install ta
!pip install prophet
!pip install pykalman
!pip install PyWavelets
!pip install curl-cffi

Collecting ta
  Downloading ta-0.11.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ta
  Building wheel for ta (setup.py) ... [?25l[?25hdone
  Created wheel for ta: filename=ta-0.11.0-py3-none-any.whl size=29412 sha256=5de33609c134a155343d376f9211c66a71d01660a1b590d20c305febf6b090c4
  Stored in directory: /root/.cache/pip/wheels/a1/d7/29/7781cc5eb9a3659d032d7d15bdd0f49d07d2b24fec29f44bc4
Successfully built ta
Installing collected packages: ta
Successfully installed ta-0.11.0
Collecting pykalman
  Downloading pykalman-0.10.1-py2.py3-none-any.whl.metadata (9.5 kB)
Collecting scikit-base<0.13.0 (from pykalman)
  Downloading scikit_base-0.12.2-py3-none-any.whl.metadata (8.8 kB)
Downloading pykalman-0.10.1-py2.py3-none-any.whl (248 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m248.5/248.5 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scikit_base-0.12.2-py3-none-any.whl (142 kB)
[2K   [90

In [3]:
# Module imports
import pandas as pd
import numpy as np
from typing import Optional, Callable, Dict, Any
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
import torch.nn as nn
import itertools

# Custom Imports
from models.statistical_models import create_dataset
from data.data_collection import gather_data
from data.scraper import load_cached_etf_tickers
from preprocessing.cointegration import find_cointegrated_pairs
from preprocessing.data_preprocessing import filter_pairs_data
from preprocessing.technical_indicators import combine_pairs_data
from models.statistical_models import default_normalize
from preprocessing.wavelet_denoising import wav_den
from preprocessing.filters import step_1_filter_remove_nans, step_2_filter_liquidity
from backtesting.trading_strategy import trade
from backtesting.utils import calculate_return_uncertainty

## caching imports
from data.data_collection_cache import gather_data_cached, _get_filename
from utils.visualization import plot_return_uncertainty, plot_comparison, plot_train_val_loss


# Any other changes to be made throughout the entire notebook
plt.style.use('seaborn-v0_8')

In [4]:
# NOTE: all the functions used here are explained in much more detail in src/main.ipynb, but this notebook is simply focused on finding how to ge the Transformer model to work as I wish.
startDateStr = '2008-10-01'
endDateStr = '2018-10-02' # documentation said that endDateStr is exclusive for both yahoofinance and the original code, but actually printing the shapes showed otherwise..
instrumentIdsNASDAQandNYSE = load_cached_etf_tickers()
data = gather_data(startDateStr, endDateStr, instrumentIdsNASDAQandNYSE)
data_close_filtered_1, data_open_filtered_1, data_high_filtered_1, data_low_filtered_1, data_vol_filtered_1, data_original_format_filtered_1 = step_1_filter_remove_nans(data['close'], data['open'], data['high'], data['low'], data['vol'], data)
data_close_filtered_2, data_open_filtered_2, data_high_filtered_2, data_low_filtered_2, data_vol_filtered_2, data_original_format_filtered_2 = step_2_filter_liquidity(data_close_filtered_1, data_open_filtered_1, data_high_filtered_1, data_low_filtered_1, data_vol_filtered_1, data_original_format_filtered_1)

scores, pvalues, pairs = find_cointegrated_pairs(data_original_format_filtered_2)
pairs_data = {key:value[1]  for (key, value) in pairs.items()}
pairs_data = sorted(pairs_data.items(), key=lambda x: x[1])
pairs_data_filtered = filter_pairs_data(pairs_data) # filter based on cointegration in such a way that we can simply pick the highest pair of stocks in the list.
# Extract the most highly cointegrated pairs
ticker_a, ticker_b = pairs_data_filtered[0][0][0], pairs_data_filtered[0][0][1]
pairs_timeseries_df = combine_pairs_data(data_close_filtered_2, data_open_filtered_2, data_high_filtered_2, data_low_filtered_2, data_vol_filtered_2, ticker_a, ticker_b)
# Note about pairs_timeseries_df: the timeseries output on which we should train are found in the key "Spread_Close"
# But, also the input features are the following keys: ['S1_rsi', 'S2_rsi', 'S1_mfi', 'S2_mfi', 'S1_adi', 'S2_adi', 'S1_vpt', 'S2_vpt', 'S1_atr', 'S2_atr', 'S1_bb_ma', 'S2_bb_ma', 'S1_adx', 'S2_adx', 'S1_ema', 'S2_ema', 'S1_macd', 'S2_macd', 'S1_dlr', 'S2_dlr']

[*********************100%***********************]  777 of 777 completed
ERROR:yfinance:
494 Failed downloads:
ERROR:yfinance:['FINE', 'KPDD', 'INFR', 'UCYB', 'NERD', 'MRAL', 'BSMT', 'IBGL', 'IBTM', 'PLTU', 'BOTT', 'IBTP', 'IWTR', 'LDSF', 'ORCX', 'DLLL', 'DUKX', 'MYCK', 'PYPG', 'HLAL', 'ELFY', 'FEAT', 'AIRL', 'UTWY', 'TXUE', 'BSCX', 'EWJV', 'METD', 'TSYY', 'XFIX', 'DVQQ', 'HYDR', 'AVXC', 'FTGS', 'UMMA', 'TARK', 'QDTY', 'SIXG', 'PCMM', 'IONX', 'QQQS', 'SARK', 'EHLS', 'BEEZ', 'SMCL', 'PDBA', 'TEKX', 'ENDW', 'BTFX', 'MYCL', 'AMZZ', 'PALU', 'RDTY', 'AVGX', 'BSCW', 'GPIX', 'IBTG', 'EVYM', 'UPGR', 'OOQB', 'ABCS', 'PLTD', 'SMST', 'PATN', 'CA', 'PTIR', 'IBTO', 'QRMI', 'BUFM', 'IBTJ', 'GLCR', 'MYCI', 'MCHS', 'QOWZ', 'DAPP', 'PSWD', 'NCIQ', 'DMAT', 'BRRR', 'ERET', 'JMID', 'BULD', 'IBTQ', 'VRTL', 'LEXI', 'MYMI', 'BSMP', 'BEEX', 'SOLT', 'BELT', 'BSMU', 'TDI', 'ALLW', 'PBQQ', 'FDNI', 'CORO', 'FICS', 'ASMG', 'BSCT', 'IBOT', 'TSLQ', 'BABX', 'SUSL', 'WINC', 'TQQY', 'RKLX', 'BAFE', 'NIKL', 'ORR', 'USXF

Completed 820 pairs


In [5]:
pairs_data_filtered

[(('PFF', 'IGSB'), np.float64(3.0163860840139413e-06)),
 (('USIG', 'PPH'), np.float64(2.621351855041312e-05)),
 (('USIG', 'QTEC'), np.float64(5.673295148821592e-05)),
 (('USIG', 'IFGL'), np.float64(9.623595997163074e-05)),
 (('IGSB', 'SHY'), np.float64(0.00011659037067853098)),
 (('IGSB', 'PID'), np.float64(0.00013065834446531597)),
 (('USIG', 'PHO'), np.float64(0.0002540997299264048)),
 (('USIG', 'IGF'), np.float64(0.0003383020756817305)),
 (('USIG', 'ACWI'), np.float64(0.0005322750082700492)),
 (('IGSB', 'EMB'), np.float64(0.0005964786506590314)),
 (('IGSB', 'BBH'), np.float64(0.0006129783629473792)),
 (('SHV', 'BBH'), np.float64(0.0007265501368268785)),
 (('IFGL', 'IGSB'), np.float64(0.0007860628026761489)),
 (('SHV', 'PEY'), np.float64(0.0008390644259399934)),
 (('SHV', 'RTH'), np.float64(0.0009203870168499974)),
 (('IGSB', 'PEY'), np.float64(0.0009728025177510877)),
 (('PFF', 'IGIB'), np.float64(0.0010572018262853824)),
 (('USIG', 'ACWX'), np.float64(0.0013155551359995995)),
 (('S

In [6]:
# save variable `data` somewhere
# import pickle
# with open('data_2010_10_01_2024_10_02_142_failed.pkl', 'wb') as f:
#     pickle.dump(data, f)
pick_pairs_data_position = 0
ticker_a, ticker_b = pairs_data_filtered[pick_pairs_data_position][0][0], pairs_data_filtered[pick_pairs_data_position][0][1]
pairs_timeseries_df = combine_pairs_data(data_close_filtered_2, data_open_filtered_2, data_high_filtered_2, data_low_filtered_2, data_vol_filtered_2, ticker_a, ticker_b)

In [7]:
# Set a bunch of variables based on the existing function `execute_kalman_workflow` (Note: Some are changed already)
pairs_timeseries: pd.DataFrame = pairs_timeseries_df
target_col: str = "Spread_Close"
burn_in: int = 30 # we remove the first 30 elements, because the largest window used for technical indicators is
train_frac: float = 0.90
dev_frac: float = 0.05   # remaining part is test
look_back: int = 20
batch_size: int = 64
denoise_fn: Optional[Callable[[pd.Series], np.ndarray]] = wav_den
scaler_factory: Callable[..., MinMaxScaler] = MinMaxScaler
scaler_kwargs: Optional[Dict[str, Any]] = {"feature_range": (0, 1)}
normalise_fn: Callable[[pd.Series], pd.Series] = default_normalize
delta: float = 1e-3
obs_cov_reg: float = 2.
trans_cov_avg: float = 0.01
obs_cov_avg: float = 1.
return_datasets: bool = False
verbose: bool = False

In [12]:


def execute_transformer_workflow(
  pairs_timeseries: pd.DataFrame,
  target_col: str = "Spread_Close",
  burn_in: int = 30, # we remove the first 30 elements, because the largest window used for technical indicators is
  train_frac: float = 0.90,
  dev_frac: float = 0.05,   # remaining part is test
  look_back: int = 20,
  batch_size: int = 64,
  epochs: int = 400,
  patience: int = 150,
  denoise_fn: Optional[Callable[[pd.Series], np.ndarray]] = wav_den,
  scaler_factory: Callable[..., MinMaxScaler] = MinMaxScaler,
  scaler_kwargs: Optional[Dict[str, Any]] = {"feature_range": (0, 1)},
  normalise_fn: Callable[[pd.Series], pd.Series] = default_normalize,
  delta: float = 1e-3,
  obs_cov_reg: float = 2.,
  trans_cov_avg: float = 0.01,
  obs_cov_avg: float = 1.,
  return_datasets: bool = False,
  verbose: bool = False,
  add_technical_indicators: bool = True,
  result_parent_dir: str = "data/results",
  filename_base: str = "data_begindate_enddate_hash.pkl",
  pair_tup_str: str = "(?,?)" # Used for showing which tuple was used in plots, example: "(QQQ, SPY)"
):
  if not target_col in pairs_timeseries.columns:
    raise KeyError(f"pairs_timeseries must contain {target_col}")

  # burn the first 30 elements
  pairs_timeseries_burned = pairs_timeseries.iloc[burn_in:].copy()

  total_len = len(pairs_timeseries_burned)
  train_size = int(total_len * train_frac)
  dev_size   = int(total_len * dev_frac)
  test_size  = total_len - train_size - dev_size # not used, but for clarity

  train = pairs_timeseries_burned[:train_size]
  dev   = pairs_timeseries_burned[train_size:train_size+dev_size] # aka validation
  test  = pairs_timeseries_burned[train_size+dev_size:]

  train_multivariate = train.copy()
  dev_multivariate   = dev.copy() # only for completeness
  test_multivariate  = test.copy() # only for completeness

  if verbose:
      print(f"Split sizes — train: {len(train)}, dev: {len(dev)}, test: {len(test)}")

  if denoise_fn is not None: # denoise using wavelet denoising
      train = pd.DataFrame({col: denoise_fn(train[col]) for col in train.columns}) # TODO: unsure whether dev and test should also be denoised?

  x_scaler = scaler_factory(**scaler_kwargs) # important: the scaler learns parameters, so separate objects must be created for x and y
  y_scaler = scaler_factory(**scaler_kwargs)

  if not add_technical_indicators:
      train = train[[target_col]]
      dev = dev[[target_col]]
      test = test[[target_col]]

  # We want a sliding window in our dataset
  # TODO: defining this function should not be part of workflow, but imported from a custom module
  def create_sliding_dataset(mat: np.ndarray,
                            x_scaler: MinMaxScaler,
                            y_scaler: MinMaxScaler,
                            look_back: int = 20):
      """
      X  -> (samples, look_back, features)
      y  -> (samples, 1)   — the next-step Spread_Close (just 1 day in advance)
      """
      X, y = [], []
      for i in range(len(mat) - look_back):
          X.append(mat[i : i + look_back, :]) # window
          y.append(mat[i + look_back, 0]) # value right after the window
      X, y = np.array(X), np.array(y).reshape(-1, 1)

      # scale per feature (fit on the training set once!)
      X_scaled = x_scaler.fit_transform(
          X.reshape(-1, X.shape[-1])
      ).reshape(X.shape)
      y_scaled = y_scaler.fit_transform(y)

      return X, X_scaled, y, y_scaled

  trainX_raw, trainX_scaled, trainY_raw, trainY_scaled = create_sliding_dataset(
      train.values, x_scaler=x_scaler, y_scaler=y_scaler, look_back=look_back) # train_X_scaled.shape: (2219, 20, 34) // [(t - look_back), look_back, features]
  devX_raw,   devX_scaled,   devY_raw,   devY_scaled   = create_sliding_dataset(
      dev.values,  x_scaler=x_scaler, y_scaler=y_scaler, look_back=look_back)
  testX_raw,  testX_scaled,  testY_raw,  testY_scaled  = create_sliding_dataset(
      test.values, x_scaler=x_scaler, y_scaler=y_scaler, look_back=look_back)


  # use pytorch Dataset class
  class SlidingWindowDataset(Dataset):
      def __init__(self, X: np.ndarray, y: np.ndarray):
          #  cast to float32 once to avoid repeated conversions
          self.X = torch.tensor(X, dtype=torch.float32)      # (N, L, F)
          self.y = torch.tensor(y, dtype=torch.float32)      # (N, 1)

      def __len__(self):
          return self.X.shape[0]

      def __getitem__(self, idx):
          return self.X[idx], self.y[idx]                    # each X: (L, F)

  train_ds = SlidingWindowDataset(trainX_scaled, trainY_scaled)
  dev_ds   = SlidingWindowDataset(devX_scaled, devY_scaled)
  test_ds  = SlidingWindowDataset(testX_scaled, testY_scaled)

  train_loader = DataLoader(train_ds, batch_size=batch_size,
                            shuffle=True,  drop_last=True,  num_workers=0)
  dev_loader   = DataLoader(dev_ds,   batch_size=batch_size,
                            shuffle=False, drop_last=False, num_workers=0)
  test_loader  = DataLoader(test_ds,  batch_size=batch_size,
                            shuffle=False, drop_last=False, num_workers=0)

  print(next(iter(train_loader))[0].shape)   # torch.Size([64, 20, 34]) //  (batch_size, look_back, features)

  class TimeSeriesTransformerv1(nn.Module):
    """
    This version (v1) uses:
    * learnable positional embeddings (simple, so no RoPE and no sinusoidal)
    * only an encoder (followed by a regression head that transforms from form (seq_len, d_model) into (1), with the output form being the Spread_Close prediction)
    """
    def __init__(
        self,
        n_features: int,
        seq_len: int,
        d_model: int = 128,
        nhead: int = 8,
        num_layers: int = 4,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.seq_len = seq_len

        # token projection (linear layer)
        self.input_proj = nn.Linear(n_features, d_model)

        # learnable positional embedding  (1, seq_len, d_model)
        self.pos_emb = nn.Parameter(torch.randn(1, seq_len, d_model))

        # encoder (important part)
        enc_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=d_model * 4,
            dropout=dropout,
            batch_first=True, # keeps (batch, seq, dim)
        )
        self.encoder = nn.TransformerEncoder(enc_layer, num_layers)

        # regression head (mainly helps in getting to the right output format)
        self.head = nn.Sequential(
            nn.Flatten(start_dim=1), # (batch, seq_len*d_model)
            nn.Linear(seq_len * d_model, 128),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 1),
        )

    def forward(self, x): # x: (batch, seq_len, n_features)
        x = self.input_proj(x) + self.pos_emb
        x = self.encoder(x) # (batch, seq_len, d_model)
        return self.head(x) # (batch, 1)

  n_features = trainX_scaled.shape[-1]

  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
  model  = TimeSeriesTransformerv1(
              n_features=n_features,
              seq_len=look_back,
              d_model=128,
              nhead=8,
              num_layers=4,
              dropout=0.1).to(DEVICE)

  criterion = nn.MSELoss()
  optimizer = AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)

  EPOCHS = epochs
  PATIENCE = patience

  # implement the early stopping logic manually
  best_val = float("inf")
  epochs_no_improve = 0
  print_per_n = 10

  # save train_loss and val_loss to lists for plotting
  train_losses = []
  val_losses = []

  for epoch in range(1, EPOCHS + 1):
      model.train()
      running_loss = 0.0
      for x_batch, y_batch in train_loader:
          x_batch, y_batch = x_batch.to(DEVICE), y_batch.to(DEVICE)
          optimizer.zero_grad()
          preds = model(x_batch).squeeze(-1)
          loss  = criterion(preds, y_batch.squeeze(-1))
          loss.backward()
          optimizer.step()
          running_loss += loss.item() * x_batch.size(0)
      train_loss = running_loss / len(train_loader.dataset) # epoch loss = running loss / N samples
      train_losses.append(train_loss)

      model.eval()
      running_loss_val = 0.0
      with torch.no_grad():
          for x_batch, y_batch in dev_loader:
              x_batch, y_batch = x_batch.to(DEVICE), y_batch.to(DEVICE)
              preds  = model(x_batch).squeeze(-1)
              running_loss_val += criterion(preds, y_batch.squeeze(-1)).item() * x_batch.size(0)
      val_loss = running_loss_val / len(dev_loader.dataset) # again, epoch loss = running loss / N samples
      val_losses.append(val_loss)

      # print losses in a pretty way
      if epoch % print_per_n == 0:
        print(f"Epoch {epoch:03d} | train MSE {train_loss:.6f} | val MSE {val_loss:.6f}")


      # manual early stopping logic
      if val_loss < best_val - 1e-5: # 1e-5 to not actually make it zero
          best_val = val_loss
          epochs_no_improve = 0
          torch.save(model.state_dict(), "best_transformer.pt")
      else:
          epochs_no_improve += 1
          if epochs_no_improve >= PATIENCE:
              print("Early stopping triggered.")
              break

  # Now, let's run the model on the testset
  # made sure we're in eval mode
  model.eval()

  all_preds, all_targets = [], []
  with torch.no_grad(): # note for myself: torch.no_grad() makes sure that individual weights are not stored in memory, because we would only need to know those during learning, not during inference
      for x_test_batch, y_test_batch in test_loader:
        x_test_batch = x_test_batch.to(DEVICE)
        preds = model(x_test_batch) # make predictions using model
        # transform the preds and targets back to numpy, as these need to be inverted with the scaler, which expects numpy not tensors
        preds = preds.cpu().numpy()
        y_test_batch = y_test_batch.cpu().numpy()
        all_preds.append(preds)
        all_targets.append(y_test_batch)

  # maybe too much explanation here, but y_hat and y_true respectively represent the predicted and ground truth values
  y_hat_scaled = np.concatenate(all_preds).reshape(-1, 1)
  y_true_scaled = np.concatenate(all_targets).reshape(-1, 1)

  y_hat = y_scaler.inverse_transform(y_hat_scaled)
  y_true = y_scaler.inverse_transform(y_true_scaled)

  test_mse = np.mean((y_hat - y_true) ** 2)
  print(f"Test MSE  : {test_mse:.6f}")

  ## Trading
  test_s1_shortened = test_multivariate['S1_close'].iloc[look_back:]
  test_s2_shortened = test_multivariate['S2_close'].iloc[look_back:] # use multivariate versions, so we can still access cols like 'S1_close' and 'S2_close'
  test_index_shortened = test_multivariate.index[look_back:] # officially doesn't really matter whether to use `test_multivariate` or `test`, but do it like this for consistency
  forecast_test_shortened_series = pd.Series(y_hat.squeeze(), index=test_index_shortened)
  gt_test_shortened_series = pd.Series(y_true.squeeze(), index=test_index_shortened)

  spread_gt_series = pd.Series(y_true.squeeze(), index=test_index_shortened)
  gt_returns = trade(
      S1 = test_s1_shortened,
      S2 = test_s2_shortened,
      spread = spread_gt_series,
      window_long = 30,
      window_short = 5,
      position_threshold = 1.0,
      clearing_threshold = 0.5
  )
  gt_yoy = ((gt_returns[-1] / gt_returns[0])**(365 / len(gt_returns)) - 1)

  ## Trading: Mean YoY
  min_position = 2.00
  max_position = 4.00
  min_clearing = 0.30
  max_clearing = 0.70
  position_thresholds = np.linspace(min_position, max_position, num=10)
  clearing_thresholds = np.linspace(min_clearing, max_clearing, num=10)
  yoy_mean, yoy_std = calculate_return_uncertainty(test_s1_shortened, test_s2_shortened, forecast_test_shortened_series, position_thresholds=position_thresholds, clearing_thresholds=clearing_thresholds)


  ## The variables that should be returned, according to what was returned by the `execute_kalman_workflow` func:
  # give same output as was originally the case
  if add_technical_indicators:
    current_result_dir = filename_base.replace(".pkl", "_transformer")
  else:
    current_result_dir = filename_base.replace(".pkl", "_transformer_without_ta")
  result_dir = os.path.join(result_parent_dir, current_result_dir)
  if not os.path.exists(result_dir):
      os.makedirs(result_dir)

  ### Plotting #####
  # 1. Train val loss
  train_val_loss_filename = plot_train_val_loss(train_losses, val_losses, workflow_type="Transformer", pair_tup_str=pair_tup_str, result_dir=result_dir, verbose=verbose, filename_base=filename_base)

  # 2. yoy returns
  yoy_returns_filename = plot_return_uncertainty(test_s1_shortened, test_s2_shortened, forecast_test_shortened_series, test_index_shortened, look_back, position_thresholds=position_thresholds, clearing_thresholds=clearing_thresholds, verbose=verbose, result_dir=result_dir, filename_base=filename_base)

  # 3. predicted vs actual spread plot
  predicted_vs_actual_spread_filename = plot_comparison(gt_test_shortened_series, forecast_test_shortened_series, test_index_shortened, workflow_type="Kalman Filter", pair_tup_str=pair_tup_str, verbose=verbose, result_dir=result_dir, filename_base=filename_base)

  ### Plotting #####
  plot_filenames = {
      "yoy_returns": yoy_returns_filename,
      "predicted_vs_actual_spread": predicted_vs_actual_spread_filename,
      "train_val_loss": train_val_loss_filename
  }
  output: Dict[str, Any] = dict(
      val_mse=val_losses[-1],
      test_mse=test_mse,
      yoy_mean=yoy_mean,
      yoy_std=yoy_std,
      gt_yoy=gt_yoy,
      result_parent_dir=result_parent_dir,
      plot_filenames=plot_filenames
  )

  results_str = f"""
Validation MSE: {output['val_mse']}
Test MSE: {output['test_mse']}
YOY Returns: {output['yoy_mean'] * 100:.2f}%
YOY Std: +- {output['yoy_std'] * 100:.2f}%
GT Yoy: {output['gt_yoy'] * 100:.2f}%
Plot filepath parent dir: {output['result_parent_dir']}
Plot filenames: {output['plot_filenames']}
  """
  with open(os.path.join(result_dir, "results.txt"), "w") as f:
      f.write(results_str)
  if verbose:
    print(results_str)

  if return_datasets:
      output.update(
          dict(train=train, dev=dev, test=test,
                datasets=dict(
                    train=(trainX_raw, trainX_scaled, trainY_raw, trainY_scaled),
                    dev  =(devX_raw,   devX_scaled,   devY_raw,   devY_scaled),
                    test =(testX_raw,  testX_scaled,  testY_raw,  testY_scaled)
                ))

      )
  return output

output = execute_transformer_workflow(pairs_timeseries_df, verbose=True, result_parent_dir="data/results", filename_base=_get_filename(startDateStr, endDateStr, instrumentIdsNASDAQandNYSE), pair_tup_str=f"({ticker_a},{ticker_b})", epochs=20)
output_without_tas = execute_transformer_workflow(pairs_timeseries_df, verbose=True, add_technical_indicators=False, result_parent_dir="data/results", filename_base=_get_filename(startDateStr, endDateStr, instrumentIdsNASDAQandNYSE), pair_tup_str=f"({ticker_a},{ticker_b})", epochs=20)

Split sizes — train: 2239, dev: 124, test: 125
torch.Size([64, 20, 34])
Epoch 010 | train MSE 0.007178 | val MSE 0.027906
Epoch 020 | train MSE 0.006499 | val MSE 0.024379
Test MSE  : 0.058034
Saved plot to data/results/data_2008_10_01_2018_10_02_4416cb3b_transformer/data_2008_10_01_2018_10_02_4416cb3b_train_val_loss.png
Saved plot to data/results/data_2008_10_01_2018_10_02_4416cb3b_transformer/data_2008_10_01_2018_10_02_4416cb3b_plot_thresholds.png
Saved plot to data/results/data_2008_10_01_2018_10_02_4416cb3b_transformer/data_2008_10_01_2018_10_02_4416cb3b_groundtruth_comparison.png

Validation MSE: 0.024378877133131027
Test MSE: 0.05803408473730087
YOY Returns: 8.13%
YOY Std: +- 0.40%
GT Yoy: -7.44%
Plot filepath parent dir: data/results
Plot filenames: {'yoy_returns': 'data_2008_10_01_2018_10_02_4416cb3b_plot_thresholds.png', 'predicted_vs_actual_spread': 'data_2008_10_01_2018_10_02_4416cb3b_groundtruth_comparison.png', 'train_val_loss': 'data_2008_10_01_2018_10_02_4416cb3b_train_v

In [None]:
def trade(
    S1: pd.Series,
    S2: pd.Series,
    spread: pd.Series, # model-predicted spread, used for the strategy
    window_long: int, # long time-span moving-average window (and window used for stdev) (e.g. 60)
    window_short: int, # short time span moving-average window (e.g. 5)
    position_threshold: float = 1.0,
    clearing_threshold: float = 0.5
) -> float: # simulated profit-and-loss
    ma_long = spread.rolling(window=window_long, center=False).mean()
    ma_short = spread.rolling(window=window_short, center=False).mean()
    std = spread.rolling(window=window_short, center=False).std()
    zscore = (ma_long - ma_short)/std

    cash, qty_s1, qty_s2 = 0.0, 0, 0
    returns = []

    for i in range(len(spread)):
      # go through zscore for each timestep
      if zscore.iloc[i] > position_threshold: # option 1: sell short
        cash += S1.iloc[i] - S2.iloc[i] * spread.iloc[i]
        qty_s1 -= 1
        qty_s2 += spread.iloc[i] # following the classic pairs-trade: this buys β shares in the case that spread_t = S1_t + β * S2_t
      elif zscore.iloc[i] < -position_threshold: # option 2: buy long
        cash -= S1.iloc[i] - S2.iloc[i] * spread.iloc[i]
        qty_s1 += 1
        qty_s2 -= spread.iloc[i] # same as bove, but other way around
      elif abs(zscore.iloc[i]) < clearing_threshold: # option 3: go neutral, clearing all positions
        cash += qty_s1 * S1.iloc[i] - qty_s2 * S2.iloc[i] # Closing on S1 is gained (we make a profit if we have a positive amount of S1, which we sell), closing on S2 is lost (we need to buy back the shorted positions, losing cash)
        qty_s1, qty_s2 = 0, 0
      returns.append(cash)
    return returns

spread_pred_series = pd.Series(y_hat.squeeze(), index=test.index[look_back:])  # align lengths
spread_gt_series = pd.Series(y_true.squeeze(), index=test.index[look_back:])
returns = trade(
    S1 = test['S1_close'].iloc[look_back:],
    S2 = test['S2_close'].iloc[look_back:],
    spread = spread_pred_series,
    window_long = 30,
    window_short = 5,
    position_threshold = 1.0,
    clearing_threshold = 0.5
)
print("Simulated P&L:", returns[-1])

# plot returns
plt.plot(returns)
plt.show()

# plot predicted vs actual spreads
plt.plot(spread_pred_series, label='Predicted Spread')
plt.plot(spread_gt_series, label='Actual Spread')
plt.legend()
plt.show()

In [None]:
def trade(
    S1: pd.Series,
    S2: pd.Series,
    spread: pd.Series, # model-predicted spread for the strategy
    window_long: int,
    window_short: int,
    initial_cash: float = 100000,
    position_threshold: float = 1.0,
    clearing_threshold: float = 0.5,
    risk_fraction: float = 0.1
) -> list:
    ma_long = spread.rolling(window=window_long, center=False).mean()
    ma_short = spread.rolling(window=window_short, center=False).mean()
    std = spread.rolling(window=window_short, center=False).std()
    zscore = (ma_long - ma_short)/std

    cash = initial_cash
    qty_s1 = 0
    qty_s2 = 0
    returns = [initial_cash]
    position = 0 # 0: neutral, 1: long, -1: short

    for i in range(len(spread)):
        price_s1 = S1.iloc[i]
        price_s2 = S2.iloc[i]
        beta = spread.iloc[i]
        equity = cash + qty_s1 * price_s1 - qty_s2 * price_s2

        # Enter short spread (short S1, long beta S2)
        if position == 0 and zscore.iloc[i] > position_threshold:
            position = -1
            position_size = equity * risk_fraction
            qty_s1 = -position_size / price_s1
            qty_s2 = (position_size * beta) / price_s2
            cash -= (qty_s1 * price_s1 - qty_s2 * price_s2)

        # Enter long spread (long S1, short beta S2)
        elif position == 0 and zscore.iloc[i] < -position_threshold:
            position = 1
            position_size = equity * risk_fraction
            qty_s1 = position_size / price_s1
            qty_s2 = - (position_size * beta) / price_s2
            cash -= (qty_s1 * price_s1 - qty_s2 * price_s2)

        # Exit to neutral when spread reverts
        elif position != 0 and abs(zscore.iloc[i]) < clearing_threshold:
            cash += qty_s1 * price_s1 - qty_s2 * price_s2
            qty_s1 = 0
            qty_s2 = 0
            position = 0

        equity = cash + qty_s1 * price_s1 - qty_s2 * price_s2
        returns.append(equity)

    return returns

spread_pred_series = pd.Series(y_hat.squeeze(), index=test.index[look_back:])  # align lengths
spread_gt_series = pd.Series(y_true.squeeze(), index=test.index[look_back:])
returns = trade(
    S1 = test['S1_close'].iloc[look_back:],
    S2 = test['S2_close'].iloc[look_back:],
    spread = spread_pred_series,
    window_long = 30,
    window_short = 5,
    position_threshold = 1.0,
    clearing_threshold = 0.5
)
print("Simulated P&L:", returns[-1])
print(f"Return % over period: {100*((returns[-1] - returns[0]) / returns[0])}%")
print(f"Return % YoY: {((returns[-1] / returns[0])**(365/len(returns)) - 1) * 100}")

# plot returns
plt.plot(returns)
plt.show()

# plot predicted vs actual spreads
plt.plot(spread_pred_series, label='Predicted Spread')
plt.plot(spread_gt_series, label='Actual Spread')
plt.legend()
plt.show()

To get a more reliable plot, we will test out a set of different threshold combinations to be able to plot uncertainty / standard deviation in our returns.



In [None]:
def plot_with_uncertainty(position_thresholds=None, clearing_thresholds=None,
                          long_windows=None, short_windows=None):
  threshold_combinations = list(itertools.product(position_thresholds, clearing_thresholds))

  if position_thresholds is not None and clearing_thresholds is not None:
      threshold_combinations = list(itertools.product(position_thresholds, clearing_thresholds))
      param_type = 'thresholds'
  elif long_windows is not None and short_windows is not None:
      threshold_combinations = list(itertools.product(long_windows, short_windows))
      param_type = 'windows'
  else:
      raise ValueError("Must specify either (position_thresholds and clearing_thresholds) or (long_windows and short_windows)")

  all_returns = []

  for a, b in threshold_combinations:
      if param_type == 'thresholds':
          returns = trade(
              S1=test['S1_close'].iloc[look_back:],
              S2=test['S2_close'].iloc[look_back:],
              spread=spread_pred_series,
              window_long=30,
              window_short=5,
              position_threshold=a,
              clearing_threshold=b
          )
          # print(f"Returns for (pt={a},ct={b}) -> {returns[-1]}")
      else:
          returns = trade(
              S1=test['S1_close'].iloc[look_back:],
              S2=test['S2_close'].iloc[look_back:],
              spread=spread_pred_series,
              window_long=a,
              window_short=b,
              position_threshold=0.8,
              clearing_threshold=0.2
          )
          # print(f"Returns for (wl={a},ws={b}) -> {returns[-1]}")

      all_returns.append(returns)

  # turn into numpy
  returns_array = np.vstack([np.array(r) for r in all_returns])

  # mean and stdev for plotting
  mean_returns = returns_array.mean(axis=0)
  std_returns = returns_array.std(axis=0)
  time_axis_series = test.index[look_back - 1:]

  std_dev_pct = (std_returns / mean_returns[0]) * 100


  if param_type == "thresholds":
      print(f"position threshold ({min(position_thresholds):.2f}-{max(position_thresholds):.2f}), "
                f"clearing threshold ({min(clearing_thresholds):.2f}-{max(clearing_thresholds):.2f})")
  else:
      print(f"short window ({min(short_windows)}-{max(short_windows)}), "
                f"long window ({min(long_windows)}-{max(long_windows)})")
  print(f"Return % over period: {100 * ((mean_returns[-1] - mean_returns[0]) / mean_returns[0]):.2f}% ± {std_dev_pct[-1]:.2f}%")
  print(f"Return % YoY (mean and std dev): {((mean_returns[-1] / mean_returns[0])**(365 / len(mean_returns)) - 1) * 100:.2f}% ± {((std_returns[-1] / mean_returns[0]) * np.sqrt(365 / len(mean_returns))) * 100:.2f}%")

  plt.figure(figsize=(10, 6))
  plt.plot(time_axis_series, mean_returns, label='Mean Strategy Returns')
  plt.fill_between(time_axis_series, mean_returns - std_returns, mean_returns + std_returns, alpha=0.3, label='±1 Std Dev')
  plt.xlabel('Time')
  plt.ylabel('Cumulative Return')
  if param_type == "thresholds":
      plt.title(f"Trading Strategy Returns - position threshold ({min(position_thresholds):.2f}-{max(position_thresholds):.2f}), "
                f"clearing threshold ({min(clearing_thresholds):.2f}-{max(clearing_thresholds):.2f})")
  else:
      plt.title(f"Trading Strategy Returns - short window ({min(short_windows)}-{max(short_windows)}), "
                f"long window ({min(long_windows)}-{max(long_windows)})")
  plt.legend()
  plt.show()

threshold_ranges = [
    (3.0, 3.5, 0.6, 0.7),
    (1.0, 2.0, 0.1, 0.3),
    (1.5, 2.5, 0.2, 0.4),
    (2.0, 3.0, 0.3, 0.5),
    (2.5, 3.5, 0.4, 0.6),
    (3.0, 4.0, 0.5, 0.7),
    (4.0, 5.0, 0.6, 0.8),
    (1.0, 3.0, 0.2, 0.6),
    (2.0, 4.0, 0.3, 0.7),
    (3.0, 5.0, 0.4, 0.9),
    (1.5, 4.5, 0.1, 0.5),
    (1.2, 2.8, 0.2, 0.4),
    (2.3, 3.7, 0.3, 0.6),
    (3.1, 4.2, 0.4, 0.7),
    (1.8, 3.6, 0.5, 0.8),
    (2.5, 4.9, 0.6, 0.9),
    (1.0, 5.0, 0.1, 1.0),
    (2.0, 2.5, 0.3, 0.4),
    (3.0, 3.5, 0.5, 0.6),
    (4.0, 4.5, 0.6, 0.7),
    (1.5, 3.0, 0.2, 0.3)
]


for min_position, max_position, min_clearing, max_clearing in threshold_ranges:
  position_thresholds = np.linspace(min_position, max_position, num=10)
  clearing_thresholds = np.linspace(min_clearing, max_clearing, num=10)
  plot_with_uncertainty(position_thresholds, clearing_thresholds)

Some examples of outputs for understanding the form of the data better:

`len(trainX_untr)`
```
2238
```

`len(trainX_untr[0])`
```
34
```

Context: The 34 features consist of
* 10 technical indicators for both S1 and S2 (total 20)
* S1_close/open/high/low/volume, same for S2 (total 10)
* Pair spreads: close, open, high, low (total 4)


`trainX_untr[0] `
```
array([ 2.76970068e+01,  4.91006247e+01,  2.89730484e+01,  4.91027293e+01,
        2.89891358e+01,  4.91343834e+01,  2.60513431e+01,  4.87784465e+01,
        5.03546207e+05,  7.43386097e+03,  4.49718063e+01,  5.82671806e+01,
        5.75577766e+01,  8.28358144e+01,  7.59406546e+01,  3.92425336e+02,
        1.05376719e+05, -9.14930577e+03,  1.56861725e+00,  7.63660812e-01,
        2.85638197e+01,  4.83799321e+01,  8.30115861e+00,  2.65305580e+01,
        2.84687992e+01,  4.85740999e+01,  2.27306259e-01,  4.38795633e-02,
       -1.31033890e+00,  1.00654755e-01, -6.89227038e+00, -5.61453974e+00,
       -5.72867758e+00, -8.18838280e+00])
```

`len(trainY_untr)`
```
2238
```


`len(trainY_untr)[0]`
```
1
```

`trainY_untr[0]`
```
array([27.81830352])
```

`trainY_untr[:20]`
```
[array([27.81830352]),
 array([27.42025825]),
 array([25.9191175]),
 array([22.98625305]),
 array([20.5661885]),
 array([21.23151271]),
 array([24.11603916]),
 array([25.605551]),
 array([26.16966699]),
 array([26.46204422]),
 array([25.33065673]),
 array([25.72835342]),
 array([25.91998167]),
 array([25.70591191]),
 array([25.83366537]),
 array([26.33152235]),
 array([26.35160811]),
 array([26.2352556]),
 array([26.03820719]),
 array([25.75521362])]
 ```

`trainX_sliding.shape` (when using look_back=20)

```
 (2219, 20, 34)
```
