# MONGY: Training `PatchTSMixer` on Financial Candlestick Data
## Direct forecasting example

This notebooke demonstrates the usage of a `PatchTSMixer` model for a multivariate time series forecasting task. This notebook has a dependecy on HuggingFace [transformers](https://github.com/huggingface/transformers) repo. For details related to model architecture, refer to the [TSMixer paper](https://arxiv.org/abs/2306.09364).

In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSMixerConfig,
    PatchTSMixerForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor

In [2]:
# Set seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

In [3]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="torch.utils.data.dataloader")

## Load and prepare datasets

In the next cell, please adjust the following parameters to suit your application:
- `dataset_path`: path to local .csv file, or web address to a csv file for the data of interest. Data is loaded with pandas, so anything supported by
`pd.read_csv` is supported: (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
- `timestamp_column`: column name containing timestamp information, use None if there is no such column
- `id_columns`: List of column names specifying the IDs of different time series. If no ID column exists, use []
- `forecast_columns`: List of columns to be modeled
- `context_length`: The amount of historical data used as input to the model. Windows of the input time series data with length equal to
context_length will be extracted from the input dataframe. In the case of a multi-time series dataset, the context windows will be created
so that they are contained within a single time series (i.e., a single ID).
- `forecast_horizon`: Number of time stamps to forecast in future.
- `train_start_index`, `train_end_index`: the start and end indices in the loaded data which delineate the training data.
- `valid_start_index`, `valid_end_index`: the start and end indices in the loaded data which delineate the validation data.
- `test_start_index`, `test_end_index`: the start and end indices in the loaded data which delineate the test data.
- `patch_length`: The patch length for the `PatchTSMixer` model. Recommended to have a value so that `context_length` is divisible by it.
- `num_workers`: Number of dataloder workers in pytorch dataloader.
- `batch_size`: Batch size. 
The data is first loaded into a Pandas dataframe and split into training, validation, and test parts. Then the pandas dataframes are converted
to the appropriate torch dataset needed for training.

In [4]:
# We want to setup our context, horizon, and patch size based on our task. We want to use
# 4 hours of lookback to start, in order to predict the next 5 minutes of candles. Regarding
# patch length, we know that we will want a larger patch size, so we will start with 64 as
# a base case assumption
context_length = 60 * 4  # This will give us 4 hours of lookback (4 hours * 60 min per hour)
forecast_horizon = 3 # This will give us 3 minutes of predictions

In [5]:
# Load the Dataset from the CSV file
DATA_DIR = "/home/ubuntu/verb-workspace/data"

TRAIN_DATASET = f"{DATA_DIR}/1min-candles-train-w-CANDLES.csv"
VALID_DATASET = f"{DATA_DIR}/1min-candles-valid-w-CANDLES.csv"
TEST_DATASET = f"{DATA_DIR}/1min-candles-test-w-CANDLES.csv"

timestamp_col = 't'

train_data = pd.read_csv(
    TRAIN_DATASET,
    parse_dates=[timestamp_col]
)

valid_data = pd.read_csv(
    VALID_DATASET,
    parse_dates=[timestamp_col]
)

test_data = pd.read_csv(
    TEST_DATASET,
    parse_dates=[timestamp_col]
)


In [6]:
# Check for NaN values
assert sum(train_data.isna().sum().to_list()) == 0
assert sum(valid_data.isna().sum().to_list()) == 0
assert sum(test_data.isna().sum().to_list()) == 0

In [7]:
train_data

Unnamed: 0,ticker,date_string,t,targ_o,targ_h,targ_l,targ_c,targ_v,targ_red,targ_green,obs_vwap,cont_market_open,cont_market_extended
0,AAPL,2023-01-03,2023-01-03 05:30:00-05:00,130.80,130.8000,130.800,130.800,0.0,0,0,0.0000,0,1
1,AAPL,2023-01-03,2023-01-03 05:31:00-05:00,130.80,130.8000,130.800,130.800,0.0,0,0,0.0000,0,1
2,AAPL,2023-01-03,2023-01-03 05:32:00-05:00,130.80,130.8000,130.800,130.800,0.0,0,0,0.0000,0,1
3,AAPL,2023-01-03,2023-01-03 05:33:00-05:00,130.80,130.8000,130.800,130.800,235.0,0,0,130.8009,0,1
4,AAPL,2023-01-03,2023-01-03 05:34:00-05:00,130.80,130.8000,130.800,130.800,0.0,0,0,0.0000,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
937435,V,2023-11-17,2023-11-17 15:55:00-05:00,249.65,249.7277,249.620,249.680,20222.0,0,1,249.6721,1,0
937436,V,2023-11-17,2023-11-17 15:56:00-05:00,249.67,249.7700,249.670,249.705,24402.0,0,1,249.7157,1,0
937437,V,2023-11-17,2023-11-17 15:57:00-05:00,249.71,249.7600,249.670,249.725,29366.0,0,1,249.7169,1,0
937438,V,2023-11-17,2023-11-17 15:58:00-05:00,249.73,249.7300,249.655,249.660,29316.0,1,0,249.7011,1,0


In [8]:

id_columns = ['ticker', 'date_string']
forecast_columns = ['targ_o', 'targ_c', 'targ_h', 'targ_l', 'targ_v', 'targ_red', 'targ_green']
observable_columns = ['obs_vwap']
control_columns = ['cont_market_open', 'cont_market_extended']

train_tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    observable_columns=observable_columns,
    control_columns=control_columns,
    scaling=True,
)
train_tsp.train(train_data)
print("Done Train")

valid_tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    observable_columns=observable_columns,
    control_columns=control_columns,
    scaling=True,
)
valid_tsp.train(valid_data)
print("Done Valid")

test_tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    observable_columns=observable_columns,
    control_columns=control_columns,
    scaling=True,
)
test_tsp.train(test_data)
print("Done Test")


Done Train
Done Valid
Done Test


In [9]:
train_dataset = ForecastDFDataset(
    train_tsp.preprocess(train_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    control_columns=control_columns,
    observable_columns=observable_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
valid_dataset = ForecastDFDataset(
    valid_tsp.preprocess(valid_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    control_columns=control_columns,
    observable_columns=observable_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
test_dataset = ForecastDFDataset(
    test_tsp.preprocess(test_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    control_columns=control_columns,
    observable_columns=observable_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)



In [10]:
from typing import Tuple
# Indices for accessing the OHLC values in the tensors
I_OPEN = 0
I_CLOSE = 1
I_HIGH = 2
I_LOW = 3
I_RED = 5
I_GREEN = 6


def theta_pnl(x: torch.Tensor, y_pred: torch.Tensor, y_obs: torch.Tensor) -> torch.Tensor:
    # Create the series of closes
    real_candle_closes = torch.cat((x[..., I_CLOSE], y_obs[..., I_CLOSE]), dim=-1)
    forecasted_candle_closes = torch.cat((x[..., I_CLOSE], y_pred[..., I_CLOSE]), dim=-1)
    
    # Compute pnls
    real_pnl_long = real_candle_closes[..., -3:] - real_candle_closes[..., 0:3]
    forecasted_pnl_long = forecasted_candle_closes[..., -3:] - forecasted_candle_closes[..., 0:3]
    real_pnl_short = real_candle_closes[..., 0:3] - real_candle_closes[..., -3:]
    forecasted_pnl_short = forecasted_candle_closes[..., 0:3] - forecasted_candle_closes[..., -3:]

    # For each candle, compute long/short position
    is_long = real_pnl_long > 0

    real_pnl = torch.where(is_long, real_pnl_long, real_pnl_short)
    forecasted_pnl = torch.where(is_long, forecasted_pnl_long, forecasted_pnl_short)

    pnl_ae = torch.abs(real_pnl - forecasted_pnl)
    pnl_se = torch.square(real_pnl - forecasted_pnl)

    return pnl_se, pnl_ae

def pnl_factor(y_pred: torch.Tensor, y_obs: torch.Tensor) -> torch.Tensor:
    pred_reds = y_pred[..., I_RED]
    pred_greens = y_pred[..., I_GREEN]
    pred_is_green = pred_greens > pred_reds

    real_reds = y_obs[..., I_RED]
    real_greens = y_obs[..., I_GREEN]
    real_is_green = real_greens > real_reds

    factor = torch.sum(pred_is_green != real_is_green) / 3
    # print(f"Predicted Is Green: {pred_is_green}")
    # print(f"Real Is Green: {real_is_green}")
    # print(f"Factor: {factor}")

    return factor

def custom_loss(x: torch.Tensor, y_pred: torch.Tensor, y_obs: torch.Tensor) -> torch.Tensor:
    # Compute PNL rediual for each candle
    pnl_se, pnl_ae = theta_pnl(x, y_pred, y_obs)

    # Compute MSE and MAE of the OHLCV data
    mse = torch.nn.functional.mse_loss(y_pred, y_obs)
    mae = torch.nn.functional.l1_loss(y_pred, y_obs)

    # Compute the contant term for candle color accuracy 
    factor = pnl_factor(y_pred, y_obs)
    
    custom_mse = torch.mean(mse + torch.mean(pnl_se))
    custom_mae = torch.mean(mae + torch.mean(pnl_ae))

    return (custom_mse + custom_mae) / 2

In [28]:
from typing import Optional

class MongyModel(PatchTSMixerForPrediction):
    
    def forward(
        self,
        past_values: torch.Tensor,
        future_values: Optional[torch.Tensor] = None,
        observed_mask: Optional[torch.Tensor] = None,
        output_hidden_states: Optional[bool] = False,
        return_loss: bool = True,
        return_dict: Optional[bool] = None,
    ):
        # Call the parent class's forward method to get the model's outputs
        outputs = super().forward(
            past_values,
            observed_mask=observed_mask,
            future_values=future_values,
            output_hidden_states=output_hidden_states,
            return_loss=False,  # Set return_loss to False to prevent the built-in loss computation
            return_dict=return_dict,
        )

        # # Snap the candles to the correct opening positions, before computing the loss
        # # This is the "training wheels" for the head. By helping the model with the portion
        # # of it's task that we can help with, we severly limit the task that is posed to the
        # # model
        _outputs = outputs.prediction_outputs
        
        last_context_close = past_values[..., -1, I_CLOSE]
        first_candle_open = _outputs[..., 0, I_OPEN]
        first_candle_delta = last_context_close - first_candle_open
        first_candle_delta = first_candle_delta.unsqueeze(-1).unsqueeze(-1)
        _outputs[..., 0:4] = _outputs[..., 0:4] + first_candle_delta


        first_candle_close = _outputs[..., 0, I_CLOSE]
        second_candle_open = _outputs[..., 1, I_OPEN]
        second_candle_delta = first_candle_close - second_candle_open
        second_candle_delta = second_candle_delta.unsqueeze(-1).unsqueeze(-1)
        _outputs[..., -2:, 0:4] = _outputs[..., -2:, 0:4] + second_candle_delta

        second_candle_close = _outputs[..., 1, I_CLOSE]
        third_candle_open = _outputs[..., 2, I_OPEN]
        third_candle_delta = second_candle_close - third_candle_open
        third_candle_delta = third_candle_delta.unsqueeze(-1)
        _outputs[..., -1, 0:4] = _outputs[..., -1, 0:4] + third_candle_delta

        # Apply your custom loss function
        loss_val = None
        if future_values is not None and return_loss:
            loss_val = custom_loss(
                past_values[..., self.prediction_channel_indices],    
                outputs.prediction_outputs[..., self.prediction_channel_indices],
                future_values[..., self.prediction_channel_indices]
            )
            outputs.loss = loss_val

        if not return_dict:
            output = (outputs.prediction_outputs,) + outputs[2:]
            return ((loss_val,) + output) if loss_val is not None else output

        return outputs

## Training `PatchTSMixer` From Scratch

Adjust the following model parameters according to need.
- `d_model` (`int`, *optional*, defaults to 8):
    Hidden dimension of the model. Recommended to set it as a multiple of patch_length (i.e. 2-8X of
    patch_len). Larger value indicates more complex model.
- `expansion_factor` (`int`, *optional*, defaults to 2):
    Expansion factor to use inside MLP. Recommended range is 2-5. Larger value indicates more complex model.
- `num_layers` (`int`, *optional*, defaults to 3):
    Number of layers to use. Recommended range is 3-15. Larger value indicates more complex model.
- `mode`: (`str`, either to 'common_channel' or `mix_channel`)

In [29]:
patch_length = 16
stride_length = 1

prediction_channel_indicies = train_tsp.prediction_channel_indices
num_input_channels = train_tsp.num_input_channels

config = PatchTSMixerConfig(
    # Dataset Kwargs
    context_length=context_length,
    prediction_length=forecast_horizon,
    prediction_channel_indices=prediction_channel_indicies,
    patch_length=patch_length,
    num_input_channels=num_input_channels,
    patch_stride=stride_length,

    # Model Kwargs
    d_model=5 * patch_length,
    num_layers=4,
    expansion_factor=3,
    dropout=0.5,
    head_dropout=0.7,
    mode="mix_channel",
    scaling=None,
)
model = MongyModel(config=config)

# Training Run Summaries

**Run 1**: (N/A)
This run used the full year of data, and was used as a baseline to establish that the `mix_channel` mode is more effective for our task. Additionally, all subsequent runs have been updated, to instead use only the first three months of data as training data. Thus, while the loss for this run is lower, it is not indicaitve of the paramters being a better fit, just a result of having a larger dataset.

**Run 2** (0.108476):
This run was the first in which only the first two months of data was used as a training set. March was then split in half to form the validation and test sets. Additionally, the context window was expanded, to include the last four hours of data. While this wasn't explicitly compared against a shorter context window with the same dataset, the results of the paper provide an incredibly strong suggestions towards this approach yielding more effective performance.

**Run 3** (0.108230):
This run included involved increasing the `num_layers` argument from 3 to 5. This adds additional layers to the model, giving it more of an ability to percieve complex patterns in the financial data. This results in a larger model, but hopefully, will allow the model to better understand the nuances of the highly complex financial data it is being trained on.

**Run 4**: (0.107247)
This run included further incrementing the `num_layers` argument from 5 to 10. This adds additional further layers to capture more of the complex patterns in the financial dataset. 

_NOTE_: The `num_layers` does not seem to provide additional aid in this trainin task, with the side-effect of signifitcanlty increasing the inference time. As a result, we are making the decision to keep `num_layers = 3`.

---

**Run 5**: (0.108397)
The `num_layers` argument has been reset to a value of 3, which returns our baseline back to _Run 2_. The `expansion_factor` has been increased from 3 to 4. This yeilded a slight decrease in validation loss, so potentially worth running a second experiment, but likely best to test patching instead.

**Run 6** ()

In [31]:
# Compute the run number
run_num = "snap_candle_1"
save_dir = f"./checkpoints/run_{run_num}"

# Check if save_dir exists
assert not os.path.exists(save_dir), "Please update the run_num to avoid overwriting checkpoints!"

num_workers = 10  # p3.2xlarge instance has 12 vCPUs

gradient_accumulation_steps = 1 # Number of batches between each backward pass
batch_size = 64 # Size of each batches sent to GPU
eval_batch_size = 256
num_steps = 500

# Calculations
# =======================
# effective_batch_size = batch_size * grad_accumulation_steps = 64 * 1 = 64
# examples_per_evaluation = num_steps * effective_batch_size = 64 * 5,000 = 320,0000

train_args = TrainingArguments(
    output_dir=f"{save_dir}/output/",
    overwrite_output_dir=True,
    learning_rate=0.00001,
    num_train_epochs=100,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=num_steps,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    eval_accumulation_steps=250,
    dataloader_num_workers=num_workers,
    report_to="tensorboard",
    save_strategy="steps",
    save_steps=num_steps,
    logging_strategy="steps",
    logging_steps=num_steps,
    save_total_limit=3,
    logging_dir=f"{save_dir}/logs/",  # Make sure to specify a logging directory
    load_best_model_at_end=True,  # Load the best model when training ends
    metric_for_best_model="eval_loss",  # Metric to monitor for early stopping
    greater_is_better=False,  # For loss
    label_names=["future_values"], # The names of the "ground truth" values to compare predictions against
)

# Create a new early stopping callback with faster convergence properties
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=5,  # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.001,  # Minimum improvement required to consider as improvement
)

trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback],
)

print("Doing forecasting training on FULL Dataset")
trainer.train()

Doing forecasting training on FULL Dataset


Step,Training Loss,Validation Loss
500,2.258,0.549904
1000,1.7474,0.535117
1500,1.3834,0.527811
2000,1.1197,0.522586
2500,0.9427,0.517735
3000,0.8187,0.514733
3500,0.7342,0.512584
4000,0.6793,0.511679
4500,0.6385,0.510089
5000,0.6167,0.509438


TrainOutput(global_step=7000, training_loss=0.9469972577776228, metrics={'train_runtime': 2573.9075, 'train_samples_per_second': 22430.643, 'train_steps_per_second': 350.479, 'total_flos': 1.07288810496e+16, 'train_loss': 0.9469972577776228, 'epoch': 0.7759671876732069})

In [None]:
trainer.evaluate(test_dataset)