# MONGY: Training `PatchTSMixer` on Financial Candlestick Data
## Direct forecasting example

This notebooke demonstrates the usage of a `PatchTSMixer` model for a multivariate time series forecasting task. This notebook has a dependecy on HuggingFace [transformers](https://github.com/huggingface/transformers) repo. For details related to model architecture, refer to the [TSMixer paper](https://arxiv.org/abs/2306.09364).

In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSMixerConfig,
    PatchTSMixerForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor
from tsfm_public.toolkit.util import select_by_index, train_test_split

In [2]:
# Set seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

In [3]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="torch.utils.data.dataloader")

## Load and prepare datasets

In the next cell, please adjust the following parameters to suit your application:
- `dataset_path`: path to local .csv file, or web address to a csv file for the data of interest. Data is loaded with pandas, so anything supported by
`pd.read_csv` is supported: (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
- `timestamp_column`: column name containing timestamp information, use None if there is no such column
- `id_columns`: List of column names specifying the IDs of different time series. If no ID column exists, use []
- `forecast_columns`: List of columns to be modeled
- `context_length`: The amount of historical data used as input to the model. Windows of the input time series data with length equal to
context_length will be extracted from the input dataframe. In the case of a multi-time series dataset, the context windows will be created
so that they are contained within a single time series (i.e., a single ID).
- `forecast_horizon`: Number of time stamps to forecast in future.
- `train_start_index`, `train_end_index`: the start and end indices in the loaded data which delineate the training data.
- `valid_start_index`, `valid_end_index`: the start and end indices in the loaded data which delineate the validation data.
- `test_start_index`, `test_end_index`: the start and end indices in the loaded data which delineate the test data.
- `patch_length`: The patch length for the `PatchTSMixer` model. Recommended to have a value so that `context_length` is divisible by it.
- `num_workers`: Number of dataloder workers in pytorch dataloader.
- `batch_size`: Batch size. 
The data is first loaded into a Pandas dataframe and split into training, validation, and test parts. Then the pandas dataframes are converted
to the appropriate torch dataset needed for training.

In [4]:
# We want to setup our context, horizon, and patch size based on our task. We want to use
# 4 hours of lookback to start, in order to predict the next 5 minutes of candles. Regarding
# patch length, we know that we will want a larger patch size, so we will start with 64 as
# a base case assumption
context_length = 6 * 60 * 4  # This will give us 4 hours of lookback (6 candles per min * 60 min per hour)
forecast_horizon = 6 * 20 # This will give us 20 minutes of predictions
patch_length = 16
stride_length = 8

In [5]:
import gc

# Load the Dataset from the CSV file
TRAIN_DATASET = "/home/jack/data/10s-candles-train.csv"
VALID_DATASET = "/home/jack/data/10s-candles-valid.csv"
TEST_DATASET = "/home/jack/data/10s-candles-test.csv"

timestamp_col = 't'

train_data = pd.read_csv(
    TRAIN_DATASET,
    parse_dates=[timestamp_col]
)

gc.collect()

valid_data = pd.read_csv(
    VALID_DATASET,
    parse_dates=[timestamp_col]
)

gc.collect()

test_data = pd.read_csv(
    TEST_DATASET,
    parse_dates=[timestamp_col]
)

gc.collect()


0

In [6]:
train_data.head()

Unnamed: 0,t,targ_o,targ_h,targ_l,targ_c,targ_v,ticker,date_string
0,2023-01-03 09:00:00+00:00,130.28,130.95,130.28,130.95,4233.0,AAPL,2023-01-03
1,2023-01-03 09:00:10+00:00,130.98,131.0,130.93,130.93,744.0,AAPL,2023-01-03
2,2023-01-03 09:00:20+00:00,130.98,131.0,130.93,130.93,0.0,AAPL,2023-01-03
3,2023-01-03 09:00:30+00:00,130.98,131.0,130.93,130.93,0.0,AAPL,2023-01-03
4,2023-01-03 09:00:40+00:00,130.98,130.98,130.98,130.98,223.0,AAPL,2023-01-03


In [7]:
valid_data.head()

Unnamed: 0,t,targ_o,targ_h,targ_l,targ_c,targ_v,ticker,date_string
0,2023-02-07 00:00:00+00:00,151.77,151.77,151.74,151.74,0.0,AAPL,2023-02-07
1,2023-02-07 00:00:10+00:00,151.77,151.77,151.74,151.74,0.0,AAPL,2023-02-07
2,2023-02-07 00:00:20+00:00,151.77,151.77,151.74,151.74,0.0,AAPL,2023-02-07
3,2023-02-07 00:00:30+00:00,151.77,151.77,151.74,151.74,0.0,AAPL,2023-02-07
4,2023-02-07 00:00:40+00:00,151.77,151.77,151.74,151.74,0.0,AAPL,2023-02-07


In [8]:
test_data.head()

Unnamed: 0,t,targ_o,targ_h,targ_l,targ_c,targ_v,ticker,date_string
0,2023-02-24 00:00:00+00:00,148.94,148.94,148.94,148.94,0.0,AAPL,2023-02-24
1,2023-02-24 00:00:10+00:00,148.94,148.94,148.94,148.94,0.0,AAPL,2023-02-24
2,2023-02-24 00:00:20+00:00,148.94,148.94,148.94,148.94,0.0,AAPL,2023-02-24
3,2023-02-24 00:00:30+00:00,148.94,148.94,148.94,148.94,0.0,AAPL,2023-02-24
4,2023-02-24 00:00:40+00:00,148.94,148.94,148.94,148.94,0.0,AAPL,2023-02-24


In [9]:
# Check for NaN values
train_data.ffill(inplace=True)
valid_data.ffill(inplace=True)
test_data.ffill(inplace=True)

assert sum(train_data.isna().sum().to_list()) == 0
assert sum(valid_data.isna().sum().to_list()) == 0
assert sum(test_data.isna().sum().to_list()) == 0

In [10]:

id_columns = ['ticker', 'date_string']
forecast_columns = ['targ_o', 'targ_c', 'targ_h', 'targ_l', 'targ_v']

train_tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    scaling=True,
)
train_tsp.train(train_data)

gc.collect()
print("Done Train")

valid_tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    scaling=True,
)
valid_tsp.train(valid_data)

gc.collect()
print("Done Valid")

test_tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    scaling=True,
)
test_tsp.train(test_data)
print("Done Test")

gc.collect()

Done Train
Done Valid
Done Test


0

In [11]:
train_dataset = ForecastDFDataset(
    train_tsp.preprocess(train_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
valid_dataset = ForecastDFDataset(
    valid_tsp.preprocess(valid_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
test_dataset = ForecastDFDataset(
    test_tsp.preprocess(test_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)

gc.collect()

0

In [12]:
del train_data
del valid_data
del test_data

gc.collect()

0

## Training `PatchTSMixer` From Scratch

Adjust the following model parameters according to need.
- `d_model` (`int`, *optional*, defaults to 8):
    Hidden dimension of the model. Recommended to set it as a multiple of patch_length (i.e. 2-8X of
    patch_len). Larger value indicates more complex model.
- `expansion_factor` (`int`, *optional*, defaults to 2):
    Expansion factor to use inside MLP. Recommended range is 2-5. Larger value indicates more complex model.
- `num_layers` (`int`, *optional*, defaults to 3):
    Number of layers to use. Recommended range is 3-15. Larger value indicates more complex model.
- `mode`: (`str`, either to 'common_channel' or `mix_channel`)

In [14]:
config = PatchTSMixerConfig(
    context_length=context_length,
    prediction_length=forecast_horizon,
    patch_length=patch_length,
    num_input_channels=len(forecast_columns),
    patch_stride=stride_length,
    d_model=2 * patch_length,
    num_layers=3,
    expansion_factor=4,
    dropout=0.5,
    head_dropout=0.7,
    mode="mix_channel",
    scaling="std",
)
model = PatchTSMixerForPrediction(config=config)

# Training Run Summaries

**Run 1**: (N/A)
This run used the full year of data, and was used as a baseline to establish that the `mix_channel` mode is more effective for our task. Additionally, all subsequent runs have been updated, to instead use only the first three months of data as training data. Thus, while the loss for this run is lower, it is not indicaitve of the paramters being a better fit, just a result of having a larger dataset.

**Run 2** (0.108476):
This run was the first in which only the first two months of data was used as a training set. March was then split in half to form the validation and test sets. Additionally, the context window was expanded, to include the last four hours of data. While this wasn't explicitly compared against a shorter context window with the same dataset, the results of the paper provide an incredibly strong suggestions towards this approach yielding more effective performance.

**Run 3** (0.108230):
This run included involved increasing the `num_layers` argument from 3 to 5. This adds additional layers to the model, giving it more of an ability to percieve complex patterns in the financial data. This results in a larger model, but hopefully, will allow the model to better understand the nuances of the highly complex financial data it is being trained on.

**Run 4**: (0.107247)
This run included further incrementing the `num_layers` argument from 5 to 10. This adds additional further layers to capture more of the complex patterns in the financial dataset. 

_NOTE_: The `num_layers` does not seem to provide additional aid in this trainin task, with the side-effect of signifitcanlty increasing the inference time. As a result, we are making the decision to keep `num_layers = 3`.

---

**Run 5**: (0.108397)
The `num_layers` argument has been reset to a value of 3, which returns our baseline back to _Run 2_. The `expansion_factor` has been increased from 3 to 4. This yeilded a slight decrease in validation loss, so potentially worth running a second experiment, but likely best to test patching instead.

**Run 6** ()

In [15]:
# Compute the run number
run_num = 14
save_dir = f"./checkpoints/run_{run_num}"

# Check if save_dir exists
assert not os.path.exists(save_dir), "Please update the run_num to avoid overwriting checkpoints!"

num_workers = 2  # g4dn instance has 4 vCPUs
batch_size = 256 # Size of each batches sent to GPU
num_steps = 500
logging_num_steps = 100

train_args = TrainingArguments(
    output_dir=f"{save_dir}/output/",
    overwrite_output_dir=True,
    learning_rate=0.0001,
    num_train_epochs=100,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=num_steps,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=8,
    eval_accumulation_steps=250,
    dataloader_num_workers=num_workers,
    report_to="tensorboard",
    save_strategy="steps",
    save_steps=num_steps,
    logging_strategy="steps",
    logging_steps=logging_num_steps,
    save_total_limit=3,
    logging_dir=f"{save_dir}/logs/",  # Make sure to specify a logging directory
    load_best_model_at_end=True,  # Load the best model when training ends
    metric_for_best_model="eval_loss",  # Metric to monitor for early stopping
    greater_is_better=False,  # For loss
    label_names=["future_values"], # The names of the "ground truth" values to compare predictions against
)

# Create a new early stopping callback with faster convergence properties
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=5,  # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.001,  # Minimum improvement required to consider as improvement
)

trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback],
)

print("Doing forecasting training on FULL Dataset")
trainer.train()

Doing forecasting training on FULL Dataset


Step,Training Loss,Validation Loss
500,0.3652,0.360038
1000,0.3522,0.353682
1500,0.3513,0.350495


In [106]:
trainer.evaluate(test_dataset)



{'eval_loss': 0.2793983221054077,
 'eval_runtime': 32.6588,
 'eval_samples_per_second': 11915.05,
 'eval_steps_per_second': 93.114,
 'epoch': 4.0}

## If we want to train from scratch for a few specific forecast channels