# MONGY: Training `PatchTSMixer` on Financial Candlestick Data
## Direct forecasting example

This notebooke demonstrates the usage of a `PatchTSMixer` model for a multivariate time series forecasting task. This notebook has a dependecy on HuggingFace [transformers](https://github.com/huggingface/transformers) repo. For details related to model architecture, refer to the [TSMixer paper](https://arxiv.org/abs/2306.09364).

In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSMixerConfig,
    PatchTSMixerForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor
from tsfm_public.toolkit.util import select_by_index

In [2]:
# Set seed for reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

In [3]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module="torch.utils.data.dataloader")

## Load and prepare datasets

In the next cell, please adjust the following parameters to suit your application:
- `dataset_path`: path to local .csv file, or web address to a csv file for the data of interest. Data is loaded with pandas, so anything supported by
`pd.read_csv` is supported: (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
- `timestamp_column`: column name containing timestamp information, use None if there is no such column
- `id_columns`: List of column names specifying the IDs of different time series. If no ID column exists, use []
- `forecast_columns`: List of columns to be modeled
- `context_length`: The amount of historical data used as input to the model. Windows of the input time series data with length equal to
context_length will be extracted from the input dataframe. In the case of a multi-time series dataset, the context windows will be created
so that they are contained within a single time series (i.e., a single ID).
- `forecast_horizon`: Number of time stamps to forecast in future.
- `train_start_index`, `train_end_index`: the start and end indices in the loaded data which delineate the training data.
- `valid_start_index`, `valid_end_index`: the start and end indices in the loaded data which delineate the validation data.
- `test_start_index`, `test_end_index`: the start and end indices in the loaded data which delineate the test data.
- `patch_length`: The patch length for the `PatchTSMixer` model. Recommended to have a value so that `context_length` is divisible by it.
- `num_workers`: Number of dataloder workers in pytorch dataloader.
- `batch_size`: Batch size. 
The data is first loaded into a Pandas dataframe and split into training, validation, and test parts. Then the pandas dataframes are converted
to the appropriate torch dataset needed for training.

In [4]:
# We want to setup our context, horizon, and patch size based on our task. We want to use
# 4 hours of lookback to start, in order to predict the next 5 minutes of candles. Regarding
# patch length, we know that we will want a larger patch size, so we will start with 64 as
# a base case assumption
context_length = 6 * 60 * 4  # This will give us 4 hours of lookback (6 candles per min * 60 min per hour)
forecast_horizon = 6 * 20 # This will give us 20 minutes of predictions
patch_length = 64

In [5]:
# Load the Dataset from the CSV file
DATASET_PATH = "/home/jack/data/10s-candles-2023.csv"
timestamp_col = 't'

full_dataset = pd.read_csv(
    DATASET_PATH,
    parse_dates=[timestamp_col]
)

In [6]:
print(full_dataset.shape)
full_dataset.head()

(31160015, 10)


Unnamed: 0,t,targ_o,targ_h,targ_l,targ_c,targ_v,ticker,market_state_MarketState.CLOSED,market_state_MarketState.EXTENDED,market_state_MarketState.OPEN
0,2023-01-03 09:00:00+00:00,130.28,130.95,130.28,130.95,4233.0,AAPL,0,1,0
1,2023-01-03 09:00:10+00:00,130.98,131.0,130.93,130.93,744.0,AAPL,0,1,0
2,2023-01-03 09:00:20+00:00,130.98,131.0,130.93,130.93,0.0,AAPL,0,1,0
3,2023-01-03 09:00:30+00:00,130.98,131.0,130.93,130.93,0.0,AAPL,0,1,0
4,2023-01-03 09:00:40+00:00,130.98,130.98,130.98,130.98,223.0,AAPL,0,1,0


In [7]:
# Now we want to trim down the dataframe, to only include the AAPL data, and we will
# additionally remove the market_state columns
data = full_dataset.loc[full_dataset['ticker'] == 'AAPL']
data = data.drop(columns=["ticker", "market_state_MarketState.CLOSED", "market_state_MarketState.OPEN", "market_state_MarketState.EXTENDED"])

data = data.ffill()

In [8]:
# Check for NaN values
data.isna().sum()

t         0
targ_o    0
targ_h    0
targ_l    0
targ_c    0
targ_v    0
dtype: int64

In [9]:
# Before we set the explicit train, validation, and test indicies, let's retrieve these
# indicies by searching over the 't' column

# Find the index of the row with the earliest timestamp on October 1, 2023
mar_1_2023 = '2023-03-01'
mar_1_2023_index = data[timestamp_col].searchsorted(pd.to_datetime(mar_1_2023).tz_localize('UTC'))

mar_15_2023 = '2023-03-15'
mar_15_2023_index = data[timestamp_col].searchsorted(pd.to_datetime(mar_15_2023).tz_localize('UTC'))

apr_1_2023 = '2023-04-01'
apr_1_2023_index = data[timestamp_col].searchsorted(pd.to_datetime(apr_1_2023).tz_localize('UTC'))

In [10]:
id_columns = [] # since we only have one ticker, this can remain empty for now
forecast_columns = ["targ_o", "targ_h", "targ_l", "targ_c", "targ_v"]
train_start_index = None  # None indicates beginning of dataset
train_end_index = mar_1_2023_index

# we shift the start of the validation period back by context length so that
# the first validation timestamp is immediately following the training data
valid_start_index = train_end_index - context_length
# we will end the validation set at the same date that the test set begins
valid_end_index = mar_15_2023_index

# we shift the start of the test period back by context length so that
# the first test timestamp is immediately following the validation data
test_start_index = valid_end_index - context_length
test_end_index = apr_1_2023_index # none indicates the end of the dataset

In [11]:

train_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=train_start_index,
    end_index=train_end_index,
)
valid_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=valid_start_index,
    end_index=valid_end_index,
)
test_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=test_start_index,
    end_index=test_end_index,
)

tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_col,
    id_columns=id_columns,
    target_columns=forecast_columns,
    scaling=True,
)
tsp.train(train_data)

TimeSeriesPreprocessor {
  "categorical_encoder": null,
  "conditional_columns": [],
  "context_length": 64,
  "control_columns": [],
  "encode_categorical": true,
  "feature_extractor_type": "TimeSeriesPreprocessor",
  "freq": "0 days 00:00:10",
  "frequency_mapping": {
    "10_minutes": 3,
    "15_minutes": 4,
    "half_hourly": 1,
    "hourly": 2,
    "oov": 0
  },
  "id_columns": [],
  "observable_columns": [],
  "prediction_length": null,
  "processor_class": "TimeSeriesPreprocessor",
  "scaler_dict": {},
  "scaler_type": "standard",
  "scaling": true,
  "scaling_id_columns": [],
  "static_categorical_columns": [],
  "target_columns": [
    "targ_o",
    "targ_h",
    "targ_l",
    "targ_c",
    "targ_v"
  ],
  "target_scaler_dict": {
    "0": {
      "copy": true,
      "feature_names_in_": [
        "targ_o",
        "targ_h",
        "targ_l",
        "targ_c",
        "targ_v"
      ],
      "mean_": [
        143.1905872291718,
        143.2082004746137,
        143.180698882

In [12]:
train_dataset = ForecastDFDataset(
    tsp.preprocess(train_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
valid_dataset = ForecastDFDataset(
    tsp.preprocess(valid_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
test_dataset = ForecastDFDataset(
    tsp.preprocess(test_data),
    id_columns=id_columns,
    target_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)

## Training `PatchTSMixer` From Scratch

Adjust the following model parameters according to need.
- `d_model` (`int`, *optional*, defaults to 8):
    Hidden dimension of the model. Recommended to set it as a multiple of patch_length (i.e. 2-8X of
    patch_len). Larger value indicates more complex model.
- `expansion_factor` (`int`, *optional*, defaults to 2):
    Expansion factor to use inside MLP. Recommended range is 2-5. Larger value indicates more complex model.
- `num_layers` (`int`, *optional*, defaults to 3):
    Number of layers to use. Recommended range is 3-15. Larger value indicates more complex model.
- `mode`: (`str`, either to 'common_channel' or `mix_channel`)

In [13]:
config = PatchTSMixerConfig(
    context_length=context_length,
    prediction_length=forecast_horizon,
    patch_length=patch_length,
    num_input_channels=len(forecast_columns),
    patch_stride=int(patch_length / 2),
    d_model=2 * patch_length,
    num_layers=3,
    expansion_factor=4,
    dropout=0.5,
    head_dropout=0.7,
    mode="mix_channel",
    scaling="std",
)
model = PatchTSMixerForPrediction(config=config)

# Training Run Summaries

**Run 1**: (N/A)
This run used the full year of data, and was used as a baseline to establish that the `mix_channel` mode is more effective for our task. Additionally, all subsequent runs have been updated, to instead use only the first three months of data as training data. Thus, while the loss for this run is lower, it is not indicaitve of the paramters being a better fit, just a result of having a larger dataset.

**Run 2** (0.108476):
This run was the first in which only the first two months of data was used as a training set. March was then split in half to form the validation and test sets. Additionally, the context window was expanded, to include the last four hours of data. While this wasn't explicitly compared against a shorter context window with the same dataset, the results of the paper provide an incredibly strong suggestions towards this approach yielding more effective performance.

**Run 3** (0.108230):
This run included involved increasing the `num_layers` argument from 3 to 5. This adds additional layers to the model, giving it more of an ability to percieve complex patterns in the financial data. This results in a larger model, but hopefully, will allow the model to better understand the nuances of the highly complex financial data it is being trained on.

**Run 4**: (0.107247)
This run included further incrementing the `num_layers` argument from 5 to 10. This adds additional further layers to capture more of the complex patterns in the financial dataset. 

_NOTE_: The `num_layers` does not seem to provide additional aid in this trainin task, with the side-effect of signifitcanlty increasing the inference time. As a result, we are making the decision to keep `num_layers = 3`.

---

**Run 5**: (0.108397)
The `num_layers` argument has been reset to a value of 3, which returns our baseline back to _Run 2_. The `expansion_factor` has been increased from 3 to 4. This yeilded a slight decrease in validation loss, so potentially worth running a second experiment, but likely best to test patching instead.

**Run 6** ()

In [15]:
# Compute the run number
run_num = 8
save_dir = f"./checkpoints/run_{run_num}"

# Check if save_dir exists
assert not os.path.exists(save_dir), "Please update the run_num to avoid overwriting checkpoints!"

num_workers = 8  # g4dn instance has 4 vCPUs
batch_size = 256 # Size of each batches sent to GPU
num_steps = 100

train_args = TrainingArguments(
    output_dir=f"{save_dir}/output/",
    overwrite_output_dir=True,
    learning_rate=0.0001,
    num_train_epochs=100,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=num_steps,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=int(batch_size // 2), # Cut batch size down for eval
    gradient_accumulation_steps=4,
    dataloader_num_workers=num_workers,
    report_to="tensorboard",
    save_strategy="steps",
    save_steps=num_steps,
    logging_strategy="steps",
    logging_steps=num_steps,
    save_total_limit=3,
    logging_dir=f"{save_dir}/logs/",  # Make sure to specify a logging directory
    load_best_model_at_end=True,  # Load the best model when training ends
    metric_for_best_model="eval_loss",  # Metric to monitor for early stopping
    greater_is_better=False,  # For loss
    label_names=["future_values"],
)

# Create a new early stopping callback with faster convergence properties
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=5,  # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.001,  # Minimum improvement required to consider as improvement
)

trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback],
)

print("Doing forecasting training on AAPL Dataset")
trainer.train()

Doing forecasting training on AAPL Dataset


Step,Training Loss,Validation Loss
50,0.2282,0.127111
100,0.2277,0.124301
150,0.1933,0.122547
200,0.1976,0.12135
250,0.1801,0.120407
300,0.1868,0.119624
350,0.1977,0.119097


TrainOutput(global_step=350, training_loss=0.20163015638078963, metrics={'train_runtime': 518.6755, 'train_samples_per_second': 94024.294, 'train_steps_per_second': 91.772, 'total_flos': 1.83485288448e+16, 'train_loss': 0.20163015638078963, 'epoch': 0.7345225603357818})

In [106]:
trainer.evaluate(test_dataset)



{'eval_loss': 0.2793983221054077,
 'eval_runtime': 32.6588,
 'eval_samples_per_second': 11915.05,
 'eval_steps_per_second': 93.114,
 'epoch': 4.0}

## If we want to train from scratch for a few specific forecast channels

In [14]:
forecast_channel_indices = [
    -4,
    -1,
]  # add the channel indices (i.e., the column number) for which the model should forecast

In [15]:
config = PatchTSMixerConfig(
    context_length=context_length,
    prediction_length=forecast_horizon,
    patch_length=patch_length,
    num_input_channels=len(forecast_columns),
    patch_stride=patch_length,
    d_model=48,
    num_layers=3,
    expansion_factor=3,
    dropout=0.5,
    head_dropout=0.7,
    mode="common_channel",
    scaling="std",
    prediction_channel_indices=forecast_channel_indices,
)
model = PatchTSMixerForPrediction(config=config)

In [16]:
trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback],
)

print("\n\nDoing forecasting training on Etth1/train")
trainer.train()



Doing forecasting training on Etth1/train


Epoch,Training Loss,Validation Loss
1,0.2753,0.496316
2,0.2312,0.485542
3,0.2182,0.478069
4,0.2099,0.470516
5,0.2064,0.47701
6,0.2026,0.474555
7,0.2006,0.474283
8,0.1983,0.472296
9,0.196,0.464579
10,0.1948,0.467563


TrainOutput(global_step=4284, training_loss=0.20359806520264356, metrics={'train_runtime': 60.2954, 'train_samples_per_second': 13322.735, 'train_steps_per_second': 417.942, 'total_flos': 1268896459751424.0, 'train_loss': 0.20359806520264356, 'epoch': 17.0})

In [17]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.1160622164607048,
 'eval_runtime': 0.7379,
 'eval_samples_per_second': 3774.245,
 'eval_steps_per_second': 119.258,
 'epoch': 17.0}

#### Sanity check: Compute number of forecasting channels

In [18]:
output = trainer.predict(test_dataset)

In [19]:
output.predictions[0].shape

(2785, 96, 2)