 # Patch Time Series Transformer in HuggingFace - Getting Started

In this blog, we provide examples on how to get started with PatchTST. We first demonstrate the forecasting capability of `PatchTST` on the Electricity data. We will then demonstrate the transfer learning capability of `PatchTST` by using the previously trained model to do zero-shot forecasting on the electrical transformer (ETTh1) dataset.

The `PatchTST` model was proposed in A Time Series is Worth [64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.

We will demonstrate the trasfer learning capability of the `PatchTST` model.
We will pretrain the model for a forecasting task on a `source` dataset. Then, we will use the
pretrained model for a zero-shot forecasting on a `target` dataset. The zero-shot forecasting
 performance will denote the `test` performance of the model in the `target` domain, without any
 training on the target domain. Subsequently, we will do linear probing and (then) finetuning of
 the pretrained model on the `train` part of the target data, and will validate the forecasting
 performance on the `test` part of the target data.

 `Blog authors`: Arindam Jati, Vijay Ekambaram, Nam Nguyen, Wesley Gifford and Kashif Rashul


## Quick overview of PatchTST

At a high level the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head.

The model is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. The patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models.

In addition, PatchTST has a modular design to seamlessly support masked time series pre-training as well as direct time series forecasting.

## Installation
This demo needs Hugging Face [`transformers`](https://github.com/huggingface/transformers) for main modeling tasks, and IBM `tsfm` for auxiliary data pre-processing.
We can install both by cloning the `tsfm` repository and following the below steps.

1. Clone the public IBM Time Series Foundation Model Repository [`tsfm`](https://github.com/ibm/tsfm).

    ```bash
    git clone git@github.com:IBM/tsfm.git
    cd tsfm
    ```
2. Install `tsfm`. This will also install Huggingface `transformers`.
    ```bash
    pip install .
    ```
3. Test it with the following commands in a `python` terminal.
    ```python
    from transformers import PatchTSTConfig
    from tsfm_public.toolkit.dataset import ForecastDFDataset
    ```

## Part 1: Forecasting on the Electricity dataset

Here we train a PatchTST model directly on the Electricity data (available from https://github.com/zhouhaoyi/Informer2020), and evaluate its performance.

In [1]:
# Standard
import os
import random

# Third Party
from transformers import (
    EarlyStoppingCallback,
    PatchTSTConfig,
    PatchTSTForPrediction,
    Trainer,
    TrainingArguments,
)
import numpy as np
import pandas as pd
import torch

# First Party
from tsfm_public.toolkit.dataset import ForecastDFDataset
from tsfm_public.toolkit.time_series_preprocessor import TimeSeriesPreprocessor
from tsfm_public.toolkit.util import select_by_index

 ## Set seed

In [2]:
SEED = 2023
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

 ## Load and prepare datasets

 In the next cell, please adjust the following parameters to suit your application:
 - `PRETRAIN_AGAIN`: Set this to `True` if you want to perform pretraining again. Note that this might take some time depending on GPU availability. Otherwise, the already pretrained model will be used.
 - `dataset_path`: path to local .csv file, or web address to a csv file for the data of interest. Data is loaded with pandas, so anything supported by
   `pd.read_csv` is supported: (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
 - `timestamp_column`: column name containing timestamp information, use None if there is no such column
 - `id_columns`: List of column names specifying the IDs of different time series. If no ID column exists, use []
 - `forecast_columns`: List of columns to be modeled
 - `context_length`: The amount of historical data used as input to the model. Windows of the input time series data with length equal to `context_length` will be extracted from the input dataframe. In the case of a multi-time series dataset, the context windows will be created so that they are contained within a single time series (i.e., a single ID).
 - `forecast_horizon`: Number of timestamps to forecast in future.
 - `train_start_index`, `train_end_index`: the start and end indices in the loaded data which delineate the training data.
 - `valid_start_index`, `eval_end_index`: the start and end indices in the loaded data which delineate the validation data.
 - `test_start_index`, `eval_end_index`: the start and end indices in the loaded data which delineate the test data.
 - `patch_length`: The patch length for the `PatchTST` model. It is recommended to choose a value that evenly divides `context_length`.
 - `num_workers`: Number of dataloder workers in pytorch dataloader.
 - `batch_size`: Batch size.
 The data is first loaded into a Pandas dataframe and split into training, validation, and test parts. Then the pandas dataframes are converted
 to the appropriate torch dataset needed for training.

In [8]:
PRETRAIN_AGAIN = True
# The ECL data is available from https://github.com/zhouhaoyi/Informer2020
dataset_path = "~/data/ECL.csv"
timestamp_column = "date"
id_columns = []

context_length = 512
forecast_horizon = 96
patch_length = 16
num_workers = 16  # Reduce this if you have low number of CPU cores
batch_size = 64  # Adjust according to GPU memory

In [9]:
if PRETRAIN_AGAIN:
    data = pd.read_csv(
        dataset_path,
        parse_dates=[timestamp_column],
    )
    forecast_columns = list(data.columns[1:])

    # get split
    num_train = int(len(data) * 0.7)
    num_test = int(len(data) * 0.2)
    num_valid = len(data) - num_train - num_test
    border1s = [
        0,
        num_train - context_length,
        len(data) - num_test - context_length,
    ]
    border2s = [num_train, num_train + num_valid, len(data)]

    train_start_index = border1s[0]  # None indicates beginning of dataset
    train_end_index = border2s[0]

    # we shift the start of the evaluation period back by context length so that
    # the first evaluation timestamp is immediately following the training data
    valid_start_index = border1s[1]
    valid_end_index = border2s[1]

    test_start_index = border1s[2]
    test_end_index = border2s[2]

    train_data = select_by_index(
        data,
        id_columns=id_columns,
        start_index=train_start_index,
        end_index=train_end_index,
    )
    valid_data = select_by_index(
        data,
        id_columns=id_columns,
        start_index=valid_start_index,
        end_index=valid_end_index,
    )
    test_data = select_by_index(
        data,
        id_columns=id_columns,
        start_index=test_start_index,
        end_index=test_end_index,
    )

    tsp = TimeSeriesPreprocessor(
        timestamp_column=timestamp_column,
        id_columns=id_columns,
        input_columns=forecast_columns,
        output_columns=forecast_columns,
        scaling=True,
    )
    tsp.train(train_data)

In [10]:
if PRETRAIN_AGAIN:
    train_dataset = ForecastDFDataset(
        tsp.preprocess(train_data),
        id_columns=id_columns,
        timestamp_column="date",
        input_columns=forecast_columns,
        output_columns=forecast_columns,
        context_length=context_length,
        prediction_length=forecast_horizon,
    )
    valid_dataset = ForecastDFDataset(
        tsp.preprocess(valid_data),
        id_columns=id_columns,
        timestamp_column="date",
        input_columns=forecast_columns,
        output_columns=forecast_columns,
        context_length=context_length,
        prediction_length=forecast_horizon,
    )
    test_dataset = ForecastDFDataset(
        tsp.preprocess(test_data),
        id_columns=id_columns,
        timestamp_column="date",
        input_columns=forecast_columns,
        output_columns=forecast_columns,
        context_length=context_length,
        prediction_length=forecast_horizon,
    )

 ## Configure the PatchTST model

 The settings below control the different components in the PatchTST model.
  - `num_input_channels`: the number of input channels (or dimensions) in the time series data. This is
    automatically set to the number for forecast columns.
  - `context_length`: As described above, the amount of historical data used as input to the model.
  - `patch_length`: The length of the patches extracted from the context window (of length `context_length`).
  - `patch_stride`: The stride used when extracting patches from the context window.
  - `random_mask_ratio`: The fraction of input patches that are completely masked for the purpose of pretraining the model.
  - `d_model`: Dimension of the transformer layers.
  - `num_attention_heads`: The number of attention heads for each attention layer in the Transformer encoder.
  - `num_hidden_layers`: The number of encoder layers.
  - `ffn_dim`: Dimension of the intermediate (often referred to as feed-forward) layer in the encoder.
  - `dropout`: Dropout probability for all fully connected layers in the encoder.
  - `head_dropout`: Dropout probability used in the head of the model.
  - `pooling_type`: Pooling of the embedding. `"mean"`, `"max"` and `None` are supported.
  - `channel_attention`: Activate channel attention block in the Transformer to allow channels to attend each other.
  - `scaling`: Whether to scale the input targets via "mean" scaler, "std" scaler or no scaler if `None`. If `True`, the
    scaler is set to `"mean"`.
  - `loss`: The loss function for the model corresponding to the `distribution_output` head. For parametric
    distributions it is the negative log likelihood (`"nll"`) and for point estimates it is the mean squared
    error `"mse"`.
  - `pre_norm`: Normalization is applied before self-attention if pre_norm is set to `True`. Otherwise, normalization is
    applied after residual block.
  - `norm_type`: Normalization at each Transformer layer. Can be `"BatchNorm"` or `"LayerNorm"`.

For full details on the parameters, refer [here](https://huggingface.co/docs/transformers/main/en/model_doc/patchtst). We recommend that you only adjust the values in the next cell.

In [11]:
if PRETRAIN_AGAIN:
    config = PatchTSTConfig(
        num_input_channels=len(forecast_columns),
        context_length=context_length,
        patch_length=patch_length,
        patch_stride=patch_length,
        prediction_length=forecast_horizon,
        random_mask_ratio=0.4,
        d_model=128,
        num_attention_heads=16,
        num_hidden_layers=3,
        ffn_dim=256,
        dropout=0.2,
        head_dropout=0.2,
        pooling_type=None,
        channel_attention=False,
        scaling="std",
        loss="mse",
        pre_norm=True,
        norm_type="batchnorm",
    )
    model = PatchTSTForPrediction(config)

In [12]:
# quick check valid conf
# will remove later
new_conf = model.config.__class__()

for k,v in config.__dict__.items():
    if k not in new_conf.__dict__:
        print(k)

 ## Train model

 Trains the PatchTSMixer model based on the direct forecasting strategy.

In [14]:
if PRETRAIN_AGAIN:
    training_args = TrainingArguments(
        output_dir="./checkpoint/patchtst/electricity/pretrain/output/",
        overwrite_output_dir=True,
        # learning_rate=0.001,
        num_train_epochs=100,
        do_eval=True,
        evaluation_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        dataloader_num_workers=1,
        report_to="tensorboard",
        save_strategy="epoch",
        logging_strategy="epoch",
        save_total_limit=3,
        logging_dir="./checkpoint/patchtst/electricity/pretrain/logs/",  # Make sure to specify a logging directory
        load_best_model_at_end=True,  # Load the best model when training ends
        metric_for_best_model="eval_loss",  # Metric to monitor for early stopping
        greater_is_better=False,  # For loss
        label_names=["future_values"],
    )

    # Create the early stopping callback
    early_stopping_callback = EarlyStoppingCallback(
        early_stopping_patience=10,  # Number of epochs with no improvement after which to stop
        early_stopping_threshold=0.0001,  # Minimum improvement required to consider as improvement
    )

    # define trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        callbacks=[early_stopping_callback],
    )

    # pretrain
    trainer.train()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss
1,0.2686,0.148226
2,0.1731,0.132843
3,0.1593,0.125739
4,0.1522,0.12299
5,0.1476,0.120243
6,0.1439,0.119615
7,0.141,0.117554
8,0.1387,0.11617
9,0.1371,0.115255
10,0.1357,0.115005


## Evaluate model on the test set of the `source` domain

While this is not the target metric to judge in this task, it provides a reasonable check that the pretrained model has trained properly.

In [15]:
if PRETRAIN_AGAIN:
    results = trainer.evaluate(test_dataset)
    print("Test result:")
    print(results)

Test result:
{'eval_loss': 0.13068610429763794, 'eval_runtime': 11.3197, 'eval_samples_per_second': 456.286, 'eval_steps_per_second': 7.156, 'epoch': 41.0}


The MSE of `0.1307` is very close to value reported for the Electricity dataset in the original PatchTST paper.

## Save model

In [16]:
if PRETRAIN_AGAIN:
    save_dir = "patchtst/electricity/model/pretrain/"
    os.makedirs(save_dir, exist_ok=True)
    trainer.save_model(save_dir)

# Part 2: Transfer Learning from Electicity to ETTh1


In this section, we will demonstrate the transfer learning capability of the `PatchTST` model.
We use the model pretrained on Electricity dataset to do zeroshot testing on ETTh1 dataset.


In Transfer Learning,  we will pretrain the model for a forecasting task on a `source` dataset. Then, we will use the
 pretrained model for zero-shot forecasting on a `target` dataset. The zero-shot forecasting
 performance will denote the `test` performance of the model in the `target` domain, without any
 training on the target domain. Subsequently, we will do linear probing and (then) finetuning of
 the pretrained model on the `train` part of the target data, and will validate the forecasting
 performance on the `test` part of the target data. In this example, the source dataset is the Electricity dataset and the target dataset is ETTH2



## Transfer learning on `ETTh1` data. All evaluations are on the `test` part of the `ETTh1` data.

Step 1: Directly evaluate the electricity-pretrained model. This is the zero-shot performance. 

Step 2: Evalute after doing linear probing. 

Step 3: Evaluate after doing full finetuning. 

 ## Load ETTH data

In [48]:
dataset = "ETTh1"

In [49]:
print(f"Loading target dataset: {dataset}")
dataset_path = f"https://raw.githubusercontent.com/zhouhaoyi/ETDataset/main/ETT-small/{dataset}.csv"
timestamp_column = "date"
id_columns = []
forecast_columns = ["HUFL", "HULL", "MUFL", "MULL", "LUFL", "LULL", "OT"]
train_start_index = None  # None indicates beginning of dataset
train_end_index = 12 * 30 * 24

# we shift the start of the evaluation period back by context length so that
# the first evaluation timestamp is immediately following the training data
valid_start_index = 12 * 30 * 24 - context_length
valid_end_index = 12 * 30 * 24 + 4 * 30 * 24

test_start_index = 12 * 30 * 24 + 4 * 30 * 24 - context_length
test_end_index = 12 * 30 * 24 + 8 * 30 * 24

Loading target dataset: ETTh1


In [50]:
data = pd.read_csv(
    dataset_path,
    parse_dates=[timestamp_column],
)

train_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=train_start_index,
    end_index=train_end_index,
)
valid_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=valid_start_index,
    end_index=valid_end_index,
)
test_data = select_by_index(
    data,
    id_columns=id_columns,
    start_index=test_start_index,
    end_index=test_end_index,
)

tsp = TimeSeriesPreprocessor(
    timestamp_column=timestamp_column,
    id_columns=id_columns,
    input_columns=forecast_columns,
    output_columns=forecast_columns,
    scaling=True,
)
tsp.train(train_data)

TimeSeriesPreprocessor {
  "context_length": 64,
  "feature_extractor_type": "TimeSeriesPreprocessor",
  "id_columns": [],
  "input_columns": [
    "HUFL",
    "HULL",
    "MUFL",
    "MULL",
    "LUFL",
    "LULL",
    "OT"
  ],
  "output_columns": [
    "HUFL",
    "HULL",
    "MUFL",
    "MULL",
    "LUFL",
    "LULL",
    "OT"
  ],
  "prediction_length": null,
  "processor_class": "TimeSeriesPreprocessor",
  "scale_outputs": false,
  "scaler_dict": {
    "0": {
      "copy": true,
      "feature_names_in_": [
        "HUFL",
        "HULL",
        "MUFL",
        "MULL",
        "LUFL",
        "LULL",
        "OT"
      ],
      "mean_": [
        7.937742245659508,
        2.0210386567335163,
        5.079770601157927,
        0.7461858799957015,
        2.781762386375555,
        0.7884531235540096,
        17.1282616982271
      ],
      "n_features_in_": 7,
      "n_samples_seen_": 8640,
      "scale_": [
        5.812749409143771,
        2.0901046504076,
        5.518793579

In [51]:
train_dataset = ForecastDFDataset(
    tsp.preprocess(train_data),
    id_columns=id_columns,
    input_columns=forecast_columns,
    output_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
valid_dataset = ForecastDFDataset(
    tsp.preprocess(valid_data),
    id_columns=id_columns,
    input_columns=forecast_columns,
    output_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)
test_dataset = ForecastDFDataset(
    tsp.preprocess(test_data),
    id_columns=id_columns,
    input_columns=forecast_columns,
    output_columns=forecast_columns,
    context_length=context_length,
    prediction_length=forecast_horizon,
)

## Zero-shot forecasting on ETTH

In [52]:
print("Loading pretrained model")
finetune_forecast_model = PatchTSTForPrediction.from_pretrained(
    "patchtst/electricity/model/pretrain/", num_input_channels=len(forecast_columns), head_dropout=0.7
)
print("Done")

Loading pretrained model
Done


In [53]:
finetune_forecast_args = TrainingArguments(
    output_dir="./checkpoint/patchtst/transfer/finetune/output/",
    overwrite_output_dir=True,
    learning_rate=0.0001,
    num_train_epochs=100,
    do_eval=True,
    evaluation_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    dataloader_num_workers=num_workers,
    report_to="tensorboard",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=3,
    logging_dir="./checkpoint/patchtst/transfer/finetune/logs/",  # Make sure to specify a logging directory
    load_best_model_at_end=True,  # Load the best model when training ends
    metric_for_best_model="eval_loss",  # Metric to monitor for early stopping
    greater_is_better=False,  # For loss
    label_names=["future_values"],
)

# Create a new early stopping callback with faster convergence properties
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=10,  # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.001,  # Minimum improvement required to consider as improvement
)

finetune_forecast_trainer = Trainer(
    model=finetune_forecast_model,
    args=finetune_forecast_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback],
)

print("\n\nDoing zero-shot forecasting on target data")
result = finetune_forecast_trainer.evaluate(test_dataset)
print("Target data zero-shot forecasting result:")
print(result)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.




Doing zero-shot forecasting on target data


Target data zero-shot forecasting result:
{'eval_loss': 0.3702967166900635, 'eval_runtime': 0.6424, 'eval_samples_per_second': 4335.15, 'eval_steps_per_second': 68.491}


By a direct zero-shot, we get MSE of 0.370 which is near to the SOTA result in the original PatchTST paper.

 ## Target data linear probing

We can do a quick linear probing on the `train` part of the target data to see any possible `test` performance improvement.


In [54]:
# Freeze the backbone of the model
for param in finetune_forecast_trainer.model.model.parameters():
    param.requires_grad = False

print("\n\nLinear probing on the target data")
finetune_forecast_trainer.train()
print("Evaluating")
result = finetune_forecast_trainer.evaluate(test_dataset)
print("Target data head/linear probing result:")
print(result)



Linear probing on the target data


Epoch,Training Loss,Validation Loss
1,0.3719,0.67616
2,0.3589,0.665527
3,0.3523,0.660444
4,0.3478,0.657923
5,0.3441,0.657768
6,0.3404,0.657644
7,0.3384,0.657339
8,0.3364,0.657935
9,0.335,0.658182
10,0.334,0.658212


Checkpoint destination directory ./checkpoint/patchtst/transfer/finetune/output/checkpoint-126 already exists and is non-empty.Saving will proceed but saved results may be invalid.


Evaluating


Target data head/linear probing result:
{'eval_loss': 0.35606929659843445, 'eval_runtime': 0.6589, 'eval_samples_per_second': 4226.514, 'eval_steps_per_second': 66.774, 'epoch': 14.0}


By doing a simple linear probing, MSE decreased from 0.370 to 0.356, beating the originally reported results!

In [55]:
save_dir = f"patchtst/electricity/model/transfer/{dataset}/model/linear_probe/"
os.makedirs(save_dir, exist_ok=True)
finetune_forecast_trainer.save_model(save_dir)

save_dir = f"patchtst/electricity/model/transfer/{dataset}/preprocessor/"
os.makedirs(save_dir, exist_ok=True)
tsp.save_pretrained(save_dir)

['patchtst/electricity/model/transfer/ETTh1/preprocessor/preprocessor_config.json']

Next, let's check if we can get additional improvements by doing a full fine-tune.

 ## Target data `ETTh1` full fine-tune

We can do a full model fine-tune (instead of probing the last linear layer as shown above) on the `train` part of the target data to see a possible `test` performance improvement.

In [56]:
# Reload the model
finetune_forecast_model = PatchTSTForPrediction.from_pretrained(
    "patchtst/electricity/model/pretrain/", num_input_channels=len(forecast_columns), dropout=0.7, head_dropout=0.7
)
finetune_forecast_trainer = Trainer(
    model=finetune_forecast_model,
    args=finetune_forecast_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback],
)
print("\n\nFinetuning on the target data")
finetune_forecast_trainer.train()
print("Evaluating")
result = finetune_forecast_trainer.evaluate(test_dataset)
print("Target data full finetune result:")
print(result)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.




Finetuning on the target data


Epoch,Training Loss,Validation Loss
1,0.3317,0.698792
2,0.3054,0.748235
3,0.2893,0.784631
4,0.2767,0.831685
5,0.2659,0.852125
6,0.2572,0.900339
7,0.2469,0.936061
8,0.2388,0.909557
9,0.2308,0.923318
10,0.2224,0.970081


Evaluating


Target data full finetune result:
{'eval_loss': 0.35959485173225403, 'eval_runtime': 0.6277, 'eval_samples_per_second': 4437.137, 'eval_steps_per_second': 70.102, 'epoch': 11.0}


In this case, there is not much improvement with ETTh1 dataset with full fine-tuning. For other datasets there may be substantial datasets. Lets save the model anyway.

In [57]:
save_dir = f"patchtst/electricity/model/transfer/{dataset}/model/fine_tuning/"
os.makedirs(save_dir, exist_ok=True)
finetune_forecast_trainer.save_model(save_dir)

## Summary

In this blog, we presented a step-by-step guide on training PatchTST for tasks related to forecasting and transfer learning. We intend to facilitate the seamless integration of the PatchTST HF model for your forecasting use cases. We trust that this content serves as a useful resource to expedite your adoption of PatchTST. Thank you for tuning in to our blog, and we hope you find this information beneficial for your projects.