<a href="https://colab.research.google.com/github/khauzenberger/pytorch-projects/blob/main/1_long_term_forecasting_with_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The project

In a frist step, I'm going to be replicating the paper "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers" by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam (see https://arxiv.org/abs/2211.14730).

In the second step, then, I apply the key designs of the paper to forecast macroeconomic time series.

## Key designs, model overview, and key concept

**Patching**: Segmentation of time series into subseries-level patches which are served as input tokens to Transformer.

**Channel-independence**: Each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series.

The figure below sketches the so-called **PatchTST model** and transformer backbones under supervised and self-supervised learning.

![picture](https://raw.githubusercontent.com/yuqinie98/PatchTST/main/pic/model.png)

**Transformer-based models**: It is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data.

Transformers are designed to process sequential input data, processing them all at once. The attention mechanism provides context for any position in the input sequence. 

Because transformers process the entire input all at once, they allow for more parallelization than recurrent neural networks (RNNs) and therefore reduce training times.

# Replication

In the following I provide step-by-step instructions to get from our inputs to the desired outputs.

While the paper comes with an official implementation (https://github.com/yuqinie98/PatchTST), there's no learning in just copying and pasting them. That's why I will start more or less from scratch, going through the steps in the section on paper replication in Daniel Bourke's ZTM course "PyTorch for deep learning" (https://github.com/khauzenberger/pytorch-deep-learning). 

## Get data from the paper

While the paper experiments with eight different datasets, I will focus only on the smallest one: the influenza-like illnesses (ILI) dataset. 

In [1]:
import requests
from pathlib import Path

# Setup path to data folder
data_path = Path("data/")

# If the image folder doesn't exist, download it and prepare it... 
if data_path.is_dir():
    print(f"{data_path} directory exists.")
else:
    print(f"Did not find {data_path} directory, creating one...")
    data_path.mkdir(parents=True, exist_ok=True)
    
    # Download influenza-like illness data from my Github repo
    with open(data_path / "national_illness.csv", "wb") as f:
        request = requests.get("https://github.com/khauzenberger/pytorch-projects/raw/main/data/national_illness.csv")      
        print("Downloading influenza-like illness data from my Github repo...")
        f.write(request.content)

Did not find data directory, creating one...
Downloading influenza-like illness data from my Github repo...


## Create Datasets and DataLoaders

In [2]:
import os
import pandas as pd
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader

def create_dataloaders(data_path:str, file_name:str, seq_len:int, pred_len:int, 
                       batch_size:int=1, scale=False,features:str="S",
                       target:str="OT"):
  
  scaler = StandardScaler()
  df_raw = pd.read_csv(os.path.join(data_path,file_name))
  
  cols = list(df_raw.columns)
  cols.remove(target)
  cols.remove("date")
  df_raw = df_raw[["date"] + cols + [target]]

  num_train = int(len(df_raw) * 0.7)
  num_test = int(len(df_raw) * 0.2)
  num_vali = len(df_raw) - num_train - num_test

  border1s = [0, num_train - seq_len, len(df_raw) - num_vali - seq_len]
  border2s = [num_train, num_train + num_test, len(df_raw)]

  if features == "M" or features == "MS":
    cols_data = df_raw.columns[1:]
    df_data = df_raw[cols_data]
  elif features == "S":
    df_data = df_raw[[target]]

  if scale:
    train_data = df_data[border1s[0]:border2s[0]]
    scaler.fit(train_data.values)
    data = scaler.transform(df_data.values)
  else:
    data = df_data.values

  train_data = data[border1s[0]:border2s[0]]
  test_data = data[border1s[1]:border2s[1]]
  vali_data = data[border1s[2]:border2s[2]]

  train_dataloader = DataLoader(
      train_data,
      batch_size=batch_size,
      shuffle=True)
  test_dataloader = DataLoader(
      test_data,
      batch_size=batch_size,
      shuffle=False)
  vali_dataloader = DataLoader(
      vali_data,
      batch_size=batch_size,
      shuffle=False)

  return train_dataloader, test_dataloader, vali_dataloader

## Instance normalization

This technique helps mitigating the distribution shift effect between the training and testing data.

In [7]:
import torch
import torch.nn as nn

class InstanceNorm(nn.Module):
  def __init__(self, num_features:int, eps=1e-5,  affine=True, 
               subtract_last=False):    
    """
    :param num_features: the number of features or channels
    :param eps: a value added for numerical stability
    :param affine: if True, InstanceNorm has learnable affine parameters
    """    
    super(InstanceNorm, self).__init__()
    self.num_features = num_features
    self.eps = eps
    self.affine = affine
    self.subtract_last = subtract_last
    if self.affine:
      self._init_params()

  def forward(self, x, mode:str):
    if mode == "norm":
      self._get_statistics(x)
      x = self._normalize(x)
    elif mode == "denorm":
      x = self._denormalize(x)
    else: raise NotImplementedError
    return x

  def _init_params(self):
    # initialize InstanceNorm params: (C,)
    self.affine_weight = nn.Parameter(torch.ones(self.num_features))
    self.affine_bias = nn.Parameter(torch.zeros(self.num_features))

  def _get_statistics(self, x):
    dim2reduce = tuple(range(1, x.ndim-1))
    if self.subtract_last:
      self.last = x[:,-1,:].unsqueeze(1)
    else:
      self.mean = torch.mean(x, dim=dim2reduce, keepdim=True).detach()
      self.stdev = torch.sqrt(torch.var(x, dim=dim2reduce, keepdim=True, unbiased=False) + self.eps).detach()

  def _normalize(self, x):
    if self.subtract_last:
      x = x - self.last
    else:
      x = x - self.mean
      x = x / self.stdev
    if self.affine:
      x = x * self.affine_weight
      x = x + self.affine_bias
    return x

  def _denormalize(self, x):
    if self.affine:
      x = x - self.affine_bias
      x = x / (self.affine_weight + self.eps*self.eps)
      x = x * self.stdev
    if self.subtract_last:
      x = x + self.last
    else:
      x = x + self.mean
    return x

In [9]:
random_data = torch.randn(16,7,96)
instnorm = InstanceNorm(num_features=7)
instnorm(random_data,"norm")

RuntimeError: ignored

In [5]:
# Setup path to data folder and file name
data_path = Path("data/")
file_name = "national_illness.csv"

# Sequence length (look-back window)
seq_len = 104

# Prediciton length (forecast horizon)
pred_len = 24

# Batch size
batch_size = 16

# Features of model: M  -> multivariate predict multivariate
#                    S  -> univariate predict univariate
#                    MS -> multivariate predict univariate
features = "M"

train_dataloader, test_dataloader, vali_dataloader = create_dataloaders(
    data_path=data_path,
    file_name=file_name,
    seq_len=seq_len,
    batch_size=batch_size,
    pred_len=pred_len,
    features=features)

Batch size 16


In [None]:
test_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7f7e42603220>