# Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
- AI Hub Notebook: [AI Hub - Interpretable Multi-Horizon Time Series Forecasting with TFT](https://aihub.cloud.google.com/p/products%2F9f39ad8d-ad81-4fd9-8238-5186d36db2ec)
- Arxiv Paper: [Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting](https://arxiv.org/pdf/1912.09363.pdf)
- UCI ML Repository Data: [PEMS-SF Dataset](https://archive.ics.uci.edu/ml/datasets/PEMS-SF)

**FAQs**  
**What is multi-horizon forecasting?**  
Multi-horizon forecasting(MHF) often contains a `complex mix of inputs` - including `static covariates`, known `future inputs` and other `exogenous time series` that are only observed in the past - without any prior information on how they interact witht the target.

**What is new in this paper?**  
There are many DL architectures published for multi-horizon forecasting problems but most of them are `black box` models. This paper brings insights into how the full range of inputs present in practical scenarios

**What is a `Temporal Fusion Transformer(TFT)`?**  
A TFT is a novel attention based architecture which combines high performance multi-horizon forecasting with interpretable insights into temporal dynamics.

**What is powers the TFT architecture?**    
To learn temporal relationship at different scales, TFT uses `Recurrent Neural Network(RNN)` for local processing and interpretable self-attention layers for long-term dependencies.
It utilizes specialized components to select relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of scenarios.

**What are the key DL concepts used?**
- Recurrent Neural Networks (RNN)
- Long Short Term Memory Networks (LSTM)
- Attention Based Models
- Transformer Based Models

**What challenges this paper specifically addresses that is not considered before?**
- Consideration of Exogenous inputs are known into the future
- Not neglecting static covariates
- Designing networks with suitable inductive biases 
- Considering heterogeneity of forecasting inputs

**Does this paper comes with source code?**  
Yes, the authors implemented on 3 of real-world datasets and demonstrated the significance of this architecture

## Introduction
MHF, the prediction of variables of interest at multiple future time steps. It is a very important problem in time series ML. Most of the time series problems solves one time step ahead. For example, 
- Given a series of observations of event collected over a period of time, `what could be the next event`
- Given a list of closing day stock price for a given stock, `what is the price at the end of next day?`
- From the observations of monthly purchasing power of a household, `what could be the purchase amount for the next month?`
- Given unit of electricity consumed at hourly basis, `what is the power consumption by 10AM?`

In case of MHF, the above highlighted questions transform to multiple time steps
- what could be the next event $\Longrightarrow$ `what could be the next 4 subsequent events`
- what is the price at the end of next day? $\Longrightarrow$ `what is the end of day price for tomorrow, day-after-tomorrow and the the day after`
- what could be the purchase amount for the next month? $\Longrightarrow$ `what could be the purchase amount for the next 6 months?`
- what is the power consumption in the next 1 hour? $\Longrightarrow$ `what is the power consumption by 10AM, 11AM, Noon and 1PM?`

#### Data Source of MHF
MHFs have access to variety of data sources,heterogeneity of the data sources with little information about their interactions makes MHF a challenging problem, 
- Information about the future (e.g. upcoming holidays)
- exogenous time series(e.g. historical customer footprint)
- static meta data (e.g. geo-location of the store)

#### Usual Challenges 

- Common challenges or problems with autoregressing models is assume all exogenous inputs are known into future
- These variables are simply concatenated with time-dependent features at each step
- Most architectures are `black box`, where forecasts are controlled by complex nonlinear interactions between many parameters
- Trustworthiness of the model is questioned due to the opaque nature

Further, commonly used explainability methods(LIME and SHAP) for DNNs are not well suited for applying to temporal data.
- LIME and SHAP do not consider the time ordering of input features
- LIME, surrogate models are independently constructed for each data-point
- SHAP features are considered independently for neighboring time steps
Such post-hoc approaches might lead to poor explanation quality as dependencies between time steps are typically significant in time series.

- Attention based architectures has the inherent interpretability for sequential data (e.g. language or speech). 
- Usually language or speech datasets are univariabe but temporal datas are multivariate, applying an Attention based model on such datasets is a novelty but heterogeneity of data is still a challenge
- 

### Temporal Fusion Transformers(TFT)
The TFT proposed in this paper is an attention based DNN architecture for MHF to achieve high performance while enabling interpretability.
Novel ideas incorporated considering full range of potential inputs and temporal relationships are
1. Static covariate encoders which encode context vectors for use in other parts of the network
2. Gating mechanisms througout and sample dependent variable selection to minimize the contributions of irrelevant inputs
3. A sequence to sequence layer to locally process known and observed inputs
4. a temporal self-attention decoder to learn any long term dependencies present in the dataset - This facilitates the interpretability by identifying,  
    a. Globally important variables for the prediction problem  
    b. Persistent temporal patterns  
    c. Significant events  
    
While conventional method assumes target alone to be fed into prediction recursion loop and ignores numerous useful time-varying inputs for the 2nd time step onwards, TFT explicitly accounts for the diversity of inputs. This is done by naturally handling static covariates and time-varying inputs

#### Time Series Interpretability with Attention
Attention mechanisms are used in 
- Translation[17]
- Image Classification[22]
- Tabular Learning[23]  

To identify saliance of input for each instance using the magnitude of attention weights. With interpretability motivations, time series researches[7, 12, 24] were conducted  using LSTM[25] and Transformer based architectures. However it is done, without giving importance of static covariates 

TFT alleviates th static covariates problem with separate encoder-decoder attention at each step on top of the self-attention to determine the contribution of temporal inputs

Post-hoc interpretability methods are applied on pre-trained black-box models and often based on distilling into a surrogate interpretable model or decomposing into feature attributions. They are not designed to take into account the time ordering of inputs, limiting their use for complex time series data.

**Feature Selection Methods**
- Inherently interpretable modeling approaches build components for feature seection directly into the architecture
- For time series forecasting, they are based on explicitly quantifying time-dependent variable contributions
- Interpretable Multi-Variable LSTMs[27] partitions the hidden state such that each variable contributes uniquely to its own memory segment and weights memory segments to determine variable contributions
- By computing single contribution coefficient based on attention weights, temporal importance and variable selections schemes are identified

TFTs is designed to analyze global temporal relationships with input data and allow users to interpret global behviors of the model on the whole dataset. Specifically in the identification of any persistent patterns(e.g. seasonality or lag effects) and regimes present.

### Related Work
- Traditional Multi Horizon Forecasting Methods[18, 19]
- Iterated approaches using autoregressive models[9, 6, 12]
    - They are one step ahead prediction models with multi-stpe predictions obtained by recursively feeding predictions into future inputs
    - LSTM Networks like Deep AR[9] uses stacked LSTM to generate parameters of a predefined linear state-space model with predictive distributions produced via Kalman Filter
    - Further convolutional layers for local processing and a sparse attention mechanism to increase the receptive field during forecasting
    - These methods assumes target alone to be fed recursively into future inputs
- Direct methods based on sequence-to-sequence models[10, 11]
    - Direct methods explicity generate forecasts for multiple predefined horizons at each time step relying seq2seq architecture
    - LSTM encoders to summarize past inputs and a variety of methods to generate future predictions.
    - MQRNN[10] uses LSTM or Convolutional encoders to generate context vectors to feed into an MLP for each horizon
    - A multi-model attention mechanism is used with LSTM encoders to context vectors for a bi-directionsal LSTM decoder
    - Yet interpretability remains challenging
    
**Others**  
- Post-hoc explanations methods, [LIME, SHAP, RL-LIM],[15, 16, 26]
- Inherently interpretable models[27, 24]
- Methods combining temporal importance and variable selection[24]

## Multi-Horizon Forecasting

### Inputs and Targets
In a given time series dataset at each time step $\mathcal{t} \in [0, T_i]$
- $I$ is the unique number of entities.
- Each entity $i$ is associated with a set of static covariates. i.e. $\mathbf{s}_i \in \mathbb{R}^{m_s}$
- Inputs $\mathcal{X}_{i, t} \in \mathbb{R}^{m_x}$
- Targets $\mathcal{y}_{i,t} \in \mathbb{R}$

Time dependent inputs are divided into 2 categories  
 
$$\mathcal{X}_{i,t} = [\mathcal{z}^T_{i,t}, \mathcal{x}^T_{i,t}]^T$$  

**Observed Inputs**  
Measured at each step and are unknown beforehand  
$$\mathcal{z}^T_{i,t} \in \mathbb{R}^{m_s}$$
**Known Inputs** 
Inputs that are pre-determined (e.g. day of week at time $t$)
$$\mathcal{x}^T_{i,t} \in \mathbb{R}^{m_x}$$