# Informer: Beyond Efficient Transformer for Long Sequence

## 1. Introduction

Long Sequence Time-Series Forecasting (LSTF) plays a crucial role in various domains, including energy management, finance, and health monitoring. Predicting long time-series sequences requires models capable of capturing long-range dependencies while maintaining computational efficiency. Traditional forecasting models, such as Recurrent Neural Networks (RNNs) and vanilla Transformers, struggle with scalability due to their high computational complexity and memory consumption. To address these limitations, the **Informer** model introduces three primary innovations:

1. **ProbSparse Self-Attention** – Reducing the quadratic complexity of vanilla Transformers to O(L log L) by attending only to the most relevant key-query pairs.
2. **Self-Attention Distilling** – A technique that compresses redundant information, allowing for improved memory efficiency and better long-range dependency modeling.
3. **Generative Style Decoding** – Instead of sequentially generating outputs, the Informer predicts entire sequences in a single forward pass, significantly improving inference speed.

## 2. Methodology

### 2.1. Efficient Self-Attention Mechanism

Vanilla Transformers suffer from **O(L²) complexity** due to their full self-attention computation. The Informer introduces **ProbSparse Self-Attention**, which selectively attends to a small subset of key-query pairs based on importance, reducing computational demands without sacrificing accuracy.

The standard self-attention mechanism is computed as:

$$A(Q,K,V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d}} \right) V$$

where $Q, K, V$ are the query, key, and value matrices, and $d$ is the feature dimension. Informer optimizes this by attending only to the **top-u** most informative key-query interactions, drastically reducing computation:

$$A(Q,K,V) = \text{softmax} \left( \frac{QK^T_{\text{top-u}}}{\sqrt{d}} \right) V_{\text{top-u}}$$

This sparsity-aware mechanism enables efficient dependency alignment while preserving predictive quality.

### 2.2. Self-Attention Distilling

To further enhance efficiency, Informer employs a hierarchical approach that **progressively distills attention layers**:

- Redundant attention maps are pruned, focusing computation on dominant features.
- By applying **max-pooling operations**, attention maps shrink layer by layer, reducing memory consumption from **O(L²) to O((2-ε)L log L)**.

The downsampling in distillation is computed as:

$$X_{j+1} = \text{MaxPool} \left( \text{ELU} ( \text{Conv1D}(X_j) ) \right),$$

where $X_j$ is the feature map at layer $j$, $\text{Conv1D}$ is a one-dimensional convolutional filter, and $\text{ELU}$ is the activation function.

### 2.3. Generative Style Decoding

Traditional Transformers rely on **dynamic decoding**, which sequentially generates outputs, leading to slow inference speeds and cumulative error propagation. Informer replaces this with a **single-step generative decoder**, which predicts entire sequences at once. The decoder input is given as:

$$X_{\text{de}} = \text{Concat}(X_{\text{token}}, X_0)$$

where \(X_{\text{token}}\) is a known segment of the sequence and \(X_0\) is a zero-padded placeholder for the target values. The final output is computed via a fully connected transformation:

$$Y = \text{FCN}(X_{\text{de}})$$

This non-autoregressive approach eliminates dependence on previous predictions, ensuring robustness against error accumulation.

## 3. Experimental Results

The Informer model is evaluated on four real-world datasets:

- **ETTh1, ETTh2, ETTm1** – Electricity Transformer Temperature datasets containing energy consumption and transformer temperature data.
- **ECL (Electricity Consumption Load)** – A dataset tracking hourly electricity consumption for 321 clients.
- **Weather** – Climate-based dataset with hourly weather conditions across multiple locations.

### 4. Key Findings

1. **Computation Efficiency**

   - Informer significantly reduces **training and inference times** due to optimized self-attention and decoding mechanisms.
   - It scales well, handling **sequences 10× longer** than traditional Transformer models.

2. **Prediction Accuracy**

   - Informer consistently achieves lower **Mean Squared Error (MSE)** and **Mean Absolute Error (MAE)** compared to LSTM, ARIMA, Prophet, DeepAR, Reformer, and LogTrans.
   - The model maintains high performance across varying forecast lengths, demonstrating **robust long-term predictive power**.

3. **Scalability & Memory Usage**

   - Compared to standard Transformers, Informer’s **self-attention distilling** and **ProbSparse mechanism** drastically cut memory requirements.
   - It enables real-world deployment in large-scale forecasting applications where traditional models fail due to resource constraints.

## 5. Practical Applications

Informer’s efficiency and scalability make it ideal for multiple real-world applications, including:

- **Smart Grid Energy Forecasting** – Predicting electricity demand to optimize power distribution.
- **Financial Market Prediction** – Forecasting stock prices and economic trends for investment strategies.
- **Climate and Environmental Modeling** – Long-term weather pattern forecasting for disaster prevention and resource planning.
- **Healthcare Time-Series Analysis** – Predicting patient vitals and disease progression over extended periods.

## 6. Conclusion

Informer represents a significant advancement in **long-sequence time-series forecasting**, overcoming the efficiency bottlenecks of traditional Transformer models. By leveraging **ProbSparse Self-Attention**, **Self-Attention Distilling**, and **Generative Decoding**, it achieves superior predictive accuracy while reducing complexity and memory usage. Extensive experiments confirm its scalability and effectiveness across multiple real-world datasets, making Informer a groundbreaking solution for large-scale time-series forecasting challenges.

## References

[1] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2020, December 14). Informer: Beyond efficient transformer for long sequence Time-Series forecasting. arXiv.org. https://arxiv.org/abs/2012.07436