## XFormers

- Transformers are behind the most successful LLMs

- There is a need to increase the size of the transformers

- Scaling laws and Emergent properties

- In this session, we will look at various ways that transformers are being optimized for their performance 

Please refer to [1] for the general survey of the effecient transformers. This session is targeted to give a sense of various ways that attention mechanism can be made effecient. 

[[1]](https://arxiv.org/abs/2009.06732) Tay et al. (2022) Efficient Transformers: A Survey 

### Modes of Transformer
- Most common reference to transformers is from Vaswani et al. (2017). 

<img width=750 src="imgs/transformer-modes.png">

- **Encoder-only**: Used for classification tasks 
- **Decoder-only**: Used for autoregression tasks where next prediction need to be based only on present and the past predictions
- **Encoder-Decoder**: Document summarization, language translation, etc. 

### Computational Complexity
- It uses self-attention, with $ \mathbf{Q},\mathbf{K}, \mathbf{V}  \in \mathbb{R}^{n \times d}$. Here, $n$ is the sequence length, and $d$ is the dimension. $n >> d$

$$\mathbf{c} = \text{Softmax}(\frac{\mathbf{Q}\mathbf{K}^T)}{\sqrt{d}})\mathbf{V}$$

- Computational complexity: $\mathcal{O}(n^2)$. 

<img width=750 src="imgs/transformer-complexity.png">


- Most of the engineering advances in attention mechanism **aim to reduce this complexity**: We would want to increase $n$ to a very high value.

### Reducing computational complexity 

- Reduce the sequence length

- Process segment recurrently

- Improve the calculation of attention scores

- Memory level optimization: Reduce memory access times

- Alternatives to attention mechanisms

    
**Note**: Some mechanisms might not allow decoder where masking is required. As a result, they might not be suitable for auto-regressive applications. 

### Attention mechanisms

Tay et al. (2022) Efficient Transformer: A Survey

<img width=750 src="imgs/attn_complexity.png">


[[1]](https://arxiv.org/abs/2009.06732) Tay et al. (2022) Effecient Transformers: A Survey

### Reduce the sequence length

- Only compute attention using a segment of the sequence or any smaller representations of this sequence

- Several ways to define segments
    * **Split the entire sequence into blocks** and **use convolution within each block** to reduce keys and values [1, 2]
    
    <img width=750 src="imgs/mem-compressed-attn.png">    

    
[[1]](https://arxiv.org/abs/1801.10198) Liu et al. (2018), Generating Wikipedia by Summarizing Long Sequences

[[2]](https://arxiv.org/abs/1802.05751) Parmar et al. (2018), Image Transformer


### Reduce the sequence length

- Only compute attention using a segment of the sequence or any smaller representations of this sequence

- Several ways to define segments 
    * **Do impartial calculations based on pre-defined sparse patterns**,e.g., Sparse Transformers [1], Longformer[2], ETC[3], BigBird[4]. Image: Sparse Transformer
    
    <img width=750 src="imgs/sparse-transformer.png">
        

[[1]](https://arxiv.org/abs/1904.10509) Child et al. (2019) Generating Long Sequences with Sparse Transformers

[[2]](https://arxiv.org/abs/2004.05150) Beltagzy et al. (2020) Longformer: The Long-Document Transformer

[[3]](https://arxiv.org/abs/2004.08483) Ainslie et al. (2020) ETC: Encoding Long and Structured Inputs in Transformers

[[4]](https://arxiv.org/abs/2007.14062) Zaheer et al. (2020) Big Bird: Transformers for Longer Sequences




### Reduce the sequence length

- Only compute attention using a segment of the sequence or any smaller representations of this sequence

- Several ways to define segments
    * **Low rank projection to a smaller sequence length $k < n$** of the keys and values e.g., Linformer [7], Set Transformer [8]. Set transformer uses ISAB with learnable $\mathbf{I} \in \mathcal{R}^{m \times d}$ parameters.
    
    <img width=250 src="imgs/ISAB.png">    

    
[[1]](https://arxiv.org/abs/2006.04768) Wang et al. (2019) Linformer: Self-Attention with Linear Complexity

[[2]](https://arxiv.org/abs/1810.00825) Lee et al. (2019) Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

### Reduce the sequence length

- Only compute attention using a segment of the sequence or any smaller representations of this sequence

- Several ways to define segments 
    * **Learn to identify blocks in sequences** based on some mechanism, e.g, Routing Transformer uses clustering [9], Reformer uses locality-sensitive hashing [10]
    
[[1]](https://arxiv.org/abs/2003.05997) Roy et al. (2020) Efficient Content-Based Sparse Attention with Routing Transformers

[[2]](https://arxiv.org/abs/2001.04451) Ketaev et al. (2020) Reformer: The Efficient Transformer

### Process segments recurrently

- Transformer XL [1] processes segments recurrently, passing a hidden representation to the next transformer

- Compressive Transformer [2] extends Transformer-XL by maintaining memory to retain past sequences

[[1]](https://arxiv.org/abs/1901.02860) Dai et al. (2019) Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

[[2]](https://arxiv.org/abs/1911.05507) Rae et al. (2019) Compressive Transformers for Long-Range Sequence Modelling

### Improve the calculation of attention scores

- Improve the attention computation by the use of kernels
    * Performer [1] uses random kernels
    * Linear Transformer [2] breaks the computation down to $n$ dot products (kernel function)
    

[[1]](https://arxiv.org/abs/2009.14794) Choromanski et al. (2020) Rethinking Attention with Performers

[[2]](https://arxiv.org/abs/2006.16236) Katharapoulos et al. (2020) Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

### Memory level optimization: Flash Attention [1] 

- Aims to maximize floating point operations per second (FLOPS) by reducing the times to access memory
    
<img src="imgs/flash-1.png">


[[1]](https://arxiv.org/abs/2205.14135) Dao et al. (2022) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 

[[Github Repo]](https://github.com/Dao-AILab/flash-attention) 

[[Blog]](https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad) Aleksa Gordic, ELI5: FlashAttention

### Memory level optimization: Flash Attention [1] 

- Computes attention scores in blocks, thereby reading and writing each block only once

<img src="imgs/flash-mem.png">

[[1]](https://arxiv.org/abs/2205.14135) Dao et al. (2022) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 


[[Github Repo]](https://github.com/Dao-AILab/flash-attention) 

[[Blog]](https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad) Aleksa Gordic, ELI5: FlashAttention

### Memory level optimization: Flash Attention-2 [1] 

- Flash Attention-2 [1]


<img width=500 src="imgs/flash_attention.png">

[[1]](https://t.co/E5FZ3j1mDB) Tri Dao (2023) FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

[[Github Repo]](https://github.com/Dao-AILab/flash-attention) 

[[Blog]](https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad) Aleksa Gordic, ELI5: FlashAttention

### Alternative architectures

- Hyena architecture [1]: 
    * Subquadratic complexity. 
    * Uses learnable convolutions and recurrences

- MLP Mixers [2]: 
    * Avoids the use of any CNNs and Attention
    * Only uses MLPs 

- Synthesizers [3]: 
    * Similar to MLP-Mixers
    * Approximates attention scores using MLPs

- Many more ... 


[[1]](https://arxiv.org/pdf/2302.10866) Poli et al. (2023) Hyena Hierarchy: Towards Larger Convolutional Model

[[2]](https://arxiv.org/abs/2105.01601) Tolstikhin et al. (2021) MLP-Mixer: An all-MLP Architecture for Vision

[[3]](https://arxiv.org/abs/2005.00743) Tay et al. (2020) Synthesizer: Rethinking Self-Attention in Transformer Models

### Other optimizations

- **Reducing the complexity of FFN following MHA**
    * FFN is computationally expensive
    * Mixture-of-Experts (MOE) approaches that assumes experts in specific region of the inputs, thereby routes tokens to specific experts
    * Switch Transformer [1], GShard [2], etc. 
    
- **Weight sharing**:  Sharing the parameters across encoders results in smaller models, e.g., Universal Transformers [3], Albert [4]

- **Mixed precision training**: It reduces the memory costs, e.g., Q-BERT [5], Quantization-aware training of transforners [6]

- **Knowledge Distillation**: Learning smaller (faster) models from the output of the larger models, e.g., DistilBERT [7], TinyBERT [8]


[[1]](https://arxiv.org/abs/2101.03961) Fedus et al. (2021) Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

[[2]](https://arxiv.org/abs/2006.16668) Lepikhin et al. (2020) GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

[[3]](https://arxiv.org/abs/1807.03819) Dehghani et. al. (2018) Universal Transformers

[[4]](https://arxiv.org/abs/1909.11942) Lan et al. (2019) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

[[5]](https://arxiv.org/abs/1909.05840) Shen et al. (2019) Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

[[6]](https://arxiv.org/abs/2004.07320) Fan et al. (2020) Training with Quantization Noise for Extreme Model Compression

[[7]](https://arxiv.org/abs/1910.01108) Sanh et al. (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[[8]](https://arxiv.org/abs/1909.10351) Jiao et al. (2019) TinyBERT: Distilling BERT for Natural Language Understanding