### Usage of attention mechanism

- Attention mechanism has evolved over the years

- It has been used in several applications

- In this session, we will look at some of the ways in which attention mechanism has been used in various domains

- Various survey exists which define their own taxonomy [1, 2, 3, 4] 

- This session follows the taxonomy from Brauwers et al.
    * Simplified for the ease of explanation
    * Applications are not covered



[[1]](https://arxiv.org/abs/2203.14263) Brauwers et al. (2022) A General Survey on Attention Mechanisms in Deep Learning

[[2]](https://link.springer.com/article/10.1007/s41095-022-0271-y) Guo et al. (2022) Attention mechanisms in computer vision: A survey

[[3]](https://dl.acm.org/doi/abs/10.1145/3465055) Chaudhari et al. (2021) An Attentive Survey of Attention Models

[[4]](https://arxiv.org/abs/1811.05544) Dichao Hu (2018) An Introductory Survey on Attention Mechanisms in NLP Problems

### Attention mechanism

<img width=500, src="imgs/attention.png">

**Inputs:**
- Features: $\textbf{F} \in \mathcal{R}^{n_{F}\times d}$
- Query: $\textbf{Q} \in \mathcal{R}^{n_{Q}\times d}$
    * Given a query, what features are important and how to combine them?
    * optional: Self-attention mechanism learns the query itself

### Attention mechanism

<img width=500, src="imgs/attention.png">

**Outputs:**
- Context: $\textbf{C} \in \mathcal{R}^{n_{c}\times d}$
    * Use it as **Output** for the downstream task
    * Use it as **Query** for another attention mechansim
    * Use it as **Features** for another attention mechanism


### Attention mechanism
<img width=500, src="imgs/attention.png">

**Processing blocks:**
- **Feature to Keys and Values** ($\textbf{A}_1$): Mostly, simple MLPs to convert features into keys and values
    * mostly, linear transformation
    * low computational complexity

- **Score**:
    * combines queries with keys
    * what are relevant keys to answer a query?
    * high computational complexity

- **Align**:
    * given attention scores, how to combine values?
    * low computational complexity, but potential for memory bandwidth optimization
 
- **Combining values** ($\textbf{A}_2$): 
    * mostly, elementwise multiplication
    * low computational complexity

### Attention mechanisms by Feature Modality

- **Single feature**: A single data modality, e.g., text, video, audio

- **Multiple features**: Multiple data modalities, e.g., combinations of text, video, audio, tabular features

<img src="imgs/multimodal.png">



### Attention mechanisms by Feature Modality

- **Multiple features**: How to combine features from various attention mechanisms?
    * Alternating co-attention [1]

    <img width=500 src="imgs/alternating_co-attn.png">
    
[[1]](https://arxiv.org/pdf/1606.00061) Lu et al. (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering

### Attention mechanisms by Feature Modality

- **Multiple features**: How to combine features from various attention mechanisms?
    * Interactive co-attention [1]

    <img width=500 src="imgs/interactive_co-attn.png">
    
[[1]](https://arxiv.org/abs/1709.00893) Ma et al. (2017) Interactive Attention Networks for Aspect-Level Sentiment Classification

### Attention mechanisms by Feature Modality

- **Multiple features**: How to combine features from various attention mechanisms?
    * Rotary co-attention [1]

    <img width=500 src="imgs/rotary-attn.png">    
    
[[1]](https://arxiv.org/abs/1802.00892) Zheng et al. (2018) Left-Center-Right Separated Neural Network for Aspect-based Sentiment Analysis with Rotatory Attention

### Attention mechanisms by Feature Modality

- **Multiple features**: How to combine features from various attention mechanisms?
    * Hierarchical attention [1]

    <img width=500 src="imgs/hierarchical_attn.png">
    
[[1]](https://arxiv.org/abs/1806.00723) Wu et al. (2018) A Hierarchical Attention Model for Social Contextual Image Recommendation


### Attention mechanisms by Queries (Type)

- **Standard Atttention**

<img width=500 src="imgs/standard-attn.png">

### Attention mechanisms by Queries (Type)

- **Self-attention**: Learn the query so that the queries are specialized for the downstream task

<img width=500 src="imgs/learn-query.png">

### Attention mechanisms by Queries (Type)

- **No query**: There is a constant query asking: which features are important?

<img width=500 src="imgs/constant-query.png">

### Attention mechanisms by Queries (Multiplicity)

- Multi-head attention: Uses several queries from the same features. 
    * Attention mechanisms in parallel


- Forming queries from the output of other attention mechanisms
    * Alternating Co-attention
    * Interactive Co-attention
    * Rotary attention

### Attention mechanisms by Processing functions (Scoring - scalar)

- Computes the score between query and keys, i.e., $score(\mathbf{q}, \mathbf{k}) \in \mathbb{R}$


- Additive (Concatenate) [1] 

$$ score(\mathbf{q}, \mathbf{k}) = \mathbf{w}^T \times activation(\mathbf{W} [\mathbf{q}, \mathbf{k}] + \mathbf{b}) $$

- Multiplicative (Dot-product) [2]

$$ score(\mathbf{q}, \mathbf{k}) = \mathbf{q}^T \times \mathbf{k} $$

- Scaled Multiplicative [3]

$$ score(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^T \times \mathbf{k}}{\sqrt{d_k}} $$

[[1]](https://arxiv.org/abs/1409.0473) Bahdanau et al. (2014) Neural Machine Translation by Jointly Learning to Align and Translate

[[2]](https://arxiv.org/abs/1508.04025) Luong et al. (2015) Effective Approaches to Attention-based Neural Machine Translation

[[3]](https://arxiv.org/abs/1706.03762) Vaswani et al. (2017) Attention is all you need

### Attention mechanisms by Processing functions (Scoring - scalar)

- Computes the score between query and keys, i.e., $score(\mathbf{q}, \mathbf{k}) \in \mathbb{R}$

- General [1, 2, 3]

$$ score(\mathbf{q}, \mathbf{k}) = \mathbf{k}^T \times \mathbf{W} \times \mathbf{q} $$

$$ score(\mathbf{q}, \mathbf{k}) = \mathbf{k}^T \times (\mathbf{W} \times \mathbf{q}  + \mathbf{b})$$

$$ score(\mathbf{q}, \mathbf{k}) = activation(\mathbf{k}^T \times \mathbf{W} \times \mathbf{q}  + \mathbf{b})$$

- Miscellaneous [4]

$$ score(\mathbf{q}, \mathbf{k}) = \lvert\lvert \mathbf{q}-\mathbf{k} \rvert\rvert_2$$

$$ score(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^{T}\mathbf{k}}{\|\mathbf{q}\|\|\mathbf{k}\|}$$



[[1]](https://arxiv.org/abs/1508.04025) Luong et al. (2015) Effective Approaches to Attention-based Neural Machine Translation

[[2]](https://arxiv.org/abs/1606.02245) Sordoni et al. (2016) Iterative Alternating Neural Attention for Machine Reading

[[3]](https://arxiv.org/abs/1709.00893) Ma et al. (2017) Interactive Attention Networks for Aspect-Level Sentiment Classification

[[4]](https://arxiv.org/abs/1410.5401) Graves et al. (2014) Neural Turing Machines


### Attention mechanisms by Processing functions (Alignment)

- How should the scores be combined to process values and produce a context vector?

- Assuming 1 query and $n$ keys and values, $\mathbf{e} = score(\mathbf{q}, \mathbf{K}) \in \mathbf{R}^n$, i.e., one score for each key $\mathbf{k}$

- $\mathbf{a} = \text{Alignment}(\mathbf{e}) \in \mathbf{R}^n$ defines how to combine Values, $\mathbf{V} \in \mathbf{R}^{n \times d}$ to form the context vector

$$\mathbf{c} = \mathbf{a}^T\mathbf{V}$$

- **Soft alignment/global alignment** [1]

$$\mathbf{a}_i = \frac{\exp(\mathbf{e}_i)}{\sum_i{\exp(\mathbf{e}_i)}}$$


- **Hard alignment**: Select one Value [2]

$$\mathbf{a} = \text{Multinomial}(\frac{\exp(\mathbf{e}_i)}{\sum_j{\exp(\mathbf{e}_j)}})$$


[[1]](https://arxiv.org/abs/1508.04025) Luong et al. (2015) Effective Approaches to Attention-based Neural Machine Translation

[[2]](https://arxiv.org/abs/1502.03044) Xu et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

### Attention mechanisms by Processing functions (Alignment)

- How should the scores be combined to process values and produce a context vector?

- **Local alignment**: Softmax distribution is calculated based only on the subset of the attention scores.
    * Fix the midpoint, $p$, and compute the softmax scores based on the window around $p$, i.e, values too far from p will be ignored and only those closer to $p$ will be taken into account.
    
    $$\mathbf{a}_i = \frac{\exp{\mathbf{e}_i}}{\sum_{j=p-D}^{p+D}\exp{\mathbf{e}_j}} $$

    * Monotonic alingment: Fix $p$ to the position in the prediction token 
 
    * Predictive alignment: Learn $p$
    
[[1]](https://arxiv.org/abs/1502.03044) Xu et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

### Attention mechanisms by Processing functions (Score - multidimensional)

- Attention scores are multidimensional, i.e., scores define how to combine each dimension in the values [1]

- If keys, $\mathbf{K} \in \mathbb{R}^{n \times d}$ and values $\mathbf{V} \in \mathbb{R}^{n \times d}$

$$score(\mathbf{q}, \mathbf{K}) \in \mathbb{R}^{n \times d}$$

$$\mathbf{c} = \mathbf{1}^T(score(\mathbf{q}, \mathbf{K}) \cdot \mathbf{V})$$ 

[[1]](https://arxiv.org/abs/1709.04696) Shen et al. (2017) DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding