# Attention



## Single-Head Attention

$$Attention(Q, K, V) = softmax \left( \frac{Q K^T}{\sqrt{d}} \right) V$$

- machine translation
- $\sqrt{d}$: scale down to avoid gradient vanishing

$\textbf{Key Components}$:
- Q (to match others): vector representing the word
- K (to be matched): memoery (all the words generated before)
- V (info to be extracted): could be the same for key

<img src="https://cdn-images-1.medium.com/max/900/1*NIxhlMqHFyhllBm4v1j2-A.png" alt="attention" style="width:600px;"/>

$\textbf{Steps}$:
1. Compute similarity between query and source
  - $e^{\langle q, k \rangle}$: similarity between query and key
    - dot product: $Sim(Query, Key_i) = Query \cdot Key_i$
    - cosine similarity: $Sim(Query, Key_i) = \frac{Query \cdot Key_i}{||Query|| * ||Key||}$
    - multi-layer preceptron: $Sim(Query, Key_i) = Sim(Query, Key_i)$
2. softmax on all the similarities
  - compute attention weights to each word (total weights equal to 1)
  $$\alpha^{\langle q, k_i \rangle} = softmax(e^{\langle q, k_i \rangle}) = \frac{exp(e^{\langle q, k_i \rangle})}{\sum_{j=1}^{L_x} exp(e^{\langle q, k_j \rangle})}$$
3. compute attention between query and source
  $$Attention(Query, Source) = \sum_{i=1}^{L_x} \alpha^{\langle q, k_i \rangle} * Value_i$$

## Multi-Head Attention

$\textbf{Steps}$:
1. linear transformation on Q, K, V before attention
  - W matrix: change dimension on each matrix from d to d'
  $$\tilde{V}_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$$
2. Concatenate (merge by column) all attentions
  $$MultiHead(Q, K, V) = Concat(\tilde{V}_1, ...,\tilde{V}_a) W^O$$
  
<img src="https://lilianweng.github.io/lil-log/assets/images/transformer.png" alt="multi-head attention" style="width:800px;"/>

$\textbf{Different types of attention}$
- Encoder-Decoder attention
  - Query: previous decoder
  - Source: current encoder
  - Example: T5
- Encoder self-attention
  - Query and Source: previous encoder
  - Example: BERT
- Decoder masked self-attention
  - Query and Source: previous decoder
  - Example: GPT

| Architecture       | Pros                                         | Cons                                         |
|-------------------|--------------------------------------------|---------------------------------------------|
| **Encoder-Decoder (Seq2Seq)** | ✅ Great for sequence-to-sequence tasks (translation, summarization)  <br> ✅ Contextualized bidirectional understanding <br> ✅ More flexible for structured output generation | ❌ Computationally expensive (both encoder & decoder) <br> ❌ Slower inference than decoder-only models |
| **Encoder-Only**  | ✅ Deep bidirectional context understanding <br> ✅ Efficient for classification & retrieval tasks <br> ✅ Supports parallel processing | ❌ Cannot generate text <br> ❌ Limited to discriminative tasks |
| **Decoder-Only**  | ✅ Highly efficient for text generation <br> ✅ Scales well for large datasets <br> ✅ Works well for open-ended generation (chatbots, coding) | ❌ Lacks deep bidirectional context <br> ❌ Struggles with tasks requiring full-sentence understanding <br> ❌ Prone to hallucination |



## Masking
- generate next words based on previous words
- prevent models from referring future words

$$\text{masked attention(Q, K, V)} = softmax \left(\frac{Q K^T + M}{\sqrt{d}} \right)$$
- M: mask matrix of 0 and $-\infty$

## Positional Embedding

$$PE_{position, 2i} = \sin \left( \frac{position}{10000^{2i/d}} \right), PE_{position, 2i+1} = \cos \left( \frac{position}{10000^{2i/d}} \right)$$

## Comparison of Self-Attention and CNN/RNN
- Self-Attention: can parallel (GPU), no position info
- RNN: hard to parallel, large number of training steps
- CNN: can parallel, but hard to handle sequence data

| Layer Type | Complexity per Layer | Sequential Operations | Max Path Length |
| --- | --- | --- | --- |
| Self-Attention | $O(n^2 * d)$ | $O(1)$ | $O(1)$ |
| RNN | $O(n * d^2)$ | $O(n)$ | $O(n)$ |
| CNN | $O(k * n * d^2)$ | $O(1)$ | $O(\log_k (n))$ |
| Self-Attention (restricted) | $O(r * n * d)$ | $O(1)$ | $O(n / r)$ |

- n = sequence length
- d = representation dimension
- k = kernel size of CNN
- r = neighbor size

<p float="middle">
  <img src="images/Transformer/Transformer.png" width="350" />
  <img src="images/Transformer/Cross_attention.png" width="350" /> 
</p>

# Transformer

| Autoregressive Transformer | Non-autoregressive Transformer |
| --- | --- |
|![image.png](images/Transformer/AT_Decoder.png)|![image.png](images/Transformer/NAT_Decoder.png)|
| Output word one after the other | Output the entire sequence |
| Advantage: better performance than NAT | Advantage: parallel, controllable output length |
| Disadvantage: cannot parallel | Disadvantage: multi-modality |

Transformer Slides: 
- 2019: https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Transformer%20(v5).pdf
- 2021: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/seq2seq_v9.pdf

# Transfer Learning (Pretrain + Fine-tune)

## BERT
- architecture: bidirectional self-attention transformer (encoder transformer)
- activation function: Gaussian Error Linear Unit
  $$GELU(x) = 0.5 x \left(1 + tanh \left(\frac{\pi}{2} (x + 0.044715 x^3) \right) \right)$$
  
<img src="https://pytorch.org/assets/images/bert1.png" alt="BERT" style="width:800px;"/>

$\textbf{Masked Language Model (MLM)}$
- the man went to [mask1], he bought a [mask2] of milk
- $\hat{x}$: masked sequence
$$p(x_{1:T}) \approx \prod_{t=1}^T p(x_t | \hat{x})^{\textbf{1}(token is maksed)}$$

$\textbf{Next Sentence Prediction (NSP)}$
- binary classification
- sentence relationship: 
  - Question Answering (QA)
  - Natural Language Inference (NLI)
- example:
  - sentence1: the man went to the store
  - sentence2: he bought a gallon of milk
  - Label: IsNext (1)

$$P_i^{start} = softmax(\overrightarrow{S} \cdot \overrightarrow{T_i}), P_i^{end} = softmax(\overrightarrow{E} \cdot \overrightarrow{T_i})$$
- objective: maximize likelihood of start and end
$$max \sum_{(x_s, b_s, e_s) \in D} \left(log(P_{b_s}^{start}) + log(P_{e_s}^{end})\right)$$

$\textbf{Token representation}$:
- token embedding
- segment embedding
- position embedding

## GPT
- semi-supervised learning
- unidirectional auto-regressive generative language model
$$p(x_{1:T}) = \prod_{t=1}^T p(x_t | x_{0:t-1})$$

$\textbf{Pretrain}$: using unsupervised learning (masked language model, MLM) to predict missing words
- covers all layers in model: embedding, transformer, output layer, etc.
- why need unsupervised pre-training: help generalization (robustness) in deep learning
- Reference: https://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
- multi-layer transformer decoder (unidirectional self-attention)
- k: window size
$$L_1(s) = \sum_{t=1}^T log(p(w_t | w_{t-k}, ..., w_{t-1};\Theta))$$

$\textbf{Fine-tune}$: supervised learning
- only tune small portion of layers: usually final layer (depends on tasks)
- tasks: natural language inference, question answering, sentence similarity, classification
- $h^{(i)}$: output of final layer transformer
$$L_2(s^{(i)}, y^{(i)}) = log(p(y^{(i)}|s^{(i)}))$$
$$p(y^{(i)}|s^{(i)}) = softmax(W_{y^{(i)}}^T h^{(i)})$$

$\textbf{Loss function}$:
$$L_3(s, y) = L_2(s^{(i)}, y^{(i)}) + \lambda L_1(s)$$

## XLNet
- permuted language model

$\textbf{Loss function}$:
- log probabilities of all permutations
- $Z_T$: all permutations
$$\underset{\theta}{\mathrm{max}} E_{z \in Z_T} \left(\sum_{t=1}^T log(p_{\theta}(x_t | x_{1:t-1})) \right)$$

$\textbf{architecture}$: two-stream self-attention
- query stream
- content stream

# Fine-Tuning

| Fine-Tuning Technique | Description | Pros | Cons | Use Cases |
|----------------------|-------------|------|------|-----------|
| **Full Fine-Tuning** | Updates all model weights | ✅ High performance <br> ✅ Best for domain-specific models | ❌ Expensive <br> ❌ Risk of catastrophic forgetting | Specialized chatbots, medical/legal AI |
| **LoRA** | Adds trainable low-rank matrices | ✅ Memory-efficient <br> ✅ Works on large models | ❌ Less effective for extreme domain shifts | Domain adaptation, cost-effective fine-tuning |
| **Adapters** | Inserts small trainable layers | ✅ Multi-task flexibility <br> ✅ Lower cost | ❌ Adds inference latency | Multi-task learning, quick adaptation |
| **Prefix Tuning** | Optimizes a small set of continuous vectors | ✅ Lightweight <br> ✅ No need to modify model weights | ❌ Limited adaptation | Task-specific tuning with minimal resources |
| **Prompt Tuning** | Learns continuous prompt embeddings | ✅ Extremely parameter-efficient | ❌ Works best with instruction-tuned models | Efficient tuning of API-based LLMs |
| **RLHF** | Fine-tunes based on human feedback | ✅ Improves safety & alignment | ❌ Expensive <br> ❌ Requires human labeling | Chatbots, AI alignment |
| **RLAIF** | Fine-tunes using AI feedback | ✅ Scalable alternative to RLHF | ❌ Reinforces model biases | Large-scale alignment |
| **Multi-Task Fine-Tuning** | Trains on multiple tasks simultaneously | ✅ Efficient for general-purpose AI | ❌ Risk of task interference | General NLP assistants |


# References

- Attention: https://arxiv.org/pdf/1706.03762.pdf
- BERT: https://arxiv.org/pdf/1810.04805.pdf
- XLNet: https://arxiv.org/pdf/1906.08237.pdf
- GPT: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf