# Attention

## Single-Head Attention

$$Attention(Q, K, V) = softmax \left( \frac{Q K^T}{\sqrt{d}} \right) V$$
- machine translation
- $\sqrt{d}$: scale down to avoid gradient vanishing

$\textbf{Key Components}$:
- Q (query for decoder): vector representing the word
Source for encoder:
- K (key): memoery (all the words generated before)
- V (value): could be the same for key

<img src="https://cdn-images-1.medium.com/max/900/1*NIxhlMqHFyhllBm4v1j2-A.png" alt="attention" style="width:600px;"/>

$\textbf{Steps}$:
1. Compute similarity between query and source
  - $e^{\langle q, k \rangle}$: similarity between query and key
    - dot product: $Sim(Query, Key_i) = Query \cdot Key_i$
    - cosine similarity: $Sim(Query, Key_i) = \frac{Query \cdot Key_i}{||Query|| * ||Key||}$
    - multi-layer preceptron: $Sim(Query, Key_i) = Sim(Query, Key_i)$
2. softmax on all the similarities
  - compute attention weights to each word (total weights equal to 1)
  $$\alpha^{\langle q, k_i \rangle} = softmax(e^{\langle q, k_i \rangle}) = \frac{exp(e^{\langle q, k_i \rangle})}{\sum_{j=1}^{L_x} exp(e^{\langle q, k_j \rangle})}$$
3. compute attention etween query and source
  $$Attention(Query, Source) = \sum_{i=1}^{L_x} \alpha^{\langle q, k_i \rangle} * Value_i$$

## Multi-Head Attention

$\textbf{Steps}$:
1. linear transformation on Q, K, V before attention
  - W matrix: change dimension on each matrix from d to d'
  $$\tilde{V}_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$$
2. Concatenate (merge by column) all attentions
  $$MultiHead(Q, K, V) = Concat(\tilde{V}_1, ...,\tilde{V}_a) W^O$$
  
<img src="https://lilianweng.github.io/lil-log/assets/images/transformer.png" alt="multi-head attention" style="width:800px;"/>

$\textbf{Different types of attention}$
- Encoder-Decoder attention
  - Query: previous decoder
  - Source: current encoder
- Encoder self-attention
  - Query and Source: previous encoder
- Decoder masked self-attention
  - Query and Source: previous decoder

# Transfer Learning (Pretrain + Fine-tune)

## BERT
- architecture: bidirectional self-attention transformer (encoder transformer)
- activation function: Gaussian Error Linear Unit
  $$GELU(x) = 0.5 x \left(1 + tanh \left(\frac{\pi}{2} (x + 0.044715 x^3) \right) \right)$$
  
<img src="https://pytorch.org/assets/images/bert1.png" alt="BERT" style="width:800px;"/>

$\textbf{Masked Language Model (MLM)}$
- the man went to [mask1], he bought a [mask2] of milk
- $\hat{x}$: masked sequence
$$p(x_{1:T}) \approx \prod_{t=1}^T p(x_t | \hat{x})^{\textbf{1}(token is maksed)}$$

$\textbf{Next Sentence Prediction (NSP)}$
- binary classification
- sentence relationship: 
  - Question Answering (QA)
  - Natural Language Inference (NLI)
- example:
  - sentence1: the man went to the store
  - sentence2: he bought a gallon of milk
  - Label: IsNext (1)

$$P_i^{start} = softmax(\overrightarrow{S} \cdot \overrightarrow{T_i}), P_i^{end} = softmax(\overrightarrow{E} \cdot \overrightarrow{T_i})$$
- objective: maximize likelihood of start and end
$$max \sum_{(x_s, b_s, e_s) \in D} \left(log(P_{b_s}^{start}) + log(P_{e_s}^{end})\right)$$

$\textbf{Token representation}$:
- token embedding
- segment embedding
- position embedding

## GPT
- semi-supervised learning
- auto-regressive generative language model
$$p(x_{1:T}) = \prod_{t=1}^T p(x_t | x_{0:t-1})$$

$\textbf{Pretrain}$: unsupervised learning
- multi-layer transformer decoder (unidirectional self-aatention)
- k: window size
$$L_1(s) = \sum_{t=1}^T log(p(w_t | w_{t-k}, ..., w_{t-1};\Theta))$$

$\textbf{Fine-tune}$: supervised learning
- depends on tasks
- $h^{(i)}$: output of final layer transformer
$$L_2(s^{(i)}, y^{(i)}) = log(p(y^{(i)}|s^{(i)}))$$
$$p(y^{(i)}|s^{(i)}) = softmax(W_{y^{(i)}}^T h^{(i)})$$

$\textbf{Loss function}$:
$$L_3(s, y) = L_2(s^{(i)}, y^{(i)}) + \lambda L_1(s)$$

## XLNet
- permuted language model

$\textbf{Loss function}$:
- log probabilities of all permutations
- $Z_T$: all permutations
$$\underset{\theta}{\mathrm{max}} E_{z \in Z_T} \left(\sum_{t=1}^T log(p_{\theta}(x_t | x_{1:t-1})) \right)$$

$\textbf{architecture}$: two-stream self-attention
- query stream
- content stream

# References

- Attention: https://arxiv.org/pdf/1706.03762.pdf
- BERT: https://arxiv.org/pdf/1810.04805.pdf
- XLNet: https://arxiv.org/pdf/1906.08237.pdf
- GPT: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf