# Chapter 4: NLP

[ref](https://www.borealisai.com/research-blogs/tutorial-14-transformers-i-introduction/)

## 0. Pre-transformer

### 0.1 RNN

\begin{align}
z^{(l)} &= Wx^{(l)} + b \\
h^{(l)} &= \sigma(z^{(l)}) \\
z^{(l+1)} &= Uh^{(l)} + Vx^{(l+1)} \label{rnn_eq}\tag{Eq 0.1}
\end{align}

### 0.2 LSTM

An LSTM consists of the following components:

**3 variables**:
- $c_t$: cell state to persist memory
- $x_t$: input to the current LSTM cell
- $h_{t-1}$: embedded output of the previous LSTM cell

**3 gates**:
- $f_t$: forget gate to forget about the state from previous cell
- $i_t$: input gate to decide how new input should be incorporated to calculate the *new state*
- $o_t$: output gate to decide if output should be passed to the next cell 

\begin{align}
c_t &= f_t \times c_{t-1} + i_t \times \tanh(w_c[h_{t-1}, x_t] + b_c) \\
h_t &= o_t \times \tanh(c_t)
\end{align}

In turn, the three gates are computed as

\begin{align}
f_t &= \sigma(w_f[h_{t-1}, x_t] + b_f) \\
i_t &= \sigma(w_i[h_{t-1}, x_t] + b_i) \\
o_t &= \sigma(w_o[h_{t-1}, x_t] + b_o)
\end{align}

In a nutshell, 

|$i_t$|$f_t$|behavior|
|-----|-----|--------|
|0  |1  |remember and use the previous state|
|1  |1  |add input to the previous state|
|0  |0  |erase the previous state|
|1  |0  |overwrite state with new input|

## 1. Transformers

### 1.1 Training objectives

When a transformer-based model is trained, it is trying to maximize the log likelihood of observing some sequences. Depending on the exact form of the likelihood, there are three types of objectives


#### Autoencoding (AE)

The likelihood is factorized as the product of probability of predicting all masked tokens $\vec{x}_m$ conditioned on all the unmasked tokens (corrupted sequence):

\begin{equation}
\max_\theta \log P_\theta(\vec{x}_m | \vec{x}_\overline{m}) \approx \sum_{t=1}^T m_t \log P_\theta(x_t |\vec{x}_\overline{m}) = \sum_{t=1}^T m_t \log \frac{\exp(H(x_t|\vec{x}_\overline{m}))}{\sum_x' \exp(H(x'|\vec{x}_\overline{m}))}
\end{equation}

where $m_t$ is one if the $t$-th token is masked, and zero otherwise. $H(x_t|\vec{x}_\overline{m})$ is some energy function of token $x_t$ given the corrputed sequence $\vec{x}_\overline{m}$.

This is the Masked Language Modeling (MLM) mechanism for training BERT. The other being Next Sentence Prediction (NSP).

#### Autoregressive (AR)

The likelihood is factorized as the product of probability of predicting the target token at each position $x_t$ conditioned on preceding tokens in the sequence:

\begin{equation}
\max_\theta \log P_\theta(x_1,..x_T) \approx \sum_{t=1}^T \log P_\theta(x_t|x_1,...x_{t_1}) = \sum_{t=1}^T \log \frac{\exp(H(x_t|x1,...x_{t_1}))}{\sum_x' \exp(H(x'|x1,...x_{t_1}))}
\end{equation}

#### Permutation objective

Similar to AR objective, the likelihood is factorized as the product of probability of predicting the target token at each position $x_t$ conditioned on preceding **permuted** tokens in the sequence

\begin{equation}
\max_\theta \log P_\theta(x_1,..x_T) \approx \max_\theta \mathbb{E}_{z\sim Z_T}\Big[\sum_{t=1}^T \log P_\theta(x_{z_t}|x_{z_{<t}}) \Big]
\end{equation}

where $z$ deontes a permutation in the set $Z_T$, which contains all possible permutations of the text sequence $x$ of length $T$. The $t$-th token at permutation sequence $z$ is deonated by $x_{z_t}$, and the tokens sequence preceding the $t$-th position are denoted by $x_{z_{<t}}$. Despite the permutation operation, the ordering of words are still somewhat intact due to (relative) positional encoding. however, too much suffling will break this mechanism. Therefore, *attention mask* is used instead. For detail, see 

[overview of different transformer models](https://medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b#:~:text=Below%20are%20examples%20of%20models,decoder%2Donly%2C%20or%20both%3A&text=Encoder%2Donly%20Models%3A%20BERT%3B&text=Decoder%2Donly%20Models%3A%20GPT%2C,(Transformer%2DXL%2C%20permutation)%3B&text=Encoder%2DDecoder%20Models%3A%20T5%2C%20BART.)

[overview of XLnet](https://www.borealisai.com/research-blogs/understanding-xlnet/)

### 1.2 Tokenizier

### 1.3 Positional encoding

On graphs, where no natural ordering of nodes exists, multiple alternatives
were proposed to such positional encodings. While we defer discussing
these alternatives for later, we note that one promising direction involves a
realisation that the positional encodings used in Transformers can be directly
related to the discrete Fourier transform(DFT), and hence to the eigenvectors
of the graph Laplacian of a “circular grid”. Hence, Transformers’ positional
encodings are implicitly representing our assumption that input nodes are
connected in a grid. For more general graph structures, one may simply
use the Laplacian eigenvectors of the (assumed) graph—an observation
exploited by Dwivedi and Bresson (2020) within their empirically powerful
Graph Transformer model.

[laplacian - sinusodal position encoding](https://math.stackexchange.com/questions/3853424/what-does-the-value-of-eigenvectors-of-a-graph-laplacian-matrix-mean)

### 1.4 Residual connection and layer normalization

Layer normalization normalizes each node using mean and standard deviation of all the nodes in the same layer. This is different from batch normalization, which normalize each node using the mean and standard deviation of all data of that feature in the batch.

The basic inspiration behind Batch Normalization is the long-known observation that training in neural networks
works best when the inputs are centered around zero with respect to the bias. The reason for this is that it prevents neurons from saturating and gradients from vanishing in deep nets. In the absence of such centering, changes in parameters in lower layers can give rise to saturation effects in higher layers, and vanishing gradients. The idea of Batch Normalization is to introduce additional new “BatchNorm” layers that standardize the inputs by the mean and variance of the mini-batch.

Layer normalization is used in the transformer because the statistics of language data exhibit large fluctuations across the batch dimension, and this leads to instability in batch normalization. 

### 1.5 Fine-tuning by specific downstream tasks

[Improving Language Understanding by Generative Pre Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

Fine-tuning the model weight by adding a MLP layer + appropriate output layer at the end. For example, a sigmoid for binary classification such as sentiment prediction, or softmax for multi-class classification such as NER. Fine-tuning is done using supervised training data.

[ref](https://rpradeepmenon.medium.com/a-deep-dive-into-fine-tuning-of-large-language-models-96f7029ac0e1#:~:text=In%20contrast%20to%20few%2Dshot,the%20model's%20weights%20and%20biases.)

## 2. Model summary

[ref](https://arxiv.org/pdf/2304.13712.pdf)

- BERT
    - encoders only
    - bidirectional
    - pre-trained using autoencoder (AE) objective, i.e. MLM and NSP
    - downstream tasks are usually not about predicting masked words. Hence fine tuning the model is a must. Fine tune the model according to one of the four downstream tasks
        - Sentence Pair Classification Tasks (e.g., semantic similarity between two sentences)
        - Single Sentence Classification Tasks (e.g., sentiment analysis)
        - SQuAD (Question-Answering)
        - Named Entity Tagging
    - Limitations
        - if two words are masked, the prediction of one does not leverage information of the prediction of the other word
        
- GPT
    - decoders only
    - autoregressive (AR) objective. Either left-to-right or right-to-left
    - downstream tasks are usually the same as how the model is pre-trained. Fine-tuning is optional. Again, fine tune the model according to one of the four downstream tasks
        - Classification
        - Entailment
        - Similarity
        - Multiple-choice
    - 
        
- XLNet
    - decoders only
    - permutation objective. Tokens are permutated in all possible ways (or a sample of it). Apply AR to each permutation. 
    - Token ordering is still perserved using (relative) positional encoding (Transformer XL). However, will break if shuffle too much. Use attention mask instead
    - segment recurrence mechanism to learn long-range dependency, beyond 512-2048 tokens (Transformer XL)
        
    
- RoBERTa

The difference between an encoder and a decoder is mainly in how they are trained. The former use AE objective (MLM, NSP) and the later AR (with softmax head). Removing the head, a decoder can act as a word embedder.


## 3. Variants

### 3.1 Dense multi-head attention

### 3.2 Sparse transformers (factorized attention)

[Ref2](https://arxiv.org/pdf/1906.04341.pdf) 

We first explore generally how BERT’s attention heads behave. We find that there are common patterns in their behavior, such as attending to
fixed positional offsets or attending broadly over
the whole sentence. A surprisingly large amount
of BERT’s attention focuses on the deliminator token [SEP], which we argue is used by the model
as a sort of no-op. Generally we find that attention
heads in the same layer tend to behave similarly.

[Ref3](https://arxiv.org/pdf/1905.07799.pdf) note that while dense attention allows each head to attend over the full context, many attention heads specialize to only consider local context while others consider the entire available sequence.  They propose leveraging this observation by using a variant of self-attention that allows the model to select it's context size.

## 4. Time and memory complexity

[Ref1](https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/)

## 5. Kernelizing attention computation

The attention mechanism can be written as

\begin{equation}
h_i^{(l+1)} = \sum_{ij}a_{ij} V h_j^{(l)}
\end{equation}

where the attention function $a_{ij}$ is 

\begin{equation}
a_{ij} = \sum_{ij} \text{softmax}(Q^{(l)} h_i^{(l)} \cdot K^{(l)} h_j^{(l)})
\end{equation}

The attention function can be generalized to a kernel function $K(x_i, x_j)$, which is a dot product of the normalized eigenfunctions $h_l(x) \equiv \sqrt{\lambda_l}\phi_l(x)$

\begin{equation}
a_{ij} = K(Qx_i, Kx_j) = \langle \vec{h}(Qx_i), \vec{h}(Kx_j)\rangle = \sum_{l=1}^\infty h_l(Qx_i)h_l(Kx_j)
\end{equation}

Writing the attention function as a dot product has the advantage of updating $h_i^{(l+1)}$ in a sequential way by reusing results computed for $h_{i-1}^{(l+1)}$ when an autoregressive model is used. The downside is $\vec{h}$ is usually infinite dimensional and we are effectively not taking advantage of the kernel trick (which avoid explicit computation of the dot product). The sequential updating property induced by casting the attention function into dot product of non-linear functions $h$ makes a kernelized decoder similar to an RNN. The comparison will be exact if $\vec{h}$ is infinite dimensional.


## 6. Attention as memory retrieval

Given a set of memory attractors (key) $\{\vec{\xi}_p\}_{p=1}^P$, and an arbitrary (query) configuration $\vec{s}$, a generalized Hopfield model has the following functional form of energy

\begin{equation}
E = -\sum_{p=1}^PF(\vec{\xi}_p\cdot \vec{s})
\end{equation}

the classical Hopfield model is recovered when $F(s) = s^2$. For $F(s) = \exp(s)$, the energy function becomes

\begin{equation}
E = -\sum_{p=1}^P \exp(\vec{\xi}_p\cdot \vec{s})
\end{equation}

Using an exponential function is equivalent to capturing higher (infinite) order of interactions of the spins. I.e. for Ising model, there is only two-point interaction. When using an exponential function, we consider interaction of all orders (Taylor expansion of the exponential function). Considering $s$ to be continuous variable instead of $s \in \{-1,1\}$, one add an additional quadratic term in $\xi$ to control for the norm of $\xi$. The energy function becomes

\begin{equation}
E = -\log \sum_{p=1}^P \exp(\vec{\xi}_p\cdot \vec{s}) + \frac{1}{2}\vec{s}^T\vec{s}
\end{equation}

where the first term is known as the *log-sum-exp function*. The EM updating equation becomes

\begin{equation}
\vec{s}^{(t+1)} = \text{softmax}(\vec{\xi}_p\cdot \vec{s})\cdot \vec{\xi}_p  \label{lse_hopfield_em}\tag{Eq. 6.1}
\end{equation}

\ref{lse_hopfield_em} is reminiscent to how a hidden state in an attention layer is calculated

\begin{align}
h^{(l+1)}_i &= \sum_{j}w_{ij} V^{(l)} h^{(l)}_j \\
h^{(l+1)}_i &= \sum_{j}\text{softmax}_j(K^{(l)}h^{(l)}_j \cdot Q^{(l)}h^{(l)}_i) V^{(l)} h^{(l)}_j
\end{align}

## NLP cookbook

### EDA

- word frequency analysis
- sentence length analysis
- average word length analysis

How this affect choice of model from the memory complexity and learning long-range dependency perspective

### Choice of models

- suitibility to the task
- time and space complexity

## Relevant knowledge

### Cosine similarity and the curse of dimensionality

Euclidean distance suffers from the curse of dimensionality because data in high dimension are roughly equaldistant to each other. I.e. the ratio of minimum distance and maximum distance among all possible pair of data points approaches one in high dimension. 

The reason why cosine similarity can be used to measure similarity of words embedded in high dimension is that the word vectors usually have zero (or small weight) in lots of dimension. Therefore, even the vector lives in high dimension, the actual space spanned is much smaller. 

Note that if data is uniformly distributed on the surface of an n-sphere, the likelihood of two random vectors being near orthogonal approaches one when $n \to \infty$.

Why word vectors tend to be sparse?