# Data Preprocessing Techniques

Data preprocessing involves cleaning, normalizing, and converting user prompts or raw text into a format that models can process effectively. It enhances model performance and helps standardize textual data for analysis.

## Common Preprocessing Techniques

1. **Tokenization**  
   Breaking down text into smaller units called tokens, which can be words, subwords, or characters. This step ensures the text is manageable for the model. Tokenization approaches may vary by language, as some languages (e.g., Chinese) require special segmentation.

2. **Stemming**  
   Reducing words to their root or base form by stripping suffixes and prefixes using heuristic rules. Stemming does not guarantee meaningful root forms (e.g., *flies* → *fli*), as it relies on simple rules rather than linguistic context.

3. **Lemmatization**  
   Reducing a word to its base or dictionary form (lemma) using linguistic analysis. Unlike stemming, lemmatization considers the grammatical role and context of a word, making it more accurate but computationally expensive.  

   *Example*: *am, are, is* → *be* (lemma).

4. **Normalization**  
   Converting text into a standardized form to ensure consistency and improve model interpretability. Common normalization steps include:  
   - Lowercasing text (e.g., *HELLO* → *hello*).  
   - Removing punctuation or special characters.  
   - Removing stop words (e.g., *the, is, and*), though this can sometimes hurt performance in tasks where stop words carry meaning.

5. **Part-of-Speech (POS) Tagging**  
   Assigning grammatical categories (e.g., nouns, verbs, adjectives) to each word in a sentence. While not strictly a preprocessing step, POS tagging can support other tasks like lemmatization or dependency parsing by providing grammatical structure and improving semantic understanding.

<img src="https://miro.medium.com/v2/resize:fit:1024/1*pzjECYWP8WOWhwfCjebZVw.png" width=700>

**Note**: These techniques are often combined to tailor preprocessing to the specific requirements of the NLP task. Each technique has its own strengths and trade-offs, which should be carefully considered for optimal results.

# Feature Extractions

Feature extraction involves transforming raw input data into numerical representations that retain meaningful information, enabling models to process and analyze the data. In NLP, feature extraction techniques help capture the importance, relationships, and patterns of words or phrases in a corpus.

## Common Feature Extraction Techniques

1. **Bag of Words (BoW)**  
   Bag of Words represents text as a collection of its words, ignoring grammar and word order but retaining word frequency. It is simple and quick to compute, but the lack of contextual understanding and high-dimensional sparsity can limit its effectiveness. Additionally, non-informative terms (e.g., stop words) may appear frequently and dominate the representation.  

   <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*sK9pIrYSFfqlsDbC.png" width=500>

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**  
   TF-IDF evaluates the importance of a term in a document relative to a collection of documents. It balances local importance (frequency within a document) and global rarity (occurrence across all documents).  
   - **Term Frequency (TF)**: Measures the frequency of a term in a document (note that the TF score can be different for the same word in different documents).  
   - **Inverse Document Frequency (IDF)**: Reduces the weight of commonly occurring terms that appear across many documents, emphasizing rare but important terms.  
   - **TF-IDF**: Combines TF and IDF to assign higher scores to terms that are frequent in a document but rare in the corpus (each document will have its own TF-IDF score)

   **Formulas**:  
       $$TF(t,d)=\frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}$$  

   $$IDF(t, D) = \log{\frac{\text{Total number of documents in D}}{\text{Number of documents term t appears} + 1}}$$  

   $$TF-IDF(t, d, D) = TF(t,d) * IDF(t, D)$$  

   <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*bK8wXF-TtjQpCiVo.png" width=500>

3. **N-Grams**  
   N-grams are contiguous sequences of `n` items (e.g., characters, words, or tokens) extracted from text. They capture small amounts of context depending on the value of `n`, with larger `n` representing longer sequences and more context. However, increasing `n` can lead to sparsity and computational challenges. N-grams are commonly used in text representation, feature extraction, and language modeling.

   <img src="https://cdn.botpenguin.com/assets/website/N_Gram_feb7947286.png" width=300>

4. **Word Embeddings**  
   Word embeddings are powerful feature extraction techniques that represent words as dense, continuous vectors in a high-dimensional space. These vectors capture semantic meaning, syntactic properties, and relationships between words. Unlike sparse representations (e.g., BoW, TF-IDF), embeddings leverage distributional semantics to model words based on their context. Popular embedding methods include Word2Vec, GloVe, and contextual embeddings like BERT, which capture word meanings in different contexts.


# One-hot encoding
Each word in the vocabulary is represented as a unique vector, with all entries set to 0 except one

<img src="https://miro.medium.com/v2/resize:fit:1400/1*GsKLFAlzoNeIKo-1gz1h_Q.png">

Issues:
1. In one-hot representation, words are treated as individuals with no relations to other words
2. High-dimensional vector when the number of word is large
3. Inefficient computation (many entries being 0)

# Word embeddings
Word embeddings are essentially a way to convert words into numerical representations (vectors) in a continuous, dense, low dimensional vector space. The goal is to capture the semantic meaning of words such that the distance and direction between vectors reflect the similarity and relationships among the corresponding words. Word embeddings are used as inputs to the models

<img src="https://miro.medium.com/v2/resize:fit:1056/1*GkJpulpSAIm6GTeC1dVR_w.png" width=500>

Some common features of good word embeddings are
1. Dense vectors: the word vectors are dense and continuous with mainly non-zero elements
2. Lower dimensionality: the embedding size is usuaslly a lot smaller than the vocabulary size, allowing faster computation
3. Semantic relationship: words with similar meaning will have similar embeddings in the vector space

## Embedding similarity
The distance and direction between vectors represents their relationship

Given two word embeddings $e_1$, $e_2$, the similarity between the two is calculated as

$$\text{Similarity} = \frac{e_1 \cdot e_2}{||e_1|| ||e_2||}$$

The similarity value is between 1 and -1, where 1 represents 100% similary, -1 represents 100% opposite, and 0 represents no relationship

<img src="https://miro.medium.com/v2/resize:fit:1400/1*sXNXYfAqfLUeiDXPCo130w.png" width=600>

## Embedding matrix
For a given word $w$ and its one-hot encoding vector $o_w$, the embedding matrix $E$ is a matrix that maps its 1-hot representation $o_w$ to its embedding $e_w$ as follows:

$$e_w = Eo_w$$

$E$: the embedding matrix with number of rows equals to number of features for the embedding and number of columns equals to the number of vocabulary in the one-hot encoding dictionary

<img src="https://miro.medium.com/v2/resize:fit:1400/1*Bq6lIOdjCK172I1V04RwLA.png" width=500>


## Apply word embedding
Word embeddings is widely used for transfer learning

Steps:
1. Train a model to learn word embeddings from a large corpus of text or use a pre-trained embedding
2. Use the word embedding on a smaller training set to complete the given task
3. (Optional) Fine tune the word embedding based on the transfer learning

# Learning word embedding
Note: the individual components of the learned word embeddings are not necessarily interpretable since the axis chosen by the algorithm does not necessarily align with interpretable axis

## Word2vec
Word2vec is a framework aimed at learning word embeddings through a neural network by estimating the likelihood that a given word is surrounded by other words

There are two types of word2vec model, continuous bag of words (CBOW) and skip-gram. CBOW learns the word embeddings by trying to predict a target word giving its surrounding context words, and skip-gram learns the embeddings by trying to predict all the surrounding context words given a target word
 
<img src="https://community.alteryx.com/t5/image/serverpage/image-id/45458iDEB69E518EBA3AD9/image-size/large?v=v2&px=999" width=500>

### Continuous Bag of Words (CBOW)
CBOW predicts the probability of a target word occurring, given the surrounding context words. Essentially, CBOW trains a model to produce similar embeddings for words that appear in similar contexts by pulling related word vectors closer together in the embedding space and pushing unrelated ones further apart

#### Steps for Training Word Embeddings with CBOW:

1. **Corpus and Vocabulary**:  
   Start with a text corpus and construct a vocabulary of size $V$. Each word is assigned a unique index and represented as a one-hot vector $o_i$ with dimension $V$

2. **Context Window and Embedding Size**:  
   Define a context window of size $C$. For each center word, its context consists of $2C$ surrounding words (C to the left and C to the right, excluding the center). Also define an embedding dimension $N$, which is the size of all the word vectors.

3. **Generate Training Examples**:  
   Slide a window of size $2C+1$ across the corpus. For each window, use the $2C$ context words as input and the center word as the prediction target.

4. **Embedding Lookup**:  
   For each one-hot encoded context word $o_i$, use an embedding matrix $E$ with dimension $(V, N)$ to obtain its word vector. The embedding matrix converts an one-hot vector into a unique word embedding by looking up its rows:

   $$e_i = o_i^\top E$$

   After retrieving all $2C$ context vectors, stack them as:
   $$
   e = [e_1, e_2, \dots, e_{2C}]
   $$
   
   where $e$ has the dimension of $(2C, N)$

5. **Aggregate Context Embeddings**:  
   Average all the context embeddings inside $e$ to obtain a single vector:

   $$
   h = \frac{1}{2C} \sum_{i=1}^{2C}e_i
   $$
   
   where $h$ has the dimension of $N$ and it captures the average context based on all context words

6. **Output Layer**:  
   Multiply the averaged vector $h$ with an output weight matrix $W$ with dimension $(N, V)$ to compute unnormalized scores (logits) over the vocabulary:

   $$
   u = h^\top W
   $$
   where $u$ has the dimension of $V$

7. **Softmax**:  
   Apply the softmax function to convert logits into a probability distribution over the vocabulary:

   $$
   p_j = \frac{\exp(u_j)}{\sum_{k=0}^{V-1} \exp(u_k)}
   $$
   
   Note: the sum of probability over the entire vocabulary is 1, where $\sum^{V}_{i=1}p_i = 1$

8. **Cross Entropy Loss**:  
   Use cross entropy to calculate the loss

   $$
   L = -\sum_{i=0}^{V-1} y_i \log(p_i) = -\log(p_{\text{true}})
   $$
   
   For cross entropy loss, only one word inside the vocabulary will have the ground truth of one, so only this term contributes to the loss calculation, and the words with ground truth of zero do not contribute to the loss directly. However, the loss function still implicitly penalizes for misclassification because weight updates that increase the probability of a class must decrease the probability of the incorrect class

9. **Training and Embedding Extraction**:  
   Use backpropagation to update both $E$ (the embedding matrix) and $W$. After training, we can extract the embedding matrix $E$ from the model and use it to convert one-hot vectors into word embeddings.

<img src="https://mlarchive.com/wp-content/uploads/2024/01/New-Project-1024x595.png" width=500>

### Skip-gram
The Skip-Gram model is another Word2Vec model that learns word embeddings by predicting surrounding context words given a center (target) word. It is essentially the inverse of CBOW. The model is trained such that words that appear in similar contexts have similar vector representations.

#### Steps for Training Word Embeddings with Skip-Gram:
The steps for training skip-gram are very similar to that of CBOW, but with the following differences

1. In step 3 (Generate Training Examples), we reverse the input and label by making the target word as the input and the $2C$ context words become the prediction targets.

2. Step 5 (Aggregate Context Embeddings) in CBOW can be skipped since there's only one input word, so there is no need to average the embeddings and the dimension of $h$ is $N$

3. In step 9 (Cross Entropy Loss), since the ground truth labels are multiple context words, so instead of only counting the loss for only one word like CBOW, the total loss for each target word is the sum of the negative log-probability of each context words, where
   $$
   L = \sum_{i=0}^{2C-1} -\log(p_{\text{context}_i})
   $$
    Then, we use this loss to back propagate and train the embedding matrix

<img src="https://aegis4048.github.io/images/featured_images/skip-gram.png" width=700>

### CBOW VS Skip-gram

<img src="https://miro.medium.com/v2/resize:fit:1136/format:webp/1*x6aahsfT5wtqL6x5Xj-cEg.png" width=500>

### Negative sampling
In practice, computing the softmax probability over a large vocabulary size is very expensive. To address this, **negative sampling** is often used to approximate the full softmax efficiently. Instead of predicting the correct word among all vocabulary items, the model only focuses on predicting

1. The positive sample (actual target word)

2. A few negative samples (random words assumed to be incorrect)

Essentially, negative sampling simplifies a multi-class classification problem to a series of binary classification problems

#### Steps Applying Negative Sampling to CBOW and Skip-Gram

The previous steps for CBOW and Skip-Gram (Steps 1 to 6) remain the same as before. The only modification is in step 7 (softmax) and step 8 (cross-entropy loss), where we replace the full softmax with a more efficient negative sampling procedure.

* For **CBOW**, instead of calculating the probability with softmax:

$$
p_j = \frac{\exp(u_j)}{\sum_{k=0}^{V-1} \exp(u_k)}
$$

we compute a binary classification objective using the sigmoid function. For each training example, we have:
- One positive sample (the true center word)
- $K$ negative samples (randomly chosen words not in the context)

This works well because, in a large vocabulary, randomly sampled words are highly unlikely to be semantically related to the true context.

The loss for one CBOW example becomes:

$$
L = -\log \sigma(h \cdot v'_{\text{target}}) - \sum_{k=1}^{K} \log \sigma(-h \cdot v'_{\text{neg}_k})
$$

$h$: the averaged context embedding

$v'_{\text{target}}$: the output embedding of the ground truth label (relevant word)

$v'_{\text{neg}_k}$: the output embeddings of each negative sampled words (non-relevant words)

$\sigma$: the sigmoid function

Essentially, the loss function uses dot products to score how likely a given word is related to the context. It encourages high dot products for true center words and low dot products for unrelated (negative) words. This approach is much more computationally efficient than full softmax, as it only requires evaluating a few word pairs per example.


* For **Skip-Gram**, the process is similar, but repeated for each context word. For each training pair (center word, context word), we compute:

$$
L = -\log \sigma(v_{\text{center}} \cdot v'_{\text{context}}) - \sum_{k=1}^{K} \log \sigma(-v_{\text{center}} \cdot v'_{\text{neg}_k})
$$

$v_{\text{center}}$: the embedding of the center word (input)

$v'_{\text{context}}$: the output embedding of the true context word

$v'_{\text{neg}_k}$: the embeddings of negative samples

Since Skip-Gram generates one training pair for each context word (there are $2C$ context words), the total loss for one center word is the sum over all context words in its window:

$$
L_{\text{total}} = \sum_{j=1}^{2C} \left[-\log \sigma(v_{\text{center}} \cdot v'_{\text{context}_j}) - \sum_{k=1}^{K} \log \sigma(-v_{\text{center}} \cdot v'_{\text{neg}_{j,k}})\right]
$$

The key difference between CBOW and Skip-Gram with negative sampling is that CBOW computes the loss once per training example (center word target), where skip-gram will iteratively compute loss $2C$ times for each context word within the context window


## GloVe
The GloVe is another model that learns the word embedding based on global co-occurrence statistics. The key idea of GloVe is that words that appear in similar contexts tend to have similar meanings. So, instead of learning through predicting the word like Word2Vec, GloVe counts how often two words appears together in the corpus (co-occurence). If two words have high co-occurence count, it suggests they have similar semantic meaning, where two words have low co-occurence count, then they have different semantic meaning

uses a co-occurence matrix $X$, where each $X_{i,j}$ denotees for the number of times that a target word $i$ appears in the context of the word $j$

Its cost function is
$$J(\theta) = \frac{1}{2}\sum_{i, j=1}^{m}f(X_{i,j})(\theta_t^Te_c + b_i + b'_j + log(X_{i,j}))^2$$

$f$: a weighting function such that $f(X_{i,j}) = 0$ when $X_{i,j} = 0$ (do not add loss if two words are not in context)

Initially, we initialize $e$ and $\theta$ randomly. After training, given the symmetry that $e$ and $\theta$ play in this model, the final word embedding $e_w^{final} = \frac{e_w + \theta_w}{2}$

# Visualize word embeddings in low dimension:
## t-SNE (t-distributed Stochastic Neighbor Embedding)
t-SNE is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*p6W2UNZvc14eOEHY4lw1zw.png" width=500>

# Recurrent neural network
RNN is specialized for  tasks that involve sequential inputs, such as speech and language. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations

Compares to normal neural networks, RNN has the advantages of
* Managing different input and output length
* Sharing features learned across different position of the sequence 
* Process data based on their order and context

## Notations
$x^{(i)<t>}$: the input from the $i$th training example at the time step $t$

$y^{(i)<t>}$: the output for the $i$th training example at the time step $t$ (not necessarily a fixed size)

$T_x^{(i)}$: size of the input of the $i$th training example

$T_y^{(i)}$: size of the output of the $i$th training example

$a^{<t>}$: the activation at the time step $t$ ($a^{<0>} = \vec{0}$)

## Representing words
Each word is represented as a one-hot vector based on a dictionary

## Architecture
<img src="https://miro.medium.com/v2/resize:fit:1400/1*SKGAqkVVzT6co-sZ29ze-g.png">

The image shows the unfolded version of a RNN. The actual RNNs use one cell repeatedly

At each time step, the network will take in an activation from the current time step, $x^{<t>}$, and an activation from the previous time step, $a^{<t-1>}$ to provide the output $y^{<t>}$ and pass the activation to the next time setp, $a^{<t>}$. Thus, RNN takes in 2 inputs and have 2 outputs

### Forward propagation
$$a^{<t>} = g_1(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$$
$$y^{<t>} = g_2(W_{ya}a^{<t-1>} + b_y)$$

RNN cell architecture
<img src="https://global.discourse-cdn.com/dlai/original/3X/2/c/2cd9b38764a152e508d90650b5a365599c6347f8.png">

The activation $a^{<t-1>}$ contains the information from previous timesteps, which represents the context. The input $x^{<t>}$ represents the information from the current timestep. The RNN cell combines them to produce an output that contains both the context and current information. This output servers as the context for the next timestep

For the output layer, the RNN will transform the activation at this timestep and apply a softmax function on it. The softmax function gives the probability of each word being the next word, $y^{<t>}$. In general, we will choice the entry with the highest probability to be the next word (to ensure randomness, we may pick among the words with high probability)

Note: in RNN, tanh activation is used most often

## Loss function
$$L(\hat{y}, y) = \sum^{T_y}_{t=1} -y^{<t>}log(\hat{y}^{<t>}) - (1 - y^{<t>})log(1 - \hat{y}^{<t>})$$

The BCE loss function is used at each timestep to compute the difference between the predicted probability, $\hat{y}^{<t>}$, and the true labe, $\hat{y}^{<t>}$. The true label only contains one entry with value 1 and the rest are 0s (one-hot encoding)

## Different types of RNN
1. Many to many: many input and many output (eg. translation)
2. Many to one: many input and only one output (eg. movie rating based on description)
3. One to many: one input and many output (eg. music generation)

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*qUcWLuiBgpFVACYE.png">

## Training language model
1. Collect large corpus (body) of text as the training set
2. Tokenize the training text into one-hot vectors based on the dictionary (UNK for unknown words)
3. Feed the tokenized input to the network one by one and predict the probability of the next word ($x^{<0>} = \vec{0}$)
4. Construct the cost function based on the predicted probability and perform gradient descents (backpropogation through time) to update parameters

## Generate sequence  (one to many)
In order to generate text, we can randomly sample a word based on the probability distrubtion of the softmax output at the time step, $y^{<t>}$. Then, the sampled word is fed as the input of the next time step, where $x^{<t+1>} = y^{<t>}$, to generate a sequence

Initially, we start with $a^{<0>} = \vec{0}$ and $x^{<1>} = \vec{0}$

<img src="https://media5.datahacker.rs/2020/09/59-1-1024x410.jpg" width=700>

## Vanishing gradient and solutions
For a long sequential data, tranditional RNNs will experience vanishing gradient, causing the model to take longer to train and difficult to learn long term dependencies (context of a long sentence).

As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. This is caused by the staturation nature of the some activation functions

### Solutions
1. Proper initialization of weights (Xavier initialization)
2. Use Non-saturating activation function: LeakyReLU (non-zero gradient)
3. Batch normalization: normalize the activation to stabilize activations and ensure gradients remain within a reasonable range
4. Gradient clipping: force the gradient to be in a certain range (the range requires tuning)

# Gated recurrent unit (GRU)
GRU solves the vanishing gradient problem by capturing the long term dependencies using memory cells, which contains 2 gates, an update gate and a relevance gate. The update gate decides how much past information to remember and forget, and the relevance gate determines how much past information will we keep when forming the new memory

Note: $c^{<t>}$ and $a^{<t>}$ are the same in the context of GRU, so there's only 2 inputs and 2 outputs for each GRU cell

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/gru-ltr.png?00f278f71b4833d32a87ed53d86f251c" width=500>


$$c^{<t>} = a^{<t>}$$

$$\tilde c^{<t>} = tanh(W_{cc}(\Gamma_r * c^{<t-1>}) + W_{cx}x^{<t>} + b_c)$$

$$\Gamma_u = \sigma(W_{uc}c^{<t-1>} + W_{ux}x^{<t>} + b_u)$$

$$\Gamma_r = \sigma(W_{rc}c^{<t-1>} + W_{rx}x^{<t>} + b_r)$$

$$c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t - 1>}$$

$c^{<t>}$: the actual memory content at time step $t$. Initially, we can set $c^{<0>} = a^{<0>}$. Note that $c^{<t>}$ can be a matrix, implying it captures multiple dependencies at the same time

$\tilde c^{<t>}$: the current memory content at time step $t$, which is the candidate for updating the actual memory content. Note that the current memory content depends on the actual memeory content, $c^{<t-1>}$, the input at this time step, $x^{<t>}$, and the relavance gate, $\Gamma_r$

$\Gamma_r$: the relavance gate that decides how relevant is the the actual memory content, $c^{<t-1>}$, to compute the current memory content $\tilde c^{<t>}$. $\Gamma_r$ depends on the actual memory content, $c^{<t-1>}$ and the input at this time step, $x^{<t>}$

$\Gamma_u$: the update gate that decides whether acutal memory content, $c^{<t>}$, will be updated to the calculated current memory content, $\tilde c^{<t>}$. $\Gamma_u$ depends on the actual memory content, $c^{<t-1>}$ and the input at this time step, $x^{<t>}$

$c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t - 1>}$: the function that decides whether the value $c^{<t>}$ will be updated to the value of $\tilde c^{<t>}$. $*$ denotes for element-wise multiplication so, $c^{<t>}$, $\tilde c^{<t>}$, $\Gamma_u$, and $\Gamma_r$ must have the same dimensions

Note: since $\Gamma_r$ and $\Gamma_u$ are calculated with a sigmoid function, their actual values will be very close to either 0 or 1, which indiate relevant if $\Gamma_r \approx 1$ or irrelevant if $\Gamma_r \approx 0$ and update the value if $\Gamma_u \approx 1$ or not update the value if $\Gamma_u \approx 0$. This update can be partial since all variables are matrices

If $\Gamma_u \approx 1$, $c^{<t>} = \tilde c^{<t>}$

If $\Gamma_u \approx 0$, $c^{<t>} = c^{<t-1>}$


## LSTM
Long Short-Term Memory Networks (LSTMs) solves the managing long-term data dependencies problem that traditional RNN faced by using a system of gates that control how information flows through the network — deciding what to keep and what to forget over extended sequences

The LSTM cell takes in 3 inputs and produce 3 outputs, the previous cell state (the information which one is stored at the end of the previous time step), the previous hidden state (activation from previous state), and the input at the current time step, $x^{<t>}$

The hidden state and current timestep inputs are very similar to those of the traditional RNNs. The cell state are like "memory" that moves the information with basic operations like addition and multiplication that remembers important information and forgets not important ones. This is done by 3 gates, the update gate, the forget gate, and the output gate

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/lstm-ltr.png?4539fbbcbd9fabfd365936131c13476c" width=500>

$$\tilde c^{<t>} = \Gamma_r = tanh(W_{ca}a^{<t-1>} + W_{cx}x^{<t>} + b_c)$$

$$\Gamma_u = \sigma(W_{ua}a^{<t-1>} + W_{ux}x^{<t>} + b_u)$$

$$\Gamma_f = \sigma(W_{fa}a^{<t-1>} + W_{fx}x^{<t>} + b_f)$$

$$\Gamma_o = \sigma(W_{oa}a^{<t-1>} + W_{ox}x^{<t>} + b_o)$$

$$c^{<t>} = \Gamma_u * \tilde c^{<t>} + \Gamma_f * c^{<t - 1>}$$

$$a^{<t>} = \Gamma_o * c^{<t>}$$

$\tilde c^{<t>}$: in LSTM, the current memory content depends on the activation from the previous time step, $a^{<t-1>}$, and the input at the current time step, $x^{<t>}$

$\Gamma_u$, $\Gamma_f$, $\Gamma_o$: the update gate, forget gate, and output gate; all depend on the activation from the previous time step, $a^{<t-1>}$, and the input at the current time step, $x^{<t>}$

$c^{<t>} = \Gamma_u * \tilde c^{<t>} + \Gamma_f * c^{<t - 1>}$: the function that decides whether to update the memory content or not. Compared to the equation of GRU, this equation is more powerful because it uses two gates to update the memeory content; this means we can not only makes decision on whether to update the memory content to $\tilde c^{<t>}$ or not, but can also decide to keep both $\tilde c^{<t>}$ and $c^{<t - 1>}$ by adding them when $\Gamma_u \approx 1$ and $\Gamma_f \approx 1$

$a^{<t>} = \Gamma_o * c^{<t>}$: the activation $a^{<t>}$ is a filtered version of the cell state $c^{<t>}$

Despite the forget gate and update gate have values between 0 and 1, they do not necessarily adds to 1 (they are independent), which provides more flexility to the model 

In general, LSTM is more powerful and flexible than the GRU but requires more computational power

# Different types of RNNs
### Deep RNN
The Deep RNN is constructed by stacking multiple layers of RNN together. In this architecture, every RNN layer predicts the sequence of outputs to send to the next RNN layer instead of predicting a single output value. Then the final RNN layer predicts the single output

Note: we can start processing the next layer as soon as the current layer produces an output for the current time step, and there is no need to wait for the entire sequence to be processed by the current layer

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/deep-rnn-ltr.png?f57da6de44ddd4709ad3b696cac6a912" width=300>

### CNN-RNN
CNN-RNN architecture is a combination of CNN and RNN architectures. It first uses the CNN network layer to extract the essential features from the input and then send them to the RNN layer to support sequence prediction. An example application for this architecture is generating textual descriptions for the input image

<img src="https://media.springernature.com/lw1200/springer-static/image/art%3A10.1007%2Fs11063-024-11687-w/MediaObjects/11063_2024_11687_Fig3_HTML.png" width=500>

### Encoder-decoder RNN (Seq2Seq)
Encoder-decoder RNN architecture has an encoder to convert the input to an intermediate encoder vector. Then one decoder transforms the intermediate encoder vector into the final result. An application for this is model is machine translation

<img src="https://miro.medium.com/v2/resize:fit:1400/1*1JcHGUU7rFgtXC_mydUA_Q.jpeg" width=500>

### Bidirectional RNN
Bidirectional RNN connect two RNN layers together, one in forwarding direction and the other in backward direction. With this architecture, the output layer can get information from past and future simultaneously. In general, the forward and backward will process independently, and their output will be combined to produce the final output. An application for this model is sentiment classification

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/bidirectional-rnn-ltr.png?e3e66fae56ea500924825017917b464a" width=400>

# Sequence to sequence model
A sequence-to-sequence (Seq2Seq) model transforms one sequence into another sequence (e.g. translation, text generation, summarization). Seq2Seq models are particularly useful in tasks where the model needs to process and understand the entire input sequence before generating an output sequence

For example, in language translation, the model reads the entire input sentence first to capture its holistic meaning (context, grammar, and semantics) and then generates the translated sentence. A word-by-word translation often fails because many languages have different grammatical structures and idiomatic expressions that require contextual understanding to produce a meaningful translation

Seq2Seq models address this challenge by encoding the entire input sequence into a fixed-length context vector (via the encoder), and then decoding this vector to generate the output sequence step-by-step (via the decoder). This allows the model to handle the input and output sequences of different lengths effectively.

The basic sequence to sequence model has an encoder and a decoder. The encoder takes in the input sequence, converts it to a fixed-length context encoding, which captures the meaning of the input sequence, and feed it to the decoder. The decoder will takes in the context encoding and generate a sequence based on the given encoding. Note that the output at each timestep of the encoder network is discarded

Note: the encoder can be a CNN, RNN, or other architecture depending on the input data

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*RSdvbRGBJnLgX4E0.png" width=700>

Theoretically, in training, the decoder should feed its prediction at the previous timestep as the input to the next timestep as context for predicting the next word. However, at the beginning of the training, this may cause issue because the initial, untrained network will output some random predictions, and using these predictions as the context will cause more inaccurate predictions for the sequence after, which lead to very slow convergence and model instability. Therefore, in training phase, we use the teacher forcing method by always using the ground truth label from the previous timestep as the input for next timestep. This method is like training each timestep individually, which allows the model to learn faster. After the model is trained (testing phase), we will switch back to use the model's prediction at the previous timestep as the input for the next timestep

Compared to the language model previously with input activation of, $\vec 0$, the decoder of Seq2Seq model is a conditional language model that generate a sequence with maximum probability (the most likely sequence) given the ecoding as the condition

### Issues with Seq2Seq model and solutions
1. The amount of "memeory" that the model can capture from the input sequence depends on the context vector size. With, a small context vector, the model may not be able to capture the entire context from the input sequence, especially for long input sequences. A solution to this is to use deep RNN architecture, which allows the context vector to capture more information

2. For the generation process, we find the sentence with max probability given the encoding, $P(y^{<1>}, ..., y^{<T_y>}|encoding)$. The greedy search does not work well in this condition because it only maxmizes the probability of each word based on the previous words, but not maxmizes the probability of the entire generated sequence. A solution to this is beam search, which maximizes the probability for the output sentence instead of each word 

# How Seq2Seq model pick the next word

## Greedy search
Greedy search means the model always picked the token that has the highest predicted probablitity. Despite greedy search being simple and fast, there are some key issues with this approach

1. Maximizing the probability at token level does not guarantee the highest probability at sentence level
2. Always picking the highest probablity token makes the generation deterministic, meaning giving the same input, a model will always produce the same output


## Beam search
Beam search is used to find the most likely output sequence, $y$, given the input encoding, $x$

Steps:
1. Define a beam width, $B$
2. At the first time step, select the top $B$ words with the highest probability
3. For each word/sequence, feed it into seconds time step and compute the combined probability. Only keep track of the top $B$ sequences with the highest probability and drop the rest
4. Repeat step 3

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/beam-search-en.png?3515955a2324591070618dd85812d5d7" width=900>

$B$: the beam width that determines the number of most likely sentence to be track of. Large values of $B$ yield to better result but with slower performance and increased memory. Small values of $B$ lead to worse results but is less computationally intensive. A standard value for $B$ is around 10

### Length normalization
The combined probability is calculated by multiplying each word given the previous words, where

$$P = \text{argmax}_y \Pi_{t=1}^{T_y}P(y^{<t>}|x, y^{<1>}, ...y^{<t-1>})$$

$T_y$: the length of the output sequence

$x$: the context vector from the encoder

$y^{<t>}$: the prediction of the model at timestep $t$

Thus, the longer the sentence, the combined probability will be smaller and can result in numerical under-floor. To prevent this, we apply the normalized log-likelihood objective, where

$$Objective = \text{argmax}_y \frac{1}{(T_y)^{\alpha}} \sum_{t=1}^{T_y}log(P(y^{<t>}|x, y^{<1>}, ...y^{<t-1>}))$$

This allows the objective to be calculated as a sum to prevent numerical under-floor. Since the objective is calculated on a log scale, the values of the objective will be negative, and more negative objective indicates smaller proabability; less negative objective indicates greater proabability

Since longer sequence will have a smaller probability, the model will tend to generate shorter sequence. To prevent this, we apply the term $\frac{1}{(T_y)^{\alpha}}$ to compute the average probability to ensure the model can generate long sequences as well. $\alpha$ is a softener with its value usually between 0.5 and 1

### Beam search error analysis
The beam search error analysis helps us determine whether is badly generated sequence is caused by the RNN or the beam search algorithm since beam search can miss the optimal sequence due to its search space limitations

Suppose $\hat y$ is a bad sequence generated by the model and $y$ is a good target sequence, we can then calculate the probablity, $P(\hat y|x)$ and $P(y|x)$

* If $P(\hat y|x) \geq P(y|x)$: the probability of the model to generate a good sequence is lower than the probability of the model to generate a bad sequence indicates the RNN is not able to generate a good sequence. This can be solved by using a different architecture, applying regularizations, or getting more training data

* If $P(\hat y|x) < P(y|x)$: the probability of the model to generate a good sequence is higher than the probability of the model to generate a bad sequence indicates the RNN is able to generate a good sequence, but the beam search algorithm is not able to pick it up. This can be solved by increasing the beam width

Note: if length normalization is applied, the objective should be compared instead of the probability

Beam search is a better option than greedy search, but there are still key issues with this method

1. Even though beam search explores multiple candidates, it is still greedy in nature. It focuses on sequences with the highest cumulative probabilities, which will lead to low diversity output
2. Larger $B$ values lead to linear growth in computation time and memory usage

## Sampling
Sampling is one of the most commonly used decoding methods in modern language models because it introduces controlled randomness, which helps improve the diversity and creativity of generated outputs

Sampling works by adding randomness when picking the next token. When the model select the next token, it will samples a token randomly, in proportion to its softmax probability. This means less probable token may still be chosen, but with a very low chance.

### Hallucination and solutions
Despite the sampling enables more diversity and creativity, it will also introduce hallucination in language model, which means the model will generate output that doesn't make sense. This is because the natural of sampling means that all words inside the vocabulary has a chance of being picked, no matter how unprobable it is. Therefore, when a less probable word is pick, the output sentence is unlikely to make sense

While sampling enables greater diversity and creativity in language models, it also increases the risk of hallucination, which means the model generates outputs that are grammatically incorrect, illogical, or nonsensical.

This happens because, by its nature, sampling allows the model to sometime select low probability tokens. Although these tokens can contribute to creative outputs, they may also lead the sentence that doesn't make sense. The lower the probability of a selected word, the more likely it is to break coherence or accuracy in the generated output

#### Top-k sampling
Top-k sampling prevents hallucination by only allowing the model to sample the next token from the top-k most probable words, which prevents the model from pick low-probable words that doesn't make sense

#### Top-p sampling
Top-p sampling selects the smallest set of tokens whose cumulative probability exceed $p$, where $p = 0.8-0.9$ typically.

Despite both top-k and top-p search only select from a small set of possible tokens to prevent hallucination, these methods are still stochcastic because the selection process is still random

#### Temperature
The temperature, $t$, is a hyperparameter that controls the creativity and diversity of the generate output by adjusting the sharpness of the softmax distribution. A high temperature will cause the model to generate more creative outputs by using less probable words, and a low temperature will cause the model to generate less creative outputs by using more probable words

With temperature, $t$ the softmax function for computing the probability of each token in the vocabulary with size $V$ becomes

$$p(w_i) = \frac{exp(\frac{w_i}{t})}{\sum^{V}_{i = 1}exp(\frac{w_i}{t})}$$

$w_i$: the $i$th token in the vocabulary

$V$: the size of vaocabulary

$p(w_i)$: the probability of selecting the $i$th token

* $t = 1$: the formula is the same as normal softmax, nothing changes

* $t < 1$: makes the distribution "sharper", so the model is less likely to select less probable words (closer to greedy), which means less diversity but safer output

* $t > 1$: makes the distribution "wider", so the model is more likely to select more probable words, which means more diversity but may cause hallucination

These sampling methods can be used in combination to achieve better results

# BLEU Score (Bilingual Evaluation Understudy)

The **BLEU (Bilingual Evaluation Understudy) score** is a metric used to evaluate the quality of machine-translated text by comparing it to one or more reference translations. It measures **n-gram precision** while incorporating a penalty for overly short translations.

### 1. N-gram Precision

BLEU evaluates how many n-grams in the generated translation appear in the reference translation. The **n-gram precision** is calculated as:

$$
p_n = \frac{\sum_{\text{n-gram} \in \hat{y}} \text{count}_{\text{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in \hat{y}} \text{count}(\text{n-gram})}
$$

where:

- $p_n$ = **n-gram precision** (i.e., the fraction of predicted n-grams that appear in the reference translation).
- $\hat{y}$ = generated (candidate) translation.
- $\sum_{\text{n-gram} \in \hat{y}}$ = sum over all contiguous n-grams in the generated sequence.
- $\text{count}(\text{n-gram})$ = total number of n-gram in the generated sequence.
- $\text{count}_{\text{clip}}(\text{n-gram})$ = **clipped count**, which limits the count of an n-gram to the maximum number of times it appears in **any single reference translation** (prevents artificially high precision due to repeated words).

### 2. Combined BLEU Score

The overall BLEU score is calculated using **a weighted geometric mean of n-gram precisions** and a **brevity penalty** to penalize translations that are too short.

$$
\text{BLEU} = \text{BP} \cdot \exp \left( \sum_{k=1}^{N} w_k \log p_k \right)
$$

where:

- $w_k = \frac{1}{N}$ (uniform weight for each precision score when using up to $N$-grams, typically $N=4$).
- $p_k$ = precision for n-grams of size $k$.
- **Brevity Penalty (BP)**:

$$
\text{BP} =
\begin{cases} 
1 & \text{if } L_c \geq L_r \\
e^{(1 - L_r / L_c)} & \text{if } L_c < L_r
\end{cases}
$$

where:

- $L_c$ = length of the generated (candidate) translation.
- $L_r$ = length of the closest reference translation.
- This penalty discourages overly short translations, which might artificially inflate n-gram precision.

### 3. Interpreting BLEU Scores

- BLEU scores range from **0 to 1** (higher is better).
- **Typical scores**:
  - **0.6 - 0.7**: Considered **good** machine translation.
  - **0.8 - 1.0**: Indicates potential **overfitting** (rare in practical use cases).
  - **< 0.3**: Poor translation quality.

# Sentiment classification
Sentiment classification predicts the sentiment based on a given sentence

To make predictions, we first convert all the words in the sentence from one-hot vector into embeddings. Then, take the average or sum of the embeddings and feed it into a softmax unit for classificatioin. An issue of this method is that it is not good at catching multiple negations in the same sentence

Another method is to feed the word embeddings of the sentence into a RNN by time steps and feed the activation from the last time step into a softmax unit for classification

<img src="https://www.tensorflow.org/static/text/tutorials/images/bidirectional.png" width=400>

# Attention Mechanism
Issue: In traditional **Seq2Seq models** (such as those using RNNs or LSTMs), the **context vector** has a fixed size and serves as the only source of information for the decoder. This can lead to **information loss**, especially for long input sequences, since the decoder must rely solely on this compressed representation.

Solution: The **attention mechanism** allows the model to **dynamically focus on different parts of the input sequence** at each decoding step. Instead of encoding all input information into a single fixed-size vector, the attention mechanism encode each input into context information, called hidden states, to better retain input information during the encoding. When decoding, attention assigns **different weights** to each inputs' hidden states based on their relevance to the current decoding step, so despite the model keeps all the input context, it will only look at the important, relevant parts when decoding 

## Advantages of attention mechanism

1. **Handling Long-Range Dependencies**  
   - The model can refer to any part of the input sequence, regardless of length, improving translation quality and other sequence tasks.

2. **Better Contextual Understanding**  
   - By selectively attending to relevant words, the model can generate more **coherent and contextually appropriate** outputs.

3. **Improved Interpretability**  
   - The attention weights show which parts of the input are most influential, making the model’s decisions more transparent.

4. **Parallelization (only in Transformer models, not in RNN-based models)**  
   - Unlike RNN-based attention, **self-attention in Transformers** allows for **parallel computation**, making training much faster.

## How attention works
The attention mechanism allows each encoder **time step** to output its own **hidden state**, $h_{t}$, as part of the **context information**, rather than producing a single fixed-size context vector at the end. These hidden states are similar to those in any sequential model, where each hidden state contains information from **all previous inputs up to that point**. Once computed, these hidden states values remain **fixed** throughout the decoding process.

At the end of encoding, we have **$T$ fixed hidden states** from the encoder, where $T$ is the length of the input sequence. These hidden states serve as **inputs to the attention mechanism**, which assigns different weights to them and computes a **weighted sum** to form a **context vector** at each decoding step.

After encoding, the last hidden state of the encoder is fed as the initial hidden state of the decoder, similar to the traditional Seq2Seq model. At each time step, the decoder produces its own hidden state, and we compute a **score** for each encoder hidden state by taking the dot product or some learned transformation between the decoder hidden state and each encoder hidden state. These scores represent the relevance of each input word for generating the current output. A higher score indicates that a particular input word is more important, and the attention mechanism will assign it a higher weight when generating the current output.

Next, we pass all the scores through a softmax function to normalize them into attention weights, ensuring that each weight is between 0 and 1, and that their sum equals 1. These attention weights determine how much each encoder hidden state contributes to the final context vector used by the decoder.

Finally, we calculate the context vector by computing the weighted sum of the encoder hidden states using the attention weights. This sum forms the context vector for the decoder at the current timestep, which contains a focused representation of the input. The decoder then uses the context vector and its hidden state to calculate an output. This method overcomes the information bottleneck of the intermediary state by allowing the decoder model to access all the hidden states 

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/0*C8OjldkhqO6aTHyG.png">

## Mathematical explanation
First, the encoder calculates the hidden states for each input timestep, which we refer as $h_{t}$ for all input timestep where $t = 0, 1, ..., T$. After encoding, we have an array of hidden states $[h_0, h_1, ..., h_T]$

At each step of decoding, the decoder takes in the decoder hidden state, $h_{s-1}$ and the predicted word, $y_{s-1}$ (ground truth label at training time) from the previous timestep to compute its current hidden state, where
$$h_{s} = f(h_{s-1}, y_{s-1})$$

Note: for the first decoder timestep, the input hidden state will be the last hidden state of the encoder and the input word is just a 0 vector

After obtaining the current decoder hidden state, we use this value to calculate a score for each encoder hidden state, where
$$e^{(s)}_{t} = score(h_s, h_t)$$

$e^{(s)}_{t}$: how much the decoder at step $s$ should focus on the encoder hidden state $h_t$. You can think of there are two timelines, one for the encoder, represented by $t = 0, 1, ..., T$, and another for the decoder, represented by $s = 0, 1, ..., T'$

Different type of attention mechanisms use different score function. For each decoder timestep, $s$, we iterate through the score calculation $T$ times with respect to all encoder hidden states to obtain $e^{(s)} = [e^{(s)}_{0}, e^{(s)}_{1}, ..., e^{(s)}_{T}]$

After obtaining the score, we need to normalize this value using softmax layer to get the attention weights, where
$$\alpha^{(s)}_{t} = \frac{exp(e^{(s)}_{t})}{\sum_{t'=0}^{T} exp(e^{(s)}_{t'})}$$

$\alpha^{(s)}_{t}$: the attention weight to the $t$th encoder hidden state

$e^{(s)}_{t}$:  the score with respect to the $t$th encoder hidden state

$\sum_{t'=0}^{T} exp(e^{(s)}_{t'})$: the sum of all the score

Again, we iterate this calculation $T$ time with respect to all encoder hidden states to obtain the attention weights $\alpha^{(s)} = [\alpha^{(s)}_{0}, \alpha^{(s)}_{1}, ..., \alpha^{(s)}_{T}]$. $\alpha^{(s)}$ tells the decoder how much it should focus on to the $t$th encoder hidden state when generating the output. All attention weights are values between 0 and 1. A value closer to 0 indicates that the input should not be focused on, while a value closer to 1 means the input is highly attended to

Then, we calculate the weighted sum of the encoder hidden state using the attention weights to obtain the context vector for the decoder timestep $s$, where
$$c_s = \sum^{T}_{t=0}{\alpha^{(s)}_{t} h_t}$$

$c_s$: the context vector at decoder timestep $s$

Finally, the decoder can make a prediction at this timestep by using its hidden state and the context vector, where
$$y_s = g(h_s, c_s)$$

$y_s$: the decoder prediction at timestep $s$

In this case, the attention mechanism is not a neural network and there's nothing to be learned. It's only there to help the encoder and decoder learn better. However, learned parameters can definitely be added to inpromve the models' capability

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20200603211336/attn.png">

In the image, a feed-forward network is added, which is responsible for transforming the target hidden state, $h_t$, into a representation that is compatible with the attention mechanism, $A$. It takes the target hidden state h(t-1) and applies a linear transformation followed by a non-linear activation function (e.g., ReLU) to obtain a new representation, $A$. Instead of using the hidden state directly, it use a neural network to map the hidden state to a new representation as the input to attention layer. However, the above formulas still hold by replacing all $h_t$ to $A$

# Different types of attentions
## 1. Bahdanau Attention (Additive Attention)

Bahdanau attention, also known as **additive attention**, is very similar to the general attention mechanism introduced above. The key difference is that instead of using the **current decoder hidden state** $h_s$ to compute attention scores, Bahdanau attention uses the **previous decoder hidden state** $h_{s-1}$ and each **encoder hidden state** $h_t$ to compute a compatibility score, $e_t$.

The attention score for each encoder hidden state is computed using a small feedforward neural network:

$$e_t = v_a^\top \tanh(W_a h_{s-1} + W_b h_t)$$

where:
- $h_{s-1}$: the **previous decoder hidden state**,
- $h_t$: the **encoder hidden state at time $t$**,
- $W_a$, $W_b$: **learnable weight matrices**,
- $v_a$: a **learnable weight vector**. This learnable parameter projects the score into a scalar value and is more expressive and flexible than a simple dot product to improve the models' capability

The rest of the attention operations, such as applying **softmax** to compute attention weights and computing the **context vector**, remain the same as in general attention. Finally, the context vector will be concatenated with the input at the current decoder timestep $y_{s-1}$, which is the previous prediction as the input to the decoder. This is **one of the earliest attention mechanisms**

## 2. Luong Attention (Multiplicative Attention)
Multiplicative attention builds upon Bahdanau Attention but uses a simpler, more efficient scoring method by leveraging dot products.

Compared to additive attention, which uses the previous decoder hidden state value, $h_{s-1}$, multiplicative attention uses the current timestep docoder hidden state, $h_s$ to compute the score, so the score is simply calculated as the dot product, where

$$e_t = h_s^{\top} h_t$$

- $h_s$: the current decoder hidden state
- $h_t$: the $t$th encoder hidden state

Then, the calculation for the context vector remains the same. Finally, instead of using the context vector as a input to the RNN cell, we directly pass it as an input to output layer, where it is combined with the decoder hidden state to predict the output

This mechanism follows the same steps as the general attention mechanism, but with a different way of computing attention scores. Compared to additive attention, this method is computationally more efficient due to the use of simple dot products, making it faster and more scalable for large models.

## 3. Self attention
Everything being introduced up to this point are good, but share the following two common issues

### 1. Slow Processing in Sequence Models
So far, all the attention mechanisms we have seen operate within sequence models such as RNNs, GRUs, or LSTMs. These models are necessary because traditional neural networks cannot capture dependencies between time steps in sequential data. However, sequence models cannot fully utilize parallel computation due to their inherently sequential processing nature. This leads to slow training times, especially for very large models.

### 2. Static Word Embeddings
Word embeddings, such as those in Word2Vec or GloVe, are powerful in capturing semantic relationships between words. However, these embeddings are static—once trained, their vector representations remain fixed, regardless of the context in which a word appears.

This creates a problem when words have multiple meanings. For example:

* In "He went to the bank to withdraw money", the word "bank" refers to a financial institution.
* In "He sat by the bank of the river", the word "bank" refers to a landform next to water.

Since static word embeddings assign only one vector representation per word, they capture an average meaning rather than adapting to the context. This limitation prevents traditional models from understanding word ambiguity and polysemy (words with multiple meanings).

Self-attention is designed to address the following two key issues by
* Allowing the model to process all tokens simultaneously, leveraging vectorized matrix operations instead of step-by-step recurrence
* Applying the attention model to take a traditional embedding, $e^{<t>}$, as input and output a more refined, contextualized embedding, $A^{<t>}$. This significantly improves the ability for the model to "understand" the text

### How self attention works
Self attention operates on 3 major input components, the querie matrix, $Q$, the key matrix, $K$, and the value matrix, $V$

Query ($Q$): the query is a transformed representation of each word in the sequence. It is used to compare against all Key vectors to determine how much attention the model should assign to every word, including itself.

Key ($K$): the key is a transformed representation of each word that allows it to be compared with Queries. It encodes how relevant each word is to others when a Query is searching for context. The Key itself is only used for comparison, not for passing forward information.

Value ($V$): a learned projection of the input embedding, which contains the actual information that will be passed forward in the network. After computing attention weights using the Query and Key, we compute a weighted sum of the Value vectors to generate a refined, context-aware representation of each word.

Note: the key and query has no instrinsic meaning, and they are projections of the input word embeddings, optimized purely for computing attention scores. In stardard transformers, the key and query must have the same dimension, and they are compared by taking the dot product between each query vector and each key vector. A high dot product value between a Query and a Key indicates that the corresponding word is important and should receive more attention. A low dot product value means that the word is less relevant and should be attended to less

The reason we use two separate transformations (Q and K) is not because Q and K have explicit, predefined meanings, but because it makes the model more powerful by allowing it to learn two different transformations instead of just one. There is no fundamental mathematical requirement to have two separate transformations—you could, in theory, use just one transformation (i.e., compare Q with Q). However, separating Q and K makes the model more flexible and allows it to learn richer representations.

### Process of self attention
First, we convert all the words in the input sequence in to their static word embedding and concatenate them into a matrix $X$, given by
$$X = [x_0, x_1, ... x_T]$$
where $x_0, x_1, ...$ are the word embedding for each single word

Unlike RNNs, which process word embeddings one at a time (one word per timestep), self-attention takes in the entire embedding matrix, $X$, and process all the word embedding at once

Then, we will calculate a query matrix, a key matrix, and a value matrix for the entire word embedding, where

$$Q = X W_Q$$

$$K = X W_K$$

$$V = X W_V$$

$Q$: the query matrix, where $Q = [q_0, q_1, ..., q_T]$. $q_t$ is the query vector for the $t$th word, so the query  matrix is essentially concatenates all the query vectors for each input token. In general, each query vector has a lot smaller dimensions than the word embedding. The Query matrix has the dimension of $(T, d_q)$, where T is the number of input tokens and $d_q$ is the size of each query vector, which depends on the architecture

$K$: the key matrix, where $K = [k_0, k_1, ..., k_T]$. $k_t$ is the key vector for the $t$th word. All key vectors have the same dimension as the query vectors, and the query matrix and the key matrix have the same dimension, which is $(T, d_k)$ and $d_q = d_k$ ($d_k$ is the size of each key vector)

$V$: the value matrix, where $V = [v_0, v_1, ..., v_T]$. $v_t$ is the learned projection from the word embedding to the value vector of the $t$th word, which stores the information that will be selectively attended to and passed forward after applying attention weights. $V$ has the dimension of $(T, d_v)$, where $d_v$ is the size of each value vector. Typically, $d_v = d_{model}$, where $d_{model}$ is the size of each word embedding


$W_Q, W_K, W_V$: learnable matrices that maps the input word embeddings to $Q, K, V$

After obtaining the query, key, value matrices for the input sequence, we can calculate the attention scores by using the query and key matrices, where

$$A = softmax(\frac{QK^{\top}}{\sqrt{d_k}})$$

$A$: the attention weight matrix, where $A = [a_0, a_1, ..., a_T]$. $A$ is a $T x T$ matrix, where $T$ indicates the number of input tokens, and $a_t$ is the attention weight for the $t$th word, which indicates how much attention should this words pay toward each other words. A higher attention weight means the corresponding Value vector should contribute more to the output representation, while a lower attention weight means it should contribute less. The sum of **each row** of the attention weight matrix, $a_t$, is 1. Note that this matrix is not symetrical as the attention from the word A to B is not necessarily the same as the attention from word B to A

<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7dd5ec8-4912-4e0d-8265-b335c08d2958_1760x1126.png" width=500>

$QK^{\top}$: a $T x T$ similarity score matrix, which is the dot product between the query and key matrices. Each entry of this matrix represents how much one word should attend to another. The result of the dot product is the attention score matrix before being normalized by the softmax function, which denotes the similarity between each of the two input words

$\sqrt{d_k}$: the square root of the dimension of the key query space, which scale the dot products to prevent extremely large values. Without scaling, large dot-product values would cause the softmax function to produce highly skewed attention distributions, making the model focus too much on a few tokens and leading to unstable training.

Finally, we use the attention weight matrix and the value matrix to calculate the refined, contextual word embedding specifically for this input sequence, where

$$Z = AV$$

$Z$: the refined embedding matrix for the input sequence, where each entry is the update that needs to be made to the original word embedding based on the context, where

$$X' = X + Z$$

$X'$: the updated, context awared embedding

$X$: the original input embedding

$Z$: which encodes contextual relationships between words. This information is combined with the original embedding $X$ through a residual connection to produce the final context-aware representation.

Note: $Z$ and $X$ must have the same dimension in order of the residual connection to work. If $d_v$ does not equal $d_{model}$, another linear layer will be used to ensure their dimension matches

<img src="https://sebastianraschka.com/images/blog/2023/self-attention-from-scratch/summary.png" width=700>

In summary, self attention takes in all the static word embedding, look at all the input words and make necessary adjustment to the static word embedding of each word to refine their meaning based on the given contxt, which
* captures long-range dependencies in a sequence more effectively than RNNs and LSTMs.
* allows parallel computation, enabling faster training.


## 4. Multihead attention
Multihead attention is an extension of self attention by performing self attention in parallel $h$ times, where $h$ is the number of heads. Each self attention head has its own Q, K, V mapping to capture different contextual relationships in the sequence to propose a update to improve the original word embedding based on the context. By processing information from multiple perspectives, multi-head attention allows the model to better understand dependencies between words.

After each multihead attention layer, we will have $h$ number of refined embedding matrix, denoted by $z_0, z_1, ..., z_h$, where $z_i$ is the output for the $i$th self attention head. These outputs are concatenated into a single matrix, given by

$$Z = [z_0, z_1, ..., z_{h-1}]$$

<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff406d55e-990a-4e3b-be82-d966eb74a3e7_1766x1154.png" width=700>

Then, a linear transformation is applied to project this concatenated output, $Z$, back to the model’s embedding size $d_{model}$, where

$$Z' = Concat(z_0, z_1, ..., z_{h-1}) W_o = Z W_o$$

$W_o$: a learned linear mapping that maps Z to have the size dimension of the word embedding

Finally, this proposed change will be added to the original embedding to obtain one refined embedding, $X'$, where

$$X' = X + Z'$$

Multihead attention is powerful because it
1. enables the model to capture multiple relationships within the sequence, improving its ability to focus on different parts of the data.
2. allows better representation of complex dependencies and context.

In Transformer architectures, we use multiple layers of multi-head attention because the refined word embeddings may reveal new relationships that were not apparent in earlier layers. By iteratively applying multi-head attention, the model progressively enhances its understanding of contextual dependencies and word meanings to improve the overall performance

After each attention layer, the updated word embeddings can be processed in parallel because all dependencies, contextual relationships, and relative positions between words have already been captured and encoded within them

## 5. Cross attention
Cross-Attention is a mechanism built upon self-attention and is commonly used in encoder-decoder architectures. Unlike self-attention, which applies attention in only one input sequence, cross-attention allows one sequence to attend to another sequence, capturing dependencies between them.

Self-attention captures dependencies within a single sequence by allowing each token to attend to all other tokens in the sequence. Cross-attention, on the other hand, captures dependencies between two sequences, enabling one sequence (e.g., the target sentence being generated) to attend to another sequence (e.g., the source sentence). Essentially, it servers as a connection between the encoder and decoder

### Use cases
For traditional sequence generation tasks, self-attention/multihead attention is sufficient since there is no dependies between sources (only one source). Cross-attention is useful when the output sequence depends on an external context, such as in:

* Machine translation: Output depends on the source sentence
* Image captioning: Output depends on visual features from an image
* Text summarization: Output depends on the input document
* Dialogue systems: Response depends on the context of the conversation

### How cross attention works
The operations behind cross attention is exactly the same as self/multihead attention, and the only difference is the input Query, Key, and Value. Cross attention requires an encoder and a decoder, where the encoder looks at the input sequences (eg. sentence in English) and the decoder looks at the generated sequences (eg. paritally generated sequence in French) before the current timestep, both using using self/multihead attention blocks.

Encoder output: refined word embedding for the input sequences

Decoder output: refined word embeddings for the generated sequences based on the input sequence

The operations behind cross-attention are mathematically the same as self/multi-head attention—the key difference is how the Query, Key, and Value are sourced.
* Self-attention: Q, K, and V come from the same sequence.
* Cross-attention: Q comes from the decoder, while K and V come from the encoder output.

Cross-attention is commonly used in encoder-decoder architectures, where:
* The encoder processes the input sequence (e.g., an English sentence) using self-attention and outputs refined embeddings.
* The decoder uses self-attention to model dependencies within the partially generated sequence (eg. a partially generated French sentence) before the current timestep

* Encoder output: Refined word embeddings for the input (source) sequence

* Decoder output: First output refined word embeddings for the generated (target) sequence, then use this and the encoder output to predict the next word

In practice, the encoder only need to process the input sequence once since the input sequence does not change, meaning the refined word embeddings remains the same. The decoder, however, need to process every time when a new word is generated to predict the next word.

### Process of cross attention
Once the encoder and decoder outputs the refined word embedding. The cross attention layer will map the encoder output to a Key and Value matrix and the decoder output to a Query matrix. Then, it uses the Q, K, V to perform the same operations as the self attention to generated a refined embeddings that tells the decoder how to align its outputs with the encoder's output at a given timestep

In this case, the Q and K matrices will have size $(m, d_q)$ and $(n, d_k)$ respectively where $n$ is the size of the input sequence and $m$ is the size of the output sequence, so $m \neq n$ in general, but $d_q = d_k$

$QK^{\top}$: instead of comparing each word with all other words within one sequence, this matrix compares each word in one sequence to the words in the other sequence. The output dimension is $(m, n)$, where each entry $(i, j)$ tells how much attention the decoder token at position $i$ should give to the encoder token at position $j$

After the softmax function, we obtaint the attention weight matrix $A$ of size $(m, n)$ The final context vector, $Z$, is obtained by
$$Z = AV$$
where V has the size of $(n,  d_v)$, so $Z$ has the dimension of $(m, d_v)$.

The context vector, $Z$, tells the decoder what information to incorporate from the encoder to refine the current hidden states. The final refined, decoder hidden state is given by
$$H' = H + Z$$
where $H'$ is the new hidden state and $H$ is the original hidden state

The final output of the cross attention layer is $H'$ (a refined decoder hidden state), which tells the decoder how to use the encoder information in the next prediction

<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16d2bd0-984d-4224-9c4c-2f0cc144a599_1520x1108.png" width=700>

Essentially, cross attention layer combines and aligns the information from two sources to capture the dependencies between them

# Transformer
Putting everything together, the transformer model uses the encoder decoder model combined with attention mechanism, which has the advantage of
1. Parallel computation
2. Effectively capture long range dependencies
3. Preserve the information about input order

## Architecture
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*hMpd6D6yc7L7eY7s.png" width=800>

The transformer model follows an encoder-decoder architecture that uses self-attention and cross-attention mechanisms. The encoder processes the entire input sequence in parallel, while the decoder is autoregressive, generating output tokens one by one. The actual encoder and decoder are made out of multiple ($N$) same encoder/decoder blocks

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*icgq7WfgAryCABqv.png" width=500>


### Positional encoding
One issue with the encoder processing all embeddings in parallel is that the self-attention module cannot capture the order of words in a sentence. Therefore, we need positional encodings to provide information about the location of each word in the sentence.

Simply assigning position numbers to each word (word0, word1, etc) is a solution to this, but it comes with the issue of
1. Unbounded Growth of Numbers: large position numbers do not generalize well, making it difficult for the model to learn meaningful position representations
2. Discrete Numbers: discrete position numbers do not encode smooth positional relationships, making it harder for the model to understand gradual positional changes
3. Inability to Capture Relative Positions: position numbers only encode absolute positions, not relative distances, because the model treats them as arbitrary numerical values rather than meaningful spatial relationships. This means the model cannot easily determine how far apart words are in the sequence, which is crucial for understanding sentence structure.

Therefore, we use sine and cosine positional encoding, which bounded, continuous, and periodic that enables the model to capture both absolute and relative positions effectively. The positional encoding is given b


$$PE_{(pos, 2i)} = \sin{(\frac{pos}{10000^{\frac{2i}{d}}})}$$

$$PE_{(pos, 2i + 1)} = \sin{(\frac{pos}{10000^{\frac{2i}{d}}})}$$

$pos$: the position index of the word in the sequence, which is a discrete integer ranging from 0 to sequence length - 1

$i$: the index of the encoding dimension pair, ranging from 0 to $d/2 - 1$

$d$: size of the word embedding, which equals $d_{\text{model}}$

Positional encoding generate values for even positions using the sine function and odd positions using the cosine function. The final encoding has the same size as the word embedding $d_\text{model}$
<img src="https://machinelearningmastery.com/wp-content/uploads/2022/01/PE3.png" width=500>

Sine and cosine waves oscillate, meaning they repeat values periodically. However, they still capture distance and relationships effectively due to two key properties:
1. Different frequencies for different dimensions (multi-scale representation)
2. Relative position information is preserved through phase differences

#### Multi-Scale Representation
Each position $p$ is encoded with multiple sine and cosine waves of different frequencies depended on $i$. Using different frequencies captures both the local and global position information of each word. This makes every position unique even though individual sin/cos values repeat periodically

* Lower frequency components (when $i$ is small) has larger period, which capture global position information, like whether a word is near the start or end
* Higher frequency components (when $i$ is large) has smaller period, which capture local position information, like how close two words are to each other

#### Phase Differences Capture Relative Position
For the same frequency (when $i$ is fixed), the difference between positional encodings for two words separated by the same distance remains similar, where

$$PE(p + k) - PE(p) \approx f(k)$$

for any position $p$ and seperation $k$. This implies that the difference between any two words’ embeddings depends primarily on $k$ with only minor variations based on $p$ that are negligible in practice. Therefore, the model can infer and understand relative distances between words by analyzing the differences in their input embeddings

<img src="https://miro.medium.com/v2/resize:fit:1012/format:webp/1*JdYCgMl1NXshwi3GfUFSSg.png" width=500>

Finally, the positional encoding is added to the original word embedding to form the input word embeddings for the attention layer. This ensures that the new input word embeddings (size = $d_{\text{model}}$) not only encode the semantic meaning of words but also incorporate their global, local, and relative positional information within the sequence, ensuring the model understands word order and relationship effectively

### Masked Multihead Attention Layer
In the decoder of the Transformer architecture, there is a masked multihead attention layer, also called causal attention. This is needed because, during training, we input the entire target sentence (ground truth) into the decoder for convenience using teacher forcing method. However, this also means the decoder could cheat by attending to future ground truth tokens

Therefore, at each timestep, we mask out all future tokens to prevent the decoder from seeing them. This ensures that the decoder can only attend to earlier ground truth tokens, mimicking the real scenario during inference, where it must use only previously generated tokens to predict the next token.

To masked out all future tokens, we first compute the similarity score matrix ($QT^{\top}$) as usual. Then, at each timestep, we add a value of $-\infty$ to the scores of future tokens. After applying softmax, the attention weights for those future tokens become $e^{-\infty} = 0$, meaning the model does not attend to or use future tokens for prediction

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*QYFua-iIKp5jZLNT.png" width=400><img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*3ykVCJ9okbgB0uUR.png" width=400>

Masking ensures that attention to future tokens is always zero, but each token still computes attention over past tokens, producing a meaningful context-aware output based solely on previous context. This context-aware output from the masked attention layer is then added to the original word embedding via a residual connection. This ensures that the static word embeddings are refined using only past context, maintaining autoregressive behavior during both training and inference.

Thus, the primary purpose of causal attention is to prevent the model from using future tokens during training, thereby simulating inference-time behavior, where ground truth tokens are not available

### Encoder
The encoder takes in the input sequence as tokens and processes them in parallel to output a refined, context-aware embedding. Thus, the encoder is not an autoregressive model

The actual encoder consists of $N$ encoder blocks, where each block contains:
* A Multihead Self-Attention layer
* A Residual Connection followed by Layer Normalization
* A Fully Connected Feed-Forward layer
* Another Residual Connection followed by Layer Normalization

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*Blt8BYGjFAqXyajX.png" width=700>

First, the input is fed into the multihead attention layer, which computes how the embeddings should be updated based on the sentence context. The residual connection then adds this contextual information back to the original embedding, producing a more refined embedding.

Next, this refined embedding is normalized using LayerNorm and then fed into the fully connected layer. The fully connected layer is point-wise, meaning that each token embedding is passed through the same MLP independently, using shared weights. This is because self-attention already captures contextual information between tokens, so the MLP's role is to refine each token’s representation individually. It also introduces additional non-linearity and modeling capacity, allowing the model to learn more abstract and expressive patterns.

Finally, the output from the fully connected layer is added back to the input through a residual connection, followed by LayerNorm, producing the final refined representation.

Note: Residual connections and LayerNorm help stabilize gradients during training by preventing vanishing gradients and ensuring smoother optimization.

The output of each encoder block is a set of vectors, where each vector represents a token in the input sequence, now enriched with contextual information. We stack multiple encoder blocks because deeper layers can capture higher-level relationships that emerge from the refined embeddings, improving contextual representation. The output of the final encoder layer is a set of context-aware token embeddings, which are passed to the decoder for sequence generation.

### Decoder
The decoder takes in the output of the encoder and all previously generated tokens to predict the next token in an autoregressive manner.

Similar to the encoder, the decoder also contains $N$ identical blocks, where each block contains
* A Masked Multihead Self-Attention Layer (attends to previous tokens only)
* A Residual Connection followed by Layer Normalization
* A Multihead Cross-Attention Layer (attends to encoder output)
* Another Residual Connection followed by Layer Normalization
* A Fully Connected Feed-Forward Layer (MLP)
* Another Residual Connection followed by Layer Normalization

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*GH8TpFxjIwwbHCr4.png" width=400>

First, the decoder adds positional encoding to the input word embeddings — which are the ground truth tokens during training (via teacher forcing) or previously generated tokens during inference.

Then, the decoder passes these embeddings through a masked multihead self-attention layer, which masks out all future tokens to prevent the model from “cheating.” This attention layer computes context-aware embeddings, where each token can only attend to earlier tokens in the sequence. The output is then normalized by a LayerNorm layer and passed along.

Next, the output of the masked self-attention is used as the Query in a cross-attention layer, while the Key and Value come from the encoder’s output. This cross-attention mechanism computes how each token in the partially generated target sequence should attend to the encoder's representations of the input sequence. This allows the decoder to align and incorporate relevant context from the source sentence into the target-side embeddings.

The output of the cross-attention layer is passed through another LayerNorm, followed by a point-wise feed-forward network (MLP) — the same structure as in the encoder — and another LayerNorm. This stack refines each token’s embedding independently while preserving stability and non-linearity.

After passing through $N$ decoder blocks (each using the same encoder output), the final embeddings are passed into a linear layer, followed by a softmax, to produce a probability distribution over the target vocabulary — from which the next token is predicted.

During inference, only the last token’s refined embedding from the decoder is used to predict the next word, ensuring the input to the linear layer has the expected shape (i.e., a single vector of size $d_{model}$)

During training, the process is fully parallelized. The decoder produces one refined embedding per token in the input sequence, and each embedding is used to predict the next word at its corresponding position. This allows the model to predict the entire target sentence in a single forward pass, while still preserving autoregressive behavior using a causal attention mask.

#### Difference between causal and cross attention
You can think of the masked self-attention layer in the decoder as computing a context-aware embedding based on the target sentence itself — specifically, the tokens that have already been generated or provided (i.e., past tokens only).

On the other hand, the cross-attention layer computes a context-aware embedding by incorporating information from both the target-side tokens (via Query) and the source-side tokens (via Key and Value) — effectively aligning and integrating context from the input sentence into the target sentence representation.

## Training
Transformer training is not strictly autoregressive in implementation, as parallelization is used for efficiency. However, it remains autoregressive in principle, since each token is predicted using only the previous tokens.

#### 1. **Tokenization & Special Tokens**
Tokenize the raw training sequences — both the source sentence (e.g., in English) and the target sentence (e.g., in French).  
Add special tokens such as `<sos>` (start of sentence), `<eos>` (end of sentence), and `<pad>` (for padding). Padding is always added at the end of a sequence to ensures every sequence in a batch has the same length as the longest sequence in the batch

#### 2. **Word Embedding**
Convert each token into a word embedding using a learned embedding matrix.  
Typically, **separate embedding matrices** are used for the source and target sequences (e.g., one for English and one for French) to account for different vocabularies.

Note: we treat all `<pad>` like a normal token and convert all of them into a fixed embedding

#### 3. **Add Positional Encoding**
Since the Transformer lacks recurrence or convolution, inject **positional information** into the embeddings by adding a deterministic positional encoding (e.g., sine/cosine functions) to each token's embedding, for both the source and target sequences.

#### 4. **Encoder Processing**
Feed the source sequence embeddings into the **encoder**, which applies a stack of self-attention and feedforward layers.  
This process is **fully parallelized**, meaning all tokens in the input sequence are processed simultaneously.  
The encoder outputs **context-aware representations** of each input token.

**Note:** All `<pad>` tokens are processed like real tokens — they pass through the embedding, positional encoding, and attention layers just like any other token. However, we apply a padding mask inside the attention mechanism to ensure that no real tokens do not attend to `<pad>` tokens. This is done by setting all attention weights toward `<pad>` tokens to zero before the softmax operation.

While `<pad>` tokens may still compute attention to real tokens, their influence is negligible since their outputs are not used during loss calculation or decoding. Keeping them in the computation slightly increases the cost, but it allows for efficient parallelization using fixed-size tensors — which makes the overall training process faster and simpler.


#### 5. **Decoder Processing with Teacher Forcing**
During training, the target sequence embeddings (from the ground truth) are passed into the **decoder** using **teacher forcing**.  
This means we input the full target sequence (shifted by one token) into the decoder, rather than using the model’s previous predictions.

The decoder uses a **masked (causal) self-attention layer**, where a triangular mask ensures that each token can only attend to itself and previous tokens — not future ones. This enforces autoregressive behavior.

> **Example:**  
> In the attention mask matrix below, each row shows which tokens a particular position can attend to:
> 
> - The first row (e.g., "Your") can only attend to itself.  
> - The second row (e.g., "Journey") can attend to "Your" and "Journey".  
> - This pattern continues through the sequence.  
> 
> These masked inputs are then combined with the encoder outputs using **cross-attention** to inform decoding with both source and target context.

Note: similarly, we add a padding mask to ensure all `<pad>` are not being attended to by real tokens, so the paddings will have no impact on the model's results

<img src="https://drek4537l1klr.cloudfront.net/raschka/v-7/Figures/ch03__image037.png" width="500">

#### 6. **Prediction & Parallelization**
The decoder outputs the full predicted target sentence, where each token is predicted independently based on its own masked context and encoder-derived information.  

Although each token is technically predicted **autoregressively**, the use of **masking and teacher forcing** during training enables **parallel processing** of all target tokens at once to speed up training.

Note: the model also compute predictions for `<pad>` positions, but these predictions are ignored during loss calculation, so they don't affect learning.

#### 7. **Loss Function & Backpropagation**
After the decoder predicts all token in the target sequence, the model compares each prediction individually to the corresponding ground truth token using a cross-entropy loss. This loss measures how well the predicted probability distribution matches the actual label.

For each position $t$ in the loss for the $t$th token in sequence is

$$ \mathcal{L}_t = -\sum_{j=1}^{V} y_{t,j} \cdot \log(\hat{y}_{t,j}) $$

$V$: total vocalubary size

$\hat{y}_t$: the predicted probabilities for each word (probabilities sum to 1)

$y_t$: one-hot encoded ground truth token


Because sequences are padded to the same length for batching, some positions in the ground truth labels are `<pad>` tokens, which shouldn't contribute to the loss.

Therefore, we define a binary mask $m_t \in \{0, 1\}$ for each position $t$:
* $m_t = 1$ if the target token is **not** a `<pad>`
* $m_t = 0$ if the token is a `<pad>`

Then, we apply the masks when calculating the average loss of a sequence. Essentially, this mask ensures that if the ground truth label at a position is `<pad>`, we do not account for that token, so we only account for real tokens when perofrming backpropagation

The average loss is given by

$$ \mathcal{L} = \frac{\sum_{t=1}^{T} m_t \cdot \mathcal{L}_t}{\sum_{t=1}^{T} m_t} $$

Finally, we use the computed loss to calculate the gradient and perform backpropagation


## Inference
At inference time, the model has no access to the ground truth target sequence, so it generates tokens **autoregressively** — predicting one token at a time based on all previously generated tokens.

#### 1. **Tokenization, Word Embedding, and Positional Encoding**
Similar to training, the model first **tokenizes** the input source sequence (e.g., a sentence in English), converts the tokens into **word embeddings**, and adds **positional encodings** to inject order information.  
This process applies only to the **encoder**, since there is no ground truth available for the decoder input during inference.


#### 2. **Encoder Processing**
The encoder processes the entire input sequence in parallel using self-attention and feedforward layers, producing **context-aware representations** for each token in the source sequence.

This step is identical to training, where all tokens are processed simultaneously.


#### 3. **Decoder Processing**
The decoder starts with an initial input of the special `<sos>` (start-of-sequence) token. It combines this input with the encoder’s output through the **cross-attention layer** to predict the next token.

Once the next token is generated:
- It is **appended to the decoder input sequence**
- The updated sequence is passed back into the decoder to generate the next token

This process repeats **one token at a time**, with the decoder continually attending to both the **previously generated tokens** (via causal masking in self-attention) and the **encoder output** (via cross-attention), until an `<eos>` (end-of-sequence) token is generated or a maximum length is reached.

Note: In the standard Transformer architecture, causal attention masks are technically not required during inference, since the decoder only has access to previously generated tokens, and there are no future tokens to mask. However, in optimized decoding strategies such as cached decoding (where key and value tensors from previous steps are stored and reused), causal masks are necessary to ensure the model does not attend to future positions in the cached sequence.


# LLM Architectures
There are multiple variations of transformer architectures in NLP, each tailored for different purposes.

## Seq2Seq (Encoder-Decoder) Models
Seq2Seq models use both the encoder and decoder components of the traditional transformer architecture. They are effective at both interpreting and generating text. This architecture is commonly used in tasks where the model needs to process and generate sequences - such as machine translation or text summarization - especially when attending to multiple input elements

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*O1zaE60vTfrDE0w4WTU5Ww.png" width=500>

## AutoEncoding Models (Encoder-only)
Encoder-only models utilize only the encoder portion of the transformer. They excel at understanding and analyzing input text, making them well-suited for tasks such as sentiment analysis, text classification, or named entity recognition. These models are designed to capture contextual relationships within the input without generating new sequences.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*-ZK1N7ATPw4v0p93aSOnIQ.png" width=500>

## AutoRegressive Models (Decoder-only)
Decoder-only models rely solely on the decoder component of the transformer. They are designed for text generation, where each token is generated based on the previously generated ones. These models, such as GPT, are optimized for tasks like story generation, chatbots, or code completion. However, since they lack an explicit encoder, their ability to deeply understand long input contexts is more limited compared to encoder-based models.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*YV-zoM7wVTh_2yyPIIxqWA.png" width=500>

# Reinforcement Learning From Human Feedback

# Direct 

# Agentic LLMs

# Resources
* CS230: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#