# <center> MSBA 6461: Advanced AI for Natural Language Processing </center>
<center> Summer 2025, Mochen Yang </center>

## <center> Transformer Architecture </center>

# Table of Contents
1. [Transformer Architecture](#transformer)
    - [What is the Transformer Architecture?](#transformer_intro)
    - [Self Attention](#transformer_components)
        - [Multi-Head Attention](#multihead)
        - [Feedforward Neural Network](#FFNN)
    - [Other Components of Transformer](#transformer_other)
        - [Positional Encoding](#transformer_other_pe)
        - [Layer Normalization and Residual Connection](#transformer_other_lnrc)
        - [Putting Everything Together](#transformer_all)
    - [Encoder vs. Decoder](#transformer_encoder_decoder)
1. [Additional Resources](#resource)

# Transformer <a name="transformer"></a>

The transformer architecture is arguably one of the most important deep learning architectures we have right now. It is the bedrock of virtually all large language models on the market. It has been applied to representation learning tasks for various different types of data, including text, image, video, time series, etc. In addition to its wide applicability, it is also responsible for many state-of-the-art results / performances in AI. The goal of this notebook is to offer an in-depth yet accessible exposition of the transformer architecture (mostly based on [this seminal paper](https://arxiv.org/pdf/1706.03762)) with small-scale demonstrations (for actual implementations, please refer to ```pytorch/Transformer.ipynb```).

## What is the Transformer Architecture? <a name="transformer_intro"></a>

The transformer architecture we will discuss here largely follows the same encoder-decoder structure, but seeks to completely throw away the RNNs for encoder/decoder, and only uses (a particular kind of) attention mechanism combined with fully-connected feed-forward neural networks (i.e., non-recurrent). 

<font color="red">But why would you want to throw away the RNNs?</font> One of the key reasons is computational complexity. In a RNN, computations have to be done sequentially (e.g., processing one word after another), which prohibits parallelization. As a result, large-scale tasks with RNNs may become very slow. As you will see, most of the computations in a transformer (especially the self-attention component) can be done in a parallel manner.

There are a number of technical components to a transformer architecture (see figure below), including self-attention, positional encoding, layer normalization, and residual connection. I will explain the intuition behind these components, with an emphasis on the self-attention mechanism. 

![Transformer Architecture](images/transformer.png)

image credit: [Attention is all You Need](https://arxiv.org/pdf/1706.03762.pdf) (Figure 1)

## Self-Attention <a name="transformer_components"></a>

The attention mechanism that we discussed before can be thought of as a "layer" that sits between an encoder and a decoder, which allows the decoder RNN to "pay attention to" different positions of the encoder hidden states. Because the attention layer is between encoder and decoder, it is often referred to as **cross-attention**. The transformer architecture relies on a twist of this attention mechanism, namely **self-attention**.

![Self-Attention Visual Illustration](images/self_attention.png)

image credit: [Self-Attention For Generative Models](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture14-transformers.pdf)

You can think of self-attention as a mechanism that applies to an input sequence _itself_ (like the visualization above), in order to generate a representation of the sequence that encodes information about how different words in the sequence are related to each other. In a (non-rigorous) sense, it allows the representation of the input sequence to contain information about "interactions" among different words in the sequence. Importantly, the entire process of calculating self-attention representation of an input sequence does NOT involve any RNNs or word-by-word recurrence. That's the point of transformer - it is a highly parallel architecture.

Now let's get technical about self-attention. Given a sequence of tokens $(e_1, \ldots, e_T)$ where $e_t$ is the embedding representation (dimension = $D$) of the $t$-th token, the self-attention mechanism seeks to "associate" each token with all the other tokens and incorporate those associations into the (attention-enriched) representation of the token. Specifically, self-attention based on dot-product transforms $e_t$ to
$$e_t^{Attn} = softmax\left( \frac{e_t \cdot e_1}{\sqrt{D}}, \ldots, \frac{e_t \cdot e_T}{\sqrt{D}} \right) \cdot (e_1, \ldots, e_T)$$
where $\cdot$ is the dot-product operation. $\sqrt{D}$ is a scaling parameter based on the embedding dimension to make sure that the embeddings don't "blow up" when dimension is high ($e_t^T e_i$ tends to grow as $d$ increases). If you re-write the above in matrix terms, you will see that it's basically the dot-product attention mechanism where key ($K$), query ($Q$), and value ($V$) are all the same input embeddings. 

### Multi-Head Attention <a name="multihead"></a>

To enable even more parallelism, people often use something called a **Multi-Head Self-Attention**. The high-level idea is you project $Q, K, V$ multiple times with trainable weight matrices, apply the self-attention, then concatenate the results together. More technical details below.

For better notations, let's pack all embeddings of the sequence into a matrix of shape $(T, D)$ (i.e., one token embedding per row). The above (single-head) attention mechanism can be represented in the following matrix format:
$$ Attention(Q, K, V) = softmax\left(\frac{QK'}{\sqrt{D}} \right) V $$
where $K = Q = V$ are all the same embedding matrix.

Then, with multi-head attention, we will first project the key, query, and value matrices into lower-dimensional embedding matrices. This is done by multiplying them with separate weight matrices $W^K$, $W^Q$, $W^V$. Consider, for example, a 4-head self-attention, then the shape of the three weight matrices would be $(D, D/4)$. Next, for each head $i \in \{1,2,3,4\}$, we will compute the regular self-attention as:
$$ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$$

Finally, the 4 heads are concatenated together, followed by another projection by weight matrix, $W^O$ of shape $(D, D)$, to produce the final multi-head self-attention embeddings:
$$ MultiHead(Q, K, V) = (head_1, head_2, head_3, head_4) W^O $$

### Feedforward Neural Network <a name="FFNN"></a>

After (multi-head) self-attention, the transformed embeddings will go through a feedforward neural network for additional non-linear transformations. The network uses RELU activation, followed by a linear projection:
$$e_t^{Attn+FFNN} = b_2 + W_2 RELU(b_1 + W_1 e_t^{Attn})$$

In a transformer architecture, both encoder and decoder each contains several "blocks" (note that the original transformer paper calls these "layers"). Each block contains a self-attention component and a fully-connected feed-forward neural net. These blocks are stacked; meaning the outputs of a previous block become the inputs of the next block. In other words, from the original input tokens to the final embedding representations, you will go through several times of self-attention and non-linear transformation.

## Other Components of Transformer <a name="transformer_other"></a>

In addition to self-attention, the transformer architecture also uses several other technical elements, such as positional encoding, layer normalization, and residual connection. Below are some optional content on these elements. The [Additional Resources](#resource) section lists articles you can read for mroe information, and for a detailed demonstration of how to implement a transformer model.

### Positional Encoding <a name="transformer_other_pe"></a>


Remember that we throw away the encoder and decoder RNNs, and only rely on self-attention to generate representations of the sequences? Without the sequential RNNs, the model now does not know the sequence of words in the input or output. To counter this loss of information, we try to encode the position of a word in a sequence into the embedding, using **Positional Encoding**. The positional encoding for each word at each position is another vector of the same dimension as the word embedding.

In the original paper that proposed transformer, the positional encoding is calculated as follows:
$$PE(pos, 2i) = \sin(\frac{pos}{10000^{2i/D}})$$
$$PE(pos, 2i+1) = \cos(\frac{pos}{10000^{2i/D}})$$
where $pos$ is a particular position in a sequence and $i \in {0, ..., D/2}$ is a running index. <font color="red">What does it mean? Let me explain with a small example.</font> 

Suppose you have an input sequence of 5 words, $(e_1,\ldots, e_5)$, and each $e_t$ is a $4$-dimensional embedding (i.e. $D = 4$). Now you want to also encode the positions of each word. For the sake of demonstration, let's say you want to encode the second position, i.e., $pos=2$. You would use the formula above to compute the following:
- Set $i=0$, $PE(2, 0) = \sin(\frac{2}{10000^0})=\sin(2) \approx 0.91$ and $PE(2, 1) = \cos(\frac{2}{10000^0})=\cos(2) \approx -0.42$;
- Set $i=1$, $PE(2, 2) = \sin(\frac{2}{10000^{0.5}}) \approx 0.02$ and $PE(2, 3) = \cos(\frac{2}{10000^{0.5}})=\cos(2) \approx 1.00$. Stop here because your embedding only has 4 dimensions.
Then, the embedding with positional encoding for the second word in this sequence will become:
$$e_2 + [0.91, -0.42, 0.02, 1.00]$$

This works because, after injecting the positional encoding, _the second word in this sequence will have a different embedding than the same word appearing at a different position in a different sequence_. Essentially, this allows the embedding to contain position-specific information that can help learning. Finally, why using the trigonometry functions? It's mostly for mathematical convenience and it works in practice.

<font color="blue">If you are comfortable with trigonometry... </font> Basically, the above positional encoding function adds a position-specific vector of the following form:
$$\left[\sin\left( \frac{pos}{10000^0} \right), \cos\left( \frac{pos}{10000^0} \right), \sin\left( \frac{pos}{10000^{2/D}} \right), \cos\left( \frac{pos}{10000^{2/D}} \right), \ldots, \sin\left( \frac{pos}{10000} \right), \cos\left( \frac{pos}{10000} \right) \right]$$
Due to the shapes of sine and cosine functions, this vector will be different for $pos \in \{1, \ldots, 10000\}$, thereby allowing you to differentiate positions up to length 10000.

### Layer Normalization and Residual Connection <a name="transformer_other_lnrc"></a>

Both layer normalization and residual connection are tricks in deep learning to aid with training large / deep networks. Their intuitions are as follows:

1. **Layer Normalization** performs a standardization (i.e., $\frac{x - E(x)}{SD(x)}$) over all inputs in a given layer, so that the "normalized" inputs have mean 0 and sd 1. In the transformer architecture, within each block, the input embeddings (corresponding to all tokens in a single sequence) to the self-attention and to the feed-forward layers each go through a layer normalization operation. As a result, the normalized embeddings have mean 0 and sd 1.
2. **Residual Connection** allows the original inputs to a layer to directly contribute to the outputs of that layer _in addition_ to any transformations imposed by the layer (i.e., allowing the inputs to "skip" the transformations). Informally, consider some inputs $X$ to a hidden layer in MLP that applies a nonlinear transformation $f()$. Without residual connection, the outputs from this layer would be $f(X)$. With residual connection, it will be $X + f(X)$. <font color="red">Why doing this?</font> Because it allows the gradient (during training) to directly connect with the original inputs $X$ in addition to through $f(X)$.

### Putting Everything Together <a name="transformer_all"></a>

Putting everything together, what actually goes on inside each transformer block (using the encoder side as an example) is the following: suppose $E$ represents the matrix of (positionally encoded) embedding inputs to the block. It first goes through (multi-head) self-attention:
$$E' = \text{self-attention}(E)$$
Then, apply residual connection and layer normalization, you get:
$$E'' = \text{LayerNorm}(E + E')$$
Next, it goes through the feed-forward neural net:
$$E''' = FFNN(E'')$$
Finally, apply residual connection and layer normalization again:
$$E'''' = \text{LayerNorm}(E'' + E''')$$

## Encoder vs. Decoder <a name="transformer_encoder_decoder"></a>

Although both encoder and decoder follows roughly the same stacked architecture, they have some important differences that are worth clarifying. For concreteness, let's consider a translation task (like the English-to-Spanish translation task discussed in the "sequence-to-sequence modeling" lecture). Suppose the input sequence (in English) is $(e_1, \ldots, e_T)$ and the output sequence (in Spanish) is $(s_1, \ldots, s_{T'})$.

The first difference is in the details of self-attention. In the encoder, input sequence go through a **bidirectional** self-attention transformation (as described above), in the sense that every position in the sequence can attend to every other position in the sequence. However, in the decoder, input sequence go through a **masked** self-attention (also called **causal** self-attention), where position $t$ can only attend to positions $i \leq t$ but not to future positions. This is because, during inference time, we will not know future tokens in the decoding process. The masked self-attention is achieved by replacing the values inside softmax that correspond to illegitimate pairs to $-\infty$ (which becomes 0 after softmax). Take the 3rd position of the decoder sequence as an example, the "masked" softmax values would be $softmax\left( \frac{e_3 \cdot e_1}{\sqrt{D}}, \frac{e_3 \cdot e_2}{\sqrt{D}}, \frac{e_3 \cdot e_3}{\sqrt{D}}, -\infty, \ldots, -\infty \right)$.

The second difference is that decoder sequence is allowed to attend to encoder sequence via a standard **cross-attention** mechanism (but not vice versa). Specifically, after all the transformations (across multiple blocks) applied to the input sequence, the encoder will emit a final sequence representation. In each decoder block, the embeddings are allowed to attend to all positions of this encoded squence. This is implemented in the same way as discussed in the "Attention Mechanism" lecture.

The third difference is the output. The outputs of encoder are sequence embedding representations, whereas the outputs of decoder are probability predictions over vocabulary (to predict the next token). 

# Additional Resources <a name="resource"></a>

- Original research paper that proposed the transformer architecture: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf);
- Original paper on self-attention: [Long Short-Term Memory-Networks for Machine Reading](https://arxiv.org/pdf/1601.06733.pdf);
- Additional articles to learn about self-attention: [Illustrated: Self-Attention](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a), [Introduction ofSelf-Attention Layer in Transformer](https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc);
- Additional articles on other components in a transformer: [Layer Normalization](https://arxiv.org/abs/1607.06450), [Normalization Techniques in Deep Neural Networks](https:/medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8), [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385);
- Implementation of Transformer: [Transformer model for language understanding](https://www.tensorflow.org/tutorials/text/transformer);
- [Transformer for text classification](https://keras.io/examples/nlp/text_classification_with_transformer/)
- Andrej Karpathy's [YouTube Tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY)