# <center> MSBA 6461: Advanced AI for Natural Language Processing </center>
<center> Summer 2025, Mochen Yang </center>

## <center> Transformer Architecture </center>

# Table of Contents
1. [Transformer Architecture](#transformer)
    - [What is the Transformer Architecture?](#transformer_intro)
    - [Self Attention](#transformer_components)
        - [Multi-Head Attention](#multihead)
        - [Feedforward Neural Network](#FFNN)
    - [Other Components of Transformer](#transformer_other)
        - [Positional Encoding](#transformer_other_pe)
        - [Layer Normalization and Residual Connection](#transformer_other_lnrc)
        - [Putting Everything Together](#transformer_all)
    - [Encoder vs. Decoder](#transformer_encoder_decoder)
    - [Temperature](#transformer_temperature)
1. [Transformer Implementation: A Step-by-Step Explanation](#implementation)
1. [Additional Resources](#resource)

# Transformer <a name="transformer"></a>

The transformer architecture is arguably one of the most important deep learning architectures we have right now. It is the bedrock of virtually all large language models on the market. It has been applied to representation learning tasks for various different types of data, including text, image, video, time series, etc. In addition to its wide applicability, it is also responsible for many state-of-the-art results / performances in AI. The goal of this notebook is to offer an in-depth yet accessible exposition of the transformer architecture (mostly based on [this seminal paper](https://arxiv.org/pdf/1706.03762)) with small-scale demonstrations (for actual implementations, please refer to ```pytorch/Transformer.ipynb```).

## What is the Transformer Architecture? <a name="transformer_intro"></a>

The transformer architecture we will discuss here largely follows the same encoder-decoder structure, but seeks to completely throw away the RNNs for encoder/decoder, and only uses (a particular kind of) attention mechanism combined with fully-connected feed-forward neural networks (i.e., non-recurrent). 

<font color="red">But why would you want to throw away the RNNs?</font> One of the key reasons is computational complexity. In a RNN, computations have to be done sequentially (e.g., processing one word after another), which prohibits parallelization. As a result, large-scale tasks with RNNs may become very slow. As you will see, most of the computations in a transformer (especially the self-attention component) can be done in a parallel manner.

There are a number of technical components to a transformer architecture (see figure below), including self-attention, positional encoding, layer normalization, and residual connection. I will explain the intuition behind these components, with an emphasis on the self-attention mechanism. 

![Transformer Architecture](images/transformer.png)

image credit: [Attention is all You Need](https://arxiv.org/pdf/1706.03762.pdf) (Figure 1)

## Self-Attention <a name="transformer_components"></a>

The attention mechanism that we discussed before can be thought of as a "layer" that sits between an encoder and a decoder, which allows the decoder RNN to "pay attention to" different positions of the encoder hidden states. Because the attention layer is between encoder and decoder, it is often referred to as **cross-attention**. The transformer architecture relies on a twist of this attention mechanism, namely **self-attention**.

![Self-Attention Visual Illustration](images/self_attention.png)

image credit: [Self-Attention For Generative Models](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture14-transformers.pdf)

You can think of self-attention as a mechanism that applies to an input sequence _itself_ (like the visualization above), in order to generate a representation of the sequence that encodes information about how different words in the sequence are related to each other. In a (non-rigorous) sense, it allows the representation of the input sequence to contain information about "interactions" among different words in the sequence. Importantly, the entire process of calculating self-attention representation of an input sequence does NOT involve any RNNs or word-by-word recurrence. That's the point of transformer - it is a highly parallel architecture.

Now let's get technical about self-attention. Given a sequence of tokens $(e_1, \ldots, e_T)$ where $e_t$ is the embedding representation (dimension = $D$) of the $t$-th token, the self-attention mechanism seeks to "associate" each token with all the other tokens and incorporate those associations into the (attention-enriched) representation of the token. Specifically, self-attention based on dot-product transforms $e_t$ to
$$e_t^{Attn} = softmax\left( \frac{e_t \cdot e_1}{\sqrt{D}}, \ldots, \frac{e_t \cdot e_T}{\sqrt{D}} \right) \cdot (e_1, \ldots, e_T)$$
where $\cdot$ is the dot-product operation. $\sqrt{D}$ is a scaling parameter based on the embedding dimension to make sure that the embeddings don't "blow up" when dimension is high ($e_t^T e_i$ tends to grow as $d$ increases). If you re-write the above in matrix terms, you will see that it's basically the dot-product attention mechanism where key ($K$), query ($Q$), and value ($V$) are all the same input embeddings. 

### Multi-Head Attention <a name="multihead"></a>

To enable even more parallelism, people often use something called a **Multi-Head Self-Attention**. The high-level idea is you project $Q, K, V$ multiple times with trainable weight matrices, apply the self-attention, then concatenate the results together. More technical details below.

For better notations, let's pack all embeddings of the sequence into a matrix of shape $(T, D)$ (i.e., one token embedding per row). The above (single-head) attention mechanism can be represented in the following matrix format:
$$ Attention(Q, K, V) = softmax\left(\frac{QK'}{\sqrt{D}} \right) V $$
where $K = Q = V$ are all the same embedding matrix.

Then, with multi-head attention, we will first project the key, query, and value matrices into lower-dimensional embedding matrices. This is done by multiplying them with separate weight matrices $W^K$, $W^Q$, $W^V$. Consider, for example, a 4-head self-attention, then the shape of the three weight matrices would be $(D, D/4)$. Next, for each head $i \in \{1,2,3,4\}$, we will compute the regular self-attention as:
$$ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$$

Finally, the 4 heads are concatenated together, followed by another projection by weight matrix, $W^O$ of shape $(D, D)$, to produce the final multi-head self-attention embeddings:
$$ MultiHead(Q, K, V) = (head_1, head_2, head_3, head_4) W^O $$

### Feedforward Neural Network <a name="FFNN"></a>

After (multi-head) self-attention, the transformed embeddings will go through a feedforward neural network for additional non-linear transformations. The network uses RELU activation, followed by a linear projection:
$$e_t^{Attn+FFNN} = b_2 + W_2 RELU(b_1 + W_1 e_t^{Attn})$$

In a transformer architecture, both encoder and decoder each contains several "blocks" (note that the original transformer paper calls these "layers"). Each block contains a self-attention component and a fully-connected feed-forward neural net. These blocks are stacked; meaning the outputs of a previous block become the inputs of the next block. In other words, from the original input tokens to the final embedding representations, you will go through several times of self-attention and non-linear transformation.

## Other Components of Transformer <a name="transformer_other"></a>

In addition to self-attention, the transformer architecture also uses several other technical elements, such as positional encoding, layer normalization, and residual connection. Below are some optional content on these elements. The [Additional Resources](#resource) section lists articles you can read for mroe information, and for a detailed demonstration of how to implement a transformer model.

### Positional Encoding <a name="transformer_other_pe"></a>


Remember that we throw away the encoder and decoder RNNs, and only rely on self-attention to generate representations of the sequences? Without the sequential RNNs, the model now does not know the sequence of words in the input or output. To counter this loss of information, we try to encode the position of a word in a sequence into the embedding, using **Positional Encoding**. The positional encoding for each word at each position is another vector of the same dimension as the word embedding.

In the original paper that proposed transformer, the positional encoding is calculated as follows:
$$PE(pos, 2i) = \sin(\frac{pos}{10000^{2i/D}})$$
$$PE(pos, 2i+1) = \cos(\frac{pos}{10000^{2i/D}})$$
where $pos$ is a particular position in a sequence and $i \in {0, ..., D/2}$ is a running index. <font color="red">What does it mean? Let me explain with a small example.</font> 

Suppose you have an input sequence of 5 words, $(e_1,\ldots, e_5)$, and each $e_t$ is a $4$-dimensional embedding (i.e. $D = 4$). Now you want to also encode the positions of each word. For the sake of demonstration, let's say you want to encode the second position, i.e., $pos=2$. You would use the formula above to compute the following:
- Set $i=0$, $PE(2, 0) = \sin(\frac{2}{10000^0})=\sin(2) \approx 0.91$ and $PE(2, 1) = \cos(\frac{2}{10000^0})=\cos(2) \approx -0.42$;
- Set $i=1$, $PE(2, 2) = \sin(\frac{2}{10000^{0.5}}) \approx 0.02$ and $PE(2, 3) = \cos(\frac{2}{10000^{0.5}})=\cos(2) \approx 1.00$. Stop here because your embedding only has 4 dimensions.
Then, the embedding with positional encoding for the second word in this sequence will become:
$$e_2 + [0.91, -0.42, 0.02, 1.00]$$

This works because, after injecting the positional encoding, _the second word in this sequence will have a different embedding than the same word appearing at a different position in a different sequence_. Essentially, this allows the embedding to contain position-specific information that can help learning. Finally, why using the trigonometry functions? It's mostly for mathematical convenience and it works in practice.

<font color="blue">If you are comfortable with trigonometry... </font> Basically, the above positional encoding function adds a position-specific vector of the following form:
$$\left[\sin\left( \frac{pos}{10000^0} \right), \cos\left( \frac{pos}{10000^0} \right), \sin\left( \frac{pos}{10000^{2/D}} \right), \cos\left( \frac{pos}{10000^{2/D}} \right), \ldots, \sin\left( \frac{pos}{10000} \right), \cos\left( \frac{pos}{10000} \right) \right]$$
Due to the shapes of sine and cosine functions, this vector will be different for $pos \in \{1, \ldots, 10000\}$, thereby allowing you to differentiate positions up to length 10000.

### Layer Normalization and Residual Connection <a name="transformer_other_lnrc"></a>

Both layer normalization and residual connection are tricks in deep learning to aid with training large / deep networks. Their intuitions are as follows:

1. **Layer Normalization** performs a standardization (i.e., $\frac{x - E(x)}{SD(x)}$) over all inputs in a given layer, so that the "normalized" inputs have mean 0 and sd 1. In the transformer architecture, within each block, the input embeddings (corresponding to all tokens in a single sequence) to the self-attention and to the feed-forward layers each go through a layer normalization operation. As a result, the normalized embeddings have mean 0 and sd 1.
2. **Residual Connection** allows the original inputs to a layer to directly contribute to the outputs of that layer _in addition_ to any transformations imposed by the layer (i.e., allowing the inputs to "skip" the transformations). Informally, consider some inputs $X$ to a hidden layer in MLP that applies a nonlinear transformation $f()$. Without residual connection, the outputs from this layer would be $f(X)$. With residual connection, it will be $X + f(X)$. <font color="red">Why doing this?</font> Because it allows the gradient (during training) to directly connect with the original inputs $X$ in addition to through $f(X)$.

### Putting Everything Together <a name="transformer_all"></a>

Putting everything together, what actually goes on inside each transformer block (using the encoder side as an example) is the following: suppose $E$ represents the matrix of (positionally encoded) embedding inputs to the block. It first goes through (multi-head) self-attention:
$$E' = \text{self-attention}(E)$$
Then, apply residual connection and layer normalization, you get:
$$E'' = \text{LayerNorm}(E + E')$$
Next, it goes through the feed-forward neural net:
$$E''' = FFNN(E'')$$
Finally, apply residual connection and layer normalization again:
$$E'''' = \text{LayerNorm}(E'' + E''')$$

## Encoder vs. Decoder <a name="transformer_encoder_decoder"></a>

Although both encoder and decoder follows roughly the same stacked architecture, they have some important differences that are worth clarifying. For concreteness, let's consider a translation task (like the English-to-Spanish translation task discussed in the "sequence-to-sequence modeling" lecture). Suppose the input sequence (in English) is $(e_1, \ldots, e_T)$ and the output sequence (in Spanish) is $(s_1, \ldots, s_{T'})$.

The first difference is in the details of self-attention. In the encoder, input sequence go through a **bidirectional** self-attention transformation (as described above), in the sense that every position in the sequence can attend to every other position in the sequence. However, in the decoder, input sequence go through a **masked** self-attention (also called **causal** self-attention), where position $t$ can only attend to positions $i \leq t$ but not to future positions. This is because, during inference time, we will not know future tokens in the decoding process. The masked self-attention is achieved by replacing the values inside softmax that correspond to illegitimate pairs to $-\infty$ (which becomes 0 after softmax). Take the 3rd position of the decoder sequence as an example, the "masked" softmax values would be $softmax\left( \frac{e_3 \cdot e_1}{\sqrt{D}}, \frac{e_3 \cdot e_2}{\sqrt{D}}, \frac{e_3 \cdot e_3}{\sqrt{D}}, -\infty, \ldots, -\infty \right)$.

The second difference is that decoder sequence is allowed to attend to encoder sequence via a standard **cross-attention** mechanism (but not vice versa). Specifically, after all the transformations (across multiple blocks) applied to the input sequence, the encoder will emit a final sequence representation. In each decoder block, the embeddings are allowed to attend to all positions of this encoded squence. This is implemented in the same way as discussed in the "Attention Mechanism" lecture.

The third difference is the output. The outputs of encoder are sequence embedding representations, whereas the outputs of decoder are probability predictions over vocabulary (to predict the next token). 

## Temperature <a name="transformer_temperature"></a>

During decoding (also referred to as "inference"), **temperature** ($T$) is an important parameter that controls how deterministic vs. random the generation process is. Roughly speaking, lower / higher temperature typically leads to more deterministic / random output. Specifically,

- When $T = 0$, the decoder will simply output the token with the highest predicted probability. This is completely deterministic;
- When $T > 0$, the decoder will sample from the predicted distribution over tokens, thereby producing non-determnistic outputs. There are a few ways to sample:
    - _Top-k_: select the $k$ token that receive the highest predicted logit values (i.e., value before softmax transformation), then apply the softmax only on those $k$ tokens to obtain sampling probabilities;
    - _Nucleus Sampling_: select the tokens whose cumulative probability exceeds a threshold $p$, then apply the softmax on the logits of these tokens to obtain sampling probabilities.

# Transformer Implementation: A Step-by-Step Explanation <a name="implementation"></a>

In this part, we build a transformer model for a simple task. The goal is to understand the implementation of different components in a transformer, and how they are put together. The demonstration seeks to be provide step-by-step explanations to the transformer architecture.

The **task** is to predict a numerical output based on a number of input features. Treating a number as a sequence of digits (where each digit is a token), then this task is essentially a sequence-to-sequence prediction task. Even though such "numeric prediction" task is typically not where transformers are applied, it does offer several advantages as a tutorial / demonstration: (1) the vocabulary is very restricted (all 10 single digits + blank space) and (2) each input can be represented as a fixed length sequence, thereby removing the need for padding / masking.

We will use ```pytorch``` for this demonstration, because it offers a off-the-shelf ```transformer``` module. Its documentation is available on [this page](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html).

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import Transformer

As the first step, let's simulate the data used for training and evaluation:
- $X_1$, ..., $X_{10}$: 10 numerical input features, each randomly sampled from a uniform distribution.
- $Y = \frac{1}{10} \sum_i X_i$: the numerical output is simply the average value.
- $N=5000$: 5000 samples, 4000 for training and 1000 for evaluation.

In [27]:
# set random seed for reproducibility
np.random.seed(123)
X = np.random.uniform(size = (5000, 10))
Y = np.mean(X, axis = 1)
X_train = X[:4000]
X_test = X[4000:]
Y_train = Y[:4000]
Y_test = Y[4000:]
print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

(4000, 10) (4000,) (1000, 10) (1000,)


Importantly, ```pytorch``` does not take these raw values / arrays as input. We need to tokenize them and convert them into indices in the vocabulary.

In [16]:
# vocab has single-digits, space, start, end
VOCAB = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ' ', 's', 'e']
# for simplicity, we restrict each input/output number to 8 digits
MAX_DIGITS = 8

With such a restricted vocabulary, tokenizing each number is the same as splitting it into a sequence of single digits. Note that, because both inputs and outputs take value between 0 and 1, every number starts with "0." (followed by 8 decimal digits). Therefore, as a further simplification, we don't need to keep track of the "0." for each number.

The following ```CustomDataset``` class performs basic processing and tokenization of input features and output values. It will allow us to convert the raw numpy arrays ```X``` and ```Y``` into a format that can be ingested by the transformer model.

In [25]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, X, Y, vocab):
        self.X = X
        self.Y = Y
        self.vocab = vocab
        # the "index" method is defined below
        self.X_indexed = self.index(X, 'X')
        self.Y_indexed = self.index(Y, 'Y')

    # The "index" method converts either an input vector or an output value to a sequence of token indices
    def index(self, data, type):
        data_indexed = []
        for row in data:
            if type == 'Y':
                # in this case, row is a scalar, we convert it to a string and remove the "0." prefix
                # the '{:8f}'.format(...) part ensures the number has 8 digits after the decimal point, and converts it to a string
                # the '[2:]' part removes the "0." prefix
                row_str = '{:.8f}'.format(row)[2:]
            if type == 'X':
                # in this case, we do the same processing to each feature value, then concatenate them to a longer sequence, separated by blank spaces
                row_str = ' '.join(['{:.8f}'.format(x)[2:] for x in row])
            # also need to prepend 's' and append 'e' to the sequence
            row_str = 's' + row_str + 'e'
            # convert to indices in vocabulary
            row_idx = [self.vocab.index(c) for c in row_str]
            data_indexed.append(row_idx)
        return np.array(data_indexed)

    def __len__(self):
        # this is a required method in custom dataset classes, it should return size of data (i.e., number of rows)
        return len(self.X_indexed)

    def __getitem__(self, idx):
        # this is also a required method, it should return the item at the given index
        src = torch.tensor(self.X_indexed[idx], dtype=torch.long)
        tgt = torch.tensor(self.Y_indexed[idx], dtype=torch.long)
        return src, tgt

Now, we can create the datasets that can be used for training and evaluation:

In [32]:
train_dataset = CustomDataset(X_train, Y_train, VOCAB)
test_dataset = CustomDataset(X_test, Y_test, VOCAB)
print(len(train_dataset), len(test_dataset))

4000 1000


Let's also print out the first data point to see (remember the values you see are indices in the vocabulary):

In [36]:
print("raw inputs:", X_train[0])
print("raw output:", Y_train[0])
print("tokenized input sequence:", train_dataset[0][0])
print("tokenized output sequence:", train_dataset[0][1])

raw inputs: [0.69646919 0.28613933 0.22685145 0.55131477 0.71946897 0.42310646
 0.9807642  0.68482974 0.4809319  0.39211752]
raw output: 0.544199352975335
tokenized input sequence: tensor([11,  6,  9,  6,  4,  6,  9,  1,  9, 10,  2,  8,  6,  1,  3,  9,  3,  3,
        10,  2,  2,  6,  8,  5,  1,  4,  5, 10,  5,  5,  1,  3,  1,  4,  7,  7,
        10,  7,  1,  9,  4,  6,  8,  9,  7, 10,  4,  2,  3,  1,  0,  6,  4,  6,
        10,  9,  8,  0,  7,  6,  4,  2,  0, 10,  6,  8,  4,  8,  2,  9,  7,  4,
        10,  4,  8,  0,  9,  3,  1,  9,  0, 10,  3,  9,  2,  1,  1,  7,  5,  2,
        12])
tokenized output sequence: tensor([11,  5,  4,  4,  1,  9,  9,  3,  5, 12])


We are now ready to construct the transformer model. This include several modules:
- A ```TokenEmbedding``` class that projects each token to its (trainable) embedding representation;
- A ```PositionalEncoding``` class that adds the positional encoding to the token embeddings;
- A ```Seq2SeqTransformer``` that implements the actual transformer architecture.

We will do them one at a time.

In [37]:
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        """
        :param vocab_size: the size of the vocabulary
        :param d_model: the embedding dimension
        """
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

    def forward(self, tokens):
        """
        :param tokens: the input tensor with shape (batch_size, seq_len)
        :return: the tensor after token embedding with shape (batch_size, seq_len, d_model)
        """
        return self.embedding(tokens)

In [69]:
# see for yourself: if you apply the TokenEmbedding module to the first input sequence in the training set, you should get a tensor of shape (1, seq_len, d_model)
# unsqueeze(0) here adds a batch dimension, so the input tensor conform to the (batch_size, seq_len) shape
test_input = train_dataset[0][0].unsqueeze(0)
test_emb = TokenEmbedding(len(VOCAB), 512)(test_input)
test_emb.size()

torch.Size([1, 91, 512])

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=100):
        """
        :param d_model: the embedding dimension
        :param max_len: the maximum length of the sentence
        """
        super(PositionalEncoding, self).__init__()
        # setting max_len to 100 here, because the largest input sequence is 91 tokens long (10 * 8 digits + 9 spaces + 1 start + 1 end), so 100 is enough
        # intialize the positional encoding, pe.shape = (max_len, d_model)        
        pe = torch.zeros(max_len, d_model)
        # generate a tensor of shape (max_len, 1), with values from 0 to max_len - 1, to represent all unique positions
        # the unsqueeze(1) operation adds a dimension after the first dimension, so the shape changes from (max_len,) to (max_len, 1)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        # calculate scaling factors for each dimension of the positional encoding, see the formula in the first section of this notebook
        scaling_factors = torch.tensor([10000.0 ** (-2 * i / d_model) for i in range(d_model // 2)])
        # now populate the positional encoding tensor with values, even indices use sine functions, odd indices use cosine functions
        pe[:, 0::2] = torch.sin(position * scaling_factors)  # pe[:, 0::2].shape = (max_len, d_model/2)
        pe[:, 1::2] = torch.cos(position * scaling_factors)  # pe[:, 1::2].shape = (max_len, d_model/2)
        # add a batch dimension to the positional encoding tensor so that it's compatible with the input tensor. pe.shape = (1, max_len, d_model)
        pe = pe.unsqueeze(0)
        # register the positional encoding tensor as a buffer, so that it will be stored as part of the model's "states" and won't be updated during training
        # this is desirable because we don't want the positional encoding to be trained, we want it to be fixed
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        :param x: the input tensor with shape (batch_size, seq_len, d_model)
        :return: the tensor after adding positional encoding with shape (batch_size, seq_len, d_model)
        """
        # for a given input tensor x, add the positional encoding to it
        # x.size(1) gets the second dimensions of x, which is dimension that contains the element indices in the sequence
        x = x + self.pe[:, :x.size(1)]
        return x

In [74]:
# see for yourself:
test_emb_with_pe = PositionalEncoding(512)(test_emb)
test_emb_with_pe.size()

torch.Size([1, 91, 512])

Next we have the actual ```Seq2SeqTransformer``` module. Things like multi-head attention, feed-foward layers, layer normalziation, and residual connections are all encapsulated in pytorch's ```Transformer``` module, which makes it very straightforward to build.

In [75]:
class Seq2SeqTransformer(nn.Module):
    def __init__(self, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, vocab_size):
        """
        :param d_model: the embedding dimension
        :param nhead: the number of heads in multi-head attention
        :param num_encoder_layers: the number of blocks in the encoder
        :param num_decoder_layers: the number of blocks in the decoder
        :param dim_feedforward: the dimension of the feedforward network
        """
        super(Seq2SeqTransformer, self).__init__()
        # note that, in many other tasks (e.g., translation), you need two different token embeddings for the source and target languages
        # here, however, because both input and output use the same vocabulary, we can use the same token embedding for both
        self.tok_emb = TokenEmbedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        # the transformer model is constructed with the Transformer module, which takes care of all the details
        # the batch_first=True argument means the input and output tensors are of shape (batch_size, seq_len, d_model)
        self.transformer = Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward, batch_first=True)
        # the generator is a simple linear layer that projects the transformer output to the vocabulary size
        # it generates the logits for each token in the vocabulary, will be used for computing loss and making predictions
        self.generator = nn.Linear(d_model, vocab_size)

    def forward(self, src, tgt):
        """
        :param src: the sequence to the encoder (required). with shape (batch_size, seq_len, d_model)
        :param tgt: the sequence to the decoder (required). with shape (batch_size, seq_len, d_model)
        :param src_mask: the additive mask for the src sequence (optional). with shape (batch_size, seq_len, seq_len)
        :param tgt_mask: the additive mask for the tgt sequence (optional). with shape (batch_size, seq_len, seq_len)
        :param src_padding_mask: the additive mask for the src sequence (optional). with shape (batch_size, 1, seq_len)
        :param tgt_padding_mask: the additive mask for the tgt sequence (optional). with shape (batch_size, 1, seq_len)
        :param memory_key_padding_mask: the additive mask for the encoder output (optional). with shape (batch_size, 1, seq_len)
        :return: the decoder output tensor with shape (batch_size, seq_len, d_model)
        """
        # separately embed the source and target sequences
        src_emb = self.positional_encoding(self.tok_emb(src))
        tgt_emb = self.positional_encoding(self.tok_emb(tgt))
        # Important: we don't need any masks for source sequence, or any padding masks, nor do we need a mask for decoder attending to the encoder
        # but we do need a mask for the target sequence -- this is a "causal mask", which prevents the decoder from attending to subsequent tokens during training
        tgt_mask = self.transformer.generate_square_subsequent_mask(tgt.size(1))
        outs = self.transformer(src_emb, tgt_emb, tgt_mask=tgt_mask)
        return self.generator(outs)
    
    # The transformer also have an encode method and a decode method
    # the encode method takes the source sequence and produce the context vector (which pytorch calls "memory")
    # the decoder method takes the target sequence and the context vector, and produce the output sequence
    def encode(self, src):
        """
        :param src: the sequence to the encoder (required). with shape (batch_size, seq_len, d_model)
        :return: the encoder output tensor with shape (batch_size, seq_len, d_model)
        """
        return self.transformer.encoder(self.positional_encoding(self.tok_emb(src)))
    
    def decode(self, tgt, memory):
        """
        :param tgt: the sequence to the decoder (required). with shape (batch_size, seq_len, d_model)
        :param memory: the sequence from the last layer of the encoder (required). with shape (batch_size, seq_len, d_model)
        :return: the decoder output tensor with shape (batch_size, seq_len, d_model)
        """
        return self.transformer.decoder(self.positional_encoding(self.tok_emb(tgt)), memory)

We can now start the actual training and evaluation process

In [76]:
# specify model parameters and training parameters
VOCAB_SIZE = len(VOCAB)
EMB_SIZE = 256
NHEAD = 4
FFN_HID_DIM = 128
BATCH_SIZE = 32
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
NUM_EPOCHS = 25

In [77]:
# instantiate the model
model = Seq2SeqTransformer(EMB_SIZE, NHEAD, NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, FFN_HID_DIM, VOCAB_SIZE)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

In [78]:
# Create DataLoader for batching
# for eval_loader, we load data one at a time for better demonstration of what happens -- in practice you can also batch it
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
eval_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)

In [None]:
# training
for epoch in range(NUM_EPOCHS):
    # start model training
    model.train()
    # initialize total loss for the epoch
    total_loss = 0
    for src, tgt in train_loader:
        optimizer.zero_grad()        
        # Separate the input and target sequences for teacher forcing
        # tgt_input has everything except the last token
        # tgt_output has everything except the first token
        tgt_input = tgt[:, :-1]
        tgt_output = tgt[:, 1:]
        # Forward pass with teacher forcing, logits has shape (batch_size, seq_len, vocab_size)
        logits = model(src, tgt_input)
        # Calculate loss. The .reshape(-1) flattens the logits to (batch_size * seq_len, vocab_size)
        outputs = logits.reshape(-1, logits.shape[-1])
        # also flatten the ground truth outputs to shape (batch_size * seq_len)
        tgt_out = tgt_output.reshape(-1)
        loss = criterion(outputs, tgt_out)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    print(f"Epoch: {epoch}, Training Loss: {total_loss}")
    
    # monitor loss test set
    model.eval()
    test_loss = 0      
    with torch.no_grad():
        for src, tgt in eval_loader:
            encoder_output = model.encode(src)
            # decoding starts with the "start" token
            tgt_idx = [VOCAB.index('s')]
            pred_num = '0.'
            for i in range(MAX_DIGITS):
                # prepare the input tensor for the decoder, adding the batch dimension
                decoder_input = torch.LongTensor(tgt_idx).unsqueeze(0)
                # the decoder output has shape (1, seq_len, d_model) and the last position in sequence is the prediction for next token
                decoder_output = model.decode(decoder_input, encoder_output)
                # the predicted logits has shape (1, seq_len, vocab_size)
                logits = model.generator(decoder_output)
                # calculate test loss based on most recent token prediction, that is logits[:, -1, :]
                test_loss += criterion(logits[:, -1, :], tgt[0][i].unsqueeze(0)).item()
                # the actual predicted token is the one with highest logit score
                # here, .argmax(2) makes sure the max is taken on the last dimension, which is the vocabulary dimension, and [:, -1] makes sure that we are looking at the last position in the sequence
                pred_token = logits.argmax(2)[:,-1].item()
                # append the predicted token to target sequence as you go
                tgt_idx.append(pred_token)
                pred_num += VOCAB[pred_token]
                if pred_token == VOCAB.index('e'):
                    break            
            # Convert the predicted sequence to a number - if you want, you can use it to compute other metrics such as RMSE
            try:
                pred_num = float(pred_num)  # Convert the accumulated string to a float
            except ValueError:
                pred_num = 0.0  # Handle any conversion errors gracefully
    print("Test Loss: ", test_loss)

I have also put the entire pipeline into a separate script ```Transformer.py``` under the ```pytorch``` folder.

# Additional Resources <a name="resource"></a>

- Original research paper that proposed the transformer architecture: [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf);
- Original paper on self-attention: [Long Short-Term Memory-Networks for Machine Reading](https://arxiv.org/pdf/1601.06733.pdf);
- Additional articles to learn about self-attention: [Illustrated: Self-Attention](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a), [Introduction of Self-Attention Layer in Transformer](https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc);
- Additional articles on other components in a transformer: [Layer Normalization](https://arxiv.org/abs/1607.06450), [Normalization Techniques in Deep Neural Networks](https:/medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8), [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385);
- Implementation of Transformer: [Transformer model for language understanding](https://www.tensorflow.org/tutorials/text/transformer);
- [Transformer for text classification](https://keras.io/examples/nlp/text_classification_with_transformer/)
- Andrej Karpathy's [YouTube Tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY)