# A Tutorial on the  Transformer Neural Network
> Alex Judge and Harrison Prosper<br>
> Florida State University, Spring 2023 (closely follows the Annotated Transformer[1])<br>
 > __Updated__: July 4, 2023 for Terascale 2023, DESY, Hamburg, Germany


## Introduction

This tutorial describes a sequence to sequence (seq2seq) neural network, called the __transformer__[1], which can be  used to translate one sequence of tokens to another. The tutorial follows closely the Annotated Transformer[2]. 

The seq2seq model
consists of three parts:

  1. The embedding layers: encodes the tokens and their relative positions within sequences.
  1. The transformer layer[2]: implements the syntactic and semantic analysis.
  1. The output layer: makes a probabilistic prediction for the next token in the output sequence given the input sequence and the current output sequence. 

__Tensor Convention__
We follow the convention used in the Annotated Transformer[2] in which the batch is the first dimension in all tensors. 


## Sequence to Sequence Model 

### Introduction
A transformer-based seq2seq model comprises an `encoder` and a `decoder`. The encoder embeds every token in the source sequence, $\boldsymbol{x}$, together with its ordinal value,  in a vector space. The vectors are processed with a chain of algorithms called __attention__ and the transformed vectors together with the current target sequence, $\boldsymbol{t}$, or current predicted sequence, $\boldsymbol{y}$, are sent to the decoder which embeds the targets in another vector space. The target vectors are likewise processed with a chain of attention algorithms, while the target vectors and those from the encoder are processed with another attention algorithm. Finally, the decoder assigns a weight to every token in the target vocabulary. Using a greedy strategy, one chooses the next output token to be the one with the largest weight, that is, one chooses the most probable token. The model is __auto-regressive__: the predicted token is appended to the existing predicted output sequence and the model is called again with the same source and the updated output. The procedure repeats until either the maximum output sequence length is reached or the end-of-sequence (EOS) token is predicted as the most probable token.


### Attention

When we translate from one sequence of symbols to another sequence of symbols, for example from one natural language to another,  the meaning of the sequences is encoded in the symbols, their relative order, and the degree to which a given symbol is related to the other symbols. Consider the phrases "the white house" and "la maison blanche". In order to obtain a correct translation it is important for the model to encode the fact that "la" and "maison" are strongly related, while "the" and "house" are less so. It is also important for the model to encode the strong relationship between "the" and "la", between "house" and "maison", and between "white" and "blanche". That is, the model needs to *pay attention to* grammatical and semantic facts. At least that's what humans do.

The need for the model to pay attention to relevant linguistic facts is the basis of the so-called [attention mechanism](https://nlp.seas.harvard.edu/annotated-transformer/). In the encoding stage, the model associates a vector to every token that tries to capture the strength of a token's relationship to other tokens. Since this association mechanism operates within the same sequence it is referred to as __self attention__. Ideally, self attention will note the fact that "la" and "maison" are strongly coupled and, ideally, that the relative positions of "maison" and "blanche" are also strongly coupled as are the relative positions of "white" and "house". In the decoding stage of the model, in addition to the self attention, this time over the target sequences, another attention mechanism should pay attention to the fact that "the" and "la", "house" and "maison", and "white" and "blanche" are strongly coupled. At a minimum, therefore, we expect a successful seq2seq model to (somehow) model self attention in both the encoding and decoding phases and source to target attention in the decoding phase. The optimal way to implement this is not known, however, the transformer model implements an attention mechanism, described next, which empirically appears to be highly effective.


### Prediction
As noted, the transformer is trained, and used, *auto-regressively*: given source, i.e., input, sequence $\boldsymbol{x} = x_0, x_1,\cdots, x_{n-1}$ or length $n$ tokens, and current output sequence  $\boldsymbol{y}_k = y_0, y_1,\cdots, y_{k-1}$ of length $k$ tokens, the model approximates a discrete conditional probability distribution,  over the target vocabulary of size $m$ tokens, 

$$p_{ij} \equiv p(y_{ij} | \boldsymbol{x}, \boldsymbol{y}_k), \quad i = 0, \cdots, k, \quad j = 0,\cdots, m-1 .$$

For a sequence of size $k$, there are $k^m$ possible "sentences". Ideally, we want to find the most probable. Alas, we have a bit of a computational problem. For a sequence of size $k=200$ tokens and a target vocabulary of size $m = 28$ tokens, there are $\sim 2.68\times10^{64}$ possible sentences. Even at a trillion probability calculations per second, an exhaustive search would be an utterly futile undertaking because it would take far longer to complete than the current age of the universe ($\sim 4 \times 10^{17}$ s)! Obviously, we need to use a heuristic strategy.

The simplest strategy is the __greedy strategy__ in which we consider only the last predicted probability distribution, that is, distribution $k+1$ and choose the next token to be the most probable.  

A better strategy is __beam search__ in which at each prediction stage we keep track of the $n$ "best" sequences so far where by best we mean the $n$ most probable sequences so far. At the end we pick the most probable sequence among the $n$.


### References
  1. [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)
  1.  [Attention is all you need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)


In [2]:
import sys

try: 
    from google.colab import drive
    drive.mount('/content/gdrive')
    
    BASE = '/content/gdrive/My Drive/transformer'
    sys.path.append(BASE)
    
    print('\nRunning in Google Colab\n')
    
    gpu_info = !nvidia-smi
    gpu_info = '\n'.join(gpu_info)
    if gpu_info.find('failed') >= 0:
        print('Not connected to a GPU')
    else:
        print(gpu_info)
    
except:
    BASE = '.'   
    print('\nRunning locally')

def pathname(filename):
    return f'{BASE:s}/{filename:s}'

import os, re
import numpy as np
import random
import math
import time

import torch
import torch.nn as nn
import torch.optim as optim

import matplotlib as mp
import matplotlib.pyplot as plt
%matplotlib inline

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'\nComputational device: {str(DEVICE):s}')



Running locally

Computational device: cpu


In [3]:
SEED = 314159
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Read Sequence Data

In [4]:
%run "{BASE}"/dataloader.ipynb

MAX_SEQ_LEN = 200
BATCH_SIZE  = 128
FTRAIN      = 17
FVALID      = 2
FTEST       = 1

filename = pathname('seq2seq_series.txt')
delimit  = '|'

dloader  = DataLoader(filename, delimit, 
                      max_seq_len=MAX_SEQ_LEN, 
                      batch_size=BATCH_SIZE, 
                      ftrain=FTRAIN, 
                      fvalid=FVALID, 
                      ftest=FTEST)

train_data, valid_data, test_data = dloader.data_splits()

read sequences
     0 -----------------------------------------------------------------------------------


(exp(3*a*x) - sin(c*x)/sinh(g*x))*exp(-a*x)

1 + x*(a*c/g + 2*a) + x**2*(-a**2*c/(2*g) + 2*a**2 + c**3/(6*g) + c*g/6) + x**3*(a**3*c/(6*g) + 4*a**3/3 - a*c**3/(6*g) - a*c*g/6) + x**4*(-a**4*c/(24*g) + 2*a**4/3 + a**2*c**3/(12*g) + a**2*c*g/12 - c**5/(120*g) - c**3*g/36 - 7*c*g**3/360) + x**5*(a**5*c/(120*g) + 4*a**5/15 - a**3*c**3/(36*g) - a**3*c*g/36 + a*c**5/(120*g) + a*c**3*g/36 + 7*a*c*g**3/360) - c/g + O(x**6)

 15414 -----------------------------------------------------------------------------------


exp(d*x) + sinh(f*x) - tanh(d*x) + tanh(g*x)/tanh(a*x)

g/a + 1 + x**2*(a*g/3 + d**2/2 - g**3/(3*a)) + x**3*(d**3/2 + f**3/6) + x**4*(-a**3*g/45 - a*g**3/9 + d**4/24 + 2*g**5/(15*a)) + x**5*(-d**5/8 + f**5/120) + f*x + O(x**6)

 30828 -----------------------------------------------------------------------------------


exp(2*d*x)*tan(c*x)**3

x**5*(c**5 + 2*c**3*d**2) + c**3*x**3 + 2*c**3*d*x**4 + O(x**6)


source vocabulary
{'<pad>': 0, '<sos>': 1, '<eos>': 2, '(': 3, ')': 4, '*': 5, '**': 6, '+': 7, '-': 8, '/': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, 'a': 20, 'b': 21, 'c': 22, 'cos': 23, 'cosh': 24, 'd': 25, 'exp': 26, 'f': 27, 'g': 28, 'sin': 29, 'sinh': 30, 'tan': 31, 'tanh': 32, 'x': 33}

target vocabulary
{'<pad>': 0, '<sos>': 1, '<eos>': 2, '(': 3, ')': 4, '*': 5, '**': 6, '+': 7, '-': 8, '/': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, 'O(x**6)': 20, 'a': 21, 'b': 22, 'c': 23, 'd': 24, 'f': 25, 'g': 26, 'x': 27}

tokenize
 46000
 46000
delimit and pad training data
 33000
delimit and pad validation data
  3000
delimit test data but do not pad
  1000
avg. source sequence length:       29
std. source sequence length:       10
     source sequence length:       61
     source vocabulary size:       34

avg. target sequence length:      142
std. target sequence length:      212
    

## The Model

The transformer comprises an encoder and decoder, each of which consists of one or more processing layers.

### Encoder

The encoder does the following:
 1. Each token in the source (input) sequence is encoded as a vector $\boldsymbol{t}$ in a space of __emb_dim__ dimensions. 
 1. The position of each token is also encoded as a vector $\boldsymbol{p}$ in a vector space of the same dimension as $\boldsymbol{t}$.  Both the token and position embeddings are trainable.
 1. Each token is associated with a third vector: $\boldsymbol{v} = \lambda \boldsymbol{t} + \boldsymbol{p}$, where the scale factor $\lambda = \sqrt{\text{emb_dim}}$.  

The vectors $\boldsymbol{v}$ are processed through $N$ *encoder layers*.

Since the source sequences are padded so that they are all of equal length, a method is needed to ensure that the pad tokens are ignored. This is done using masks.
The source mask, `src_mask`, has value 1 if the token in the source is *not* a `<pad>` token and 0 otherwise. The source mask is used in the encoder layers to mask the `<pad>` tokens in the multi-head attention mechanisms.

In [5]:
class Encoder(nn.Module):
    
    def __init__(self, 
                 vocab_size,      # vocabulary size (of source)
                 emb_dim,         # dimension of token embedding space
                 n_layers,        # number of encoding layers
                 n_heads,         # number of attention heads
                 ff_dim,          # dimension of feed-forward network
                 dropout,         # dropout probability
                 device,          # computational device
                 max_len):        # maximum number of tokens/sequence
        
        super().__init__()

        # cache computational device
        self.device = device
        
        # represent each of the 'vocab_size' possible tokens 
        # by a vector of size 'emb_dim'
        self.tok_embedding = nn.Embedding(vocab_size, emb_dim)
        
        # represent the position of each token by a vector of size 'emb_dim'.
        # 'max_len' is the maximum length of a sequence.
        self.pos_embedding = nn.Embedding(max_len, emb_dim)
        
        # create 'n_layers' encoding layers
        self.layers = nn.ModuleList([EncoderLayer(emb_dim, 
                                                  n_heads, 
                                                  ff_dim,
                                                  dropout, 
                                                  device) 
                                     for _ in range(n_layers)])
        
        self.dropout= nn.Dropout(dropout)
        
        # factor by which to scale token embedding vectors
        self.scale  = torch.sqrt(torch.FloatTensor([emb_dim])).to(device)
        
    def forward(self, src, src_mask):
        # src      : [batch_size, src_len]         (shape of src)
        # src_mask : [batch_size, 1, 1, src_len]   (shape of src_mask)
        
        batch_size, src_len = src.shape
  
        # ---------------------------------------
        # input embedding 
        # ---------------------------------------
        src = self.tok_embedding(src)
        # src: [batch_size, src_len, emb_dim]
        
        # ---------------------------------------
        # position embedding
        # ---------------------------------------
        # create a row tensor, p, with entries [0, 1,..., src_len-1]
        pos = torch.arange(0, src_len)
        # pos: [src_len]
        
        # 1. add a dimension at position 0 (for batch size)
        # 2. repeat p, once per row, 'batch_size' times, so that 
        #    we obtain
        # pos = |p|
        #       |p|
        #        :
        #       |p|
        # 3. send to computational device
        once_per_row = 1
        pos = pos.unsqueeze(0).repeat(batch_size, once_per_row).to(self.device)
        # pos: [batch_size, src_len]
        
        pos = self.pos_embedding(pos)
        # pos: [batch_size, src_len, emb_dim]
        
        # linearly combine token and token position embeddings
        # could try replacing this by a feed-forward network
        src = src * self.scale + pos
        # src: [batch_size, src_len, emb_dim]
        
        # is this is really necessary?
        src = self.dropout(src)
        
        # pass embedding vectors through encoding layers
        # Note: the entire sequence is processed (in principle) in parallel
        for layer in self.layers:
            src = layer(src, src_mask) 
            # src: [batch_size, src_len, emb_dim]
            
        return src

Alternative (non-trained) position embedding

In [6]:
class PositionEmbedding(nn.Module):
    
    def __init__(self,
                 emb_dim: int,           # dimension of embedding space (must be even)
                 max_len: int,           # max_len: maximum length of sequence
                 dropout: float):        # dropout probability
        
        super(PositionEmbedding, self).__init__() # initialize parent class
        
        # den = 10000^(-2j / d), j = 0, 1, 2,... emb_size / 2
        den = torch.exp(-torch.arange(0, emb_dim, 2) * math.log(10000) / emb_dim)
        # [emb_dim/2]    #  a row-wise vector
        
        pos = torch.arange(0, max_len).reshape(max_len, 1)
        # [max_len, 1]   # a column-wise vector
        
        # compute outer product of pos and den
        # x_mn = pos_m * den_n
        x   = pos * den
        # [max_len, emb_dim/2]
        
        pos_encoding = torch.zeros((max_len, emb_dim))
        # [max_len, emb_dim]
        
        # set every other column starting at column 0
        pos_encoding[:, 0::2] = torch.sin(x)
        
        # set every other column starting at column 1
        pos_encoding[:, 1::2] = torch.cos(x)
        
        # use unsqueeze 0 to place a third dimension (batch) in position 0
        pos_encoding = pos_encoding.unsqueeze(0)
        # [1, max_len, emb_dim]
               
        # registering a tensor as a buffer tells PyTorch that the tensor
        # is not to be changed during optimization.
        self.register_buffer('pos_encoding', pos_encoding)

    def forward(self, seq_len: int):
        # make sure sequence length of position encoding matches
        # that of token sequences.
        p = self.pos_encoding[:, :seq_len, :]
        #p: [1, seq_len, emb_dim]
        return p

### Encoder Layer

 1. Pass the source tensor and its mask to the *multi-head attention layer*.
 1. Apply a residual connection and [Layer Normalization](https://arxiv.org/abs/1607.06450). 
 1. Apply a linear layer.
 1. Apply a residual connection and layer normalization.

In [7]:
class EncoderLayer(nn.Module):
    
    def __init__(self, 
                 emb_dim, 
                 n_heads, 
                 ff_dim,  
                 dropout, 
                 device):
        
        super().__init__()
        
        self.self_attention       = MultiHeadAttention(emb_dim, 
                                                       n_heads, 
                                                       dropout, 
                                                       device)
        self.self_attention_norm  = nn.LayerNorm(emb_dim)
        
        self.feedforward          = Feedforward(emb_dim, ff_dim, dropout)
        self.feedforward_norm     = nn.LayerNorm(emb_dim)
        
        self.dropout              = nn.Dropout(dropout)
        
    def forward(self, src, src_mask):
        # src      : [batch_size, src_len, emb_dim]
        # src_mask : [batch_size, 1, 1, src_len] 
          
        # ------------------------------------------
        # self attention over embedded source
        # ------------------------------------------
        # distinguish between src and src_ as the 
        # former is needed later for a residual connection
        src_ = self.self_attention(src, src, src, src_mask)
        # src_: [batch_size, src_len, emb_dim]
        
        # is this useful?
        src_ = self.dropout(src_)
        
        # ------------------------------------------
        # add residual connection then layer norm.
        # ------------------------------------------
        # distinguish between src and src+src_ as the
        # former is later needed for a residual connection
        src  = self.self_attention_norm(src + src_)
        # src: [batch_size, src_len, emb_dim]
        
        src_ = self.feedforward(src)
        # src_: [batch_size, src_len, emb_dim]
        
        src_ = self.dropout(src_)

        # add residual connection and layer norm
        src  = self.feedforward_norm(src + src_)
        # src: [batch_size, src_len, emb_dim]
        
        return src

### Mutli-Head Attention Layer


Attention in the transformer model is defined by the matrix expression

\begin{align}
    \textrm{Attention}(Q, K, V) & = \textrm{Softmax}\left(\frac{Q K^T}{\sqrt{d}} \right) V,
\end{align}

where $Q$ is called the `query`, $K$ the `key`, $V$ the `value`, and $d =$ __emb_dim__ is the dimension of the vectors that represent the tokens. In practice, the vectors are split into __n_heads__ pieces each of size __head_dim__ $= \text{emb_dim} / \text{n_heads}$. __n_heads__ is the number of so-called __attention heads__. (It is stated that each `head` can pay attention to different aspects of a sequence. However, at our current level of understanding of how functions with millions of parameters truly work, such statements should be taken with a liberal pinch of salt.)
In self attention, the query, key, and value tensors are derived from the same tensor via separate linear transformations of that tensor (see Attention Algorithm below). The coefficients of the linear functions are free parameters to be set by the training algorithm.  The number of rows in $Q$, $K$, and $V$, namely, __query_len__,  __key_len__, and __value_len__, respectively, is equal to the sequence length __seq_len__. For target/source attention, the query is a linear function of the target tensor while the key and value tensors are linear functions of the source tensor, where, again, the coefficients are free parameters to be fitted during training.

We first describe the attention mechanism mathematically and then follow with an algorithmic description that closely follows 
the description in the [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/). It is to be understood that every operation described below is performed for a batch of sequences. Therefore, when we refer to a matrix we really mean a batch of matrices. 

First consider the matrix product $Q K^T$ in component form, where summation over repeated indices (the Einstein convention) is implied,

\begin{align}
A_{qk} 
& = Q_{q h} \, [K^T]_{hk}, \nonumber\\
& \quad q=1,\cdots, \text{query_len}, \,\, h = 1, \cdots, \text{head_dim}, \,\, k = 1, \cdots, \text{key_len} .
\end{align}

When the matrix $A$ is scaled and a softmax function is applied elementwise along the key length dimension (here, horizontally) the result is another matrix $W$ whose row elements, by construction, sum to unity. The matrix $W$ is then multiplied by $V$ to yield

\begin{align}
    \text{Attention}_{qh}  
    & = W_{qk} V_{kh}. 
\end{align}

Since tokens are represented by vectors, it is instructive to think of the attention computation geometrically.   Each row, $i$, of $Q$, $K$, and $V$ can be regarded as the vectors $\boldsymbol{q}_i$, $\boldsymbol{k}_i$, and $\boldsymbol{v}_i$, respectively, associated with token $i$. Consider a sequence with __seq_len__ = 2. We can write $Q$, $K$, and $V$ as

\begin{align}
Q & = \left[\begin{matrix} \boldsymbol{q}_1 \\ \boldsymbol{q}_2 \end{matrix}\right], \\
K & = \left[\begin{matrix} \boldsymbol{k}_1 \\ \boldsymbol{k}_2 \end{matrix}\right], \text{ and} \\
V & = \left[\begin{matrix} \boldsymbol{v}_1 \\ \boldsymbol{v}_2 \end{matrix}\right] ,
\end{align}

and $A = Q K^T$ as the outer product matrix

\begin{align}
A & = \left[\begin{matrix} \boldsymbol{q}_1 \\ \boldsymbol{q}_2 \end{matrix}\right] 
\left[\begin{matrix} \boldsymbol{k}_1 & \boldsymbol{k}_2 \end{matrix}\right] ,
\nonumber\\
& = \left[
\begin{matrix} 
\boldsymbol{q}_1\cdot\boldsymbol{k}_1 & \boldsymbol{q}_1\cdot \boldsymbol{k}_2 \\ 
\boldsymbol{q}_2\cdot\boldsymbol{k}_1 & \boldsymbol{q}_2\cdot \boldsymbol{k}_2
\end{matrix}
\right] .
\end{align}

The matrix $A$ can be interpreted as a measure of the degree to which the $\boldsymbol{q}$ and $\boldsymbol{k}$ vectors are aligned. Presumably, the more aligned the two vectors the stronger the relationship between the  tokens they represent. Because of the use of the dot product, the degree of alignment depends both on the angle between the vectors as well as on their magnitudes. Consequently, two vectors can be more strongly aligned than a vector's alignment with itself! 

After the scaling and softmax operations on $A$, tokens 1 and 2 become associated with vectors $\boldsymbol{w}_1 =  (w_{11}, w_{12})$ and $\boldsymbol{w}_2 =  (w_{21}, w_{22})$, respectively, where

\begin{align}
    w_{ij} & = \frac{\exp\left(\boldsymbol{q}_i \cdot \boldsymbol{k}_j \, / \, \sqrt{d}\right)}
    {\sum_{k = 1}^2 \exp\left(\boldsymbol{q}_i \cdot \boldsymbol{k}_k \, / \, \sqrt{d}\right)} .
\end{align}

These vectors lie in the line segment $[\boldsymbol{p}_1, \boldsymbol{p}_2]$ depicted in the figure below. The line segment is a simplex (specifically, a 1-simplex) that is embedded in a vector space of dimension __seq_len__.  In this vector space, tokens 1 and 2 are represented by the orthogonal unit vectors $\boldsymbol{u}_1$ and $\boldsymbol{u}_2$, respectively. For a sequence of length $n$, the vectors $\boldsymbol{w}_i$, $i = 1,\cdots, n$ lie in the $(n-1)$-simplex and, again, each coordinate unit vector $\boldsymbol{u}_i$ represents a token.  

![simplex](simplex.png)

In the figure above, notice that vector $\boldsymbol{w}_2$ is closer to $\boldsymbol{u}_1$ than it is to $\boldsymbol{u}_2$ indicating that token 2 is more strongly aligned with token 1 than token 2 is aligned with itself, while the converse is true of token 1. The attention vector for token, $i$, in the transformer model is simply the weighted average 

\begin{align}
    \text{Attention}_i & = w_{i1}  \boldsymbol{v}_1 + w_{i2} \boldsymbol{v}_2
\end{align}

of the so-called value vectors $\boldsymbol{v}_1$ and $\boldsymbol{v}_2$.  The upshot of this construction is that the attention vectors can be viewed as linear functions of the source or target tensors with coefficients that depend non-linearly on the source or target. Consequently, the attention adapts to the sequences as, presumably, it should.


### Attention Algorithm

Now we describe the transformer attention mechanism algorithmically, again following closely the description in the [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/), but with some notational changes.

#### Step 1
As noted, the attention mechanism starts with three tensors, $Q_\text{in}$, $K_\text{in}$, and $V_\text{in}$, of shapes __[batch_size, query_len, emb_dim]__, __[batch_size, key_len, emb_dim]__, and __[batch_size, value_len, emb_dim]__, respectively, with __value_len=key_len__. (emb_dim is called hid_dim in the [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)).  For self attention, $Q_\text{in}$, $K_\text{in}$, and $V_\text{in}$ are the same tensor, while for target/source attention $Q_\text{in}$ is associated with the target tensor and $K_\text{in}$ and $V_\text{in}$ with the source tensor.

Three trainable linear layers, $f_Q$, $f_K$, $f_V$, are defined, each of shape __[emb_dim, emb_dim]__, which yield the so-called `query`, `key`, and `value` tensors

\begin{align}
    Q & = f_Q(\boldsymbol{Q_\text{in}}),\\
    K & = f_K(\boldsymbol{K_\text{in}}), \text{ and}\\
    V & = f_V(\boldsymbol{V_\text{in}}).
\end{align}

Each tensor $Q$, $K$, and $V$ is the same shape as $Q_\text{in}$, $K_\text{in}$, and $V_\text{in}$, respectively. 


#### Step 2
Tensors $Q$, $K$, and $V$ are reshaped by first splitting the embedding dimension, __emb_dim__, into __n_heads__ blocks of size __head_dim = emb_dim / n_heads__ so that their shapes become __[batch_size, -1, n_heads, head_dim]__, where the __-1__ pertains to __query_len__, __key_len__, or __value_len__, whose value is determined at runtime.

#### Step 3
Dimensions 1 and 2 of the tensors $Q$, $K$, and $V$ are permuted (`Tensor.permute(0, 2, 1, 3)`) so that we now have __[batch_size, n_heads, -1, head_dim]__. Tensor $K$ is further permuted (`Tensor.permute(0, 1, 3, 2)`) to shape __[batch_size, n_heads, head_dim, -1]__ so that it represents $K^T$.

#### Step 4
Tensor $A = Q K^T$ is computed using `torch.matmul(Q, K^T)`, scaled by $1 \, / \, \sqrt{d}$, and a softmax is applied to the last dimension of $A$, that is, the key/value length dimension, yielding the tensor $W$ of shape __[batch_size, n_heads, query_len, key_len]__.


#### Step 5
$\text{Attention} = W V$ is computed, yielding a tensor of shape 
__[batch_size, n_heads, query_len, head_dim]__.

#### Step 6
The __n_heads__ and __query_len__ dimensions of `Attention` are transposed (`Tensor.permute(0, 2, 1, 3)`) to shape __[batch_size, query_len, n_heads, head_dim]__ and forced to be contiguous in memory (`contiguous()`).

#### Step 7
The __n_heads__ and __head_dim__ are concatenated using `Attention.view(batch_size, -1, emb_dim)` to merge the attention heads into a single `MultiHeadAttention` tensor.

#### Step 8
Finally, the merged `MultiHeadAttention` tensor is pushed through a trainable linear layer of shape __[emb_dim, emb_dim]__ to output a tensor of shape __[batch_size, -1, emb_dim]__, where __-1__ is the sequence length.


### Comments
It is claimed that the  algorithm above captures the notion of "paying attention to" token-token associations both within the same sequence and across sequences and that each attention head "pays attention to" a different aspect of the sequences. All such claims should be taken with a pinch of salt for at least two reasons.
First, it is far from clear that this computation aligns with our intuitive understanding of  that notion and, second, the computation is nested through multiple attention layers. Therefore, whatever the attention layers are doing, it is distributed over multiple layers in a highly non-linear way. 

It is, however, undeniable that the transformer has yielded amazing results. Therefore, we are forced to concede that, in practice,  whatever is going on in the attention layers the algorithm works wonders!


In [8]:
class MultiHeadAttention(nn.Module):
    
    def __init__(self, emb_dim, n_heads, dropout, device):
        
        super().__init__()
        
        # emb_dim must be a multiple of n_heads
        assert emb_dim % n_heads == 0
        
        self.emb_dim  = emb_dim
        self.n_heads  = n_heads
        self.head_dim = emb_dim // n_heads
        
        self.linear_Q = nn.Linear(emb_dim, emb_dim)
        self.linear_K = nn.Linear(emb_dim, emb_dim)
        self.linear_V = nn.Linear(emb_dim, emb_dim)
        self.linear_O = nn.Linear(emb_dim, emb_dim)
        
        self.dropout  = nn.Dropout(dropout)
        
        self.scale    = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        # query  : [batch_size, query_len, emb_dim]
        # key    : [batch_size, key_len,   emb_dim]
        # value  : [batch_size, value_len, emb_dim]
        
        batch_size, _, emb_dim = query.shape
        assert emb_dim == self.emb_dim
        
        Q = self.linear_Q(query)
        # Q: [batch_size, query_len, emb_dim]
        
        K = self.linear_K(key)
        # K: [batch_size, key len,   emb_dim]
        
        V = self.linear_V(value)
        # V: [batch_size, value_len, emb_dim]
        
        # split vectors of size emb_dim into 'n_heads' vectors of size 'head_dim'
        # and then permute dimensions 1 and 2
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        # Q: [batch_size, n_heads, query_len, head_dim]
        
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        # K: [batch_size, n_heads, key_len,   head_dim]        
        
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        # V: [batch_size, n_heads, value_len, head_dim]
          
        # transpose K (by permuting key_len and head_dim), then
        # compute QK^T/scale
        A = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        # A: [batch_size, n_heads, query_len, key_len]
        
        if mask is not None:
            A = A.masked_fill(mask == 0, -1e10)
        
        # apply softmax to the last dimension (i.e, to key len)
        # WARNING: W is referred to as 'attention' in Annotated Transformer!
        W = torch.softmax(A, dim=-1)     
        # W: [batch_size, n_heads, query_len, key_len]
        
        # not sure why dropout is useful here
        W = self.dropout(W)
        
        # compute attention: (QK^T/scale)V
        attention  = torch.matmul(W, V)
        # attention: [batch_size, n_heads, query_len, head_dim]
        
        # permute n heads and query len and make sure the tensor 
        # is contiguous in memory...
        attention = attention.permute(0, 2, 1, 3).contiguous()
        # attention: [batch_size, query_len, n_heads, head_dim]
        
        # ... and concatenate the n heads into a single multi-head 
        # attention tensor
        attention = attention.view(batch_size, -1, self.emb_dim)
        # attention: [batch_size, query_len, emb_dim]
        
        output    = self.linear_O(attention)
        # output: [batch_size, query_len, emb_dim]

        return output

In [9]:
# Written by ChatGPT v3.5!
def group_sort(input):
    """
    Sorts the input tensor into groups of size 2 and sorts each group independently.

    Args:
        input (torch.Tensor): The input tensor to be sorted.

    Returns:
        torch.Tensor: The sorted tensor, with elements grouped and sorted in ascending order.
    """
    # Reshape the input tensor into groups of size 2
    grouped_tensor = input.view(-1, 2)

    # Sort each group individually using torch.sort
    sorted_groups, _ = torch.sort(grouped_tensor)
    
    # Flatten the sorted groups tensor
    sorted_tensor = sorted_groups.reshape(input.shape)

    return sorted_tensor

### Feedforward Layer

In [10]:
class Feedforward(nn.Module):
    
    def __init__(self, emb_dim, ff_dim, dropout):
        
        super().__init__()
        
        self.linear_1 = nn.Linear(emb_dim, ff_dim)
        
        self.linear_2 = nn.Linear(ff_dim, emb_dim)
        
        self.dropout  = nn.Dropout(dropout)
        
    def forward(self, x):
        # x: [batch_size, seq_len, emb_dim]
        
        x = self.linear_1(x)
        x = torch.relu(x)
        x = self.dropout(x)
        # x: [batch_size, seq_len, ff_dim]
        
        x = self.linear_2(x)
        # x: [batch_size, seq_len, emb_dim]
        
        return x

### Decoder

The decoder takes the encoded representation of the source sequence and the target sequence, or current predicted output sequence, and predicts, probabilistically, the next output token. 

The decoder has two multi-head attention layers: a *masked multi-head attention layer* over the target sequence, and a multi-head attention layer which uses the decoder representation as the query and the encoder representation as the key and value.

__Note__: In PyTorch, the softmax operation is contained within the loss function, so the decoder does not have a softmax layer.

In [11]:
class Decoder(nn.Module):
    
    def __init__(self, 
                 vocab_size,   # size of target vocabulary
                 emb_dim, 
                 n_layers, 
                 n_heads, 
                 ff_dim, 
                 dropout, 
                 device,
                 max_len):
        
        super().__init__()
        
        self.device = device
        
        self.tok_embedding = nn.Embedding(vocab_size, emb_dim)
        
        self.pos_embedding = nn.Embedding(max_len, emb_dim)
        
        self.layers  = nn.ModuleList([DecoderLayer(emb_dim, 
                                                   n_heads, 
                                                   ff_dim, 
                                                   dropout, 
                                                   device)
                                     for _ in range(n_layers)])
        
        self.linear  = nn.Linear(emb_dim, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale   = torch.sqrt(torch.FloatTensor([emb_dim])).to(device)
        
    def forward(self, trg, src, trg_mask, src_mask):
        # trg      : [batch_size, trg_len]
        # src      : [batch_size, src_len, emb_dim]
        # trg_mask : [batch_size, 1, trg_len, trg_len]
        # src_mask : [batch_size, 1, 1, src_len]
                
        batch_size, trg_len = trg.shape
        
        # see Encoder for comments
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)                  
        # pos: [batch_size, trg_len]
            
        trg = self.tok_embedding(trg) * self.scale + self.pos_embedding(pos)
        # trg: [batch_size, trg_len, emb_dim]
        
        trg = self.dropout(trg)
        
        # send the same input source to every decoding layer, with the
        # input target entering the first layer and its output target 
        # entering the second layer etc.
        for layer in self.layers:
            trg = layer(trg, src, trg_mask, src_mask)
            # trg: [batch_size, trg_len, emb_dim]
        
        output = self.linear(trg)
        # output: [batch_size, trg_len, vocab_size]
            
        return output

### Decoder Layer

The decoder layer has two multi-head attention layers, `self_attention` and `attention`. The former applies the attention algorithm to the target sequences, while the latter applies the algorithm between the target and source sequences.

In [12]:
class DecoderLayer(nn.Module):
    
    def __init__(self, 
                 emb_dim, 
                 n_heads, 
                 ff_dim, 
                 dropout, 
                 device):
        
        super().__init__()
        
        self.self_attention      = MultiHeadAttention(emb_dim, n_heads, dropout, device)
        self.self_attention_norm = nn.LayerNorm(emb_dim)
        
        self.attention           = MultiHeadAttention(emb_dim, n_heads, dropout, device)
        self.attention_norm      = nn.LayerNorm(emb_dim)
        
        self.feedforward         = Feedforward(emb_dim, ff_dim, dropout)
        self.feedforward_norm    = nn.LayerNorm(emb_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, trg, src, trg_mask, src_mask):
        # trg      : [batch size, trg len, emb dim]
        # src      : [batch size, src len, emb dim]
        # trg_mask : [batch size, 1, trg len, trg len]
        # src_mask : [batch size, 1, 1, src len]
        
        # compute attention over embedded target sequences.
        # distinguish between trg and trg_, since the former 
        # is needed later for residual connections.
        trg_ = self.self_attention(trg, trg, trg, trg_mask)
        # trg_: [batch_size, trg_len, emb_dim]
        
        trg_ = self.dropout(trg_)
        
        # residual connection and layer norm
        # ?? trg has not had the target mask applied, so the
        # residual connection must surely dilute the effect of
        # of the masked tensor trg_ ??
        trg  = self.self_attention_norm(trg + trg_)
        # trg: [batch_size, trg_len, emb_dim]
            
        # target/source attention
        trg_ = self.attention(trg, src, src, src_mask)
        # trg_: [batch_size, trg_len, emb_dim]
        
        trg_ = self.dropout(trg_)
        
        # residual connection and layer norm
        trg  = self.attention_norm(trg + trg_)      
        # trg: [batch_size, trg_len, emb_dim]
        
        trg_ = self.feedforward(trg)
        # trg_: [batch_size, trg_len, emb_dim]
        
        trg = self.dropout(trg)
        
        # residual and layer norm
        trg  = self.feedforward_norm(trg + trg_)
        # trg: [batch_size, trg_len, emb_dim]
        
        return trg

### Seq2Seq

The `Seq2Seq` model encapsulates the encoder and decoder and handles the creation of the source and target masks.

The source mask, as described above, masks out `<pad>` tokens: the mask is 1 where the token is *not* a `<pad>` token and 0 if it is. The mask is then unsqueezed so it can be correctly broadcast to tensors of shape **_[batch_size, n_heads, seq_len, seq_len]_**, which appear in the multi-head attention mechanism.

The target mask also includes a mask for the `<pad>` tokens. Next, we create a *subsequent* mask, `trg_sub_mask`, using `torch.tril`. This creates a diagonal matrix where the elements above the diagonal are zero and the elements below the diagonal are one. For example, for a target comprising 5 tokens the `trg_sub_mask` will look like this:

$$\begin{matrix}
1 & 0 & 0 & 0 & 0\\
1 & 1 & 0 & 0 & 0\\
1 & 1 & 1 & 0 & 0\\
1 & 1 & 1 & 1 & 0\\
1 & 1 & 1 & 1 & 1\\
\end{matrix}$$

When the mask is applied to a target sequence (a column vector), each token in the target sequence, which corresponds to a row, is associated with the target tokens with non-zero column entries in that row. For example, the first token of the target sequence has the mask **_[1, 0, 0, 0, 0]_**. Therefore, that token is associated with the first target token. The second token of the target sequence has the mask **_[1, 1, 0, 0, 0]_**; therefore, that token is associated with both the first and second target tokens. 

The pad mask causes the model to ignore `<pad>` tokens, which makes sense since the latter contain no useful information, while for a given token in a target sequence, the subsequent mask causes the model to ignore tokens that are subsequent to a given target sequence token. This also makes sense since the goal is to have the model predict the next token given a sequence of tokens. If all target tokens were given for all tokens in the target sequence, the model would never learn to predict the next token. 

The target mask is the logical and of the pad and subsequent masks.

In [14]:
class Seq2Seq(nn.Module):
    
    def __init__(self, 
                 encoder, 
                 decoder, 
                 src_pad, 
                 trg_pad, 
                 device):
        
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad = src_pad
        self.trg_pad = trg_pad
        self.device  = device
        
    def make_src_mask(self, src):
        # src: [batch_size, src_len]
        
        src_mask = (src != self.src_pad).unsqueeze(1).unsqueeze(2)
        # [batch_size, 1, 1, src_len]

        return src_mask
    
    def make_trg_mask(self, trg):
        # trg: [batch size, trg len]
        
        trg_len = trg.shape[1]
            
        trg_pad_mask = (trg != self.trg_pad).unsqueeze(1).unsqueeze(2)
        # trg_pad_mask: [batch_size, 1, 1, trg_len]
        
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), 
                                             device=self.device)).bool()
        # trg_sub_mask: [trg_len, trg_len]
            
        # logical AND of the two masks
        trg_mask = trg_pad_mask & trg_sub_mask
        # trg_mask: [batch_size, 1, trg_len, trg_len]
        
        return trg_mask

    def forward(self, src, trg):
        # src: [batch_size, src_len]
        # trg: [batch_size, trg_len]
                
        src_mask = self.make_src_mask(src)
        # src_mask: [batch_size, 1, 1, src_len]
        
        trg_mask = self.make_trg_mask(trg)
        # trg_mask: [batch_size, 1, trg_len, trg_len]
        
        src      = self.encoder(src, src_mask)
        # src: [batch_size, src_len, emb_dim]
                
        # the decoder will encode the target sequences 
        # before applying the attention layers.
        output   = self.decoder(trg, src, trg_mask, src_mask)
        # output: [batch_size, trg_len, output_dim]
        
        return output

## Training the Seq2Seq Model

We can now define our encoder and decoders. This model is significantly smaller than Transformers used in research today, but is able to be run on a single NVIDIA V100 GPU in an hour or so.

### Training Loop

We want the model to predict the `<eos>` token, thereby terminating its predicted sequence. Therefore, we slice off the `<eos>` token from the end of the target:

$$\begin{align*}
\text{trg} &= [sos, x_1, x_2, x_3, eos]\\
\text{trg[:-1]} &= [sos, x_1, x_2, x_3],
\end{align*}$$

where the $x_i$ denotes target sequence tokens other than `<sos>` and `<eos>`. The sliced targets are fed into the model to get a predicted sequence. If all goes well, the model should predict the `<eos>` token, thereby terminating the predicted sequence:

$$\begin{align*}
\text{output} &= [y_1, y_2, y_3, eos],
\end{align*}$$

where the
$y_i$ are the predicted target sequence tokens. The loss is computed using the original `trg` tensor with the `<sos>` token sliced off:

$$\begin{align*}
\text{output} &= [y_1, y_2, y_3, eos]\\
\text{trg[1:]} &= [x_1, x_2, x_3, eos]
\end{align*}$$

In [15]:
def train(model, optimizer, loss_fn, dataloader,
          niterations, dictfile, 
          batch_size, pad_code,
          traces, 
          lossfile=pathname('losses.txt'),
          valid_size=256,
          step=100):
    
    train_data, valid_data, _ = dataloader.data_splits()
    dataloader.set_batch_size(batch_size)
    
    xx, yy_t, yy_v = traces
    
    v_min = 1.e20 # minimum validation loss
    
    def compute_loss(x, t):
        # x: [batch_size, src_seq_len]
        # t: [batch_size, trg_seq_len]
       
        # slice off EOS token from all targets
        y = model(x, t[:,:-1])
        # [batch_size, trg_seq_len, trg_vocab_size]
        
        trg_vocab_size = y.shape[-1]
        
        y_out = y.reshape(-1, trg_vocab_size)
        # [batch_size * tgt_seq_len, tgt_vocab_size]
        
        # slice of SOS token from targets
        t_out = t[:, 1:].reshape(-1)
        # [batch_size * tgt_seq_len]
        
        loss  = loss_fn(y_out, t_out).mean()

        return loss
  
    def validate(ii):
        
        model.eval()
        
        with torch.no_grad():  # no need to compute gradients wrt. to x, t
                
            x, t   = dataloader.get_batch(train_data, ii, batch_size=valid_size)
            t_loss = compute_loss(x, t).item()
                
            x, t   = dataloader.get_batch(valid_data, ii, batch_size=valid_size)
            v_loss = compute_loss(x, t).item()
            if len(xx) < 1:
                xx.append(0)
            else:
                xx.append(xx[-1]+step)
            yy_t.append(t_loss)
            yy_v.append(v_loss)
            
        return t_loss, v_loss
    
    timeleft = TimeLeft(niterations)
    
    for ii in range(niterations):
        
        model.train()
        
        src, tgt = dataloader.get_batch(train_data, ii)
        
        loss     = compute_loss(src, tgt)

        optimizer.zero_grad()     # zero gradients
        
        loss.backward()           # compute gradients

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1)

        optimizer.step()          # make a single step in average loss
           
        if (ii % step == 0) or (ii >= niterations-1):
        
            t_loss, v_loss = validate(ii)
            
            line = f'{t_loss:12.6f}|{v_loss:12.6f}|{np.exp(v_loss):12.6f}'

            open(lossfile, 'a').write(f'{ii:8d} {t_loss:12.6f} {v_loss:12.6f}\n')
            
            if v_loss < v_min:
                v_min = v_loss
                # save best model so far
                torch.save(model.state_dict(), dictfile) 

            timeleft(ii, line)

    print()

    torch.save(model.state_dict(), 'model.pt')

    return xx, yy_t, yy_v

def train_by_epoch(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        print(f'\r\tbatch: {i:10d}', end='')
        
        src = batch.src
        # [batch_size, src_len]
        
        trg = batch.trg
        # [batch_size, trg_len]
        
        optimizer.zero_grad()
        
        # slice off EOS token from targets
        output = model(src, trg[:,:-1])
        # [batch_size, trg_len - 1, output_dim]
            
        output_dim = output.shape[-1]
            
        output = output.contiguous().view(-1, output_dim)
        # [batch_size * (trg_len - 1), output_dim]
        
        # slice off SOS token
        trg = trg[:,1:].contiguous().view(-1)    
        # [batch size * (trg len - 1)]
            
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    print() 
    return epoch_loss / len(iterator)

def evaluate_by_epoch(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg
            
            output = model(src, trg[:,:-1])
            # [batch size, trg len - 1, output dim]
            
            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            # [batch size * (trg len - 1), output dim]
            
            trg = trg[:,1:].contiguous().view(-1)
            # [batch size * (trg len - 1)]
            
            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [16]:
def plot_average_loss(traces, ftsize=18, filename=pathname('fig_loss.png')):
    
    xx, yy_t, yy_v = traces
    
    # create an empty figure
    fig = plt.figure(figsize=(5, 5))
    fig.tight_layout()
    
    # add a subplot to it
    nrows, ncols, index = 1,1,1
    ax  = fig.add_subplot(nrows,ncols,index)

    ax.set_title("Average loss")
    
    ax.plot(xx, yy_t, 'b', lw=2, label='Training')
    ax.plot(xx, yy_v, 'r', lw=2, label='Validation')

    ax.set_xlabel('Iterations', fontsize=ftsize)
    ax.set_ylabel('average loss', fontsize=ftsize)
    ax.set_xscale('log')
    ax.set_yscale('log')
    ax.grid(True, which="both", linestyle='-')
    ax.legend(loc='upper right')
    
    plt.savefig(filename)
    plt.show()

In [17]:
MAX_SRC_LEN= dloader.SRC_SEQ_LEN
INPUT_DIM  = dloader.SRC_VOCAB_SIZE

MAX_TRG_LEN= dloader.TGT_SEQ_LEN
OUTPUT_DIM = dloader.TGT_VOCAB_SIZE

EMB_DIM    = 200

ENC_LAYERS = 4
ENC_HEADS  = 8
ENC_FF_DIM = 1024
ENC_DROPOUT= 0.1

DEC_LAYERS = 4
DEC_HEADS  = 8
DEC_FF_DIM = 1024
DEC_DROPOUT= 0.1

enc = Encoder(INPUT_DIM, 
              EMB_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_FF_DIM, 
              ENC_DROPOUT, 
              DEVICE, 
              MAX_SRC_LEN)

dec = Decoder(OUTPUT_DIM, 
              EMB_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_FF_DIM, 
              DEC_DROPOUT, 
              DEVICE, 
              MAX_TRG_LEN)

PAD_CODE = dloader.PAD
SOS_CODE = dloader.SOS
EOS_CODE = dloader.EOS

model    = Seq2Seq(enc, dec, PAD_CODE, PAD_CODE, DEVICE).to(DEVICE)

def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)
        
model.apply(initialize_weights)
print(model)

print(f'The model has {number_of_parameters(model):,} trainable parameters')

Seq2Seq(
  (encoder): Encoder(
    (tok_embedding): Embedding(34, 200)
    (pos_embedding): Embedding(61, 200)
    (layers): ModuleList(
      (0-3): 4 x EncoderLayer(
        (self_attention): MultiHeadAttention(
          (linear_Q): Linear(in_features=200, out_features=200, bias=True)
          (linear_K): Linear(in_features=200, out_features=200, bias=True)
          (linear_V): Linear(in_features=200, out_features=200, bias=True)
          (linear_O): Linear(in_features=200, out_features=200, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (self_attention_norm): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
        (feedforward): Feedforward(
          (linear_1): Linear(in_features=200, out_features=1024, bias=True)
          (linear_2): Linear(in_features=1024, out_features=200, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feedforward_norm): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
        (

In [18]:
import os

VERSION     = '06-11-23-v1'
DICTFILE    = pathname(f'seq2seq_series-{VERSION:s}.pt')
LOSSFILE    = pathname(f'seq2seq_losses-{VERSION:s}.txt')

os.system(f'rm -rf {LOSSFILE:s}')

traces=([], [], [])

In [20]:
LOAD        = True
TRAIN       = False

if LOAD:
    # load best model
    model.load_state_dict(torch.load(DICTFILE, map_location=torch.device(DEVICE)))

if TRAIN:
    BATCH_SIZE    = 128
    LEARNING_RATE = 5e-4
    NITERATIONS   = 20000
    STEP          =   100

    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

    criterion = nn.CrossEntropyLoss(ignore_index=PAD_CODE)
    
    traces    = train(model, optimizer, criterion, dloader,
                      niterations=NITERATIONS, 
                      dictfile=DICTFILE,
                      batch_size=BATCH_SIZE, 
                      pad_code=PAD_CODE, 
                      traces=traces, 
                      lossfile=LOSSFILE, 
                      step=STEP)
    
    plot_average_loss(traces)

## Testing Model

Our test data are tokenized, coded, and bracketed with the `<sos>` and `<eos>` codes. Therefore, the translation steps are as follows:
Steps:

- convert list of (coded) source tokens, `src`, to the tensor, `src_`, and add a batch dimension to it so that the source is of the correct shape, namely, `[batch size, src seq len]` with `batch size = 1`;
- create the source mask (though, strictly speaking, this is not needed here since, unlike the training and validation data, the test data are unpadded);
- feed the source and its mask into the encoder;
- create a list to hold the predicted tokens initialized with the `<sos>` token;
- while we have not hit a maximum length
  - convert the current output sentence prediction into a tensor with a batch dimension
  - create a target sentence mask
  - place the current output, encoder output and both masks into the decoder
  - get next output token prediction from decoder along with attention
  - add prediction to current output sentence prediction
  - break if the prediction was an `<eos>` token
- convert the output sentence from indexes to tokens
- return the output sentence (with the `<sos>` token removed) and the attention from the last layer

In [22]:
def translate(src, model, 
              max_len=256, 
              sos=SOS_CODE, 
              device=DEVICE, 
              threshold=0.50):
    
    from copy import copy
    
    def execute(trg, src_, src_mask):
        
        trg_ = torch.tensor(trg).unsqueeze(0).to(device)

        trg_mask = model.make_trg_mask(trg_)
        
        with torch.no_grad():
            output = model.decoder(trg_, src_, trg_mask, src_mask)
            # [batch size, trg seq len, trg vocab size]
            # assume batch_size = 1 since src is a single sequence
            # apply softmax to the last dimension (the default) of 
            # the last target position
            probs  = torch.softmax(output[:,-1], dim=1)
            
        # for each output position return the 2 token codes with 
        # the largest probabilities
        token_probs, token_codes = torch.topk(probs, k=2)
        token_probs = token_probs.t()
        token_codes = token_codes.t()

        token_code0 = token_codes[0,-1].item()
        token_code1 = token_codes[1,-1].item()
        
        token_prob0 = token_probs[0,-1].item()
        token_prob1 = token_probs[1,-1].item()

        return token_code0, token_code1, token_prob0, token_prob1
    
    model.eval()

    src_ = torch.LongTensor(src).unsqueeze(0).to(device)
    
    src_mask = model.make_src_mask(src_)
    
    with torch.no_grad():
        src_ = model.encoder(src_, src_mask)

    trg0 = [sos]
    trg1 = None
    
    prb0 = 1.0
    prb1 = 1.0
    
    first= True
    
    for i in range(max_len):
        
        single_sequence = trg1 == None
        double_sequence = not single_sequence
        
        code0, code1, prob0, prob1 = execute(trg0, src_, src_mask)

        code = code0
            
        trg0.append(code)
            
        if code == EOS_CODE:
            break
            
    return trg0

In [23]:
class Translate():
    def __init__(self, model, max_len=256, sos=SOS_CODE, device=DEVICE):
        self.model   = model
        self.max_len = max_len
        self.sos     = sos
        self.device  = device
        
    def exec_model(self):
        pass
        
    def __call__(self, src, beam_size=2):
        model.eval()

        src_ = torch.LongTensor(src).unsqueeze(0).to(self.device)
    
        src_mask = model.make_src_mask(src_)
    
        with torch.no_grad():
            src_ = model.encoder(src_, src_mask)

        trgs  = [[sos] for _ in range(beam_size)]
        vals  = [[1] for _ in range(beam_size)]
        
        for i in range(max_len):

            for trg in trgs:
                trg_ = torch.tensor(trg).unsqueeze(0).to(device)

                trg_mask = model.make_trg_mask(trg_)

                with torch.no_grad():
                    output = model.decoder(trg_, src_, trg_mask, src_mask)
                    # [batch size, trg seq len, trg vocab size]

                value, index = torch.topk(output[:,-1, :], beam_size, dim=2)
                vals = []
            # return the code (i.e., the token's ordinal value in the
            # target vocabulary) of the last token in the current target
            # sequence with the largest output value.
            token_code = output.argmax(dim=2)[:,-1].item()

            trg.append(token_code)

            if token_code == EOS_CODE:
                break

        
        
        

In [25]:
srcs, tgts = dloader.test_data
MAX_LEN    = dloader.TGT_SEQ_LEN
PRINT_MISTAKES = False

# load best model
model.load_state_dict(torch.load(DICTFILE, map_location=torch.device(DEVICE)))

N = 400
M = 0
F = 0.0

for i, (src, tgt) in enumerate(zip(srcs[:N], tgts[:N])):   

    src_ = stringify(src[1:-1], dloader.src_code2token)
    
    tgt_ = stringify(tgt[1:-1], dloader.tgt_code2token)
    
    out  = translate(src, model, 
                               max_len=MAX_LEN, 
                               sos=SOS_CODE, 
                               device=DEVICE)
    
    out_ = stringify(out[1:-1], dloader.tgt_code2token)
    
    tgt_ = tgt_.replace('<pad>','')

    if out_ == tgt_:
        M += 1
        F = M / (i+1)
    else:
        if PRINT_MISTAKES:
            print()
            print(tgt_)
            print()
            print(out_)
            print()
            print('-'*91)

    print(f'\r{i:8d}\taccuracy: {F:8.3f}', end='')

dF = math.sqrt(F*(1-F)/N)
print(f'\r{i:8d}\taccuracy: {F:8.3f} +/- {dF:.3f}')

     399	accuracy:    0.868 +/- 0.017


In [26]:
def compute_loss_from_lists(x, t, model, avloss, device):
    
    model.eval()
    
    if type(x) == type([]):
        x = torch.tensor(x)
        t = torch.tensor(t)

    x = x.unsqueeze(0).to(device)
    t = t.unsqueeze(0).to(device)

    # slice off EOS token from targets
    y = model(x, t[:,:-1])
    # [batch_size, trg_seq_len, trg_vocab_size]

    trg_vocab_size = y.shape[-1]

    y_out = y.reshape(-1, trg_vocab_size)
    # [batch_size * tgt_seq_len, tgt_vocab_size]

    # slice of SOS token from targets
    t_out = t[:, 1:].reshape(-1)
    # [batch_size * tgt_seq_len]

    loss  = avloss(y_out, t_out).mean().item()

    return loss

In [27]:
criterion = nn.CrossEntropyLoss(ignore_index=PAD_CODE)
N = 400
aloss = 0.0
for i, (src, tgt) in enumerate(zip(srcs[:N], tgts[:N])):
    aloss += compute_loss_from_lists(src, tgt, model, criterion, DEVICE)
aloss /= N
print(f'<loss>: {aloss:10.4f}')

<loss>:     0.0169
