<a href="https://colab.research.google.com/github/januverma/transformers-stuff/blob/main/GPT_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will train a GPT model on text data. This notebook is a tutorial on using transformer-based architecture for the language modeling task.  In a previous notebook, I implemented a [GPT transformer encoder](https://colab.research.google.com/drive/1mjOkkZ4C5Oxy7QmumnyT9HUZYDzIEJNs?usp=sharing) from scract using PyTorch, check it out for a detailed description of various components of the transformer architecture and implementation choices. In this notebook, we will use a benchmark dataset to build data processing and training pipelines. The goal is to build expertise in training GPT-like language models on any dataset.  

We will use the [`wikitext-2` data](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/#download) which contains text extracted from a set of verified and featured wikipedia articles. This tutorial closely follows the [official PyTorch lanuage modeling tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html). Here instead of using getting data from `torchdata`, we will use raw data. Also, this notebook will use my personal implementation of the transformer as opposed to the PyTorch `TransformerEncoder`.  

Let's download the data. 

In [1]:
! wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip

--2023-01-02 09:02:23--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.214.56, 52.217.106.166, 52.217.38.142, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.214.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4475746 (4.3M) [application/zip]
Saving to: ‘wikitext-2-v1.zip’


2023-01-02 09:02:26 (2.74 MB/s) - ‘wikitext-2-v1.zip’ saved [4475746/4475746]



In [2]:
! unzip wikitext-2-v1.zip

Archive:  wikitext-2-v1.zip
   creating: wikitext-2/
  inflating: wikitext-2/wiki.test.tokens  
  inflating: wikitext-2/wiki.valid.tokens  
  inflating: wikitext-2/wiki.train.tokens  


In [3]:
! ls -l wikitext-2

total 12872
-rw-rw---- 1 root root  1256449 Aug 15  2016 wiki.test.tokens
-rw-rw---- 1 root root 10797148 Aug 15  2016 wiki.train.tokens
-rw-rw---- 1 root root  1121681 Aug 15  2016 wiki.valid.tokens


The data is already separated into training, validation and test sets. 

In [4]:
data_dir = 'wikitext-2/'

Let's do a quick look at the data. 

In [5]:
i = 0
with open(data_dir + '/wiki.train.tokens') as f:
  for line in f:
    line = line.rstrip()
    print('line : {}'.format(i))
    print(line)
    i += 1
    if i > 5:
      break
  f.close()

line : 0

line : 1
 = Valkyria Chronicles III =
line : 2

line : 3
 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .
line : 4
 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the ga

The data has a specific format, which includes empty lines and lines containing titles of the articles. One of the goals of this notebook is to build data processing pipelines. First, the data is filtered to remove empty sentences and title sentences. This is a choice to simplify the workflow, not a statement on the usefulness of the titles.

In [6]:
def build_dataset(data_path):
  """ Data filtering by removing empty sentences and the titles """
  text_data = []
  for line in open(data_path):
    line = line.rstrip()
    if len(line) > 0 and not(line.startswith(' =')):
      text_data.append(line)
  return text_data

In [7]:
train_path = data_dir + '/wiki.train.tokens'
valid_path = data_dir + '/wiki.valid.tokens'
test_path =  data_dir + '/wiki.test.tokens'

train_iter = build_dataset(train_path)
valid_iter = build_dataset(valid_path)
test_iter = build_dataset(test_path)

print('\n Samples in train data: {} \n Samples in valid data: {} \n Samples in test data: {}'.format(len(train_iter), len(valid_iter), len(test_iter)))


 Samples in train data: 17556 
 Samples in valid data: 1841 
 Samples in test data: 2183


In [8]:
train_iter[0]

' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .'

Let's start with importing relevant libraries.

In [9]:
import math
import numpy as np

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.utils.data import dataset

In [10]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [11]:
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

We will use Pytorch `tokenizer` to break the sentences into tokens. 

In [12]:
tokenizer = get_tokenizer('basic_english')

`tokenizer` ingests a sentence or free flowing text, and returns a list of tokens. 

In [13]:
tokenizer('This is just an example')

['this', 'is', 'just', 'an', 'example']

In [14]:
def yield_tokens(data_iter):
  """ iterator for tokenizer """
  for text in data_iter:
      yield tokenizer(text)

In [15]:
train_iter[0]

' Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .'

In [16]:
for x in yield_tokens(train_iter):
  print(x)
  break

['senjō', 'no', 'valkyria', '3', '<unk>', 'chronicles', '(', 'japanese', '戦場のヴァルキュリア3', ',', 'lit', '.', 'valkyria', 'of', 'the', 'battlefield', '3', ')', ',', 'commonly', 'referred', 'to', 'as', 'valkyria', 'chronicles', 'iii', 'outside', 'japan', ',', 'is', 'a', 'tactical', 'role', '@-@', 'playing', 'video', 'game', 'developed', 'by', 'sega', 'and', 'media', '.', 'vision', 'for', 'the', 'playstation', 'portable', '.', 'released', 'in', 'january', '2011', 'in', 'japan', ',', 'it', 'is', 'the', 'third', 'game', 'in', 'the', 'valkyria', 'series', '.', '<unk>', 'the', 'same', 'fusion', 'of', 'tactical', 'and', 'real', '@-@', 'time', 'gameplay', 'as', 'its', 'predecessors', ',', 'the', 'story', 'runs', 'parallel', 'to', 'the', 'first', 'game', 'and', 'follows', 'the', 'nameless', ',', 'a', 'penal', 'military', 'unit', 'serving', 'the', 'nation', 'of', 'gallia', 'during', 'the', 'second', 'europan', 'war', 'who', 'perform', 'secret', 'black', 'operations', 'and', 'are', 'pitted', 'against'

PyTorch provides a routine to build vocabulary from a collection of tokens built from a corpus of text data. The vocaublary is a way to convert the tokens (or sequences thereof) to integers (or sequences thereof). 

In [17]:
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])

In [18]:
len(vocab)

28781

Our text corpus has 28781 unique tokens. 

In [19]:
vocab['<unk>']

0

In [20]:
vocab(['here', 'is', 'an', 'example'])

[1274, 22, 29, 610]

In [21]:
for x in yield_tokens(train_iter):
  print(vocab(x))
  break

[19940, 82, 3864, 86, 0, 3880, 21, 772, 28711, 2, 6121, 3, 3864, 4, 1, 4969, 86, 20, 2, 1809, 1006, 7, 13, 3864, 3880, 897, 623, 959, 2, 22, 8, 5729, 300, 11, 580, 244, 66, 445, 18, 13644, 5, 872, 3, 2466, 16, 1, 1755, 5709, 3, 153, 6, 248, 354, 6, 959, 2, 23, 22, 1, 234, 66, 6, 1, 3864, 92, 3, 0, 1, 154, 4447, 4, 5729, 5, 719, 11, 57, 2557, 13, 41, 7026, 2, 1, 331, 1074, 3178, 7, 1, 37, 66, 5, 1667, 1, 11144, 2, 8, 19640, 311, 1065, 2059, 1, 1693, 4, 18950, 54, 1, 99, 25293, 112, 51, 1913, 1653, 285, 605, 5, 33, 13539, 117, 1, 2283, 1065, 0, 14659, 3]


## Data processing pipeline 
`sentence --> tokens --> integer ids`

- First, the raw data is tokenized, and then each token is mapped to its correponding integer id. 
- The integer sequences of the tokens are convereted to a single flat tensor. 
- Appropriate integer sequences are then created from the flat tensor to be used for training the transformer model.

In [22]:
def data_process(text_iter):
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(item), dtype=torch.long) for item in yield_tokens(text_iter)]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

In [23]:
train_data = data_process(train_iter)

In [24]:
train_data.shape

torch.Size([2005189])

In [25]:
train_data[0]

tensor(19940)

In [26]:
valid_data = data_process(valid_iter)
valid_data.shape

torch.Size([209859])

In [27]:
test_data = data_process(test_iter)
test_data.shape

torch.Size([236503])

## Device : GPU

In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Batches 
- Divide the training data into batches. 

In [29]:
def batchify(data, batch_size):
    """Divides the data into batch_size separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Args:
        data: Tensor, shape [N]
        batch_size: int, batch size

    Returns:
        Tensor of shape [batch_size, N // batch_size]
    """
    seq_len = data.size(0) // batch_size
    data = data[:seq_len * batch_size]
    data = data.view(batch_size, seq_len).contiguous()
    return data.to(device)

In [30]:
batch_size = 16
train_data = batchify(train_data, batch_size)
train_data.shape

torch.Size([16, 125324])

In [31]:
valid_data = batchify(valid_data, batch_size)
valid_data.shape

torch.Size([16, 13116])

In [32]:
test_data = batchify(test_data, batch_size)
test_data.shape

torch.Size([16, 14781])

## Get batches of integer sequences for training. 
- Further break the data into sequences of integer of size `<= max_seq_len` 

In [33]:
max_seq_len = 30

In [34]:
def get_batch(sequence, i):
    """
    Args:
        source: Tensor, shape [batch_size, full_seq_len]
        i: int

    Returns:
        tuple (data, target), where data has shape [batch_size, seq_len] and
        target has shape [batch_size * seq_len ]
    """
    seq_len = min(max_seq_len, sequence.size(1) - 1 - i)
    data = sequence[:, i:i+seq_len]
    target = sequence[:, i+1:i+1+seq_len].reshape(-1)
    return data, target

Let's do a quick test of the working of the above function. 

In [35]:
for batch, i in enumerate(range(0, train_data.size(1) - 1, max_seq_len)):
  source, targets = get_batch(train_data, i)
  print(source.shape)
  print(targets.shape)
  break

torch.Size([16, 30])
torch.Size([480])


In [36]:
source[0]

tensor([19940,    82,  3864,    86,     0,  3880,    21,   772, 28711,     2,
         6121,     3,  3864,     4,     1,  4969,    86,    20,     2,  1809,
         1006,     7,    13,  3864,  3880,   897,   623,   959,     2,    22],
       device='cuda:0')

In [37]:
targets[:35]

tensor([   82,  3864,    86,     0,  3880,    21,   772, 28711,     2,  6121,
            3,  3864,     4,     1,  4969,    86,    20,     2,  1809,  1006,
            7,    13,  3864,  3880,   897,   623,   959,     2,    22,     8,
           65,     0,   202,    23,    13], device='cuda:0')

## Model
 We will now define our model. This part is taken verbatim from the previous [notebook](https://colab.research.google.com/drive/1mjOkkZ4C5Oxy7QmumnyT9HUZYDzIEJNs#scrollTo=XkVScG_mOdUd) check it out for more details. At a high-level, a transformer layer is defined as a sequential model

$$X → X + MultiHeadSelfAttention(X)$$
$$X → LayerNorm(X)$$
$$X → FeedForwardNetwork(X)$$
$$X → LayerNorm(X)$$ 

In [38]:
class MultiHeadSelfAttention(nn.Module):
    '''
    Implements MHSA using the PyTorch MultiheadAttention Layer.
    '''
    def __init__(self, hidden_dim, num_heads, dropout):
        '''
        Arguments:
            hidden_dim: Dimension of the output of the self-attention.
            num_heads: Number of heads for the multi-head attention. 
            dropout: Dropout probability for the self-attention. If `0.0` then no dropout will be used.
            
        Returns:
            A tensor of shape `num_tokens x hidden_size` containing output of the MHSA for each token.
        '''
        super().__init__()
        if hidden_dim % num_heads != 0:
            print('The hidden size {} is not a multiple of the number of heads {}'.format(hidden_dim, num_heads))
        self.attention_layer = nn.MultiheadAttention(hidden_dim, num_heads, dropout=dropout, batch_first=True)
    def forward(self, x, key_padding_mask=None, attention_mask=None):
        '''
        Arguments:
            x: Tensor containing input token embeddings.
            key_padding_mask: Mask indicating which elements within the input sequence to be considered as padding and ignored for the computation of self-attention scores.  
            attention_mask: Mask indicating which relative positions are allowed to attend.  
        '''
        return self.attention_layer(query=x, key=x, value=x, key_padding_mask=key_padding_mask, attn_mask=attention_mask)


In [39]:
class FeedForward(nn.Module):
    '''
    Implements the feed-forward component of the transfomer model.
    '''
    def __init__(self, input_dim, hidden_dim, dropout=0.0):
        '''
        Arguments:
            input_dim: Dimension of the token embedding, output of the MHSA layer.
            hidden_dim: Hidden size of the Transformer that this feed-forward layer is part of.
            dropout: Dropout probability to use for the projected activations. If `0.0` then no dropout will be used.
        Returns:
            A tensor of shape `num_tokens x hidden_dim` containing projections for each token.
        '''
        super().__init__()
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.layer_1 = nn.Linear(input_dim, hidden_dim)
        self.layer_2 = nn.Linear(hidden_dim, input_dim)
    def forward(self, x):
        x = self.layer_1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.layer_2(x)
        return x

In [40]:
class TransformerLayerNorm(nn.Module):
    '''
    Implements LayerNorm for self-attention and feed-forward networks.

    Arguments:
        input_dim: Input dimension.
    
    Returns:
        A normalized tensor of the same dimension as the input. 
    '''
    def __init__(self, input_dim):
        super().__init__()
        self.layer_norm = nn.LayerNorm(input_dim)
    def forward(self, x):
        x = x.to(self.layer_norm.weight.dtype)
        return self.layer_norm(x) 

In [41]:
class TransformerLayer(nn.Module):
    '''
    A transformer layer which is a sequential model consisting of self-attention, layer norm, residual connection, feed-forward projection, layer norm, residual connection. 
    
    Arguments:
        hidden_dim: Hidden dimension transformer layers.  
        num_heads: Number of attention heads. 
        attn_dropout: Dropout for MHSA layers. 
        ffn_dropout: Dropout for feed-forward layers.
    Returns:
        A tensor containing attention scores for each token. 
        attn_weights: A tensor of shape `num_tokens x num_tokens` containing the attention weights. 
    '''
    def __init__(self, hidden_dim, num_heads, attn_dropout=0.0, ffn_dropout=0.0):
        super().__init__()
        self.attn_layer = MultiHeadSelfAttention(hidden_dim, num_heads, dropout=attn_dropout)
        self.ffn_layer = FeedForward(hidden_dim, hidden_dim, dropout=ffn_dropout)
        self.layer_norm = TransformerLayerNorm(hidden_dim)
    def forward(self, x, key_padding_mask=None, attention_mask=None):
        attn_out, attn_weights = self.attn_layer(x, key_padding_mask, attention_mask)
        x = self.layer_norm(x + attn_out)
        ffn_out = self.ffn_layer(x)
        x = self.layer_norm(x + ffn_out)
        return x, attn_weights

We also need positional encoding to make the network sequence aware. Here sinusoidal PEs are used.

In [42]:
class PositionalEncoding(nn.Module):
    '''
    Implements the sinusoidal positional encoding for the input tokens. 

    Arguments:
        embed_dim: Dimension of the positional encoding, should be the same as input token embedding. 
        dropout: Dropout probability to be used for positional encoding. 
        max_len: Maximum length of the input token sequences. 
    Returns:
      A tensor containing positional embeddings for each token.
    '''
    def __init__(self, embed_dim, dropout=0.1, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_dim, 2) * (-math.log(10000.0)/embed_dim))
        pe = torch.zeros(max_len, 1, embed_dim)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)
    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [43]:
class TransformerEncoder(nn.Module):
    '''
    Transformer Encoder which is composed for a stack of TransformerLayers. 

    Arguments:
        num_layers: Number of Transformer layers in the encoder. 
        hidden_dim: Hidden dimension of the transformer layers.  
        num_heads: Number of heads. 
        attn_dropout: Dropout for MHSA layers. 
        ffn_dropout: Dropout for feed-forward layers.
    '''
    def __init__(self, num_layers, hidden_dim, num_heads, attn_dropout=0.0, ffn_dropout=0.0):
        super().__init__()
        self.layers = nn.ModuleList([TransformerLayer(hidden_dim, num_heads, attn_dropout, ffn_dropout) for _ in range(num_layers)])
        self.attn_weights = []
    def forward(self, x, key_padding_mask=None, attention_mask=None):
        for layer in self.layers:
            x, weights = layer(x, key_padding_mask, attention_mask)
            self.attn_weights.append(weights)
        return x
    def get_attention_weights(self):
        if len(self.attn_weights) != 0:
            return self.attn_weights
        else:
            print("The model hasn't been training yet")

In [44]:
class MyGPT(nn.Module):
    '''
    GPT-like language model.

    Arguments:
        vocab_size: Size of vocabulary.
        embed_dim: Dimension of the input token embedding. 
        num_layers: Number of Transformer layers in the encoder. 
        hidden_dim: Hidden dimension of the transformer layers.  
        ffn_hidden_dim: Hidden dimension of the Feed-forward layers. 
        num_heads: Number of heads. 
        attn_dropout: Dropout for MHSA layers. 
        ffn_dropout: Dropout for feed-forward layers.
    '''
    def __init__(self, vocab_size, embed_dim, num_layers, num_heads, attn_dropout, ffn_dropout):
        super(MyGPT, self).__init__()
        self.embedding_layer = nn.Embedding(vocab_size, embed_dim)
        self.pos_encoder = PositionalEncoding(embed_dim)
        self.transformer_encoder = TransformerEncoder(num_layers, embed_dim, num_heads, attn_dropout, ffn_dropout)
        self.decoder = nn.Linear(embed_dim, vocab_size)
        self.embed_layer_norm = nn.LayerNorm(embed_dim)
        self.init_weights()

    def init_weights(self):
        init_range = 0.5
        self.decoder.weight.data.uniform_(-init_range, init_range)
        self.decoder.bias.data.zero_()
        
    
    def forward(self, seqs, attn_mask=None, key_padding_mask=None):
        embedded_seq = self.embedding_layer(seqs)
        embedded_seq = self.pos_encoder(embedded_seq)
        embedded_seq = self.embed_layer_norm(embedded_seq)
        out = self.transformer_encoder(x=embedded_seq, key_padding_mask=key_padding_mask, attention_mask=attn_mask)
        results = self.decoder(out)
        return results
      
    def get_attn_weights(self):
      return self.transformer_encoder.get_attention_weights()

## Attention Mask
For a autoregressive language model, the self-attention layers are only allowed to attend to earlier positions in the sequence, looking into the future is not allowed. We will require an attention mask which prevents peeking into future using upper-triangular matrix of ones. 

In [45]:
def generate_square_subsequent_mask(sz):
    """
    Generates an upper-triangular matrix of ones, with zeros on diag.
    Shape max_length * max_length
    """
    return torch.triu(torch.ones(sz, sz, dtype=torch.bool), diagonal=1)

generate_square_subsequent_mask(max_seq_len).shape

torch.Size([30, 30])

## Training Pipeline

In [46]:
import time

In [47]:
def train(data, max_seq_len, model, criterion, optimizer, device):
    """
    Trains an epoch.
    """
    losses = []
    accuracies = []
    running_losses = []
    model.train()
    num_batches = data.size(1) // max_seq_len
    start_time = time.time()
    attn_mask = generate_square_subsequent_mask(max_seq_len)

    for i, batch in enumerate(range(0, data.size(1) - 1, max_seq_len)):
        seqs, out = get_batch(data, batch)
        seq_len = seqs.size(1)
        if seq_len != max_seq_len:
          attn_mask = attn_mask[:seq_len, :seq_len]
        seqs = seqs.to(device)
        attn_mask = attn_mask.to(device)
        out = out.to(device)

        logits = model(seqs, attn_mask)
        J = criterion(logits.view(-1, ntokens), out)
        losses.append(J.item())
        running_losses.append(J.item())
        optimizer.zero_grad()
        J.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        accuracies.append(out.eq(logits.view(-1, ntokens).detach().argmax(dim=1)).float().mean())

        if i%500 == 0:
          lr = scheduler.get_last_lr()[0]
          ms_per_batch = (time.time() - start_time) * 1000 / 500
          print('|{:5d}/{:5d} batches done | time elapsed {:8.3f} | lr {} | loss {:8.3f}'.format(i, num_batches, ms_per_batch, lr, torch.tensor(running_losses).mean()))
          running_losses = []
          start_time = time.time()
    
    epoch_acc = torch.tensor(accuracies).mean()
    epoch_loss = torch.tensor(losses).mean()
    return epoch_loss, epoch_acc

In [48]:
def evaluate(data, max_seq_len, model, criterion, device):
    """
    Trains an epoch.
    """
    losses = []
    accuracies = []
    model.eval()
    attn_mask = generate_square_subsequent_mask(max_seq_len)
    with torch.no_grad():
      for i, batch in enumerate(range(0, data.size(1) - 1, max_seq_len)):
          seqs, out = get_batch(data, batch)
          seq_len = seqs.size(1)
          if seq_len != max_seq_len:
            attn_mask = attn_mask[:seq_len, :seq_len]
          seqs = seqs.to(device)
          attn_mask = attn_mask.to(device)
          out = out.to(device)

          logits = model(seqs, attn_mask)
          J = criterion(logits.view(-1, ntokens), out)
          losses.append(J.item())

    epoch_loss = torch.tensor(losses).mean()
    return epoch_loss

## Training

In [49]:
ntokens = len(vocab)

In [50]:
model = MyGPT(vocab_size=ntokens, embed_dim=200, num_layers=2, num_heads=2, attn_dropout=0.2, ffn_dropout=0.2).to(device)

In [51]:
criterion = nn.CrossEntropyLoss()
lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

As in the offical PyTorch tutorial, we are using `SGD` as optimzier, but the original GPT and Transformer papers use `Adam`. 

Note that, following standard practices, we are using scheduler for the learning rate which decays the learning rate every epoch by `5%`. There are of course more sophisticated choices of the schedulers and hyperparameters for it. 

There are other nuances involved with achieving a stable training of transformer-based models. We intend to explore those in separate notebooks. 

In [52]:
epochs = 10
max_seq_len = 30
for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train(train_data, max_seq_len, model, criterion, optimizer, device)
    val_loss = evaluate(valid_data, max_seq_len, model, criterion, device)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)
    scheduler.step()

|    0/ 4177 batches done | time elapsed    6.308 | lr 5.0 | loss   18.448
|  500/ 4177 batches done | time elapsed   10.619 | lr 5.0 | loss    8.017
| 1000/ 4177 batches done | time elapsed   10.538 | lr 5.0 | loss    6.926
| 1500/ 4177 batches done | time elapsed   10.685 | lr 5.0 | loss    6.744
| 2000/ 4177 batches done | time elapsed   10.783 | lr 5.0 | loss    6.612
| 2500/ 4177 batches done | time elapsed   10.711 | lr 5.0 | loss    6.537
| 3000/ 4177 batches done | time elapsed   10.785 | lr 5.0 | loss    6.484
| 3500/ 4177 batches done | time elapsed   10.866 | lr 5.0 | loss    6.446
| 4000/ 4177 batches done | time elapsed   10.924 | lr 5.0 | loss    6.392
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 49.78s | valid loss  6.25 | valid ppl   516.34
-----------------------------------------------------------------------------------------
|    0/ 4177 batches done | time elapsed    0.022 | lr 4.75 | loss    6

This concludes the training process. Our model is a simple one, with only 2 transformer layers and 2 heads. Compare that with the original GPT architecture which has 12 transformer layers each with 12 attention heads. We trained the model for only few epochs, in practice such models need to be trained using much larger datasets and for a large number of epochs to get a decent language model. This is for the illustration purposes only. 

On the test set, the model performance can be computed as follows. 

In [53]:
test_loss = evaluate(test_data, max_seq_len, model, criterion, device)

In [54]:
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

| End of training | test loss  5.96 | test ppl   388.11


This is the second notebook in my transformer series where we trained a GPT-like language model from scratch. In the next one, I'll discuss techniques for better and efficient training.  