# Chapter 14: Train A Transformer to Generate Text



The GPT2 transformer we created in Chapter 3 is a large language model with 1.5 billion parameters. Once we load the pretrained weights, the model generates text that are as good as human-written. However, training such a large language model requires supercomputing resources, which most people don't have. 

In this chapter, we'll build a much smaller transformer and train it from scratch by using the Old Man and the Sean text that we used in Chapter 12. You'll see that the trained model generates better text than the LSTM model that we used in Chapter 12. 

The main purpose of this chapter is to learn how to build a language model from scratch and train it using data and use it to generate text. 

Start a new cell in ch14.ipynb and execute the following lines of code in it:

In [1]:
import os

os.makedirs("files/ch14", exist_ok=True)

# 1. Tokenization with Torchtext
We built our own vocabulary and word indexes from scratch in Chapter 12 based on a raw text file. The experience shows the steps involved in word tokenization. 

In this chapter, we'll use the built-in tokenizer in the Torchtext library. We'll use the raw text file of The Old Man and the Sea by Ernest Hemingway. Go to https://archive.org/stream/TheOldManAndTheSea-Eng-Ernest/oldmansea_djvu.txt and download the raw text file from the website and save it as *oldmansea.txt* in the folder /Desktop/ai/files/ch14/ on your computer. Make sure your remove the top and bottom paragraphs that are not part of the original book. 

## 1.1. Clean Up the Text
Take a look at the raw text file you just saved. You'll notice that we need to remove some information from the file. For example, at the top of each section, it has the section number such as -24- in it. It also includes "The Old Man and the Sea Asiaing.com" after the section number. Further, it has page numbers such as "[117] in it. We therefore load the raw text file and remove certain information from the file, as follows:

In [2]:
with open("files/ch14/oldmansea.txt","r") as f:
    text=f.read()
text=text.replace("The Old Man and the Sea", "")    
text=text.replace("Asiaing.com", "")   
text=text.replace("/'", '"') 
text=text.replace("\n", ' ') 
for x in "0123456789]-[":
    text=text.replace(f"{x}", "")
print(text[:1000])

  He was an old man who fished alone in a skiff in the Gulf Stream and he had gone  eighty four days now without taking a fish. In the first forty days a boy had been with him.  But after forty days without a fish the boy's parents had told him that the old man was  now definitely and finally salao, which is the worst form of unlucky, and the boy had gone  at their orders in another boat which caught three good fish the first week. It made the  boy sad to see the old man come in each day with his skiff empty and he always went  down to help him carry either the coiled lines or the gaff and harpoon and the sail that  was furled around the mast. The sail was patched with flour sacks and, furled, it looked  like the flag of permanent defeat.   The old man was thin and gaunt with deep wrinkles in the back of his neck. The  brown blotches of the benevolent skin cancer the sun brings from its  reflection on the  tropic sea were on his cheeks. The blotches ran well down the sides of his face 

The line break (\n) is treated as part of the text, so we have replaced line breaks with white spaces. We'll use the Torchtext tokenizer later, which will automatically converts upper case letters to lower cases and separate punctuations from words. 

We can now save the cleaned up text as follows:

In [3]:
with open("files/ch14/cleaned_text.txt","w") as f:
    f.write(text)

## 1.2. Torchtext Tokenizer
First, we load up the clean text file that we just saved. 

In [4]:
with open("files/ch14/cleaned_text.txt","r") as f:
    text=f.read()
words=text.split(" ") 
print(words[:100])

['', '', 'He', 'was', 'an', 'old', 'man', 'who', 'fished', 'alone', 'in', 'a', 'skiff', 'in', 'the', 'Gulf', 'Stream', 'and', 'he', 'had', 'gone', '', 'eighty', 'four', 'days', 'now', 'without', 'taking', 'a', 'fish.', 'In', 'the', 'first', 'forty', 'days', 'a', 'boy', 'had', 'been', 'with', 'him.', '', 'But', 'after', 'forty', 'days', 'without', 'a', 'fish', 'the', "boy's", 'parents', 'had', 'told', 'him', 'that', 'the', 'old', 'man', 'was', '', 'now', 'definitely', 'and', 'finally', 'salao,', 'which', 'is', 'the', 'worst', 'form', 'of', 'unlucky,', 'and', 'the', 'boy', 'had', 'gone', '', 'at', 'their', 'orders', 'in', 'another', 'boat', 'which', 'caught', 'three', 'good', 'fish', 'the', 'first', 'week.', 'It', 'made', 'the', '', 'boy', 'sad', 'to']


We then import the Torchtext tokenizer as follows:

In [5]:
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import vocab

tokenizer = get_tokenizer('basic_english')

We create a vocabulary item *vocab* based on the words in the text file:

In [6]:
counter = Counter()
for word in words:
    counter.update(tokenizer(word))
vocabulary = vocab(counter, min_freq=1,
   specials=('\<unk\>', '\<BOS\>', '\<EOS\>', '\<PAD\>'))

To use the vocabulary, we can convert an example sentence below to indexes, like so:

In [7]:
example="Today is a great day."
idx=[]
for w in example.split():
    idx += vocabulary(tokenizer(w))
print(idx)

[271, 47, 13, 522, 68, 28]


We can also use the *lookup_token()* method to convert indexes back to text:

In [8]:
txt=[vocabulary.lookup_token(i) for i in idx]
print(" ".join(txt))

today is a great day .


So the vocabulary object we built works properly. We'll use these methods later in the chapter to generate text by converting generated integer numbers back to text. 

## 1.2. Create Batches for Training
We first create a PyTorch dataset based on the text file *cleaned_text.txt* and the vocabulary object we created in the last subsection:

In [9]:
import torch

data=[torch.tensor(vocabulary(tokenizer(w)),
         dtype=torch.long) for w in words]
data=torch.cat(tuple(filter(lambda t:t.numel()>0,data)))
print(data.shape)

torch.Size([29725])


The text file is converted into 29,725 indexes. We'll break the data into smaller sequences: each sequence will have 256 words in it and each batch contains 32 sequences. 

In [10]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
seq_len = 256

batches=[]
i=0
while True:
    x=data[i:i+seq_len]
    y=data[i+1:i+seq_len+1]
    batches.append((x,y))
    i+=1
    if i+seq_len+1>=len(data):
        break

In [11]:
from torch.utils.data import DataLoader

batch_size = 32
loader=DataLoader(batches,batch_size=batch_size,
                  shuffle=True)

We'll print out one batch and examine the output:

In [12]:
x,y=next(iter(loader))
print(x.shape,y.shape)

torch.Size([32, 256]) torch.Size([32, 256])


The above results indicate that if you shift x one position to the right, you have y. That's exactly what we intend to do. We'll use x as features and y as targets. By using the above training data, the model learns to predict the next word based on the prompt. 

# 2. Build and Train the Transformer
We'll build an encoder transformer from scratch and train it by using the data we prepared in the last section. 

## 2.1. The Encoder Transformer 
We first define the *PositionalEncoding()* class based on the definition provided by PyTroch documentation site: 

In [13]:
import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder,TransformerEncoderLayer
from torch.utils.data import dataset

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout = 0.1, max_len = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2)\
                             * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)
    def forward(self, x: Tensor) -> Tensor:
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)    

We define the hyperparameters we use in the model:

In [14]:
# size of vocabulary
ntokens = len(vocabulary) 
# embedding dimension
emsize = 256
# dimension of the feedforward network
d_hid = 256   
nlayers = 2 
nhead = 2  
dropout = 0.2  

The encoder transformer is defined as below:

In [15]:
class Model(nn.Module):
    def __init__(self, ntoken, d_model, nhead, d_hid,
                 nlayers, dropout=0.5):
        super().__init__() 
        self.model_type="Transformer"
        self.pos_encoder=PositionalEncoding(d_model,dropout)
        encoder_layers = TransformerEncoderLayer(d_model,
                                 nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(
            encoder_layers, nlayers)
        self.embedding=nn.Embedding(ntoken,d_model)
        self.d_model=d_model
        self.linear=nn.Linear(d_model,ntoken)
        self.init_weights()
        
    def init_weights(self):
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)        

    def forward(self,src):
        src_mask=nn.Transformer.generate_square_subsequent_mask(
            src.shape[1])
        src=self.embedding(src)*math.sqrt(self.d_model)
        src=self.pos_encoder(src)
        output=self.transformer_encoder(src,src_mask)
        output=self.linear(output)
        return output

## 2.2. Create the Model
We first instantiate a model as follows:

In [16]:
import math

model=Model(ntokens,emsize,nhead,d_hid,nlayers,dropout)
model=model.to(device)
print(model)

Model(
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=256, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
        (linear2): Linear(in_features=256, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.2, inplace=False)
        (dropout2): Dropout(p=0.2, inplace=False)
      )
    )
  )
  (embedding): Embedding(2532, 256)
  (linear): Linear(in_features=256, out_features=2532, bias=True)
)


The optimizer and the loss function are as follows:

In [17]:
lr=0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss_func = nn.CrossEntropyLoss()

We'll train the Model next.

# 3. Train the Model

We then train the model for 100 epochs, as follows:

In [18]:
model.train()  
for i in range(100):
    tloss = 0.
    for idx, (x,y) in enumerate(loader):
        x,y=x.to(device),y.to(device)
        output = model(x)
        output_flat = output.reshape(-1, ntokens)
        loss = loss_func(output_flat, y.reshape(-1))
        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(),1)
        optimizer.step()
        tloss += loss.item()
    print(f'epoch {i+1} loss {tloss/(idx+1)}')

epoch 1 loss 4.291328947396543
epoch 2 loss 3.286152108355014
epoch 3 loss 3.1568637826155372
epoch 4 loss 3.1182432759727123
epoch 5 loss 3.098952170691972
epoch 6 loss 3.0861858009127143
epoch 7 loss 3.0772623860486554
epoch 8 loss 3.0703098667819906
epoch 9 loss 3.064477465181216
epoch 10 loss 3.0598204001282765
epoch 11 loss 3.0560653173444585
epoch 12 loss 3.0522191242338135
epoch 13 loss 3.049309714738761
epoch 14 loss 3.0470319146312668
epoch 15 loss 3.044534145297238
epoch 16 loss 3.0425154681314472
epoch 17 loss 3.040149181078104
epoch 18 loss 3.0384829846836716
epoch 19 loss 3.0369212252050475
epoch 20 loss 3.0358088901842843
epoch 21 loss 3.034229785171575
epoch 22 loss 3.0324920336923173
epoch 23 loss 3.0317186941670804
epoch 24 loss 3.030616597166279
epoch 25 loss 3.0292223848556206
epoch 26 loss 3.028349311033367
epoch 27 loss 3.0273971511020723
epoch 28 loss 3.026491512822535
epoch 29 loss 3.0256204840673555
epoch 30 loss 3.0248236821865286
epoch 31 loss 3.02412745592777

If you are using GPU, it takes half an hour or so to train. If you use CPU only, it may take several hours, depending on your hardware. 

Next, we save the model on the local computer:

In [19]:
torch.save(model.state_dict(),"files/ch14/txtTrans.pth")

# 4. Use the Trained Model to Generate Text
We can use the trained model to generate text. We first define the following sample() function

In [20]:
import torch.nn.functional as F

def sample(model, prompt="The old man and the", length=250, top_k=40):
    model.eval()
    idx=[]
    for w in prompt.lower().split():
        idx += vocabulary(tokenizer(w))
    xs = torch.tensor(idx).unsqueeze(0)   
    xs = xs.to(device)
    x=xs
    for i in range(0, length):
        if x.size(1)>200:x=x[:,-200:]
        logits = model(x)
        logits = logits[:, -1, :]
        v,_=torch.topk(logits,top_k)
        logits[logits < v[:, [-1]]] = -float('Inf')
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        x = torch.cat((x, idx_next), dim=1)
        xs = torch.cat((xs, idx_next), dim=1)        
    txt=[vocabulary.lookup_token(i) for i in xs[0]]
    return " ".join(txt) 

In the *sample()* function, we give a prompt so the function know where to start. The default prompt is "The old man and the". You can also specify the length of the text you want to generate and the default length is 250 words. The function then uses the trained model to predict the next word based on the existing text. It then adds the predicted word to the text. The function repeats the process until the text reaches the desired length. 

We then reload the trained model as follows:

In [21]:
model.load_state_dict(torch.load("files/ch14/txtTrans.pth"))

<All keys matched successfully>

Let's generate a passage with the model by using "The old man and the" as the prompt. 

In [22]:
print(sample(model, prompt="The old man and the"))

the old man and the old man ' ll fight is a cramp , he could not find him with coast and i ' s shivering increased as though i think about him and hard pull him though he could bring any more lions on the line was an accident . the line showed glowing below the line that were the first . they came guickly and they were the boy . it well and , and i did not much betting and the old man was . it . when he was no way slowly through his jaws . he saw the old man and the skiff forever . the skiff and his arm , he is my house where all colds and he had come in the terrace and he was no fear both badly but there was going more and he said . he is what you have . i ' s shack was more noble thing it , he had better and the surface at an airplane until something to take you sleep and would like to fight them as though , and the mast up the shark tore the boy had settled himself to look into the skiff . then he thought . make a beer cans and a piece and the skiff , the old man knew he was almost 

Since we have trained the transformer with a document with less than 30,000 words in it, the model can only capture the statistical patterns of words in this small document. The text generated above does have the style of the training text. The main purpose of this chapter is for you to learn how to create a transformer and use data to train it from scratch. 