# Chapter 12: Text Generation with Word Tokenization



We'll discuss text generation with another approach: word tokenization. Instead of tokenizing characters, you’ll tokenize individual words (or a part of a word). However, you’ll not use the one-hot encoder to use a onehot variable (with value 1 in one place and 0 in all others) because we are dealing with tens of thousands of words, instead of less than 100 characters. Rather, you’ll use the embedding layer in PyTorch to efficiently encode the words. With that, you’ll train your LSTM to extract interconnection among different words out of vast amount of text data. This is a powerful tool in the real world since it allows you to automatically mine unstructured data such as social media posts, customer reviews, and analyst reports. 

When it comes to text generation with the trained model, the approach is similar to what we have done in Chapter 11 with character-level LSTM. We'll feed the prompt to the model to predict the next most likely word. You then add the prediction to the end of the prompt to form a new prompt. You repeat the process until the text reaches a certain length. 

Start a new cell in ch12.ipynb and execute the following lines of code in it:

In [1]:
import os

os.makedirs("files/ch12", exist_ok=True)

# 1. Word-Level Tokenization
We'll use the text file of Anna Karenina in one of Carlos Lara's GitHub repositories. Go to the link https://github.com/LeanManager/NLP-PyTorch/tree/master/data to download the text file and save it as *anna.txt* in the folder /Desktop/ai/files/ch12/ on your computer. 

## 1.1. Clean Up the Text
First, we load up the data and print out some passages to get a feeling about the dataset. 

In [2]:
with open("files/ch12/anna.txt","r") as f:
    text=f.read()
words=text.split(" ") 
print(words[:100])

['Chapter', '1\n\n\nHappy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own\nway.\n\nEverything', 'was', 'in', 'confusion', 'in', 'the', "Oblonskys'", 'house.', 'The', 'wife', 'had\ndiscovered', 'that', 'the', 'husband', 'was', 'carrying', 'on', 'an', 'intrigue', 'with', 'a', 'French\ngirl,', 'who', 'had', 'been', 'a', 'governess', 'in', 'their', 'family,', 'and', 'she', 'had', 'announced', 'to\nher', 'husband', 'that', 'she', 'could', 'not', 'go', 'on', 'living', 'in', 'the', 'same', 'house', 'with', 'him.\nThis', 'position', 'of', 'affairs', 'had', 'now', 'lasted', 'three', 'days,', 'and', 'not', 'only', 'the\nhusband', 'and', 'wife', 'themselves,', 'but', 'all', 'the', 'members', 'of', 'their', 'family', 'and\nhousehold,', 'were', 'painfully', 'conscious', 'of', 'it.', 'Every', 'person', 'in', 'the', 'house\nfelt', 'that', 'there', 'was', 'no', 'sense']


The line break (\n) is treated as part of the text, so we need to replace line breaks with white spaces. We also need to change all words to lower case so *The* and *the* are the same. Further, punctuations need to have a space in front of them so they are separated from words. For that purpose, we'll print out all unique character in the text as follows:

In [3]:
print(set(text.lower()))

{'g', '*', '$', 'p', ')', 'e', '(', 'w', '4', '6', "'", 'o', ',', 'l', '"', 'r', 'i', 'u', 'd', 'q', ':', 'n', '2', '9', '&', 'f', '.', '0', 'y', '/', '?', '3', ';', '8', 'x', '1', 'v', 'b', '_', '%', 's', 'j', ' ', 'z', 'm', 't', 'h', 'a', 'c', 'k', '!', '5', '7', '\n', '-', '`', '@'}


We can then go over each character and see if we need to do something about it. The following are the steps to clean up the text:

In [4]:
clean_text=text.lower().replace("\n", " ")
clean_text=clean_text.replace("-", " ")
for x in ",.:;?!$()/_&%*@'`":
    clean_text=clean_text.replace(f"{x}", f" {x} ")
clean_text=clean_text.replace('"', ' " ') 
clean_text=clean_text.replace("     ", " ")
clean_text=clean_text.replace("    ", " ")
clean_text=clean_text.replace("   ", " ")
clean_text=clean_text.replace("  ", " ")

We can now save the cleaned up text as follows:

In [5]:
with open("files/ch12/cleaned_up.txt","w") as f:
    f.write(clean_text)

## 1.2. Preprocess the Data
We first create a PyTorch dataset based on the text file *cleaned_up.txt*. 

In [6]:
import torch
from collections import Counter
from torch.utils.data import Dataset

class Data(Dataset):
    def __init__(self,seq_len=50):
        super().__init__()
        self.text=self.get_text()
        self.words=self.get_unique_words()
        self.int_to_word={k:v for k,v in enumerate(self.words)}
        self.word_to_int={v:k for k,v in enumerate(self.words)}        
        self.wordidx=[self.word_to_int[w] for w in self.text]  
        self.seq_len=seq_len
    def get_text(self):
        with open("files/ch12/cleaned_up.txt","r") as f:
            text=f.read()
        return text.split(" ")
    def get_unique_words(self):
        word_counts = Counter(self.text)
        return sorted(word_counts, key=word_counts.get,
                      reverse=True)
    def __len__(self):
        return len(self.wordidx) - self.seq_len

    def __getitem__(self, i):
        return (
        torch.tensor(self.wordidx[i:i+self.seq_len]),
        torch.tensor(self.wordidx[i+1:i+self.seq_len+1]),
        )  

We can now instantiate the dataset and see its properties:

In [7]:
data=Data()

We'll check the length of the training data and the number of unique words in it, as follows:

In [8]:
text_length=len(data.text)
num_unique_words=len(data.words)
print(f"the text length is {text_length} word")
print(f"there are {num_unique_words} num_unique_words")

the text length is 440905 word
there are 13000 num_unique_words


We'll print out the first 20 words in the original text and then look at the individual words and their corresponding index numbers:

In [9]:
print(data.text[0:20])
print([data.word_to_int[w] for w in data.text[0:20]])

['chapter', '1', 'happy', 'families', 'are', 'all', 'alike', ';', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way', '.', 'everything', 'was']
[208, 670, 283, 3024, 82, 31, 2461, 35, 202, 690, 365, 38, 690, 10, 234, 147, 166, 1, 149, 12]


## 1.3.  Create Batches

We'll organize the text into different batches so that we can feed them to the model to train the LSTM network. 

In [10]:
import torch
import numpy as np
from torch import nn, optim
from torch.utils.data import DataLoader

loader = DataLoader(data, batch_size=32, shuffle=True)

We print out a batch to have a look:

In [11]:
x,y=next(iter(loader))
print(x)
print(y)
print(x.shape)

tensor([[ 208,  670,  283,  ...,   52,    9,  936],
        [ 670,  283, 3024,  ...,    9,  936,   10],
        [ 283, 3024,   82,  ...,  936,   10,   71],
        ...,
        [ 129,   18, 2187,  ...,  186,    6,  965],
        [  18, 2187,   11,  ...,    6,  965,   18],
        [2187,   11,    2,  ...,  965,   18,   58]])
tensor([[ 670,  283, 3024,  ...,    9,  936,   10],
        [ 283, 3024,   82,  ...,  936,   10,   71],
        [3024,   82,   31,  ...,   10,   71,  365],
        ...,
        [  18, 2187,   11,  ...,    6,  965,   18],
        [2187,   11,    2,  ...,  965,   18,   58],
        [  11,    2,  159,  ...,   18,   58, 2188]])
torch.Size([32, 50])


The above results indicate that if you shift x one position to the right, you have y. That's exactly what we intend to do. We'll use x as features and y as targets. By using the above training data, the model learns to predict the next word based on the prompt. 

# 2. Build and Train the LSTM Model
We'll use the built-in LSTM layer in PyTorch to create the model.

## 2.1. The Model Structure
We first import needed modules:

In [12]:
import torch.nn.functional as F
device="cuda" if torch.cuda.is_available() else "cpu"

We then define a *WordLSTM()* class to represent the model.

In [13]:
class WordLSTM(nn.Module):
    def __init__(self, input_size=128, n_embed=128,
             n_layers=3, drop_prob=0.2):
        super().__init__()
        self.input_size=input_size
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_embed = n_embed
        vocab_size=len(data.words)
        self.embedding=nn.Embedding(vocab_size,n_embed)
        self.lstm = nn.LSTM(input_size=self.input_size,
            hidden_size=self.n_embed,
            num_layers=self.n_layers,
            dropout=self.drop_prob,batch_first=True)
        self.fc = nn.Linear(input_size, vocab_size)

    def forward(self, x, hc):
        embed=self.embedding(x)
        x, hc = self.lstm(embed, hc)
        x = self.fc(x)
        return x, hc      
        
    def init_hidden(self, n_seqs):
        weight = next(self.parameters()).data
        return (weight.new(self.n_layers,
                           n_seqs, self.n_embed).zero_(),
                weight.new(self.n_layers,
                           n_seqs, self.n_embed).zero_()) 

## 2.2. Create the Model
We first instantiate a model as follows:

In [14]:
model=WordLSTM().to(device)
print(model)

WordLSTM(
  (embedding): Embedding(13000, 128)
  (lstm): LSTM(128, 128, num_layers=3, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=128, out_features=13000, bias=True)
)


The optimizer and the loss function are as follows:

In [15]:
lr=0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss_func = nn.CrossEntropyLoss()

We'll train the Model next.

# 3. Train the Model
We first define some hyperparameter values and get ready for training:

In [16]:
n_seqs=32
model.train()

WordLSTM(
  (embedding): Embedding(13000, 128)
  (lstm): LSTM(128, 128, num_layers=3, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=128, out_features=13000, bias=True)
)

We then train the model for 20 epochs, as follows:

In [17]:
for epoch in range(10):
    tloss=0
    sh,sc = model.init_hidden(n_seqs)
    for i, (x,y) in enumerate(loader):
        if x.shape[0]==n_seqs:
            inputs, targets = x.to(device), y.to(device)
            optimizer.zero_grad()
            output, (sh,sc) = model(inputs, (sh,sc))
            loss = loss_func(output.transpose(1,2),targets)
            sh,sc=sh.detach(),sc.detach()
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 5)
            optimizer.step()
            tloss+=loss.item()
        if (i+1)%1000==0:
            print(f"at epoch {epoch} iteration {i+1}\
            average loss = {tloss/(i+1)}")

If you are using GPU, it takes an hour or so to train. If you use CPU only, it may take several hours, depending on your hardware. 

Next, we save the model on the local computer:

In [18]:
torch.save(model.state_dict(),"files/ch12/wordLSTM.pth")

# 4. Use the Trained Modle to Generate Text
We can use the trained model to generate text. We first define the following sample() function

In [19]:
def sample(model, prompt="Anna", length=200):
    model.eval()
    text = prompt.lower().split(' ')
    sh,sc = model.init_hidden(1)

    for i in range(0, length):
        x = torch.tensor([[data.word_to_int[w] for w in text[i:]]])
        inputs = x.to(device)
        output, (sh,sc) = model(inputs, (sh,sc))
        logits = output[0][-1]
        p = nn.functional.softmax(logits, dim=0).detach().cpu().numpy()
        idx = np.random.choice(len(logits), p=p)
        text.append(data.int_to_word[idx])

    return " ".join(text)    

In the *sample()* function, we give a prompt so the function know where to start. The default prompt is "Anna". You can also specify the length of the text you want to generate and the default length is 200 words. The function then uses the trained model to predict the next word based on the existing text. It then adds the predicted word to the text. The function repeats the process until the text reaches the desired length. 

We then reload the trained model as follows:

In [20]:
model.load_state_dict(torch.load("files/ch12/wordLSTM.pth"))

<All keys matched successfully>

Let's generate a passage with the model by using "Anna and the" as the prompt. 

In [21]:
print(sample(model, prompt='Anna and the'))  

anna and the doctor could cut through the forces of frank them made great special extracts in the affair . cord were particularly well listening to his brother ' s face . " what wicked and old life with kitty , she used to say , with which she had met her husband , which was stronger . anna arkadyevna , to hear the bitterness of his farming . the peasants steeplechase , both , and herself , for having sacrificed her share . she was not so bored since she saw it , not in his any never thought of his brother . he used to without indifferent to the laborer , the same expression of the hall bright and effective " he said , with his hat from his damp hands . the grass @ something . the remainder had merrily . 17 sports . general for the rest of the dinner on russia , or ceasing , with everyone that suggested who was not going to feel completely right . his feelings was difficult to get the end of the imperial moment while , in spite of the same mode of view of the sisters she clambered away

Notice that everything is lower case since we converted all upper case letters to lower cases when processing the text to reduce the number of potential words. Also notice that there is a white space before and after the punctuations because we want to separate punctuations from words during training. 

The above generated text is not bad for an hour of training! Most of sentences are correct in terms of grammar. It's not as good as the text generated by, say, ChatGPT. But you learned how to create a language model based on word-level tokenization and use LSTM to train a model to generate text. We'll discuss how to train a transformer -- the type of models that's used by ChatGPT, in the next chapter. 