# Chapter 8: Text Generation with Recurrent Neural Networks (RNNs)


This chapter covers

* The idea behind RNNs and why they can handle sequential data 
* Character tokenization, word tokenization, and subword tokenization
* How word embedding works
* Building and training an RNN to generate text 
* Using temperature and top-K sampling to control the creativeness of text generation

So far in this book, we have discussed how to generate shapes, numbers, and images. Starting from this chapter, we’ll focus mainly on text generation. Generating text is often considered the holy grail of generative AI for several compelling reasons. Human language is incredibly complex and nuanced. It involves not only understanding grammar and vocabulary but also context, tone, and cultural references. Successfully generating coherent and contextually appropriate text is a significant challenge that requires deep understanding and processing of language. 

As humans, we primarily communicate through language. AI that can generate human-like text can interact more naturally with users, making technology more accessible and user-friendly. Text generation has many applications, from automating customer service responses to creating entire articles, scripting for games and movies, aiding in creative writing, and even building personal assistants. The potential impact across industries is enormous.

In this chapter, we’ll make our first attempt at building and training models to generate text. You’ll learn to tackle three main challenges in modeling text generation. First, text is sequential data, consisting of data points organized in a specific sequence, where each point is successively ordered to reflect the inherent order and interdependencies within the data. Predicting outcomes for sequences is challenging due to their sensitive ordering. Altering the sequence of elements changes their meaning. Secondly, text exhibits long-range dependencies: the meaning of a certain part of the text depends on elements that appeared much earlier in the text (e.g., 200 words ago). Understanding and modeling these long-range dependencies is essential for generating coherent text. Lastly, human language is ambiguous and context dependent. Training a model to understand nuances, sarcasm, idioms, and cultural references to generate contextually accurate text is challenging.

You'll explore a specific neural network designed for handling sequential data, such as text or time series: the recurrent neural network (RNN). Traditional neural networks, such as feedforward neural networks or fully connected networks, treat each input independently. This means that the network processes each input separately, without considering any relationship or order between different inputs. In contrast, RNNs are specifically designed to handle sequential data. In an RNN, the output at a given time step depends not only on the current input but also on previous inputs. This allows RNNs to maintain a form of memory, capturing information from previous time steps to influence the processing of the current input. 
This sequential processing makes RNNs suitable for tasks where the order of the inputs matters, such as language modeling, where the goal is to predict the next word in a sentence based on previous words. We’ll focus on one variant of RNN, Long Short-Term Memory (LSTM) networks, which can recognize both short-term and long-term data patterns in sequential data like text. LSTM models use a hidden state to capture information in previous time steps. Therefore, a trained LSTM model can produce coherent text based on the context. 

The style of the generated text depends on the training data. Additionally, as we plan to train a model from scratch for text generation, text length is a crucial factor. It needs to be sufficiently extensive for the model to effectively learn and mimic a particular writing style, yet concise enough to avoid excessive computational demands during training. As a result, we’ll use the text from the novel Anna Karenina to train an LSTM model. Since neural networks like an LSTM cannot accept text as input directly, you’ll learn to break down text into tokens (individual words in this chapter; but can be parts of words, as you’ll see in later chapters), a process known as tokenization. You’ll then create a dictionary to map each unique token into an integer (i.e., an index). Based on this dictionary, you’ll convert the text into a long sequence of integers, ready to be fed into a neural network. 

You’ll use sequences of indexes of a certain length as the input to train the LSTM model. You shift the sequence of inputs by one token to the right and use it as the output: you are effectively training the model to predict the next word in a sentence. This is the so-called sequence-to-sequence prediction problem in natural language processing (NLP) and you’ll see it again in later chapters. 

Once the LSTM is trained, you’ll use it to generate text one token at a time based on previous tokens in the sequence as follows: You feed a prompt (part of a sentence such as “Anna and the”) to the trained model. The model then predicts the most likely next token and appends the selected token to your prompt. The updated prompt serves again as the input and the model is used once more to predict the next token. The iterative process continues until the prompt reaches a certain length. This approach is similar to the mechanism employed by more advanced generative models like ChatGPT (though ChatGPT is not an LSTM). You’ll witness the trained LSTM model generating grammatically correct and coherent text, with a style matching that of the original novel. 

Finally, you also learn how to control the creativeness of the generated text by using temperature and top-K sampling. Temperature controls the randomness of the predictions of the trained model. A high temperature makes the generated text more creative while a low temperature makes the text more confident and predictable. Top-K sampling is a method where you select the next token from the top K most probable tokens, rather than selecting from the entire vocabulary. A small value of K leads to the selection of highly likely tokens in each step and this, in turn, makes the generated text less creative and more coherent.
The primary goal of this chapter is not necessarily to generate the most coherent text possible, which, as mentioned earlier, presents substantial challenges. Instead, our objective is to demonstrate the limitations of RNNs, thereby setting the stage for the introduction of Transformers in subsequent chapters. More importantly, this chapter establishes the basic principles of text generation, including tokenization, word embedding, sequence prediction, temperature settings, and top-K sampling. Consequently, in the later chapters, you will have a solid understanding of the fundamentals of NLP. This foundation will allow us to concentrate on other, more advanced, aspects of NLP, such as how the attention mechanism functions and the architecture of Transformers.

# 1.    Introduction to recurrent neural networks (RNNs)
# 2.	Fundamentals of Natural Language Processing (NLP)

In [1]:
# character tokenization
text="It is unbelievably good!"
tokens=list(text)
print(tokens)

['I', 't', ' ', 'i', 's', ' ', 'u', 'n', 'b', 'e', 'l', 'i', 'e', 'v', 'a', 'b', 'l', 'y', ' ', 'g', 'o', 'o', 'd', '!']


In [2]:
# exercise 8.1
text="Hi, there!"
tokens=list(text)
print(tokens)

['H', 'i', ',', ' ', 't', 'h', 'e', 'r', 'e', '!']


In [3]:
# word tokenization
text="It is unbelievably good!"
text=text.replace("!"," !")
tokens=text.split(" ")
print(tokens)

['It', 'is', 'unbelievably', 'good', '!']


In [4]:
# exercise 8.2
text="Hi, there!"
for x in list(",!"):
    text=text.replace(f"{x}",f" {x}")
tokens=text.split(" ")
print(tokens)

['Hi', ',', 'there', '!']


# 3.	Prepare data to train the LSTM model
We'll use the text file of Anna Karenina in one of Carlos Lara's GitHub repositories. Go to the link https://github.com/LeanManager/NLP-PyTorch/tree/master/data to download the text file and save it as *anna.txt* in the folder /files/ on your computer. After that, open the file and delete everything after line 39888, which says "END OF THIS PROJECT GUTENBERG EBOOK ANNA KARENINA." 

## 3.1	Download the clean up the text


In [5]:
with open("files/anna.txt","r") as f:
    text=f.read()
words=text.split(" ") 
print(words[:20])

['Chapter', '1\n\n\nHappy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own\nway.\n\nEverything', 'was', 'in', 'confusion', 'in', 'the', "Oblonskys'"]


In [6]:
print(set(text.lower()))

{'f', 'd', '"', 'h', 'b', '1', '3', '6', '8', 'o', 'i', '_', 'r', ',', ')', '9', 'a', 'j', '`', '5', '0', 'y', 'q', 'x', '2', 'g', 'z', 'p', "'", '?', 'm', 'v', ' ', '\n', 'e', '-', ':', 'w', 'n', '.', ';', '4', 'k', '(', 'u', '!', '7', 'c', 's', 't', 'l'}


In [7]:
clean_text=text.lower().replace("\n", " ")
clean_text=clean_text.replace("-", " ")
for x in ",.:;?!$()/_&%*@'`":
    clean_text=clean_text.replace(f"{x}", f" {x} ")
clean_text=clean_text.replace('"', ' " ') 
text=clean_text.split()

In [8]:
from collections import Counter   
word_counts = Counter(text)    

# get unique words
words=sorted(word_counts, key=word_counts.get,
                      reverse=True) 
print(words[:10])

[',', '.', 'the', '"', 'and', 'to', 'of', 'he', "'", 'a']


In [9]:
text_length=len(text)
num_unique_words=len(words)
print(f"the text contains {text_length} words")
print(f"there are {num_unique_words} unique tokens")  
word_to_int={v:k for k,v in enumerate(words)} 
int_to_word={k:v for k,v in enumerate(words)}
print({k:v for k,v in word_to_int.items() if k in words[:10]})
print({k:v for k,v in int_to_word.items() if v in words[:10]})

the text contains 437098 words
there are 12778 unique tokens
{',': 0, '.': 1, 'the': 2, '"': 3, 'and': 4, 'to': 5, 'of': 6, 'he': 7, "'": 8, 'a': 9}
{0: ',', 1: '.', 2: 'the', 3: '"', 4: 'and', 5: 'to', 6: 'of', 7: 'he', 8: "'", 9: 'a'}


In [10]:
print(text[0:20])
wordidx=[word_to_int[w] for w in text]  
print([word_to_int[w] for w in text[0:20]])  

['chapter', '1', 'happy', 'families', 'are', 'all', 'alike', ';', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way', '.', 'everything', 'was']
[208, 2755, 280, 2981, 83, 31, 2419, 35, 202, 685, 362, 38, 685, 10, 236, 147, 166, 1, 149, 12]


In [11]:
# exercise 8.3
print(word_to_int["anna"])

## 3.2	Create batches of training data


In [12]:
import torch

seq_len=100  
xys=[]
for n in range(0, len(wordidx)-seq_len-1):
    x = wordidx[n:n+seq_len]
    y = wordidx[n+1:n+seq_len+1]
    xys.append((torch.tensor(x),(torch.tensor(y))))

In [13]:
from torch.utils.data import DataLoader

torch.manual_seed(42)
batch_size=32
loader = DataLoader(xys, batch_size=batch_size, shuffle=True)

x,y=next(iter(loader))
print(x)
print(y)
print(x.shape,y.shape)

tensor([[  39,   31,    2,  ...,  688,  142,    7],
        [ 156, 5293,    0,  ...,   38,  330,    0],
        [   3,   97,    0,  ...,    0, 1774,   34],
        ...,
        [  16,  156,    9,  ...,  113,    5,  533],
        [   3,    4,   31,  ...,   98,    5,   98],
        [ 289,   19,   23,  ...,    9,  828,  550]])
tensor([[  31,    2, 2727,  ...,  142,    7,    0],
        [5293,    0,   16,  ...,  330,    0,    3],
        [  97,    0,    4,  ..., 1774,   34,    3],
        ...,
        [ 156,    9,  489,  ...,    5,  533,   27],
        [   4,   31,   25,  ...,    5,   98,    1],
        [  19,   23,    1,  ...,  828,  550,    1]])
torch.Size([32, 100]) torch.Size([32, 100])


# 4. Build and Train the LSTM Model


## 4.1	Build an LSTM model


In [14]:
import torch
from torch import nn
device="cuda" if torch.cuda.is_available() else "cpu"
class WordLSTM(nn.Module):
    def __init__(self, input_size=128, n_embed=128,
             n_layers=3, drop_prob=0.2):
        super().__init__()
        self.input_size=input_size
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_embed = n_embed
        vocab_size=len(word_to_int)
        self.embedding=nn.Embedding(vocab_size,n_embed)
        self.lstm = nn.LSTM(input_size=self.input_size,
            hidden_size=self.n_embed,
            num_layers=self.n_layers,
            dropout=self.drop_prob,batch_first=True)
        self.fc = nn.Linear(input_size, vocab_size)
    def forward(self, x, hc):
        embed=self.embedding(x)
        x, hc = self.lstm(embed, hc)
        x = self.fc(x)
        return x, hc        
    def init_hidden(self, n_seqs):
        weight = next(self.parameters()).data
        return (weight.new(self.n_layers,
                           n_seqs, self.n_embed).zero_(),
                weight.new(self.n_layers,
                           n_seqs, self.n_embed).zero_()) 

In [15]:
model=WordLSTM().to(device)
print(model)

WordLSTM(
  (embedding): Embedding(12778, 128)
  (lstm): LSTM(128, 128, num_layers=3, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=128, out_features=12778, bias=True)
)


In [16]:
lr=0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss_func = nn.CrossEntropyLoss()

## 4.2	Train the LSTM model


In [17]:
model.train()

for epoch in range(50):
    tloss=0
    sh,sc = model.init_hidden(batch_size)
    for i, (x,y) in enumerate(loader):    
        if x.shape[0]==batch_size:
            inputs, targets = x.to(device), y.to(device)
            optimizer.zero_grad()
            output, (sh,sc) = model(inputs, (sh,sc))
            loss = loss_func(output.transpose(1,2),targets)
            sh,sc=sh.detach(),sc.detach()
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 5)
            optimizer.step()
            tloss+=loss.item()
        if (i+1)%1000==0:
            print(f"at epoch {epoch} iteration {i+1}\
            average loss = {tloss/(i+1)}")

In [18]:
import pickle

torch.save(model.state_dict(),"files/wordLSTM.pth")
with open("files/word_to_int.p","wb") as fb:    
    pickle.dump(word_to_int, fb) 

# 5	Generate text with the trained LSTM model


## 5.1	Generate text by predicting the next token

In [19]:
# Skip this if you have trained your own model
# If you use my model, you should run this code cell
# download and unzip https://gattonweb.uky.edu/faculty/lium/gai/wordLSTM.zip

import pickle
model.load_state_dict(torch.load("files/wordLSTM.pth"))
with open("files/word_to_int.p","rb") as fb:    
    word_to_int = pickle.load(fb)      
int_to_word={v:k for k,v in word_to_int.items()}

In [20]:
import numpy as np
def sample(model, prompt, length=200):
    model.eval()
    text = prompt.lower().split(' ')
    hc = model.init_hidden(1)
    length = length - len(text)
    for i in range(0, length):
        # if the text length is less than seq_len, use text to predict 
        if len(text)<=seq_len:
            x = torch.tensor([[word_to_int[w] for w in text]])
        # otherwise use the last seq_len tokens to predict
        else:
            x = torch.tensor([[word_to_int[w] for w in text[-seq_len:]]])            
        inputs = x.to(device)
        output, hc = model(inputs, hc)
        logits = output[0][-1]
        p = nn.functional.softmax(logits, dim=0).detach().cpu().numpy()
        idx = np.random.choice(len(logits), p=p)
        text.append(int_to_word[idx])
    text=" ".join(text)
    for m in ",.:;?!$()/_&%*@'`":
        text=text.replace(f" {m}", f"{m} ")
    text=text.replace('"  ', '"')   
    text=text.replace("'  ", "'")  
    text=text.replace('" ', '"')   
    text=text.replace("' ", "'")     
    return text  

In [21]:
import torch
torch.manual_seed(42)
np.random.seed(42)
print(sample(model, prompt='Anna and the prince'))  

anna and the prince did not forget what he had not spoken.  when the softening barrier was not so long as he had talked to his brother,  all the hopelessness of the impression.  "official tail,  a man who had tried him,  though he had been able to get across his charge and locked close,  and the light round the snow was in the light of the altar villa.  the article in law levin was first more precious than it was to him so that if it was most easy as it would be as the same.  this was now perfectly interested.  when he had got up close out into the sledge,  but it was locked in the light window with their one grass,  and in the band of the leaves of his projects,  and all the same stupid woman,  and really,  and i swung his arms round that thinking of bed.  a little box with the two boys were with the point of a gleam of filling the boy,  noiselessly signed the bottom of his mouth,  and answering them took the red


## 5.2	Temperature and top-K sampling in text generation

In [22]:
def generate(model, prompt, top_k=None, 
             length=200, temperature=1):
    model.eval()
    text = prompt.lower().split(' ')
    hc = model.init_hidden(1)
    length = length - len(text)    
    for i in range(0, length):
        # if the text length is less than seq_len, use text to predict 
        if len(text)<=seq_len:
            x = torch.tensor([[word_to_int[w] for w in text]])
        # otherwise use the last seq_len tokens to predict
        else:
            x = torch.tensor([[word_to_int[w] for w in text[-seq_len:]]])    
        inputs = x.to(device)
        output, hc = model(inputs, hc)
        logits = output[0][-1]
        # scale the logits with the temperature 
        logits = logits/temperature
        p = nn.functional.softmax(logits, dim=0).detach().cpu()    
        if top_k is None:
            idx = np.random.choice(len(logits), p=p.numpy())
        # top-K sampling
        else:
            ps, tops = p.topk(top_k)
            ps=ps/ps.sum()
            idx = np.random.choice(tops, p=ps.numpy())          
        text.append(int_to_word[idx])
    text=" ".join(text)
    for m in ",.:;?!$()/_&%*@'`":
        text=text.replace(f" {m}", f"{m} ")
    text=text.replace('"  ', '"')   
    text=text.replace("'  ", "'")  
    text=text.replace('" ', '"')   
    text=text.replace("' ", "'")     
    return text   

In [23]:
# next token using default setting
prompt="I ' m not going to see"
torch.manual_seed(42)
np.random.seed(42)
for _ in range(10):
    print(generate(model, prompt, top_k=None, 
         length=len(prompt.split(" "))+1, temperature=1)) 

i'm not going to see you
i'm not going to see those
i'm not going to see me
i'm not going to see you
i'm not going to see her
i'm not going to see her
i'm not going to see the
i'm not going to see my
i'm not going to see you
i'm not going to see me


In [24]:
# next token using conservative predictions
prompt="I ' m not going to see"
torch.manual_seed(42)
np.random.seed(42)
for _ in range(10):
    print(generate(model, prompt, top_k=3, 
         length=len(prompt.split(" "))+1, temperature=0.5)) 

i'm not going to see you
i'm not going to see the
i'm not going to see her
i'm not going to see you
i'm not going to see you
i'm not going to see you
i'm not going to see you
i'm not going to see her
i'm not going to see you
i'm not going to see her


In [25]:
torch.manual_seed(42)
np.random.seed(42)
print(generate(model, prompt='Anna and the prince',
               top_k=3,
               temperature=0.5)) 

anna and the prince had no milk.  but,  "answered levin,  and he stopped.  "i've been skating to look at you all the harrows,  and i'm glad. . .  ""no,  i'm going to the country.  ""no,  it's not a nice fellow.  ""yes,  sir.  ""well,  what do you think about it?  ""why,  what's the matter?  ""yes,  yes,  "answered levin,  smiling,  and he went into the hall.  "yes,  i'll come for him and go away,  "he said,  looking at the crumpled front of his shirt.  "i have not come to see him,  "she said,  and she went out.  "i'm very glad,  "she said,  with a slight bow to the ambassador's hand.  "i'll go to the door.  "she looked at her watch,  and she did not know what to say


In [26]:
# exercise 8.4
torch.manual_seed(0)
np.random.seed(0)
print(generate(model, prompt='Anna and the nurse',
               top_k=10,
               temperature=0.6)) 

In [27]:
# next token using creative predictions
prompt="I ' m not going to see"
torch.manual_seed(42)
np.random.seed(42)
for _ in range(10):
    print(generate(model, prompt, top_k=None, 
         length=len(prompt.split(" "))+1, temperature=2)) 

i'm not going to see them
i'm not going to see scarlatina
i'm not going to see behind
i'm not going to see us
i'm not going to see it
i'm not going to see it
i'm not going to see a
i'm not going to see misery
i'm not going to see another
i'm not going to see seryozha


In [28]:
torch.manual_seed(42)
np.random.seed(42)
print(generate(model, prompt='Anna and the prince',
               top_k=None,
               temperature=2)) 

anna and the prince took sheaves covered suddenly people.  "pyotr marya borissovna,  propped mihail though her son will seen how much evening her husband;  if tomorrow she liked great time too.  "adopted heavens details for it women from this terrible,  admitting this touching all everything ill with flirtation shame consolation altogether:  ivan only all the circle with her honorable carriage in its house dress,  beethoven ashamed had the conversations raised mihailov stay of close i taste work?  "on new farming show ivan nothing.  hat yesterday if interested understand every hundred of two with six thousand roubles according to women living over a thousand:  snetkov possibly try disagreeable schools with stake old glory mysterious one have people some moral conclusion,  got down and then their wreath.  darya alexandrovna thought inwardly peaceful with varenka out of the listen from and understand presented she was impossible anguish.  simply satisfied with staying after presence came

In [29]:
# exercise 8.5
torch.manual_seed(0)
np.random.seed(0)
print(generate(model, prompt='Anna and the nurse',
               top_k=10000,
               temperature=2)) 