# Chapter 9: Text Generation with Character-Level LSTM



Starting from this chapter, we'll discuss text generation. Text is a sequence, meaning the orderings of individual characters or words matter. Sequence outcome is one of the most difficult events to predict because of this reason. If you switch the positions of the elements in a sequence, you change the meaning. The individual letters in a character-based sequence are ordered in a particular way and cannot be changed. This is in contrast to non-sequential data such as cross-sectional data. Imagine you have 1000 loan applications from different individuals and want to predict whether the loan will be approved based on applicant characteristics such as income, gender and so on. You can switch the positions of the applications of, say, John Smith and Maria Garcia, and this won’t affect the prediction outcome.

In this chapter, you’ll learn in detail a special type of neural network that can handle sequential data such as texts or time series data: recurrent neural networks (RNNs). In most neural networks, the connection of neurons goes in one direction. The network starts with the input layer, and it goes through a series of hidden layers, and finally it reaches the output layer. In contrast, in RNNs, the connection goes in both directions: neurons in later layers also give feedback to previous layers. 

One shortcoming of RNNs is that the memory is relatively short. That is, it can detect short patterns in a sequence, but once you have a long pattern, RNNs take a long time to train. You’ll use a special type of RNNs: long short-term memory (LSTM) to address this problem. As the name suggests, LSTM models can capture both short-term and long-term patterns in the data. 

You'll treat text as a sequence of characters. The LSTM model learns the statistical patterns in the training dataset. After training, you can ask the model to predict the next character based on the prompt. You then add the prediction to the end of the prompt to form the new prompt. You repeat the process until the text reaches a certain length. 

Start a new cell in ch11.ipynb and execute the following lines of code in it:

In [1]:
import os

os.makedirs("files/ch11", exist_ok=True)

# 1. Character-Level Tokenization
We'll use the complete works of William Shakespeare by Project Gutenberg. Go to the link https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt to download the text file and save it as *shakespeare.txt* in the folder /Desktop/ai/files/ch11/ on your computer. Make sure you remove the top and the bottom paragraphs from the text file so the file contains only the works of Shakespeare, not the descriptions added by Project Gutenberg. 

The code in this chapter is adapted from two wonderful GitHub repositories: Andrej Karpathy's char-rnn https://github.com/karpathy/char-rnn and Carlos Lara' Character-Level LSTM in Pytorch https://github.com/LeanManager/NLP-PyTorch. 

## 1.1. Preprocess the Data
We first load the text file to Python and see how many individual characters the text contains:

In [2]:
with open("files/ch11/shakespeare.txt","r") as f:
    file=f.read()
characters=set(file)  
vocab_size=len(characters)
print(vocab_size)

84


The output above shows that there are 84 different characters in the text, including punctuations. 

Next, we create two dictionaries: one maps a character to an integer and the other an integer to the character. 

In [3]:
char_to_int={}
idx=0
for char in characters:
    char_to_int[char]=idx
    idx+=1
print(char_to_int)

{'D': 0, "'": 1, '-': 2, '_': 3, '>': 4, 'Y': 5, 'j': 6, 'O': 7, 'm': 8, '|': 9, '2': 10, '\n': 11, 'L': 12, 'A': 13, 'g': 14, 'K': 15, 'M': 16, 'H': 17, '!': 18, 'Q': 19, 'V': 20, 'x': 21, 'B': 22, 'u': 23, 'F': 24, 'r': 25, '0': 26, '}': 27, 'a': 28, 'T': 29, 'w': 30, 'X': 31, 'y': 32, '<': 33, '8': 34, 'W': 35, ',': 36, '5': 37, 's': 38, 'U': 39, 't': 40, 'I': 41, ';': 42, '(': 43, 'f': 44, ']': 45, 'q': 46, 'p': 47, 'h': 48, '"': 49, 'R': 50, 'c': 51, 'G': 52, 'e': 53, 'i': 54, 'S': 55, '6': 56, 'v': 57, '[': 58, 'Z': 59, '.': 60, 'J': 61, 'z': 62, '9': 63, 'E': 64, ':': 65, '4': 66, 'b': 67, 'k': 68, 'n': 69, 'C': 70, 'd': 71, '3': 72, '?': 73, '`': 74, '&': 75, 'o': 76, '1': 77, '7': 78, ')': 79, 'l': 80, 'N': 81, ' ': 82, 'P': 83}


We'll switch the keys and values in the dictionary *char_to_int* and create a new dictionary *int_to_char*, like so:

In [4]:
int_to_char={v:k for k,v in char_to_int.items()}
print(int_to_char)

{0: 'D', 1: "'", 2: '-', 3: '_', 4: '>', 5: 'Y', 6: 'j', 7: 'O', 8: 'm', 9: '|', 10: '2', 11: '\n', 12: 'L', 13: 'A', 14: 'g', 15: 'K', 16: 'M', 17: 'H', 18: '!', 19: 'Q', 20: 'V', 21: 'x', 22: 'B', 23: 'u', 24: 'F', 25: 'r', 26: '0', 27: '}', 28: 'a', 29: 'T', 30: 'w', 31: 'X', 32: 'y', 33: '<', 34: '8', 35: 'W', 36: ',', 37: '5', 38: 's', 39: 'U', 40: 't', 41: 'I', 42: ';', 43: '(', 44: 'f', 45: ']', 46: 'q', 47: 'p', 48: 'h', 49: '"', 50: 'R', 51: 'c', 52: 'G', 53: 'e', 54: 'i', 55: 'S', 56: '6', 57: 'v', 58: '[', 59: 'Z', 60: '.', 61: 'J', 62: 'z', 63: '9', 64: 'E', 65: ':', 66: '4', 67: 'b', 68: 'k', 69: 'n', 70: 'C', 71: 'd', 72: '3', 73: '?', 74: '`', 75: '&', 76: 'o', 77: '1', 78: '7', 79: ')', 80: 'l', 81: 'N', 82: ' ', 83: 'P'}


We'll encode the text to integer numbers so that we can further change them to one-hot variables later before feeding them to the model. 

In [5]:
encoded_text=[char_to_int[char] for char in file]
print(encoded_text[0:20])

[77, 56, 26, 63, 11, 11, 29, 17, 64, 82, 55, 7, 81, 81, 64, 29, 55, 11, 11, 67]


Now the text are encoded as integers. The result above shows the first 20 elements in the encoded file. We'll also define a *onehot_encoder()* function to change an integer into a onehot variable with a depth equal to the number of characters. 

In [6]:
import numpy as np

def onehot_encode(arr, depth=vocab_size):
    one_hot = np.zeros((np.multiply(*arr.shape), depth),
                       dtype=np.float32)
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    one_hot = one_hot.reshape((*arr.shape, depth))
    return one_hot

## 1.2.  Create Batches

We'll organize the text into different batches so that we can feed them to the model to train the LSTM network. 

In [7]:
def create_batch(arr, n_seqs, n_steps):
    batch_size = n_seqs * n_steps
    n_batches = len(arr)//batch_size
    arr = arr[:n_batches * batch_size]
    arr = arr.reshape((n_seqs, -1))
    for n in range(0, arr.shape[1], n_steps):
        x = arr[:, n:n+n_steps]
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+n_steps]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

To make sure that the function works as expected, you can set *n_seqs* to 5 and *n_steps* to 10. That is, each batch contains five sequences and each sequence has 10 characters: 

In [8]:
import numpy as np

batch=create_batch(np.array(encoded_text),5,10)
x,y=next(batch)
print(x)
print(y)

[[77 56 26 63 11 11 29 17 64 82]
 [31 29 82  7 24 82 41 12 12 41]
 [76 67 80 53 82 38 47 54 25 54]
 [53 28 57 53 82 48 54  8 11 82]
 [32 82 48 54 38 82 67 53 71 36]]
[[56 26 63 11 11 29 17 64 82 55]
 [29 82  7 24 82 41 12 12 41 81]
 [67 80 53 82 38 47 54 25 54 40]
 [28 57 53 82 48 54  8 11 82 82]
 [82 48 54 38 82 67 53 71 36 11]]


The above results indicate that if you shift x one position to the right, you have y. We'll use x as features and y as targets. By using training data like this, the model learns to predict the next character based on the prompt. 

# 2. Bulid and Train the LSTM Model
We'll use the built-in LSTM layer in PyTorch to create the model.

## 2.1. The Model Structure
We first import needed modules:

In [9]:
from torch import nn
import torch
import torch.nn.functional as F
device="cuda" if torch.cuda.is_available() else "cpu"

We then define a *CharLSTM()* class to represent the model.

In [10]:
class CharLSTM(nn.Module):
    def __init__(self, vocab, n_steps=100, n_hidden=512,
             n_layers=2, drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        self.vocab = vocab
        self.int2char = dict(enumerate(self.vocab))
        self.char2int = {v:k for k,v in self.int2char.items()}
        self.lstm = nn.LSTM(len(self.vocab), n_hidden, n_layers, 
                        dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(n_hidden, len(self.vocab))
        self.init_weights()
        
    def forward(self, x, hc):
        x, (h, c) = self.lstm(x, hc)
        x = self.dropout(x)
        x = x.reshape(x.size()[0]*x.size()[1], self.n_hidden)
        x = self.fc(x)
        return x, (h, c)      

    def predict(self, char, h=None, top_k=None):        
        if h is None:
            h = self.init_hidden(1)
        x = np.array([[self.char2int[char]]])
        x = onehot_encode(x, len(self.vocab))
        inputs = torch.from_numpy(x).to(device)   
        h = tuple([each.data for each in h])
        out, h = self.forward(inputs, h)
        p = F.softmax(out, dim=1).data
        p = p.cpu()
        if top_k is None:
            top_ch = np.arange(len(self.vocab))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        p = p.numpy().squeeze()
        num = np.random.choice(top_ch, p=p/p.sum())  
        return self.int2char[num], h
    
    def init_weights(self):
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-1, 1)
        
    def init_hidden(self, n_seqs):
        weight = next(self.parameters()).data
        return (weight.new(self.n_layers,
                           n_seqs, self.n_hidden).zero_(),
                weight.new(self.n_layers,
                           n_seqs, self.n_hidden).zero_())  

## 2.2. Create the Model
We first instantiate a model as follows:

In [11]:
model=CharLSTM(characters).to(device)
print(model)

CharLSTM(
  (lstm): LSTM(84, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=84, bias=True)
)


The optimizer and the loss function are as follows:

In [12]:
lr=0.001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss_func = nn.CrossEntropyLoss()

We'll train the Model next.

# 3. Train the Model
We first define some hyperparameter values and get ready for training:

In [13]:
n_seqs=128
n_steps=100
data=np.array(encoded_text)
model.train()

CharLSTM(
  (lstm): LSTM(84, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=84, bias=True)
)

We then train the model for 30 epochs, as follows:

In [14]:
for epoch in range(30):
    tloss=0
    h = model.init_hidden(n_seqs)
    n = 0
    for x, y in create_batch(data, n_seqs, n_steps):
        x = onehot_encode(x, vocab_size)
        inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
        inputs, targets = inputs.to(device), targets.to(device)
        h = tuple([each.data for each in h])
        model.zero_grad()
        output, h = model(inputs, h)
        loss = loss_func(output, 
         targets.view(n_seqs*n_steps).type(torch.cuda.LongTensor))
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()
        tloss+=loss.item()
        n+=1
    print(f"at epoch {epoch} toal loss = {tloss/n}")

at epoch 0 toal loss = 2.215922442604514
at epoch 1 toal loss = 1.718651362026439
at epoch 2 toal loss = 1.5548691931892844
at epoch 3 toal loss = 1.4578687614553114
at epoch 4 toal loss = 1.3951661878473618
at epoch 5 toal loss = 1.3497756054822136
at epoch 6 toal loss = 1.3160600796867818
at epoch 7 toal loss = 1.2894528963986565
at epoch 8 toal loss = 1.2677818831275491
at epoch 9 toal loss = 1.2501716072419111
at epoch 10 toal loss = 1.235118774806752
at epoch 11 toal loss = 1.2218814403870526
at epoch 12 toal loss = 1.21080896209268
at epoch 13 toal loss = 1.200096674526439
at epoch 14 toal loss = 1.1929191334107343
at epoch 15 toal loss = 1.1838588605207556
at epoch 16 toal loss = 1.1758920436746934
at epoch 17 toal loss = 1.1689885487275966
at epoch 18 toal loss = 1.1626772577622357
at epoch 19 toal loss = 1.1572373996061438
at epoch 20 toal loss = 1.151656112951391
at epoch 21 toal loss = 1.1466644763946534
at epoch 22 toal loss = 1.1412639940486236
at epoch 23 toal loss = 1.13

If you are using GPU, it takes 30 minutes or so to train. If you use CPU only, it may take an hour or so to train, depending on your hardware. 

Next, we save the model on the local computer:

In [15]:
torch.save(model.state_dict(),"files/ch11/LSTMchar.pth")

# 4. Use the Trained Model to Generate Text
We can use the trained model to generate text. We first define the following sample() function

In [16]:
def sample(model, size, prompt='The', top_k=None):
    model.to(device)
    model.eval()
    chars = [ch for ch in prompt]
    h = model.init_hidden(1)
    for ch in prompt:
        char, h = model.predict(ch, h, top_k=top_k)
    chars.append(char)
    for j in range(size):
        char, h = model.predict(chars[-1], h, top_k=top_k)
        chars.append(char)
    return ''.join(chars)

We then reload the model:

In [17]:
model.load_state_dict(torch.load("files/ch11/LSTMchar.pth"))

<All keys matched successfully>

Finally, we call the *sample()* function and use *The man* as the prompt to generate text up to 1000 characters.

In [18]:
print(sample(model, 1000, prompt='The man', top_k=5))

The manners that his pains
      Shall hear thee at your powers, which he is stay.

            Enter CORIOLANUS, ALONSO, and, as to the Duke.
                Alexandria. The COUNT'S palace

Enter ARMADO, the COSTARD

  CASSIO. Why, madam!
  CLOTEN. Thou wert not for my fortune; thy son, sir.
  SEBASTIAN. Why, what as you do think?
    I am sent in the poor comfort and this friends,
    And see his happiness is as he charge.
    Who can say so to your lieutenant?
  CORIOLANUS. He had to hold this to the pain of silence.
  ANGELO. I shall be sorry, to the proud credit her war.
ANTIPHOLUS OF EPHESUS. And, she is nothing.

                        Enter CORIOLANUS

    He is my ludge, I will say the King's death.
  PROTEUS. What, mother, sir?
  PAROLLES. I do deliver her.  
  BARDOLPH. A gentleman all a milk to make my hist till herself and her her
    will to me, my good lord, it is any steps with a most, a maid
    and stop of man is a song and to th' encounter of Salisbory.
  CADE. Is i

The results are actually not bad. The model certainly has grasped the Shakespearean style!