In [None]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""


This code example demonstrates how to use an LSTM-based neural network and beam search to do text autocompletion. More context for this code example can be found in the section "Programming Example: Using LSTM for Text Autocompletion" in Chapter 11 in the book Learning Deep Learning by Magnus Ekman (ISBN: 9780137470358).


The initialization code is shown in the first code snippet. Apart from the import statements, we need to provide the path to the text file to use for training. We also define two variables, WINDOW_LENGTH and WINDOW_STEP, which are used to control the process of splitting up this text file into multiple training examples. The other three variables control the beam-search algorithm and are described shortly. The text used to train the model is assumed to be in the file ../data/frankenstein.txt.

In [None]:
!pip install --upgrade silabs-mltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting silabs-mltk
  Downloading silabs_mltk-0.13.0-1668553940-cp37-cp37m-manylinux2014_x86_64.whl (40.7 MB)
[K     |████████████████████████████████| 40.7 MB 1.2 MB/s 
Collecting pickle5
  Downloading pickle5-0.0.12-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (256 kB)
[K     |████████████████████████████████| 256 kB 64.7 MB/s 
[?25hCollecting bincopy<18.0
  Downloading bincopy-17.14.0-py3-none-any.whl (17 kB)
Collecting GPUtil<2.0
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
Collecting pyserial<4.0
  Downloading pyserial-3.5-py2.py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 9.5 MB/s 
Collecting pyaml<22.0
  Downloading pyaml-21.10.1-py2.py3-none-any.whl (24 kB)
Collecting tflite-support
  Downloading tflite_support-0.4.3-cp37-cp37m-manylinux2014_x86_64.whl (60.9 MB)
[K     |████████████████████████████████| 60.9 MB 1.3 MB/s 
Collecting pytest-d

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
import numpy as np
from utilities import train_model

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
EPOCHS = 32
BATCH_SIZE = 256
INPUT_FILE_NAME = '..frankenstein.txt'
WINDOW_LENGTH = 40
WINDOW_STEP = 3
BEAM_SIZE = 8
NUM_LETTERS = 11
MAX_LENGTH = 50


The next code snippet opens and reads the content of the file, converts it all into lowercase, and replaces double spaces with single spaces. To enable us to easily one-hot encode each character, we want to assign a monotonically increasing index to each character. This is done by first creating a list of unique characters. Once we have that list, we can loop over it and assign an incrementing index to each character. We do this twice to create one dictionary (a hash table) that maps from character to index and a reverse dictionary from index to character. These will come in handy later when we want to convert text into one-hot encoded input to the network as well as when we want to convert one-hot encoded output into characters. Finally, we initialize a variable encoding_width with the count of unique characters, which will be the width of each one-hot encoded vector that represents a character.

In [14]:
# Open the input file.
file = open('frankenstein.txt', 'r', encoding='utf-8-sig')
text = file.read()
file.close()

# Make lower-case and remove newline and extra spaces.
text = text.lower()
text = text.replace('\n', ' ')
text = text.replace('  ', ' ')

# Encode characters as indices.
unique_chars = list(set(text))
char_to_index = dict((ch, index) for index,
                     ch in enumerate(unique_chars))
index_to_char = dict((index, ch) for index,
                     ch in enumerate(unique_chars))
encoding_width = len(char_to_index)


The next step is to create training examples from the text file. This is done by the next code snippet. Each training example will consist of a sequence of characters and a target output value of a single character immediately following the input characters. We create these input examples using a sliding window of length WINDOW_LENGTH. Once we have created one training example, we slide the window by WINDOW_STEP positions and create the next training example. We add the input examples to one list and the output values to another. All of this is done by the first for loop.

We then create a single array holding all the input examples and another array holding the output values. Both of these arrays will hold data in one-hot encoded form, so each character is represented by a dimension of size encoding_width. We first allocate space for the two arrays and then fill in the values using a nested for loop.


In [15]:
# Create training examples.
fragments = []
targets = []
for i in range(0, len(text) - WINDOW_LENGTH, WINDOW_STEP):
    fragments.append(text[i: i + WINDOW_LENGTH])
    targets.append(text[i + WINDOW_LENGTH])

# Convert to one-hot encoded training data.
X = np.zeros((len(fragments), WINDOW_LENGTH, encoding_width), dtype=np.float32)
y = np.zeros(len(fragments), dtype=np.int64)
for i, fragment in enumerate(fragments):
    for j, char in enumerate(fragment):
        X[i, j, char_to_index[char]] = 1
    target_char = targets[i]
    y[i] = char_to_index[target_char]


We then split the dataset into a training dataset and a test dataset using the train_test_split function from scikit-learn. We only withhold 5% as test dataset. Given that we are mostly interested in inspecting the resulting auto-completions in this example we could have skipped creating a test dataset altogether, but we create it anyway for good practice.

We then convert the arrays into tensors and create Dataset objects.


In [16]:
# Split into training and test set.
train_X, test_X, train_y, test_y = train_test_split(
    X, y, test_size=0.05, random_state=0)

# Create Dataset objects.
trainset = TensorDataset(torch.from_numpy(train_X), torch.from_numpy(train_y))
testset = TensorDataset(torch.from_numpy(test_X), torch.from_numpy(test_y))


We are now ready to build our model. From the perspective of training our model, it will look similar to the book sales prediction example, but we use a deeper model consisting of two LSTM layers (indicated by the argument num_layers to the nn.LSTM object). We want both LSTM layers to use a dropout value of 0.2. However, the nn.LSTM implementation does not apply dropout to the top layer, so we stack a separate Dropout layer on top of the LSTM object.

Just as in c9e1_rnn_book_sales we have to create a custom layer to retrieve the correct set of outputs from the nn.LSTM object. The return value from LSTM is slightly more complex in that it returns both the internal cell state as well as the output state of each layer so we now need yet another index to select only the output state. That is, the second index (0) indicates that we want the output of the layer instead of the cell state, and the third index (1) indicates that we want the output of the second LSTM layer.

We end with a fully connected layer with multiple neurons using a softmax function because we will be predicting probabilities for discrete entities (characters). We use categorical cross-entropy as our loss function, which is the recommended loss function for multicategory classification.

Finally, we train the model for 32 epochs with a mini-batch size of 256.


In [18]:
# Define model.
class LastTimestep(nn.Module):
    def forward(self, inputs):
        return inputs[1][0][1] # Return hidden state for last timestep.

model = nn.Sequential(
    nn.LSTM(encoding_width, 128, num_layers=2, dropout=0.2, batch_first=True),
    LastTimestep(),
    nn.Dropout(0.2), # Add this since PyTorch LSTM does not apply dropout to top layer.
    nn.Linear(128, encoding_width)
)

# Loss function and optimizer
optimizer = torch.optim.Adam(model.parameters())
loss_function = nn.CrossEntropyLoss()

# Train the model.
train_model(model, device, EPOCHS, BATCH_SIZE, trainset, testset,
            optimizer, loss_function, 'acc')


Epoch 1/32 loss: 2.8277 - acc: 0.2136 - val_loss: 2.4480 - val_acc: 0.2893
Epoch 2/32 loss: 2.3474 - acc: 0.3175 - val_loss: 2.2391 - val_acc: 0.3304
Epoch 3/32 loss: 2.2058 - acc: 0.3495 - val_loss: 2.1310 - val_acc: 0.3609
Epoch 4/32 loss: 2.1119 - acc: 0.3751 - val_loss: 2.0574 - val_acc: 0.3807
Epoch 5/32 loss: 2.0315 - acc: 0.3975 - val_loss: 1.9690 - val_acc: 0.3979
Epoch 6/32 loss: 1.9585 - acc: 0.4173 - val_loss: 1.9036 - val_acc: 0.4227
Epoch 7/32 loss: 1.8999 - acc: 0.4324 - val_loss: 1.8534 - val_acc: 0.4375
Epoch 8/32 loss: 1.8451 - acc: 0.4470 - val_loss: 1.7977 - val_acc: 0.4470
Epoch 9/32 loss: 1.7991 - acc: 0.4603 - val_loss: 1.7547 - val_acc: 0.4607
Epoch 10/32 loss: 1.7576 - acc: 0.4722 - val_loss: 1.7179 - val_acc: 0.4647
Epoch 11/32 loss: 1.7220 - acc: 0.4821 - val_loss: 1.6843 - val_acc: 0.4766
Epoch 12/32 loss: 1.6881 - acc: 0.4908 - val_loss: 1.6526 - val_acc: 0.4870
Epoch 13/32 loss: 1.6595 - acc: 0.4988 - val_loss: 1.6279 - val_acc: 0.4968
Epoch 14/32 loss: 1.6

[0.5805607504826255, 0.5477120535714286]

The next step is to implement the beam search algorithm to predict text. Consult the section "Beam Search" in Chapter 11 for more details about beam search. In our implementation, each beam is represented by a tuple with three elements. The first element is the logarithm of the cumulative probability for the current sequence of characters. The second element is the string of characters. The third element is a one-hot encoded version of the string of characters. A reasonable question is why we store the logarithm of the cumulative probability instead of just the cumulative probability. Given that these probabilities are small, there is a risk that the limited precision of computer arithmetic results in underflow. This is addressed by instead computing the logarithm of the probability, in which case the multiplication is converted to an addition. For a small number of words, this is not necessary, but we do it anyway for good practice.

We start by creating a single beam with an initial sequence of characters ('the body ') and set the initial probability to 1.0. The one-hot encoded version of the string is created by the first loop. We add this beam to a list named beams.

This is followed by a nested loop that uses the trained model to do predictions according to the beam-search algorithm. We extract the one-hot encoding representation of each beam and create a NumPy array with multiple input examples. There is one input example per beam. During the first iteration, there is only a single input example. During the remaining iterations, there will be BEAM_SIZE number of examples.

We convert the input to a tensor, move to the GPU and feed to the model. We also need to explicitly apply softmax to the output because the softmax operation is not included in the model itself. This results in one softmax vector per beam. The softmax vector contains one probability per word in the vocabulary. For each beam, we create BEAM_SIZE new beams, each beam consisting of the words from the original beam concatenated with one more word. We choose the most probable words when creating the beams. The probability for each beam can be computed by multiplying the current probability of the beam by the probability for the added word.

Once we have created BEAM_SIZE beams for each existing beam, we sort the list of new beams according to their probabilities. We then discard all but the top BEAM_SIZE beams. This represents the pruning step. For the first iteration, this does not result in any pruning because we started with a single beam, and this beam resulted in just BEAM_SIZE beams. For all remaining iterations, we will end up with BEAM_SIZE * BEAM_SIZE beams and discard most of them.

The loop runs for a fixed number of iterations followed by printing out the generated predictions. See the section "Programming Example: Using LSTM for Text Autocompletion" in Chapter 12 for examples of generated predictions that an equivalent TensorFlow implementation generated.


In [19]:
# Create initial single beam represented by triplet
# (probability , string , one-hot encoded string).
letters = 'the body '
one_hots = []
for i, char in enumerate(letters):
    x = np.zeros(encoding_width)
    x[char_to_index[char]] = 1
    one_hots.append(x)
beams = [(np.log(1.0), letters, one_hots)]

# Predict NUM_LETTERS into the future.
for i in range(NUM_LETTERS):
    minibatch_list = []
    # Create minibatch from one-hot encodings, and predict.
    for triple in beams:
        minibatch_list.append(triple[2])
    minibatch = np.array(minibatch_list, dtype=np.float32)
    inputs = torch.from_numpy(minibatch)
    inputs = inputs.to(device)
    outputs = model(inputs)
    outputs = F.softmax(outputs, dim = 1)
    y_predict = outputs.cpu().detach().numpy()

    new_beams = []
    for j, softmax_vec in enumerate(y_predict):
        triple = beams[j]
        # Create BEAM_SIZE new beams from each existing beam.
        for k in range(BEAM_SIZE):
            char_index = np.argmax(softmax_vec)
            new_prob = triple[0] + np.log(softmax_vec[char_index])
            new_letters = triple[1] + index_to_char[char_index]
            x = np.zeros(encoding_width)
            x[char_index] = 1
            new_one_hots = triple[2].copy()
            new_one_hots.append(x)
            new_beams.append((new_prob, new_letters, new_one_hots))
            softmax_vec[char_index] = 0
    # Prune tree to only keep BEAM_SIZE most probable beams.
    new_beams.sort(key=lambda tup: tup[0], reverse=True)
    beams = new_beams[0:BEAM_SIZE]
for item in beams:
    print(item[1])
    

the body which i had
the body of the coun
the body of the cott
the body of the dest
the body of the more
the body which i hav
the body of the morn
the body of the comp
