A Text Generation Model is a type of Natural Language Processing (NLP) model that automatically generates human-like text. It can produce coherent and contextually relevant text based on the input text.

For the task of text generation , we can use the Tiny Shakespeare dataset because of two reasons:

1.It’s available in the format of dialogues, so we will learn how to generate text in the form of dialogues.

2.Usually, we need huge textual datasets for building text generation models. The Tiny Shakespeare dataset is already available in the tensorflow datasets, so we don’t need to download any dataset externally.

In [None]:
# import necessary python libraries
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

# load the Tiny Shakespeare dataset
dataset, info = tfds.load('tiny_shakespeare', with_info=True, as_supervised=False)

Downloading and preparing dataset Unknown size (download: Unknown size, generated: 1.06 MiB, total: 1.06 MiB) to /root/tensorflow_datasets/tiny_shakespeare/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteMV8UUP/tiny_shakespeare-train.tfrecord*..…

Generating validation examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteMV8UUP/tiny_shakespeare-validation.tfreco…

Generating test examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteMV8UUP/tiny_shakespeare-test.tfrecord*...…

Dataset tiny_shakespeare downloaded and prepared to /root/tensorflow_datasets/tiny_shakespeare/1.0.0. Subsequent calls will reuse this data.


Our dataset contains data in a textual format. Language models need numerical data, so we’ll convert the text to sequences of integers. We’ll also create sequences for training:

In [None]:
# get the text from the dataset
text = next(iter(dataset['train']))['text'].numpy().decode('utf-8')

# create a mapping from unique characters to indices
vocab = sorted(set(text))
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = np.array(vocab)

# numerically represent the characters
text_as_int = np.array([char2idx[c] for c in text])

# create training examples and targets
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

# create training sequences
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

For each sequence, we will now shift it to form the input and target text by using the map method to apply a simple function to each batch:

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

Now, we’ll shuffle the dataset and pack it into training batches:

In [None]:
# batch size and buffer size
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

Now, we’ll use a simple Recurrent Neural Network model with a few layers to build the model:

In [None]:
# length of the vocabulary
vocab_size = len(vocab)

# the embedding dimension
embedding_dim = 256

# number of RNN units
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

We’ll now choose an optimizer and a loss function to compile the model:

In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

We’ll now train the model:

In [None]:
import os

# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

# train the model
EPOCHS = 2
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/2
Epoch 2/2


After training, we can now use the model to generate text. First, we will restore the latest checkpoint and rebuild the model with a batch size of 1:

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

Now, to generate text, we’ll input a seed string, predict the next character, and then add it back to the input, continuing this process to generate longer text:

In [None]:
def generate_text(model, start_string):
    num_generate = 1000

    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)

        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

print(generate_text(model, start_string=u"QUEEN: So, lets end this"))

QUEEN: So, lets end thisUJ:$$PQzQQJ$JJH$&HX$J$ZX&QX$J3KX&H&PK&P&jJJKQq$QJX$$QQJJx$Uz$$DQVZQZVJK$Q$XZXJQ$$UK$$Q&K3QGMj$Q&VK$&&$$$Q&$$V&JQ$QM$QMQJJ&Z$X&KP&QF&3JMxXQKQQjPJzQ&vUz$;z$&qS$zHZMJ$Q$VXjKjCqP&QTQv$zjzX3zQ$&j$q3Q&AJJzP&$$Jj&Q$GKx3QJJ?vAj$K&QKP&jBJ$Q-$K&J$z$T$$jJz$JX$QjQA&JHz$3xvQzQjK&X3Jj&EKZQ$-&J$JV3$Q$$$UV3WXZQNZJ&U$M$Q&zKMJQ&JJV&$K$V&FJJJ&$$$QQE$$Q$zBPzzqzQ$U&-Q$J!KAZxq$&JQQKK&3QZQ&XND$JJ!3$$QjQ&&zG3JQPMzJPQZj$$QUz&WXQ$$$XZHQZF$Q$R$$$MzM$KQ3$XV$KzB$$VKMJZK$JXJ&QQ3V&WQ$XJF$RZ$XzK3$$JJZQ$Y$JK&Q&XX33$$jMKJZ$jXQBJVZVYQ$UVXJj$DKWJ$QMGXK&QZKJ$$Q$PQ&$LM$jqXXEqQ$KFWKPQ$q$J&$JF&$JjJ$F&$$Q$KzX$Q&$LQzMQJJ$P&&&G$K$JV&K$K$ZGZ$F$XQKJ$v$QZX$K$QQUJ$KXFQJjNMKKQKQ$QUDXV3SZK$QXQ$Q3KZ3VWQ$KKVZ$zQDEL3JJJ&ZV3$q&Qjg$GJHjQJVMX$JJU$Q$VKF$3$KjKxQ$J$$JJVXXJ$Q$J&KZQ$j$VQ$VYQJ$Q$J$$KSJCz&Q$D$$Y$KV$&$$$QKQ$j$j3$&3KCJPQKKzzQKQ&&J$JXkQXZJJJ$JXQKkXQJJJKXX&$JJKKVX$KPXzHU3MVBKVVQj$z$&&DqV$qBZK$$QXx$zK&Q&XOMQMVQJFJ$V&qQ$$$$XQjK$MVUqVj&D&Y&&VQQZV&UM$XZJzJJPX$X$$Q$V$H$Q$BBMKjXK$QjX$$PZQ$Q3J&VP&$$XJZUZQM$Z$Jz3YJQBX

The generate_text function in the above code uses a trained Recurrent Neural Network model to generate a sequence of text, starting with a given seed phrase (start_string). It converts the seed phrase into a sequence of numeric indices, feeds these indices into the model, and then iteratively generates new characters, each time using the model’s most recent output as the input for the next step. This process continues for a specified number of iterations (num_generate), resulting in a stream of text that extends from the initial seed.

The function employs randomness in character selection to ensure variability in the generated text, and the final output is a concatenation of the seed phrase with the newly generated characters, typically reflecting the style and content of the training data used for the model.