# Text generation with an RNN

The primary goal of this notebook is to utilize a Recurent Neural Network (RNN), specifically an Encoder-Decoder model, to generate text that mimics the style of William Shakespeare's plays. The process involves training the model on a dataset containing lines from various Shakespearean works and then using the trained model to produce new text.



In [1]:
#@title Import libraries

import os
import time

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import matplotlib.pyplot as plt

## Load Tiny Shakespeare Dataset

We initiate the process by loading a dataset named 'tine_shakespeare,' which encompasses 40,000 lines extracted from various plays by William Shakespeare. This dataset, curated by Andrej Karpathy, is specifically designed for its variety and is referenced in his blog post, ["The Unreasonable Effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), highlighting its relevance in the realm of recurrent neural networks.


In [2]:
dataset = tfds.load(name='tiny_shakespeare')

Downloading and preparing dataset Unknown size (download: Unknown size, generated: 1.06 MiB, total: 1.06 MiB) to /root/tensorflow_datasets/tiny_shakespeare/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteN5HJIZ/tiny_shakespeare-train.tfrecord*..…

Generating validation examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteN5HJIZ/tiny_shakespeare-validation.tfreco…

Generating test examples...:   0%|          | 0/1 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/tiny_shakespeare/1.0.0.incompleteN5HJIZ/tiny_shakespeare-test.tfrecord*...…

Dataset tiny_shakespeare downloaded and prepared to /root/tensorflow_datasets/tiny_shakespeare/1.0.0. Subsequent calls will reuse this data.


## Explore and Preprocess Text

The text undergoes preprocessing, including the creation of a vocabulary and the mapping of characters to unique IDs. This step is crucial for preparing the data to be fed into a neural network.

---

Creating a vocabulary involves identifying all unique characters present in the training text. This step is essential because it defines the set of symbols the model will work with. Each character in the vocabulary corresponds to a unique ID.

In [3]:
train_text = next(iter(dataset["train"].map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))))

vocabulary = np.unique(train_text)

print(f"Created vocabulary has {len(vocabulary)} unique characters")

Created vocabulary has 65 unique characters


The `ids_from_chars` and `chars_from_ids` layers created using TensorFlow's Keras API facilitate the mapping between characters and numerical IDs. This mapping is crucial during both the training and generation phases of the model.

In [4]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=vocabulary, mask_token=None
)

chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=vocabulary, mask_token=None, invert=True
)

The `get_input_target_sequences` function transforms the raw text data into input-target pairs, where each input sequence is paired with the subsequent target sequence. This structured format is specifically designed for training a Recurrent Neural Network (RNN), allowing the model to learn patterns and dependencies within sequences.

In [5]:
SEQ_LENGTH = 256


def get_input_target_sequences(text_data):
    ids = ids_from_chars(text_data)
    ids_dataset = tf.data.Dataset.from_tensor_slices(ids)
    sequences = ids_dataset.batch(SEQ_LENGTH + 1, drop_remainder=True)

    return sequences.map(lambda seq: (seq[:-1], seq[1:]))

def ids_to_text(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

The next code snippet demonstrates an example of an input-target pair after preprocessing.

In [6]:
for input_example, target_example in get_input_target_sequences(train_text).take(1):
    print("Input :", ids_to_text(input_example).numpy())
    print("Target:", ids_to_text(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\n'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nW'


## Build the model

In this section, the code focuses on creating and configuring a Recurrent Neural Network (RNN) model for the task of generating Shakespearean-like text. The architecture chosen is an Encoder-Decoder model, a type of sequence-to-sequence model commonly used in natural language processing tasks.

The model starts with an **embedding layer**. This layer is responsible for converting integer-encoded vocabulary indices into dense vectors of fixed size (embedding_dim). It captures the semantic relationships between words.

Following the embedding layer, an **LSTM** (Long Short-Term Memory) layer is employed. LSTMs are well-suited for handling sequential data, and in this case, they help the model learn the temporal dependencies within the input sequences.

The output of the LSTM layer is fed into a **dense layer**. This layer produces the final output by mapping the LSTM's hidden states to the vocabulary size, providing probabilities for each word in the vocabulary.

In [7]:
class RNNModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()

        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_shape=(100,))

        self.lstm = tf.keras.layers.LSTM(
            units=rnn_units, return_sequences=True, return_state=True
        )

        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = self.embedding(inputs, training=training)

        if states is None:
            states = self.lstm.get_initial_state(x)

        x, *states = self.lstm(x, initial_state=states, training=training)
        x = self.dense(x, training=training)

        return (x, states) if return_state else x

In [8]:
EMBEDDING_DIM = 512
RNN_UNITS = 1024


model = RNNModel(
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=EMBEDDING_DIM,
    rnn_units=RNN_UNITS
)

## Train the model

The model undergoes training on the meticulously prepared dataset, where its parameters are optimized to capture the intricacies of the Shakespearean language.

The training process revolves around minimizing a specific loss function, crucial for guiding the model towards better performance. In this case, the chosen loss function is the Sparse Categorical Crossentropy, which is well-suited for scenarios where each example belongs to a single class.

### Compile the model

The model is configured for training using the Adam optimizer and Sparse Categorical Crossentropy loss. The choice of Sparse Categorical Crossentropy is appropriate for classification tasks with integer-encoded class labels, which is the case in text generation.

In [9]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss)

### Create training batches

In [10]:
BATCH_SIZE = 32
BUFFER_SIZE = 10000

train_dataset = get_input_target_sequences(train_text) \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE, drop_remainder=True) \
    .prefetch(tf.data.experimental.AUTOTUNE)

### Configure checkpoints

Additionally, checkpoints are strategically established during training to save the model's progress at different stages. This ensures that even if the training is interrupted, the model can be restored to a specific state, preventing the loss of valuable information gained during the training process.

In [11]:
CHECHPOINT_DIR = "drive/MyDrive/tmp/training_checkpoints"
INITIAL_EPOCH = 15

checkpoint_prefix = os.path.join(CHECHPOINT_DIR, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix, save_weights_only=True
)

if INITIAL_EPOCH > 0:
    model.load_weights(os.path.join(os.path.join("Weights", "Character_Level_Shakespeare_Text_Generation_using_Encoder_Decoder_Model")))    # model.load_weights(os.path.join(CHECHPOINT_DIR, f"ckpt_{INITIAL_EPOCH}"))

### Training process

In [13]:
EPOCHS = 15

history = model.fit(train_dataset, initial_epoch=INITIAL_EPOCH, epochs=EPOCHS, callbacks=[checkpoint_callback])

if INITIAL_EPOCH != EPOCHS:
    plt.plot(range(INITIAL_EPOCH, EPOCHS+1), history.history["loss"], "o--k")
    plt.grid(True)
    plt.ylabel("loss")
    plt.xlabel("epoch");

## Generate text

A text generation class, `RNNTextGenerator`, is defined to generate text using the trained model. The `generate_text` method generates text starting from a given string.

In [20]:
class RNNTextGenerator(tf.keras.Model):
    def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.):
        super().__init__()
        self.temperature = temperature
        self.model = model
        self.chars_from_ids = chars_from_ids
        self.ids_from_chars = ids_from_chars

        skip_ids = self.ids_from_chars(["[UNK]"])[:, None]
        sparse_mask = tf.SparseTensor(
            values=[-float("inf")] * len(skip_ids),
            indices=skip_ids,
            dense_shape=[len(ids_from_chars.get_vocabulary())]
        )
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function
    def generate_one_step(self, inputs, states=None):
        input_chars = tf.strings.unicode_split(inputs, "UTF-8")
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        predicted_logits, states = self.model(
            inputs=input_ids, states=states, return_state=True
        )
        predicted_logits = predicted_logits[:, -1, :]
        predicted_logits = predicted_logits / self.temperature

        predicted_logits = predicted_logits + self.prediction_mask

        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)

        predicted_chars = self.chars_from_ids(predicted_ids)

        return predicted_chars, states

    def generate_text(self, start_char, text_length=1000):
        next_char = tf.constant([start_char])
        result = [next_char]

        states = None
        for n in range(text_length):
            next_char, states = self.generate_one_step(
                next_char, states=states
            )
            result.append(next_char)

        return tf.strings.join(result)


The next code initializes a text generator, provides an initial seed ("ROMEO:"), generates text using the trained RNN model, and prints the resulting text along with a visual separator. The generated text is a creative output influenced by the learned patterns from Shakespearean works.

In [21]:
gen = RNNTextGenerator(model, chars_from_ids, ids_from_chars)
pred_text = gen.generate_text(start_char="ROMEO:")
print(pred_text[0].numpy().decode("utf-8"), "\n\n" + "_" * 80)

ROMEO:
You beard sheer, and 'twixt with my soul toorselves:
On whose causes of my wife; I charge him,
We speak us was; such fairing gravest side,
Like in that slow her subject, let my taste
These sly settles are these nobles of the state,
Who world's conspire of whom merchings in thee:
Have alwhy to strike her turns that love I pride;
And in this nature blunt can thanks thou live?

GLOUCESTER:
Twlen? 'tis no time, Poor Gentlemon, to-ba
Besides thy mistress senatolicary.
Go poison, other tears will shrister me it marriage.

GLOUCESTER:
So do I'll melt your buptors of no mooe.

First Watchman:
What then?
What comfortable things, protectoc And the queen'?
Be jetter than the night of sucher souls
Which you do do, he is so, bunishments,
You mean to Coriola in; catchation!
We are so stilg hath done this further scope.
And wear to incles his majesty arrowf
And all escuping embraced water spinits:
The chiefix struck hour in the eyes,
Traitor were borne for a liege.
Ine worth as much, and I hav

## Analysis of the generated text

The generated text appears to mimic the style of William Shakespeare's plays, capturing some of the characteristic language and themes found in his works. Here's an analysis:

1. **Language Style:**
   - The text is written in a Shakespearean style, featuring archaic language, poetic expressions, and intricate sentence structures.
   - It includes recognizable Shakespearean terms like "thou," "thee," and "majesty," contributing to the authenticity of the language.

2. **Characters:**
   - The presence of characters such as "ROMEO" and "GLOUCESTER" aligns with the theatrical and dramatic nature of Shakespeare's plays.
   - The dialogue format with character names preceding their lines is a common structure in Shakespearean scripts.

3. **Themes:**
   - Themes of love, betrayal, nobility, and political intrigue are woven into the text, reminiscent of Shakespeare's exploration of human emotions and societal complexities.
   - References to marriage, loyalty, and the actions of characters align with themes often present in Shakespeare's works.

4. **Creative Output:**
   - The text is a creative output influenced by the learned patterns from the training dataset of Shakespearean plays.
   - While the text may not follow a specific storyline, it captures the essence of Shakespeare's writing style, creating a coherent and engaging piece.

5. **Imitated Versatility:**
   - The generator demonstrates versatility by producing text that emulates the style of different characters, as seen in the transition from ROMEO to GLOUCESTER.
   - The shifts in tone and perspective contribute to the diversity of the generated content.

6. **Incoherence:**
   - Some phrases and sentences lack clear meaning or coherence, which is not uncommon in text generated by language models. This could be due to the complexity of capturing nuanced meaning and context.

In conclusion, the generated text successfully achieves its goal of mimicking the style of Shakespearean plays. It showcases the model's ability to produce text that aligns with the linguistic and thematic characteristics of the training data.