# Text Generation with RNNs

In this notebook, we will build a simple Recurrent Neural Network (RNN) using Keras to generate text sequences in the style of Shakespeare. We will use a dataset of Shakespeare's works, preprocess the text, train the RNN, and generate new text based on the learned patterns.

In [None]:
import numpy as np
import random
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# Download the dataset
!wget -q -O shakespeare.txt https://www.gutenberg.org/cache/epub/100/pg100.txt

# Load the dataset
with open('shakespeare.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Display the first 1000 characters
print(text[:1000])

﻿The Project Gutenberg eBook of The Complete Works of William Shakespeare
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Complete Works of William Shakespeare

Author: William Shakespeare

Release date: January 1, 1994 [eBook #100]
                Most recently updated: January 18, 2024

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***
﻿The Complete Works of William Shakespeare

by William Shakespeare




                    Contents

    THE SONNETS
    ALL’S WELL THAT ENDS WELL
    THE TRAGEDY OF ANTONY AND CLEOPATRA
   

We preprocess the text data for training a character-level RNN. We begin by utilizing Keras's Tokenizer with `char_level=True` to tokenize the text at the character level, assigning a unique integer index to each character. This approach allows the model to learn patterns at the granularity of individual characters.

The text is then converted into a sequence of integers, representing each character by its corresponding index. We define sequences of 40 characters as input, with the subsequent character serving as the target output. This setup allows the RNN to learn to predict the next character in a sequence, facilitating text generation.

To prepare the target data for model training, we one-hot encode the output characters.

Finally, the input `X` and output `y` data are converted into NumPy arrays, ensuring compatibility with deep learning models.

In [None]:
# Tokenization
tokenizer = Tokenizer(char_level=True)  # Character-level tokenization
tokenizer.fit_on_texts(text)
total_chars = len(tokenizer.word_index)

# Convert text to sequences of integers
sequences = tokenizer.texts_to_sequences([text])[0]

# Define sequence length
sequence_length = 40

# Create input-output pairs
X = []
y = []
for i in range(0, len(sequences) - sequence_length):
    X.append(sequences[i:i + sequence_length])
    y.append(sequences[i + sequence_length])

# Convert to NumPy arrays
X = np.array(X)
y = np.array(y)

# One-hot encode the output variable
y = tf.keras.utils.to_categorical(y, num_classes=total_chars + 1)

print(f"Number of sequences: {X.shape[0]}")

Number of sequences: 5378624


X will contain sequences like:
```
[h, e, l, l, o]
[e, l, l, o, _]
[l, l, o, _, w]
```
And so on, where each letter is replaced by its corresponding integer index.

 `y` is a one-hot encoded array where each row corresponds to the true next character following the sequence in `X`.

In [None]:
def build_rnn_model(input_shape, num_classes):
    """
    Build a simple RNN model using Keras Sequential API.

    Parameters:
    - input_shape: Shape of the input data (sequence_length)
    - num_classes: Number of unique characters (output size)

    Returns:
    - model: Compiled RNN model
    """
    model = Sequential()
    model.add(Embedding(input_dim=num_classes, output_dim=32, input_length=input_shape))  # Converts character indices into dense vectors of fixed size
    model.add(SimpleRNN(128, return_sequences=False))
    model.add(Dense(num_classes, activation='softmax'))

    model.compile(optimizer=RMSprop(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

    return model

# Build the model
input_shape = X.shape[1]  # sequence_length
num_classes = total_chars + 1
rnn_model = build_rnn_model(input_shape, num_classes)
rnn_model.summary()



In [None]:
# Train the model
history = rnn_model.fit(X, y, validation_split=0.2, epochs=20, batch_size=128, verbose=1)

Epoch 1/20
[1m33617/33617[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m229s[0m 7ms/step - accuracy: 0.3564 - loss: 2.2180 - val_accuracy: 0.3504 - val_loss: 2.1908
Epoch 2/20
[1m33617/33617[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m211s[0m 6ms/step - accuracy: 0.3872 - loss: 2.0872 - val_accuracy: 0.3712 - val_loss: 2.2234
Epoch 3/20
[1m33617/33617[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m214s[0m 6ms/step - accuracy: 0.3916 - loss: 2.0767 - val_accuracy: 0.3714 - val_loss: 2.2418
Epoch 4/20
[1m33617/33617[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m217s[0m 6ms/step - accuracy: 0.3958 - loss: 2.0652 - val_accuracy: 0.3589 - val_loss: 2.3231
Epoch 5/20
[1m33617/33617[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m206s[0m 6ms/step - accuracy: 0.3962 - loss: 2.0628 - val_accuracy: 0.3721 - val_loss: 2.2488
Epoch 6/20
[1m33617/33617[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m212s[0m 6ms/step - accuracy: 0.3983 - loss: 2.0592 - val_accuracy: 0.3868 - val_loss:

The `generate_text` function uses the trained RNN to generate text starting from a given seed text. It tokenizes the last `sequence_length` characters of the seed to form an input sequence, which is then fed into the model to predict the next character. The predicted character, determined by the highest probability from the model's output, is appended to the growing text sequence. This process is repeated for a specified number of characters, building the generated text iteratively. The final result is a string that extends the seed text, reflecting patterns learned by the model during training.

In [None]:
def generate_text(model, tokenizer, seed_text, length=200):
    """
    Generate text using a trained model and a seed text.

    Parameters:
    - model: Trained RNN model
    - tokenizer: Tokenizer object used for preprocessing
    - seed_text: Initial text to start generating from
    - length: Number of characters to generate

    Returns:
    - generated_text: Generated text string
    """
    generated_text = seed_text
    for _ in range(length):
        # Tokenize the input sequence
        tokenized_sequence = tokenizer.texts_to_sequences([generated_text[-sequence_length:]])[0]
        tokenized_sequence = pad_sequences([tokenized_sequence], maxlen=sequence_length)

        # Predict next character
        predicted_probs = model.predict(tokenized_sequence, verbose=0)[0]
        predicted_char_index = np.argmax(predicted_probs)

        # Find the character from the index
        for char, index in tokenizer.word_index.items():
            if index == predicted_char_index:
                generated_text += char
                break

    return generated_text

# Generate text using a seed
seed_text = "To be, or not to be, that is the question: "
generated_text = generate_text(rnn_model, tokenizer, seed_text)
print("Generated Text:\n", generated_text)

Generated Text:
 To be, or not to be, that is the question: if fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fail fa


## Exercises

* Explore how different sequence lengths affect the model's performance and the coherence of the generated text.
* Replace the simple RNN layer with a bidirectional LSTM or GRU layer to improve model performance and text generation quality.