# Lab | Text Generation from Shakespeare's Sonnet

This notebook explores the fascinating domain of text generation using a deep learning model trained on Shakespeare's sonnets. 

The objective is to create a neural network capable of generating text sequences that mimic the style and language of Shakespeare.

By utilizing a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) layers, this project aims to demonstrate how a model can learn and replicate the complex patterns of early modern English. 

The dataset used consists of Shakespeare's sonnets, which are preprocessed and tokenized to serve as input for the model.

Throughout this notebook, you will see the steps taken to prepare the data, build and train the model, and evaluate its performance in generating text. 

This lab provides a hands-on approach to understanding the intricacies of natural language processing (NLP) and the potential of machine learning in creative text generation.

Let's import necessary libraries

In [1]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku 
import numpy as np

Let's get the data!

In [2]:
import requests
url = 'https://raw.githubusercontent.com/martin-gorner/tensorflow-rnn-shakespeare/master/shakespeare/sonnets.txt'
resp = requests.get(url)
with open('sonnets.txt', 'wb') as f:
    f.write(resp.content)

data = open('sonnets.txt').read()

corpus = data.lower().split("\n")

Step 1: Initialise a tokenizer and fit it on the corpus variable using .fit_on_texts

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Step 1: Initialize the tokenizer and fit it on the corpus
tokenizer = Tokenizer()  # Initialize the tokenizer

# Fit the tokenizer on the corpus
tokenizer.fit_on_texts(corpus)

# You can check the word index (dictionary of words and their corresponding token values)
word_index = tokenizer.word_index
print("Word index:\n", word_index)

# Convert the corpus into sequences of integers
sequences = tokenizer.texts_to_sequences(corpus)

# Display the tokenized sequences
print("Tokenized sequences:\n", sequences)

# Explanation:
#Tokenizer(): Initializes the Keras tokenizer.
#fit_on_texts(): Fits the tokenizer on the corpus, which means it builds a vocabulary based on the text data and assigns a unique integer to each word.
#word_index: This dictionary contains the mapping of words to their assigned integer tokens.
#texts_to_sequences(): This converts the corpus into sequences of integers based on the tokenization.


Word index:
Tokenized sequences:
 [[878], [], [], [], [3, 2, 313, 1375, 4], [118, 1376, 878], [1377, 1378, 1379, 23, 1380], [1, 8, 517], [1381, 30], [126, 186, 278, 635, 1382], [2, 98, 879], [1383, 7], [1384, 279], [880, 880], [], [], [6], [], [34, 418, 881, 166, 214, 518], [8, 882, 135, 353, 103, 156, 199], [16, 22, 2, 883, 61, 30, 48, 636], [25, 314, 637, 103, 200, 25, 280], [16, 10, 884, 3, 62, 85, 215, 53], [1385, 9, 1386, 638, 11, 123, 1387, 1388], [201, 17, 1389, 64, 519, 202], [119, 9, 1390, 3, 9, 47, 123, 136, 281], [10, 8, 54, 63, 2, 419, 315, 420], [1, 313, 1391, 3, 2, 1392, 421], [216, 62, 85, 885, 1393, 9, 886], [1, 314, 887, 888, 316, 7, 1394], [257, 2, 94, 36, 354, 29, 1395, 21], [3, 639, 2, 419, 355, 30, 2, 640, 1, 19], [], [1396], [], [27, 1397, 889, 46, 1398, 9, 282], [1, 1399, 283, 1400, 7, 9, 135, 1401], [9, 1402, 179, 1403, 20, 1404, 35, 63], [49, 21, 17, 890, 641, 4, 891, 127, 892], [38, 81, 1405, 64, 23, 9, 51, 202], [64, 23, 2, 258, 4, 9, 893, 145], [3, 95, 216, 

Step 2: Calculate the Vocabulary Size

Let's figure out how many unique words are in your corpus. This will be the size of your vocabulary.

Calculate the length of tokenizer.word_index, add 1 to it and store it in a variable called total_words.

In [4]:
# Step 2: Calculate the vocabulary size
total_words = len(tokenizer.word_index) + 1  # Add 1 to account for special token
print("Total words (vocabulary size):", total_words)

# Explanation:
# tokenizer.word_index: This is a dictionary where the keys are the unique words from the corpus and the values are the corresponding integer indices assigned to them.
 # len(tokenizer.word_index): This calculates the number of unique words in the corpus.
# Adding 1: This is necessary to account for the special token often used for padding (or to handle out-of-vocabulary words) in machine learning models.
# This code will store the vocabulary size in the variable total_words and print it out

Total words (vocabulary size): 3375


Create an empty list called input_sequences.

For each sentence in your corpus, convert the text into a sequence of integers using the tokenizer.
Then, generate n-gram sequences from these tokens.

Store the result in the list input_sequences.

In [5]:
# Initialize an empty list to hold input sequences
input_sequences = []

# Loop over each sentence in the corpus
for line in corpus:
    # Convert the sentence into a sequence of integers using the tokenizer
    token_list = tokenizer.texts_to_sequences([line])[0]
    
    # Generate n-grams for the sequence
    for i in range(1, len(token_list)):
        # Create n-gram sequences (1 to i)
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Display the first few input sequences to verify
print("First few input sequences:", input_sequences[:5])

# tokenizer.texts_to_sequences([line])[0]: Converts each sentence in the corpus into a sequence of integers (tokens).
# for i in range(1, len(token_list)): Loops through each token in the sentence to generate n-gram sequences.
# n_gram_sequence = token_list[:i+1]: This slice generates n-grams by keeping the first 1 to i tokens of the sentence.
# input_sequences.append(n_gram_sequence): Appends each n-gram sequence to the input_sequences list.

First few input sequences: [[3, 2], [3, 2, 313], [3, 2, 313, 1375], [3, 2, 313, 1375, 4], [118, 1376]]


Calculate the length of the longest sequence in input_sequences. Assign the result to a variable called max_sequence_len.

Now pad the sequences using pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre').
Convert it to a numpy array and assign the result back to our variable called input_sequences.

In [6]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Step 1: Calculate the length of the longest sequence
max_sequence_len = max([len(seq) for seq in input_sequences])
print("Max sequence length:", max_sequence_len)

# Step 2: Pad the sequences
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

# Step 3: Convert the result to a numpy array
input_sequences = np.array(input_sequences)

# Verify the padded sequences
print("Padded input sequences (shape):", input_sequences.shape)

# max([len(seq) for seq in input_sequences]): This finds the length of the longest sequence in input_sequences.
# pad_sequences(): Pads all sequences to the same length, which is max_sequence_len. The padding='pre' option pads sequences from the start (front).
# np.array(input_sequences): Converts the padded sequences into a NumPy array for efficient computation and compatibility with deep learning models.
# After running this code, input_sequences will contain the padded sequences with equal length, ready for use in model.

Max sequence length: 11
Padded input sequences (shape): (15484, 11)


Prepare Predictors and Labels

Split the sequences into two parts:

- Predictors: All elements from input_sequences except the last one.
- Labels: The last element of each sequence in input_sequences.

In [7]:
# Split the sequences into predictors and labels
predictors = input_sequences[:, :-1]  # All elements except the last one
labels = input_sequences[:, -1]       # The last element of each sequence

# Verify the shapes of predictors and labels
print("Predictors shape:", predictors.shape)
print("Labels shape:", labels.shape)

# input_sequences[:, :-1]: This slices all rows of input_sequences and keeps all columns except the last one (these are your predictors).
# input_sequences[:, -1]: This extracts the last column of each row (these are your labels).
# By splitting your input sequences this way, you will have:

# predictors: All elements except the last one from each sequence.
# labels: The last element of each sequence.
# This structure will allow you to train your model with the sequences as inputs (predictors) and the next word in the sequence as the target (labels).

Predictors shape: (15484, 10)
Labels shape: (15484,)


One-Hot Encode the Labels :

Convert the labels (which are integers) into one-hot encoded vectors. 

Ensure the length of these vectors matches the total number of unique words in your vocabulary.

Use ku.to_categorical() on labels with num_classes = total_words

Assign the result back to our variable labels.

In [8]:
from tensorflow.keras.utils import to_categorical

# One-hot encode the labels
labels = to_categorical(labels, num_classes=total_words)

# Verify the shape of the labels after one-hot encoding
print("Shape of one-hot encoded labels:", labels.shape)

# To one-hot encode the labels (which are currently integers), you will use keras.utils.to_categorical(). 
# This function will convert the integer labels into one-hot encoded vectors. 
# The number of classes in this encoding should match the total number of unique words in your vocabulary (total_words).
# to_categorical(labels, num_classes=total_words): Converts the integer labels into one-hot encoded vectors, 
# where each label is represented as a vector with a length equal to total_words (the total number of unique words in the vocabulary).
# The num_classes parameter ensures that the length of each one-hot encoded vector matches the size of the vocabulary.

Shape of one-hot encoded labels: (15484, 3375)


# Initialize the Model

Start by creating a Sequential model.

Add Layers to the Model:

Embedding Layer: The first layer is an embedding layer. It converts word indices into dense vectors of fixed size (100 in this case). Set the input length to the maximum sequence length minus one, which corresponds to the number of previous words the model will consider when predicting the next word.

Bidirectional LSTM Layer: Add a Bidirectional LSTM layer with 150 units. This layer allows the model to learn context from both directions (past and future) in the sequence. return_sequences=True

Dropout Layer: Add a dropout layer with a rate of 0.2 to prevent overfitting by randomly setting 20% of the input units to 0 during training.

LSTM Layer: Add a second LSTM layer with 100 units. This layer processes the sequence and passes its output to the next layer.

Dense Layer (Intermediate): Add a dense layer with half the total number of words as units, using ReLU activation. A regularization term (L2) is added to prevent overfitting.

Dense Layer (Output): The final dense layer has as many units as there are words in the vocabulary, with a softmax activation function to output a probability distribution over all words.

In [9]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dropout, Dense
from tensorflow.keras.regularizers import l2

# Initialize the model
model = Sequential()

# Add an embedding layer
model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1))

# Add a Bidirectional LSTM layer with 150 units
model.add(Bidirectional(LSTM(150, return_sequences=True)))

# Add a Dropout layer to prevent overfitting
model.add(Dropout(0.2))

# Add another LSTM layer with 100 units
model.add(LSTM(100))

# Add a Dense layer with half the total number of words and L2 regularization
model.add(Dense(total_words // 2, activation='relu', kernel_regularizer=l2(0.01)))

# Add the final output layer (Dense) with softmax activation
model.add(Dense(total_words, activation='softmax'))

# Print the model summary
model.summary()

# 1 Embedding Layer:

# Converts word indices into dense vectors of fixed size (100).
# input_length=max_sequence_len-1: The input length is set to the maximum sequence length minus one since the last word will be used as the target.

# 2 Bidirectional LSTM Layer:

# Adds a bidirectional LSTM layer with 150 units to allow learning context from both directions (past and future).
# return_sequences=True: Ensures that this layer returns the entire sequence, not just the final output, so it can be passed to the next LSTM layer.
# Dropout Layer:

# 3 Dropout helps prevent overfitting by randomly setting 20% of the input units to 0 during training.
# LSTM Layer:

# 4 A second LSTM layer with 100 units that processes the output from the previous layer.
# Dense Layer (Intermediate):

# 5 A Dense layer with half the number of words as units and ReLU activation, with L2 regularization to prevent overfitting.
# Dense Layer (Output):

# The final output layer has total_words units (same as the vocabulary size) and uses softmax activation to output probabilities for each word in the vocabulary.



# Compile the Model:

Compile the model using categorical crossentropy as the loss function, the Adam optimizer for efficient training, and accuracy as the metric to evaluate during training.

In [15]:
model.build()

In [16]:
# Step 9: Compile the model
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

# Print Model Summary:

Use model.summary() to print a summary of the model, which shows the layers, their output shapes, and the number of parameters.

In [17]:
model.summary()

# Now train the model for 50 epochs and assign it to a variable called history.

Training the model with 50 epochs should get you around 40% accuracy.

You can train the model for as many epochs as you like depending on the time and computing constraints you are facing. Ideally train it for a larger amount of epochs than 50.

That way you will get better text generation at the end.

However, dont waste your time.

In [12]:
history = model.fit(predictors, labels, epochs=50, verbose=1)

Epoch 1/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m95s[0m 156ms/step - accuracy: 0.0184 - loss: 7.3654
Epoch 2/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 107ms/step - accuracy: 0.0208 - loss: 6.4966
Epoch 3/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 102ms/step - accuracy: 0.0235 - loss: 6.3957
Epoch 4/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 98ms/step - accuracy: 0.0291 - loss: 6.3089
Epoch 5/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 104ms/step - accuracy: 0.0341 - loss: 6.1814
Epoch 6/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 99ms/step - accuracy: 0.0377 - loss: 6.1035
Epoch 7/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 101ms/step - accuracy: 0.0411 - loss: 6.0515
Epoch 8/50
[1m484/484[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 100ms/step - accuracy: 0.0428 - loss: 5.9788
Epoch 9/50
[1m484

# Use plt from matplotlib to plot the training accuracy over epochs and the loss over epochs

First you will have to get the accuracy and loss data over epochs, you can do this by using methods on your model.

In [18]:
import matplotlib.pyplot as plt

# Step 11: Plot training accuracy and loss over epochs

# Extract the history data for accuracy and loss
accuracy = history.history['accuracy']
loss = history.history['loss']
epochs = range(1, len(accuracy) + 1)

# Plot accuracy over epochs
plt.figure(figsize=(12, 6))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(epochs, accuracy, 'b', label='Training Accuracy')
plt.title('Training Accuracy over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

# Plot loss over epochs
plt.subplot(1, 2, 2)
plt.plot(epochs, loss, 'r', label='Training Loss')
plt.title('Training Loss over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

ModuleNotFoundError: No module named 'matplotlib'

# Generate text with the model based on a seed text

Now you will create two variables :

- seed_text = 'Write the text you want the model to use as a starting point to generate the next words'
- next_words = number_of_words_you_want_the_model_to_generate

Please change number_of_words_you_want_the_model_to_generate by an actual integer.

In [19]:
seed_text = "Shall I compare thee to a summer's day"
next_words = 50 

Now create a loop that runs based on the next_words variable and generates new text based on your seed_text input string. Print the full text with the generated text at the end.

This time you dont get detailed instructions.

Have fun!

In [20]:
for _ in range(next_words):
    # Tokenize the current seed text
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    
    # Pad the tokenized sequence to match the input length
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    
    # Predict the next word (highest probability)
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=-1)
    
    # Find the word corresponding to the predicted index
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            seed_text += " " + word  # Append the predicted word to the seed text
            break

print(seed_text)

Shall I compare thee to a summer's day bail prove end one subjects worth ' bear mine eyes prove woe prove cross prove life alone bear thee alone be thee true art so true one show well tongue something eyes to thee swearing tend tend grow good pride grow fall and thou prove crime prove thine eyes still


Experiment with at least 3 different seed_text strings and see what happens!

In [22]:
seed_text = "Life is dark"
next_words=10


In [23]:
for _ in range(next_words):
    # Tokenize the current seed text
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    
    # Pad the tokenized sequence to match the input length
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    
    # Predict the next word (highest probability)
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=-1)
    
    # Find the word corresponding to the predicted index
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            seed_text += " " + word  # Append the predicted word to the seed text
            break

print(seed_text)

Life is dark that beauty is in men worth here remain bred past


In [24]:
seed_text = "when will i die? "
next_words=10

In [25]:
for _ in range(next_words):
    # Tokenize the current seed text
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    
    # Pad the tokenized sequence to match the input length
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    
    # Predict the next word (highest probability)
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=-1)
    
    # Find the word corresponding to the predicted index
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            seed_text += " " + word  # Append the predicted word to the seed text
            break

print(seed_text)

when will i die?  in this art past hell sense thought rare gone doth


In [26]:
seed_text = "the moon is shining bright "
next_words=10

In [27]:
for _ in range(next_words):
    # Tokenize the current seed text
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    
    # Pad the tokenized sequence to match the input length
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    
    # Predict the next word (highest probability)
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=-1)
    
    # Find the word corresponding to the predicted index
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            seed_text += " " + word  # Append the predicted word to the seed text
            break

print(seed_text)

the moon is shining bright  and unrespected calls me back doth lie in you back
