# Lab | Text Generation from Shakespeare's Sonnet

This notebook explores the fascinating domain of text generation using a deep learning model trained on Shakespeare's sonnets. 

The objective is to create a neural network capable of generating text sequences that mimic the style and language of Shakespeare.

By utilizing a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) layers, this project aims to demonstrate how a model can learn and replicate the complex patterns of early modern English. 

The dataset used consists of Shakespeare's sonnets, which are preprocessed and tokenized to serve as input for the model.

Throughout this notebook, you will see the steps taken to prepare the data, build and train the model, and evaluate its performance in generating text. 

This lab provides a hands-on approach to understanding the intricacies of natural language processing (NLP) and the potential of machine learning in creative text generation.

Let's import necessary libraries

In [14]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku 
import numpy as np

Let's get the data!

In [15]:
import requests
url = 'https://raw.githubusercontent.com/martin-gorner/tensorflow-rnn-shakespeare/master/shakespeare/sonnets.txt'
resp = requests.get(url)
with open('sonnets.txt', 'wb') as f:
    f.write(resp.content)

data = open('sonnets.txt').read()

corpus = data.lower().split("\n")


Step 1: Initialise a tokenizer and fit it on the corpus variable using .fit_on_texts

In [16]:
# Your code here :

from tensorflow.keras.preprocessing.text import Tokenizer

# Step 1: Initialize the tokenizer - Convert text into tokens (unique words) and assign unique integer ID to them
tokenizer = Tokenizer()

# Step 2: Fit the tokenizer on the corpus - map each unique word to a number and build Vocabulary
tokenizer.fit_on_texts(corpus)

Step 2: Calculate the Vocabulary Size

Let's figure out how many unique words are in your corpus. This will be the size of your vocabulary.

Calculate the length of tokenizer.word_index, add 1 to it and store it in a variable called total_words.

In [17]:
# Your code here :

# Calculate the vocabulary size
total_words = len(tokenizer.word_index) + 1  # Adding 1 to account for padding or unknown words  

# Print the vocabulary size
print(f"Total Vocabulary Size: {total_words}")

Total Vocabulary Size: 3375


Create an empty list called input_sequences.

For each sentence in your corpus, convert the text into a sequence of integers using the tokenizer.
Then, generate n-gram sequences from these tokens.

Store the result in the list input_sequences.

In [18]:
# Your code here :

# Create an empty list to store input sequences
input_sequences = []

# For each line in the corpus
for line in corpus:
    # Converting the line into seq of integers
    token_list = tokenizer.texts_to_sequences([line])[0] 

    # Generate n-gram sequences
    for i in range(1, len(token_list) + 1):
        n_gram_sequence = token_list[:i]  # Take the first 'i' tokens as an n-gram
        input_sequences.append(n_gram_sequence)  # Append to the input_sequences list

# Print the first few input sequences to confirm
print(input_sequences[:5])

[[878], [3], [3, 2], [3, 2, 313], [3, 2, 313, 1375]]


Calculate the length of the longest sequence in input_sequences. Assign the result to a variable called max_sequence_len.

Now pad the sequences using pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre').
Convert it to a numpy array and assign the result back to our variable called input_sequences.

In [19]:
# Your code here :

from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Step 1: Find the length of the longest sequence in input_sequences
max_sequence_len = max([len(seq) for seq in input_sequences])

# Step 2: Pad the sequences to the length of the longest sequence
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

# Step 3: Convert the result to a numpy array
input_sequences = np.array(input_sequences)

# Print to check
print(f"Max sequence length: {max_sequence_len}")
print(f"Padded sequences shape: {input_sequences.shape}")

Max sequence length: 11
Padded sequences shape: (17805, 11)


Prepare Predictors and Labels

Split the sequences into two parts:

- Predictors: All elements from input_sequences except the last one.
- Labels: The last element of each sequence in input_sequences.

In [20]:
# Your code here :

# Step 1: Split the sequences into predictors and labels
predictors = input_sequences[:, :-1]  # All elements except the last one
labels = input_sequences[:, -1]       # The last element of each sequence

# Print the shape of predictors and labels to check
print(f"Predictors shape: {predictors.shape}")
print(f"Labels shape: {labels.shape}")


Predictors shape: (17805, 10)
Labels shape: (17805,)


One-Hot Encode the Labels :

Convert the labels (which are integers) into one-hot encoded vectors. 

Ensure the length of these vectors matches the total number of unique words in your vocabulary.

Use ku.to_categorical() on labels with num_classes = total_words

Assign the result back to our variable labels.

In [21]:
# Your code here :

from tensorflow.keras.utils import to_categorical

# Step 1: One-hot encode the labels
labels = to_categorical(labels, num_classes=total_words)

# Print to check the shape of the labels
print(f"Labels shape (one-hot encoded): {labels.shape}")


Labels shape (one-hot encoded): (17805, 3375)


# Initialize the Model

Start by creating a Sequential model.

Add Layers to the Model:

Embedding Layer: The first layer is an embedding layer. It converts word indices into dense vectors of fixed size (100 in this case). Set the input length to the maximum sequence length minus one, which corresponds to the number of previous words the model will consider when predicting the next word.

Bidirectional LSTM Layer: Add a Bidirectional LSTM layer with 150 units. This layer allows the model to learn context from both directions (past and future) in the sequence. return_sequences=True

Dropout Layer: Add a dropout layer with a rate of 0.2 to prevent overfitting by randomly setting 20% of the input units to 0 during training.

LSTM Layer: Add a second LSTM layer with 100 units. This layer processes the sequence and passes its output to the next layer.

Dense Layer (Intermediate): Add a dense layer with half the total number of words as units, using ReLU activation. A regularization term (L2) is added to prevent overfitting.

Dense Layer (Output): The final dense layer has as many units as there are words in the vocabulary, with a softmax activation function to output a probability distribution over all words.

In [9]:
model = Sequential([

    # Your code here :
    
])

# Compile the Model:

Compile the model using categorical crossentropy as the loss function, the Adam optimizer for efficient training, and accuracy as the metric to evaluate during training.

In [10]:
# Your code here :

# Print Model Summary:

Use model.summary() to print a summary of the model, which shows the layers, their output shapes, and the number of parameters.

In [11]:
# Your code here :

# Now train the model for 50 epochs and assign it to a variable called history.

Training the model with 50 epochs should get you around 40% accuracy.

You can train the model for as many epochs as you like depending on the time and computing constraints you are facing. Ideally train it for a larger amount of epochs than 50.

That way you will get better text generation at the end.

However, dont waste your time.

In [12]:
# Your code here :

# Use plt from matplotlib to plot the training accuracy over epochs and the loss over epochs

First you will have to get the accuracy and loss data over epochs, you can do this by using methods on your model.

In [13]:
# Your code here :

# Generate text with the model based on a seed text

Now you will create two variables :

- seed_text = 'Write the text you want the model to use as a starting point to generate the next words'
- next_words = number_of_words_you_want_the_model_to_generate

Please change number_of_words_you_want_the_model_to_generate by an actual integer.

In [14]:
# Your code here :

Now create a loop that runs based on the next_words variable and generates new text based on your seed_text input string. Print the full text with the generated text at the end.

This time you dont get detailed instructions.

Have fun!

In [15]:
# Your code here :

Experiment with at least 3 different seed_text strings and see what happens!

In [16]:
# Your code here :