# Generating Text with Frankenstein Using PyTorch


**Disclaimer**

The text generation you will train in this project will be trained exclusively from the ebook [_Frankenstein_](https://www.gutenberg.org/cache/epub/84/pg84-images.html) by Mary Shelley that was made freely available by [The Project Gutenberg](https://gutenberg.org/). Users should be aware that the generated text outputs may contain biases or portray perspectives prevalent at the time of the original writing. As such, please use discretion while interpreting the generated text.

**Setup - Import Libraries**

Run the cell below to import the libraries and set the random seed.

In [1]:
import numpy as np
import torch
torch.manual_seed(1) # set random seed --do not change!

<torch._C.Generator at 0x117459990>

## Task Group 1 - Import Text Dataset



### Task 1

The text file `'frankenstein.txt'` located in the `'datasets'` directory contains an ebook of [_Frankenstein_](https://www.gutenberg.org/cache/epub/84/pg84-images.html) by Mary Shelley that was made freely available by [The Project Gutenberg](https://gutenberg.org/). We'll use this text exclusively to train our text generation model in this project.

Begin by using the `with` and `open` statements to open and read the text file to the variable `frankenstein`.

In [3]:
with open('datasets/frankenstein.txt', 'r', encoding = 'utf-8') as f:
    frankenstein = f.read()

### Task 2

**Note**: Due to hardware constraints, we'll only use a small portion of the full text to train our text generation model.

The novel begins with a series of letters from one of the main characters that create a framed narrative that foreshadows key themes and providing context for the main story.

Let's extract the first letter by slicing the text in `frankenstein` from the character starting at position `1380` (inclusive) to the character ending at position `8230` (exclusive). 

Save and print the extracted letter to the variable `first_letter_text`. Feel free to read the letter to get a sense of the text that our model will be trained on.

In [4]:
first_letter_text = frankenstein[1380:8230]
print(first_letter_text)

Letter 1

_To Mrs. Saville, England._


St. Petersburgh, Dec. 11th, 17—.


You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand this
feeling? This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icy climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk

## Task Group 2 - Tokenization and Create Vocabulary

Next, let's preprocess the text using tokenization to tokenize the text of the first letter into individual **character-based** tokens. 

We'll also create the vocabulary (and inverse vocabulary) containing the collection of unique tokens mapped to unique token IDs.

### Task 3

Use `list()` to tokenize the text in `first_letter_text` into character-based tokens. Save the tokenized text to the variable `tokenized_text`.

Print the number of tokens in the tokenized text of the first letter.

In [5]:
tokenized_text = list(first_letter_text)
print(len(tokenized_text))

6850


<details><summary style="display:list-item; font-size:16px; color:blue;">How many tokens are in the tokenized text of the first letter?</summary>

The first letter in _Frankenstein_ contains 6,850 (character-based) tokens.

### Task 4

Next, let's start creating the vocabulary that maps each token to a unique token ID. 

Using `tokenized_text`, create a list of unique tokens sorted alphabetically. Save the list to the variable `unique_char_tokens`.

In [6]:
unique_char_tokens = sorted(list(set(tokenized_text)))

### Task 5

Now, create the vocabulary by assigning each unique token to a token ID based on their positional index in the sorted list `unique_char_tokens`. Save the vocabulary to the variable `c2ix`. 

Print the vocabulary.

In [7]:
c2ix = {ch:i for i, ch in enumerate(unique_char_tokens)}
print(c2ix)

{'\n': 0, ' ': 1, '!': 2, ',': 3, '-': 4, '.': 5, '1': 6, '7': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'B': 12, 'D': 13, 'E': 14, 'F': 15, 'G': 16, 'H': 17, 'I': 18, 'J': 19, 'L': 20, 'M': 21, 'N': 22, 'O': 23, 'P': 24, 'R': 25, 'S': 26, 'T': 27, 'U': 28, 'W': 29, 'Y': 30, '_': 31, 'a': 32, 'b': 33, 'c': 34, 'd': 35, 'e': 36, 'f': 37, 'g': 38, 'h': 39, 'i': 40, 'j': 41, 'k': 42, 'l': 43, 'm': 44, 'n': 45, 'o': 46, 'p': 47, 'q': 48, 'r': 49, 's': 50, 't': 51, 'u': 52, 'v': 53, 'w': 54, 'x': 55, 'y': 56, 'z': 57, '—': 58, '’': 59}


<details><summary style="display:list-item; font-size:16px; color:blue;">What does the vocabulary consist of?</summary>

The vocabulary consists of special characters (like whitespaces and punctuation marks), numbers, uppercase characters, and lowercase characters.

### Task 6

Let's obtain the vocabulary size (the number of unique tokens). Save the vocabulary size to the variable `vocab_size`.

In [8]:
vocab_size = len(c2ix)
print(vocab_size)

60


<details><summary style="display:list-item; font-size:16px; color:blue;">What is the size of the vocabulary?</summary>

The vocabulary size tells us that there are **60** unique character-based tokens in the vocabulary out of the 6,850 tokens in the tokenized text of the first letter in _Frankenstein_.

### Task 7

Using the vocabulary `c2ix`, create the inverse vocabulary that maps the token ID back to their text-based token. Save the inverse vocabulary to the variable `ix2c`.

In [9]:
ix2c = {ix:ch for ch, ix in c2ix.items()}

### Task 8

Now, let's use the vocabulary `c2ix` to map each token in the tokenized text  to their token IDs. Save the token ID mapping to the variable `tokenized_id_text`.

In [10]:
tokenized_id_text = [c2ix[ch] for ch in tokenized_text]

## Task Group 3 - Create Sequences for the Features and Labels

Since we're training an LSTM to generate text, we'll need to preprocess the tokenized text and create sequences of tokens for the features and their corresponding labels.

### Task 9

Start by importing the PyTorch utility classes `Dataset` and `DataLoader` from the `torch.utils.data` module.

In [11]:
from torch.utils.data import Dataset, DataLoader

### Task 10

Create a class named `TextDataset` that utilizes the `Dataset` class such that:

1. The `__init__` method should have the following:
    - takes in the variables `tokenized_text` and `seq_length` as inputs 
    - the attribute `self.tokenized_text` that is assigned with the variable `tokenized_text`
    - the attribute `self.seq_length` that is assigned with the variable `seq_length`
  
2. The `__len__` method that counts the number of features available for training:
    - Hint: this method should return the difference between the length of the tokenized text and the sequence length
  
3. The `__getitem__` method should have the following:
    - takes in the input variable `idx` that helps to index the tokenized text
    - the variable `features` as a tensor that creates sequences from the tokenized text using the index and sequence length
    - the variable `labels` that is created by shifting the features by one token to the right
    - the method should return the `features` and `labels`

In [12]:
class TextDataset(Dataset):
    def __init__(self, tokenized_text, seq_length):
        self.tokenized_text = tokenized_text
        self.seq_length = seq_length

    def __len__(self):
        return len(self.tokenized_text) - self.seq_length

    def __getitem__(self, idx):
        features = torch.tensor(self.tokenized_text[idx: idx + self.seq_length])
        labels = torch.tensor(self.tokenized_text[idx + 1: idx + self.seq_length + 1])
        return features, labels

### Task 11

Let's define a sequence length of `48` tokens and save the integer to the variable `seq_length`.

Now use the `TextDataset` class in the previous task to create and store the dataset of features and labels using the tokenized text `tokenized_id_text` and sequence length `seq_length`. Save the created sequences to the variable `dataset`.

In [13]:
seq_length = 48
dataset = TextDataset(tokenized_id_text, seq_length)

### Task 12

Next, let's use the `DataLoader` utility class to create an iterable that allows us to train our LSTM using one batch at a time.

Let's define a batch size of `36` and save the integer to the variable `batch_size`.

Next, use the `DataLoader` class to create the iterable using the following inputs:
- `dataset` containing the sequences for the features and labels
- `batch_size` specifying the batch size
- setting `shuffle=True` to shuffle the sequences to improve training

In [15]:
batch_size = 36
dataloader = DataLoader(dataset, batch_size = batch_size, shuffle = True)

## Task Group 4 - Build and Train the LSTM Network

Now that our text data has been tokenized and preprocessed, let's start building and training the LSTM.

### Task 13

Start by importing the `torch.nn` module with the alias `nn`.

In [16]:
import torch.nn as nn

### Task 14

Let's now create a class for our LSTM model that will be trained to generate text using character-based tokens.

Start by defining a named `CharacterLSTM` that inherits the `nn.Module` (the base class for neural networks using PyTorch) with the following:

A. The `__init__` method initializing the following components:
- `super(CharacterLSTM, self).__init__()` : used for proper initialization purposes
- `self.embedding` : an embedding layer that creates embeddings for each token in the vocabulary specified with `48` embedding dimensions
- `self.lstm` : an LSTM layer with inputs that match the embedding dimension and outputs a hidden size of `96`
- `self.linear` : a linear layer with inputs that match the hidden size of the LSTM layer and outputs equal to the vocabulary size. Be sure to set `batch_first=True`


B. The `forward` method takes in the input `x`, and the hidden/cell states `states`. Use the inputs to define the forward method in the following order:
   1. pass the input x through the embedding layer
   2. pass the embedding output along with the previous states to the LSTM layer and return the output and the updated states
   3. pass the LSTM output to the linear layer
   4. reshape the linear layer output 
   5. return the reshaped output and the updated states

C. The `init_state` method that initializes each hidden and cell state for every new batch during training that returns:
- `hidden` : a tensor of zeros for the hidden state with the shape `(1, batch_size, 96)`
- `cell` : a tensor of zeros for the cell state with the shape `(1, batch_size, 96)`
- Note: the shape of the tensors for each state corresponds to `(1, batch_size, hidden_size)`.

In [23]:
class CharacterLSTM(nn.Module):
    def __init__(self):
        super(CharacterLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim = 48) 
        self.lstm = nn.LSTM(input_size = 48, hidden_size = 96, batch_first = True)
        self.linear = nn.Linear(96, vocab_size)

    def forward(self, x, states):
        x = self.embedding(x)
        out, states = self.lstm(x, states)
        out = self.linear(out)
        out = out.reshape(-1, out.size(2))
        return out, states

    def init_state(self, batch_size):
        hidden = torch.zeros(1, batch_size, 96)
        cell = torch.zeros(1, batch_size, 96)
        return hidden, cell

### Task 15

Now, let's create an instance of the `CharacterLSTM` class and save it to the variable `lstm_model`.


In [24]:
lstm_model = CharacterLSTM()

### Task 16

Next, let's set up the loss function. 

Create an instance of the **multiclass cross-entropy loss function** and save it to the variable `loss`.

In [25]:
loss = nn.CrossEntropyLoss()

### Task 17

Now, let's set up the optimizer. 

Import the `torch.optim` module with the alias `optim`.

Create an instance of the `Adam` optimizer using the parameters from the instantiated model `lstm_model` with a learning rate of `0.015`. Save the optimizer to the variable `optimizer`.


In [26]:
import torch.optim as optim

optimizer = optim.Adam(lstm_model.parameters(), lr = 0.015)

### Task 18

With all of the LSTM components built and initialized, let's train the LSTM model to generate text.

Create a training loop that trains the network for `5` epochs.

Within each epoch, loop through each batch of features and labels in `dataloader` such that at each iteration:

1. Reset the gradients
2. Reset the hidden and cell states
3. Apply the forward pass (that returns the output and updates the states)
4. Calculate the loss
5. Compute the gradients
6. Update the weights and biases

Be sure to print out the loss every epoch.

In [27]:
num_epochs = 5

for epoch in range(num_epochs):
    for features, labels in dataloader:
        optimizer.zero_grad()
        states = lstm_model.init_state(features.size(0))
        outputs, states = lstm_model(features, states)
        CEloss = loss(outputs, labels.view(-1))
        CEloss.backward()
        optimizer.step()

    if (epoch + 1) % 1 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], CELoss: {CEloss.item():.4f}')

Epoch [1/5], CELoss: 1.1466
Epoch [2/5], CELoss: 0.6431
Epoch [3/5], CELoss: 0.4771
Epoch [4/5], CELoss: 0.4184
Epoch [5/5], CELoss: 0.3901


## Task Group 5 - Generate Text

Let's now generate text from the trained LSTM!

### Task 19

Now, create a starting prompt to provide context for the model to generate relevant text from.

For example, let's see if the trained model can accurately generate the beginning portion of the first letter:

```md
You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.
```

Let's use the first five words `"You will rejoice to hear"` as the starting prompt. Save the string to the variable `starting_prompt`.

In [28]:
starting_prompt = "You will rejoice to hear"

### Task 20

Next, we'll need to tokenize the starting prompt into token IDs. This will allow the model to read the starting prompt, update the hidden and cell states, and generate relevant text.

Use the vocabulary `c2ix` to map each character in the starting prompt into token IDs. Save the mapping as a PyTorch tensor containing a list of lists to the variable `tokenized_id_prompt`. 

In [29]:
tokenized_id_prompt = torch.tensor([[c2ix[ch] for ch in starting_prompt]])

<details><summary style="display:list-item; font-size:16px; color:blue;">Here is what the tokenized prompt should look like:</summary>

```py
tokenized_id_prompt = torch.tensor([[30, 46, 52,  1, 54, 40, 43, 43,  1, 49, 36, 41, 46, 40, 34, 36,  1, 51,
                                     46,  1, 39, 36, 32, 49,  1, 51, 39, 32, 51,  1, 45, 46,  1, 35, 40, 50,
                                     32, 50, 51, 36, 49,  1, 39, 32, 50,  1, 32, 34, 34, 46, 44, 47, 32, 45,
                                     ...
                                     ...
                                     ...]])
```

### Task 21

First, let's set the model to evaluation mode using `.eval()`

In [30]:
lstm_model.eval()

CharacterLSTM(
  (embedding): Embedding(60, 48)
  (lstm): LSTM(48, 96, batch_first=True)
  (linear): Linear(in_features=96, out_features=60, bias=True)
)

### Task 22

Let's generate the next `500` character-based tokens from the trained LSTM. Save the integer `500` to the variable `num_generated_chars`.

Next, let's begin creating the text generation loop. 

Starting within the `torch.no_grad():` context, initialize the clean hidden and cell states to the variable `states` using the `init_state(1)` method in the LSTM model.

Create a `for` loop that generates one character per iteration (using `range(num_generated_chars)`) with the following:

1. Input the tokenized prompt through the forward pass to generate the output and updated states
2. Use `torch.argmax` to select the token ID with the highest output score
3. Use the inverse vocabulary `ix2c` to map the selected token ID to its character-based token
4. Append the generated token to the starting prompt
5. Prepare the generated token for the next iteration

Lastly, print the starting prompt with the newly generated text.

In [33]:
num_generated_chars = 500

with torch.no_grad():
    states = lstm_model.init_state(1)
    for _ in range(num_generated_chars):
        output, states = lstm_model(tokenized_id_prompt, states)
        predicted_id = torch.argmax(output[-1, :], dim = -1).item()
        predicted_char = ix2c[predicted_id]
        starting_prompt += predicted_char
        tokenized_id_prompt = torch.tensor([[predicted_id]])

print(starting_prompt)

You will rejoice to hear that no disaster has accompanied the whale-fishing. I do not intend to
sail until that my father’s dying injunction of beauty every region hitherto discovered
solitudes. What may not be expected the wondrous power which addination as the
whole of our good Uncle Thomas’ library. My education was now
adm commence this
laborious voyage with the soulade a poet and firm; but my faming. I
can, even now, remember that a
history of all the voyages which have
been made in the prospect of arriving at the again a sead with your led my study
day and night, and my familiarity with the soury of eternal ing a
perpetual splendour. There—for with your led sister, how
can I answer this
laborious voyage with the soulade a poet and firm; but my faming. I
can, even now, remember that a
history of all the voyages which have
been made in the prospect of arriving at the North Sea;
I voluntarily endured cold, famine, thirst, and want of sleep; which
wille, its broad disk just skirting 

<details><summary style="display:list-item; font-size:16px; color:blue;">How is the generated text from the trained LSTM?</summary>

Recall, here is the first paragraph that our text generation model attempted to re-generated given the starting prompt `"You will rejoice to hear"`:
    
```md
You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.
```
    
The generated text from the character-based LSTM shows promising results. For example, it was able to learn some of the aspects of the underlying text as it was able to correctly generate the first phrase: `"You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which"`. 

Although it starts deviating from the original text, it still maintains some coherency and grammar. The tone and thematic style are still consistent with the original text, and the LSTM even attempts to generate new and novel content.

That's the end of our project on building a text generation model based on Mary Shelley's _Frankenstein_! There's definitely a lot of room for improvement and we encourage you to use your skills to explore different techniques to enhance the language model. 

Here are some areas for improvement:
- use the full text (or gather multiple outside texts)
- use a larger embedding size (GPT3 uses a dimension size of ~12,000!)
- modify the neural network architecture (add more neurons, layers, activation functions, etc.)
- increase the number of epochs for training
- test different optimizers and learning rates

You might want to consider building and training larger language models on your own device or cloud platform with greater memory.

Happy coding!