# DGM Lab 2 - Autoregressive models

In this Lab session we will implement recurrent neural networks (RNNs) in PyTorch. We will look at how your data should be structured in order to train such RNNs, and how we can use the networks as generators of new sequences of data.

## Preparing the data

For the first part of this lab session, we will use textual data, since this will allow us to have faster training times, more efficient models and more enjoyable results. On the one hand, training RNNs on images is generally resource-heavy because of their high dimensionality and complex structure. On the other hand, we cannot just use vanilla RNN architectures for images, since these will produce unsatisfying results &ndash; you should rather resort to architectures should as [PixelRNN](https://arxiv.org/abs/1601.06759) or [DRAW](https://arxiv.org/abs/1502.04623), but these are challenging to train and out of scope for this lab session. And as will become clear later on in this course, autoregressive models are generally not the preferred family of models to generate images.

We will use the data from (almost) all of **Shakespeare's plays**. We have already assembled them together in a single txt file by crawling [Project Gutenberg](https://www.gutenberg.org/). The entire file is approximately 6.1 MB large, and can be downloaded into the `content` folder of this Colab notebook by executing the following script:

In [None]:
!wget https://raw.githubusercontent.com/cedricdeboom/character-level-rnn-datasets/master/datasets/shakespeare.txt

--2025-05-24 16:14:35--  https://raw.githubusercontent.com/cedricdeboom/character-level-rnn-datasets/master/datasets/shakespeare.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6347705 (6.1M) [text/plain]
Saving to: ‘shakespeare.txt’


2025-05-24 16:14:35 (84.3 MB/s) - ‘shakespeare.txt’ saved [6347705/6347705]



In the first part of this lab session, we will train a so-called **character-level RNN**. This means that we will operate on the character level of the text, and not on the word level, i.e. a single token is a separate character in the text and we will generate the text one character at a time. We do this again to optimize the computational footprint and efficiency of the model, since there are much less unique characters than unique words in a text.

### Assignment 1

Since neural networks only operate on numerical data, we have to be able to convert our text into such numerical format.

1. Calculate the length (in no. of characters) of the entire dataset.
2. Create a collection of all the unique characters in the dataset and calculate its size.
3. Inspect the collection of unique characters. Are there any strange or unwanted characters? Remove them from the collection.
4. Create two data structures that can be used to map each unique character onto a unique integer index, and vice versa. These will be used to convert between a text and a sequence of numbers.
5. Since the dataset is not that large, we will keep the entire dataset in memory for quick access. Store the data as a single numerical (NumPy) array, thereby making use of the char-to-index map you created before. Make sure the array stores integers and not floats.

If you want to further increase the efficiency of the model (and reduce the data dimensionality), you can first convert the entire dataset to lowercase letters, but this is not obligatory.

In [None]:
import numpy as np

def calculate_dataset_length(file_path):
    """Calculates the length (in number of characters) of the entire dataset.

    Args:
        file_path: The path to the text file.

    Returns:
        The length of the dataset in characters.
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
            return len(text)
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
        return -1


def calculate_unique_chars(file_path):
    """
    Creates a collection of all the unique characters in the dataset and calculates its size.
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
            unique_chars = set(text)
            return unique_chars, len(unique_chars)
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
        return None, -1


def remove_unwanted_chars(unique_chars):
    unwanted_chars = {
        '\x0c',  # Form feed
        '\u200b' # Zero width space
        '\ufeff'
        '$'

    }
    cleaned_chars = unique_chars - unwanted_chars
    return cleaned_chars

def create_char_mappings(chars):
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}
    return char_to_idx, idx_to_char

def create_numerical_array(file_path, char_to_idx):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    numerical_data = np.array([char_to_idx[char] for char in text if char in char_to_idx], dtype=np.int32)
    return numerical_data


# Example usage
file_path = 'shakespeare.txt'
dataset_length = calculate_dataset_length(file_path)
unique_chars, unique_chars_size = calculate_unique_chars(file_path)
cleaned_unique_chars = remove_unwanted_chars(unique_chars)

char_to_idx, idx_to_char = create_char_mappings(cleaned_unique_chars)
numerical_array = create_numerical_array(file_path, char_to_idx)


if dataset_length != -1:
    print(f"The length of the dataset is: {dataset_length} characters")
if unique_chars_size != -1:
    print(f"The collection of unique characters is: {unique_chars}")
    print(f"The size of the unique character set is: {unique_chars_size}")
print(f"The cleaned collection of unique characters is: {cleaned_unique_chars}")
print(f"The size of the cleaned unique character set is: {len(cleaned_unique_chars)}")
print(f"The numerical array shape is: {numerical_array.shape}")
print(f"First 20 elements of the numerical array: {numerical_array[:20]}")


The length of the dataset is: 6347703 characters
The collection of unique characters is: {'p', '\n', 'b', 'l', 'j', 'V', ':', 'v', 'T', '"', '}', 'Y', '8', 'S', 'x', 'I', 'K', ';', 'Z', '?', ']', 'W', 'A', '5', 'P', 'c', '(', 's', 'G', 'E', 'M', 'd', '6', '[', '&', 'f', 'k', 't', 'C', '-', 'q', ' ', 'X', 'o', 'i', '4', 'Q', '1', 'H', '.', 'J', ')', 'w', '<', 'D', '3', 'n', 'z', 'h', 'r', 'm', '2', '7', 'a', '\ufeff', 'R', '_', ',', 'F', 'U', '!', '$', '9', 'u', "'", 'e', '0', 'B', 'N', 'L', 'y', 'g', 'O'}
The size of the unique character set is: 83
The cleaned collection of unique characters is: {'4', 'Q', '1', 'p', '\n', 'H', 'b', '.', 'l', 'J', ')', 'j', 'w', 'V', '<', ':', 'D', 'v', 'T', '3', '"', '}', 'n', 'Y', 'z', '8', 'S', 'x', 'I', 'h', 'r', 'm', '2', 'K', '7', ';', 'Z', 'a', '?', ']', 'W', '\ufeff', 'R', 'A', '5', 'P', '_', 'c', '(', ',', 's', 'G', 'E', 'M', 'd', 'F', '6', 'U', '[', '&', 'f', '!', '$', '9', 'k', 'u', 't', "'", 'e', '0', 'C', 'B', '-', 'N', 'q', ' ', 'X', 'L', 

## Representing and batching the data

When we represent characters as integers, we implicitly define an ordering between the characters: character 0 is close to character 1, but farther away from character 35. This is unwanted: the distance between each character in the feature space should ideally be the same. For this reason we will use a so-called **one-hot encoding** of each character. This is a vector of all zeros, except for a 1 at the index position of the considered character. For example: if we have 4 unique characters, character 0 is one-hot encoded as `[1,0,0,0]`, character 1 as `[0,1,0,0]`, character 2 as `[0,0,1,0]` and character 3 as `[0,0,0,1]`.

As you know, neural networks are (traditionally) trained with stochastic gradient descent. This means that the data must be delivered in batches at the input of the network. For sequential data this means that the data becomes 3-dimensional: if $B$ is the batch size, $T$ is the sequence length and $D$ is the data dimensionality, then each batch has shape $(B, T, N)$. Note that this immediately implies that within a batch, all sequences must have the same length $T$. Does is done by either:

 * Only taking chunks of length $T$ from the dataset to fill up the batch, or
 * Also taking chunks of length $\leq T$, and filling up the remainder of the sequences with invalid data (padding).

Since our entire dataset is essentially one long sequence, we can always sample chunks of length $T$, so we will take the (simpler) first approach.

### Assignment 2

Create a custom Dataset class for the Shakespeare data (see lab 1 for more details). For this purpose, write appropriate `__init__`, `__len__` and `__getitem__` methods. Make sure that you can specify the desired sequence length $T$ and data dimensionality $D$. You can take different approaches:

 * Divide the dataset in chunks of equal length $T$ e.g. by following the truncated backpropagation through-time (TBPTT) parameterizations (see lecture). You can then calculate how many sequences your dataset counts.
 * Return a random chunk whenever `__getitem__` is called. This is more versatile than the approach above and leads to smoother loss minimization, but you lose the notion of an "epoch". In this case, you can return whatever (large) number you want in the `__len__` method (e.g. `sys.maxsize` returns a large integer number).

Remember that you also have to return the training target labels for a sequence in the `__getitem__` method! Think again about what the task is ("predict the next character in a sequence") and then decide what the labels should be. Again you have different possibilities (see lecture: single-loss training vs. multi-loss training). Let's pick **multi-loss training** for now: in that case the target labels should also be a sequence.

Make sure that the input data is one-hot encoded (this is not needed for the target labels) and that your data has dtype float32 (single precision; which is the PyTorch default for neural network parameters)!

Initialize an instance of a DataLoader and test your Dataset class.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class ShakespeareDataset(Dataset):
    def __init__(self, data, sequence_length, char_to_idx):
        super(ShakespeareDataset, self).__init__()
        self.data = data
        self.sequence_length = sequence_length
        self.char_to_idx = char_to_idx
        self.num_chars = len(char_to_idx)
        self.num_sequences = (len(data) - 1) // sequence_length

    def __len__(self):
        return self.num_sequences

    def __getitem__(self, idx):
        start_idx = idx * self.sequence_length
        end_idx = start_idx + self.sequence_length
        sequence = self.data[start_idx:end_idx]
        target = self.data[start_idx+1:end_idx+1]

        # One-hot encode the input sequence
        one_hot_sequence = torch.zeros(self.sequence_length, self.num_chars, dtype=torch.float32)
        for i, char_idx in enumerate(sequence):
            one_hot_sequence[i, char_idx] = 1

        return one_hot_sequence, torch.tensor(target, dtype=torch.long)

# Example usage
sequence_length = 10

dataset = ShakespeareDataset(numerical_array, sequence_length, char_to_idx)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Test the dataset
for i, (input_data, target_data) in enumerate(dataloader):
    print(f"Batch {i+1}:")
    print("Input data shape:", input_data.shape)
    print("Target data shape:", target_data.shape)
    if i >= 2:  # Check the first 3 batches for demonstration
      break


Batch 1:
Input data shape: torch.Size([32, 10, 83])
Target data shape: torch.Size([32, 10])
Batch 2:
Input data shape: torch.Size([32, 10, 83])
Target data shape: torch.Size([32, 10])
Batch 3:
Input data shape: torch.Size([32, 10, 83])
Target data shape: torch.Size([32, 10])


## Building the model

We will now build the recurrent neural network in PyTorch that will be trained to predict the next character in a given sequence of characters. We will leverage GRU layers as the main building blocks, but you can experiment with other layers as well. You can find more information on Pytorch's recurrent layers on [https://pytorch.org/docs/master/nn.html#recurrent-layers](https://pytorch.org/docs/master/nn.html#recurrent-layers).

### Assignment 3

A diagram of the envisioned architecture is shown in the picture below. Implement this architecture by inheriting from `torch.nn.Module`, as was explained in Lab 1. Test your module by getting a batch of data, feed it trough the network, and check if a sensible output is produced (especially if you have implemented the softmax - see remark 2 below - check if your output is properly normalized).

![Image of RNN architecture](./RNNarchitecture.png)

The one-hot encoded input sequence is fed through two GRU layers with 128 hidden states. The outputs of the last GRU layer are fed through two dense layers. The first dense layer has a (leaky) ReLU output activation (but you can experiment with other ones as well). The final dense layer has an output dimensionality of $D$ and calculates a softmax over all possible characters. The parallel arrows indicate that each layer calculates an output at every time step. Of course, the dense layers do not have a recurrent nature and do not process a sequence in its entirety: each input of the dense layer is processed separately of the others and leads to its own output. But it was cleaner to draw the diagram this way.

**Important remark 1**: The Pytorch recurrent layers have a `batch_first` argument that you ideally set to `True`. That way, the first data dimension is considered the batch dimension.

**Important remark 2**: For the softmax nonlinearity, you can either choose to leave it out of the model during training and offload its computation to the CrossEntropyLoss object later (as done in Lab 1). Or you can choose to attach the log-softmax nonlinearity and make it part of the model. This will have implications on the loss function that will be used during training (see later). It's up to you, but please make sure that you know what you are doing!

**Important remark 3**: The initial hidden state $\mathbf{h}_0$ of your GRU layers is initialized as a vector of zeroes, as a PyTorch default. It is, however, possible to parameterize the initial hidden states and train these parameters; if you are interested in this, you can search through the documentation or on the internet how you can achieve this (not obligatory).

In [None]:
import torch.nn as nn

class RNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNNModel, self).__init__()
        self.hidden_size = hidden_size
        self.gru1 = nn.GRU(input_size, hidden_size, batch_first=True)
        self.gru2 = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dense1 = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.LeakyReLU()
        self.dense2 = nn.Linear(hidden_size, output_size)
        self.log_softmax = nn.LogSoftmax(dim=2) # Log-softmax for numerical stability


    def forward(self, x):
        # Initialize hidden states (zeros by default)
        h0_gru1 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
        h0_gru2 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)

        # GRU layers
        out_gru1, _ = self.gru1(x, h0_gru1)
        out_gru2, _ = self.gru2(out_gru1, h0_gru2)

        # Dense layers
        out_dense1 = self.relu(self.dense1(out_gru2))
        out_dense2 = self.dense2(out_dense1)
        out = self.log_softmax(out_dense2) # Apply log-softmax

        return out

# Example instantiation and testing
input_size = len(cleaned_unique_chars)
hidden_size = 128
output_size = len(cleaned_unique_chars)

model = RNNModel(input_size, hidden_size, output_size)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Get a batch of data
for batch_idx, (data, target) in enumerate(dataloader):
    data = data.to("cuda" if torch.cuda.is_available() else "cpu")
    # Feed the data through the network
    output = model(data)
    print("Output shape:", output.shape)
    # Check output normalization (log-softmax)

    print("Sum of probabilities along last dim (should be ~1):", torch.exp(output))

    break # Inspect only the first batch


Output shape: torch.Size([128, 100, 83])
Sum of probabilities along last dim (should be ~1): tensor([[[0.0123, 0.0122, 0.0110,  ..., 0.0125, 0.0119, 0.0123],
         [0.0122, 0.0122, 0.0110,  ..., 0.0126, 0.0119, 0.0123],
         [0.0121, 0.0122, 0.0110,  ..., 0.0126, 0.0118, 0.0123],
         ...,
         [0.0120, 0.0122, 0.0111,  ..., 0.0127, 0.0117, 0.0123],
         [0.0120, 0.0122, 0.0111,  ..., 0.0127, 0.0117, 0.0123],
         [0.0120, 0.0122, 0.0111,  ..., 0.0127, 0.0117, 0.0123]],

        [[0.0123, 0.0122, 0.0110,  ..., 0.0125, 0.0120, 0.0124],
         [0.0122, 0.0122, 0.0110,  ..., 0.0125, 0.0119, 0.0124],
         [0.0122, 0.0122, 0.0110,  ..., 0.0126, 0.0119, 0.0123],
         ...,
         [0.0122, 0.0122, 0.0112,  ..., 0.0125, 0.0117, 0.0122],
         [0.0121, 0.0121, 0.0112,  ..., 0.0125, 0.0118, 0.0123],
         [0.0122, 0.0120, 0.0112,  ..., 0.0124, 0.0118, 0.0123]],

        [[0.0123, 0.0122, 0.0110,  ..., 0.0125, 0.0119, 0.0123],
         [0.0122, 0.0122, 0.01

## Training the model

Now that we have the dataset and the model ready, it is time to train our very first character-level RNN!

### Assignment 4

Write an optimization procedure to train the RNN from assignment 3 using the data and batching strategy from assignments 1 and 2. We advise you to use the **Adam** optimizer with learning rate 0.001, which has become one of the default optimizers in deep learning, especially if you don't want to spend much time figuring out an effective learning rate schedule for plain SGD (which generally can lead to better optimizations). Pick a large enough batch size, e.g. 128, and a sequence length $T$ of around 100. Around 50 epochs of 1000 batches should be enough for now to train this model until "reasonable" convergence. Make sure to put the model and each batch on the GPU.

**Important remark**: those of you who left out the log-softmax from the model, will need to use a CrossEntropyLoss. If you did use a log-softmax, you need a NLLLoss. Please read the documentation carefully regarding the use of these loss functions. Since we use **multi-loss training** we have a classification target at each time step, i.e. $T$ different loss values for each entry in the batch. Find out how to *correctly(!)* combine these losses into a single number (they can be averaged or summed, but make sure this is done along the correct axis). Another option (instead of using the built-in loss functions) is to write the loss criterion yourself in a separate function; this can be a nice exercise.

In [None]:
import torch.optim as optim
from torch.nn import NLLLoss

# Hyperparameters
sequence_length = 100
batch_size = 128
num_epochs = 50
num_batches = 1000
learning_rate = 0.001

# Create the dataset and dataloader
dataset = ShakespeareDataset(numerical_array, sequence_length, char_to_idx)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


# Move the model and loss function to the GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = NLLLoss().to(device) # Using NLLLoss since log_softmax is in the model
optimizer = optim.Adam(model.parameters(), lr=learning_rate)


# Training loop
for epoch in range(num_epochs):
    for batch_idx in range(num_batches):
        # Get a batch of data and move it to the GPU
        data, target = next(iter(dataloader))
        data, target = data.to(device), target.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        output = model(data)

        # Calculate the loss
        loss = criterion(output.view(-1, output.shape[-1]), target.view(-1)) # Reshape for NLLLoss

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Print progress
        if batch_idx % 100 == 0:
          print(f"Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx+1}/{num_batches}], Loss: {loss.item():.4f}")


Epoch [1/50], Batch [1/1000], Loss: 4.4450
Epoch [1/50], Batch [101/1000], Loss: 3.1894
Epoch [1/50], Batch [201/1000], Loss: 2.7972
Epoch [1/50], Batch [301/1000], Loss: 2.3104
Epoch [1/50], Batch [401/1000], Loss: 2.1366
Epoch [1/50], Batch [501/1000], Loss: 1.9986
Epoch [1/50], Batch [601/1000], Loss: 1.9903
Epoch [1/50], Batch [701/1000], Loss: 1.9193
Epoch [1/50], Batch [801/1000], Loss: 1.8410
Epoch [1/50], Batch [901/1000], Loss: 1.8616
Epoch [2/50], Batch [1/1000], Loss: 1.7894
Epoch [2/50], Batch [101/1000], Loss: 1.7654
Epoch [2/50], Batch [201/1000], Loss: 1.7016
Epoch [2/50], Batch [301/1000], Loss: 1.7049
Epoch [2/50], Batch [401/1000], Loss: 1.7143
Epoch [2/50], Batch [501/1000], Loss: 1.6743
Epoch [2/50], Batch [601/1000], Loss: 1.6255
Epoch [2/50], Batch [701/1000], Loss: 1.6529
Epoch [2/50], Batch [801/1000], Loss: 1.5805
Epoch [2/50], Batch [901/1000], Loss: 1.6172
Epoch [3/50], Batch [1/1000], Loss: 1.6049
Epoch [3/50], Batch [101/1000], Loss: 1.5545
Epoch [3/50], Ba

## Generating text

In the previous assignment, our RNN was trained to predict the next character in a given sequence. We will now use this trained RNN in generator mode to produce text on its own. Since we have used multi-loss training, we have multiple options for the sampling strategy: either progressive sampling or windowed sampling, which each have their own benefits and flaws (see lecture). Also, it often depends on the framework that you use (in our case PyTorch) which of the sampling strategies is more easy to implement.

### Assignment 5

We will implement **progressive sampling**. We will start from a seed sequence of 100 characters after which we let the RNN generate the subsequent characters ad libitum (you can choose how many characters you want the RNN to generate). We also want to be able to tune the randomness in the model by means of a **softmax temperature** parameter. The plan is as follows:

 1. Sample a random seed sequence of $T$ characters from the dataset and feed it through the RNN.
 2. Consider the softmax output of the final time step, and use it to sample the next character in the sequence (`torch.multinomial` might come in handy).
 3. Feed the recently sampled character through the RNN, but make sure that the RNN starts from the last hidden state of step 1! For this to work, you will need to change the model definition such that you can specify the initial hidden state, and such that the final hidden state can be captured when a sequence is fed through the RNN. Look in the documentation on how to achieve this.
 4. Iterate from step 2 until enough characters are sampled.

Test your sampler with different temperature values and different seed sequences. If all went right, observe that the model has learned to create words, separated by spaces, and that it has learned that there are character roles which are often written in uppercase letters. Also observe that the RNN is pretty good at generating language on a low level, but that it fails to produce coherent texts on a larger scale.

**Important remark 1**: make sure that your input data still has a batch dimension. If this is not the case, take a look at the `unsqueeze` function to add extra dimensions to your data.

**Important remark 2**: you will have to change the `forward`-method of the model class such that a softmax temperature can be specified as an extra argument.

**Important remark 3**: since you will be changing the model class, it is advised to store the parameters of the already trained RNN. You can then load these parameters into your updated model object.

**Important remark 4**: if you specify a random seed, your results will be reproducible, which might be handy.

In [None]:
class RNNModel_with_temperature(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNNModel_with_temperature, self).__init__()
        self.hidden_size = hidden_size
        self.gru1 = nn.GRU(input_size, hidden_size, batch_first=True)
        self.gru2 = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dense1 = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.LeakyReLU()
        self.dense2 = nn.Linear(hidden_size, output_size)
        self.log_softmax = nn.LogSoftmax(dim=2)

    def forward(self, x, temperature=1.0, hidden=None):
        # Initialize hidden states if not provided
        if hidden is None:
          h0_gru1 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
          h0_gru2 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
          hidden = (h0_gru1, h0_gru2)

        h0_gru1, h0_gru2 = hidden
        out_gru1, h1 = self.gru1(x, h0_gru1)
        out_gru2, h2 = self.gru2(out_gru1, h0_gru2)

        out_dense1 = self.relu(self.dense1(out_gru2))
        out_dense2 = self.dense2(out_dense1)
        out = self.log_softmax(out_dense2 / temperature)  # Apply temperature to softmax

        return out, (h1, h2)



# Example usage
seed_length = 100
generation_length = 500
temperature = 0.8  # Example temperature

# Sample a random seed sequence
start_index = np.random.randint(0, len(numerical_array) - seed_length)
seed_sequence = numerical_array[start_index : start_index + seed_length]


# Convert to one-hot encoded tensor
seed_tensor = torch.zeros(1, seed_length, len(char_to_idx), dtype=torch.float32).to(device) #batch size 1
for i, char_idx in enumerate(seed_sequence):
    seed_tensor[0, i, char_idx] = 1


# Generate text
generated_text = list(seed_sequence) #start with the seed
hidden = None

model_temp = RNNModel_with_temperature(len(char_to_idx), 128, len(char_to_idx)).to(device)
model_temp.load_state_dict(model.state_dict()) #load the weights

with torch.no_grad():
    for _ in range(generation_length):
        output, hidden = model_temp(seed_tensor, temperature=temperature, hidden=hidden)
        last_output = output[0, -1, :]  # Get output for the last character
        probabilities = torch.exp(last_output) #remove log
        next_char_index = torch.multinomial(probabilities, 1).item()
        generated_text.append(next_char_index)

        # Prepare input for the next step
        next_char_one_hot = torch.zeros(1, 1, len(char_to_idx), dtype=torch.float32).to(device)
        next_char_one_hot[0, 0, next_char_index] = 1
        seed_tensor = torch.cat((seed_tensor, next_char_one_hot), dim=1)[:,1:,:] #remove the first element

# Convert generated text indices to characters
generated_text_chars = [idx_to_char[idx] for idx in generated_text]
print("".join(generated_text_chars))


e how I will undo myself:
    I give this heavy weight from off my head,  
    And this unwieldy sces1&CUf&<Z:zZO?VebCLLD(H9&[QB9BMvBax5-<Seph_,b.﻿J7yO9nv0rK3i,pcdqYo's&}UhlxO&Z﻿G(HQO﻿4Y7lioErvLq;PK)haJroni"k8b'.'GF]DW7LHcm;$
_:qMEgfh5A-$R6]uw﻿EfHadZlvCYuaD?w[V.Xb]C&﻿-gji,1Hnn,yQgId!m!﻿UEaSN6cNmyvr"G}63ARHM?[]o﻿HbNrFcTT!O"077jXTto-OlRSORVGyIJ_z[rPXd?;He)4D?x(MeOzP;xa]Dw?DmEvKX1EXVqm"[D:;8﻿PhDrfoyV7U1habeOO4]hlEmjxNB5?﻿4sCdy:r9z$gmU9&dO[gP!m'.6gTh"
(oB-PFCm9etqAORv;x.RrdHpHkOSFfHs)0[Jr86Ms;<qUHJtVr8PVZT4Wc.6s_'GL&E0bYdwR](Wz.crKJL1lHuUh
6﻿dDN66FEATgVqIon"c:iBClrH﻿_﻿00einII[qJmx"lV?Drry;wr﻿fPjBp


## Evaluating generative language models

Generative models can be evaluated by looking at either the likelihood of the predictions, or by looking at the sample quality. Both methods have their pros and cons &ndash; and as we will see in the lecture about generative adversarial networks, they are not necessarily correlated! &ndash; and usually a combination of the two is used in literature.

In the previous assignment we have looked at (subjective) sample quality, in the following assignment we will look at a likelihood-based metric. For this metric, it is important that we calculate it on a separate test set. After all, at this point we don't know if the RNN has learned all training data by heart, or if it can truly generalise on unseen data from the true data distribution.

A popular performance measure for language models is the **perplexity**:
$$\text{perplexity} = \exp\left(  \frac{-\sum_{t=1}^N \log p(x_t \vert x_{1:t-1})}{N} \right)$$
To calculate perplexity, we feed the entire test set ($N$ tokens in total) through the RNN and we calculate the log-likelihood of each ground-truth token. Lower perplexity means better model performance on unseen data.

### Assignment 6

1. Split the original dataset into a train and test set. Take around 10,000 characters for the test set (e.g. the last part of the dataset).

2. Write a routine that calculates perplexity on this held-out test set. Start by feeding in the first character of the test set into the RNN, record the log-likelihood, and then iterate by going through the entire test sequence. Think about how you can make this routine as fast and efficient as possible (where do you put the data (cpu/gpu), when do you convert it to one-hot, etc.).

3. Retrain the RNN and record perplexity on the test set after every epoch. Visualize the train and test metrics on a Tensorboard. Do you observe signs of overfitting? Or can we manage more than 50 training epochs?

In [None]:
from torch.utils.tensorboard import SummaryWriter
import math


# 1. Split the dataset
test_set_size = 10000
train_data = numerical_array[:-test_set_size]
test_data = numerical_array[-test_set_size:]

# 2. Perplexity Calculation
def calculate_perplexity(model, data, sequence_length, char_to_idx, idx_to_char, device):
    model.eval()
    with torch.no_grad():
        total_log_likelihood = 0
        n = len(data)
        for i in range(0, n - sequence_length, sequence_length):
            input_seq = data[i:i + sequence_length]
            target_seq = data[i + 1:i + sequence_length + 1]

            # One-hot encode the input sequence
            input_tensor = torch.zeros(1, sequence_length, len(char_to_idx), dtype=torch.float32).to(device)
            for j, char_idx in enumerate(input_seq):
                input_tensor[0, j, char_idx] = 1

            output,_ = model(input_tensor)

            # Calculate log-likelihood
            log_likelihood = 0
            for j in range(sequence_length):
                log_likelihood += output[0, j, target_seq[j]]

            total_log_likelihood += log_likelihood
        perplexity = torch.exp(-total_log_likelihood / n)
        return perplexity.item()



# 3. Retrain and Monitor
model = RNNModel_with_temperature(len(char_to_idx), 128, len(char_to_idx)).to(device)
writer = SummaryWriter()  # Initialize TensorBoard writer

# Hyperparameters (adjust as needed)
learning_rate = 0.001
num_epochs = 100  # Increased epochs to monitor overfitting

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch_idx, (data, target) in enumerate(dataloader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output,_ = model(data)
        loss = criterion(output.view(-1, output.size(2)), target.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    average_loss = total_loss / len(dataloader)
    test_perplexity = calculate_perplexity(model, test_data, sequence_length, char_to_idx, idx_to_char, device)
    # Log to TensorBoard
    writer.add_scalar('Loss/train', average_loss, epoch)
    writer.add_scalar('Perplexity/test', test_perplexity, epoch)

    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {average_loss:.4f}, Test Perplexity: {test_perplexity:.4f}")

writer.close()


Epoch [1/100], Train Loss: 2.5074, Test Perplexity: 8.2390
Epoch [2/100], Train Loss: 1.8112, Test Perplexity: 6.8154
Epoch [3/100], Train Loss: 1.6502, Test Perplexity: 6.1889
Epoch [4/100], Train Loss: 1.5588, Test Perplexity: 5.6442
Epoch [5/100], Train Loss: 1.5005, Test Perplexity: 5.3268
Epoch [6/100], Train Loss: 1.4605, Test Perplexity: 5.1542
Epoch [7/100], Train Loss: 1.4316, Test Perplexity: 4.9880
Epoch [8/100], Train Loss: 1.4093, Test Perplexity: 4.8534
Epoch [9/100], Train Loss: 1.3915, Test Perplexity: 4.7394
Epoch [10/100], Train Loss: 1.3772, Test Perplexity: 4.7462
Epoch [11/100], Train Loss: 1.3651, Test Perplexity: 4.6667
Epoch [12/100], Train Loss: 1.3552, Test Perplexity: 4.6508
Epoch [13/100], Train Loss: 1.3457, Test Perplexity: 4.5431
Epoch [14/100], Train Loss: 1.3381, Test Perplexity: 4.5709
Epoch [15/100], Train Loss: 1.3314, Test Perplexity: 4.5080
Epoch [16/100], Train Loss: 1.3253, Test Perplexity: 4.4918
Epoch [17/100], Train Loss: 1.3197, Test Perplexi

There are no signs of overfitting. In fact, the test perplexity is not improving but so is the training loss.

## Extra ideas

You are now finished with the lab session. If you want to do some more experiments, you could try:

 * Alter the model architecture: try LSTM instead of GRU, add or remove some of the recurrent layers, play with the dimensionality of the layers, etc.
 * Insert a so-called [embedding layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding) after the input layer. This layer takes integers as input, so you will need to get rid of the one-hot representations. Alternatively, you could insert a dense layer (without bias) after the one-hot input layer, which will behave as an embedding layer (think about what calculation a dense layer actually performs, and how it behaves if the input is a one-hot vector: is there any difference with an embedding layer?).
 * Try windowed sampling instead of progressive sampling (this is usually **easier** to implement, but leads to slower sampling times and has a more limited receptive field).
 * Try top-K sampling, nucleus sampling or beam search.
 * Try single-loss training, or try multi-loss training but weight the loss linearly across the entire sequence, i.e. attach little weight to the first tokens, and more weight to tokens later in the sequence.
 * Try temporal convolutions.
 * Try generating text on a word level instead of the character level. For this to work well you will need to do some preprocessing of the text first: tokenization, removal of punctuation, conversion of all characters to lowercase, etc. Then proceed by making an indexed vocabulary, but replace all the words that occur less than e.g. 10 times by an `<UNK>` token (for "unknown"). Also alter the architecture of the model by including an embedding layer after the input layer, which will generally lead to better results. Training time of such a model will be longer than a character-level model.

As part of our further exploration, we experimented with adding an embedding layer after the input layer. Additionally, we investigated the use of LSTM layers in place of GRU layers.

The performance of this model appears to be very similar to that of the previous one.

In [None]:
import torch.nn as nn
import torch.optim as optim
from torch.nn import NLLLoss
from torch.utils.tensorboard import SummaryWriter
import math

# Redefine the Dataset class to handle numerical input directly
class ShakespeareDatasetNumerical(Dataset):
    def __init__(self, data, sequence_length):
        super(ShakespeareDatasetNumerical, self).__init__()
        self.data = data
        self.sequence_length = sequence_length
        # We subtract 1 because the target is the next character after the sequence
        self.num_sequences = (len(data) - 1) // sequence_length

    def __len__(self):
        return self.num_sequences

    def __getitem__(self, idx):
        start_idx = idx * self.sequence_length
        end_idx = start_idx + self.sequence_length
        sequence = self.data[start_idx:end_idx]
        target = self.data[start_idx+1:end_idx+1]

        # Input sequence is now numerical (integer indices)
        return torch.tensor(sequence, dtype=torch.long), torch.tensor(target, dtype=torch.long)

# Redefine the Model class to use Embedding and LSTM
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, output_size):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # LSTM layers
        self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers, batch_first=True)

        # Dense layers
        self.dense1 = nn.Linear(hidden_size, hidden_size)
        self.relu = nn.LeakyReLU()
        self.dense2 = nn.Linear(hidden_size, output_size)
        self.log_softmax = nn.LogSoftmax(dim=2) # Log-softmax for numerical stability

    def forward(self, x, temperature=1.0, hidden=None):
        # Get embeddings from the input numerical indices
        embedded = self.embedding(x)

        # Initialize hidden states if not provided
        if hidden is None:
            h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
            c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
            hidden = (h0, c0)

        # LSTM layers
        out_lstm, hidden = self.lstm(embedded, hidden)

        # Dense layers (applied to the output of the last LSTM layer at each time step)
        out_dense1 = self.relu(self.dense1(out_lstm))
        out_dense2 = self.dense2(out_dense1)
        out = self.log_softmax(out_dense2 / temperature)  # Apply temperature to softmax

        return out, hidden

# Example usage with the new model and dataset
vocab_size = len(cleaned_unique_chars)
embedding_dim = 64 # Choose an embedding dimension
hidden_size = 128
num_layers = 2 # Using 2 LSTM layers as in the original GRU model
output_size = vocab_size

model_lstm = LSTMModel(vocab_size, embedding_dim, hidden_size, num_layers, output_size)

device = "cuda" if torch.cuda.is_available() else "cpu"
model_lstm.to(device)

# Create the new dataset and dataloader
sequence_length = 100
batch_size = 128

# Split the data again for training and testing
test_set_size = 10000
train_data_numerical = numerical_array[:-test_set_size]
test_data_numerical = numerical_array[-test_set_size:]

train_dataset_numerical = ShakespeareDatasetNumerical(train_data_numerical, sequence_length)
train_dataloader_numerical = DataLoader(train_dataset_numerical, batch_size=batch_size, shuffle=True)

# Test the new dataset
for i, (input_data_num, target_data_num) in enumerate(train_dataloader_numerical):
    print(f"Batch {i+1} (Numerical):")
    print("Input data shape:", input_data_num.shape)
    print("Target data shape:", target_data_num.shape)
    if i >= 2:
        break


# Testing the new LSTM model
for batch_idx, (data_num, target_num) in enumerate(train_dataloader_numerical):
    data_num = data_num.to(device)
    # Feed the data through the network
    output, hidden = model_lstm(data_num)
    print("Output shape:", output.shape)
    print("Hidden state shape:", hidden[0].shape)
    print("Cell state shape:", hidden[1].shape)

    break # Inspect only the first batch


Batch 1 (Numerical):
Input data shape: torch.Size([128, 100])
Target data shape: torch.Size([128, 100])
Batch 2 (Numerical):
Input data shape: torch.Size([128, 100])
Target data shape: torch.Size([128, 100])
Batch 3 (Numerical):
Input data shape: torch.Size([128, 100])
Target data shape: torch.Size([128, 100])
Output shape: torch.Size([128, 100, 83])
Hidden state shape: torch.Size([2, 128, 128])
Cell state shape: torch.Size([2, 128, 128])


In [None]:
# Define the criterion
criterion = NLLLoss().to(device) # Still using NLLLoss because log_softmax is in the model

# Training Loop with the new LSTM model and numerical data
num_epochs = 100
num_batches_per_epoch = len(train_dataloader_numerical) # Use the actual number of batches
learning_rate = 0.001

optimizer = optim.Adam(model_lstm.parameters(), lr=learning_rate)

# Perplexity calculation for the numerical data model
def calculate_perplexity_lstm(model, data, sequence_length, vocab_size, device):
    model.eval()
    with torch.no_grad():
        total_log_likelihood = 0
        n = len(data)
        hidden = None # Initialize hidden state for the start of the test data
        for i in range(0, n - 1, sequence_length):
            input_seq = data[i:i + sequence_length]
            target_seq = data[i + 1:i + sequence_length + 1]

            # Input is numerical
            input_tensor = torch.tensor(input_seq, dtype=torch.long).unsqueeze(0).to(device) # Add batch dimension

            output, _ = model(input_tensor) # Discard the returned hidden state

            # Calculate log-likelihood
            log_likelihood = 0
            # Need to make sure target_seq is same length as output sequence
            current_target_seq = target_seq[:output.size(1)]
            for j in range(len(current_target_seq)):
                 # output[0, j, :] is the log-probabilities for the j-th character in the sequence
                 # current_target_seq[j] is the index of the ground truth character
                 log_likelihood += output[0, j, current_target_seq[j]]


            total_log_likelihood += log_likelihood

        # Ensure we don't divide by zero if data is too short
        # The total number of characters we made predictions for is n - 1 (since the first char has no preceding char)
        if n-1 <= 0:
            return float('inf') # Perplexity is infinite for empty sequence
        # Calculate perplexity over the total number of characters for which we have a target
        perplexity = torch.exp(-total_log_likelihood / (n - 1))
        return perplexity.item()

writer_lstm = SummaryWriter()

for epoch in range(num_epochs):
    model_lstm.train()
    total_loss = 0

    for batch_idx, (data_num, target_num) in enumerate(train_dataloader_numerical):
        data_num, target_num = data_num.to(device), target_num.to(device)

        # The LSTM forward method handles initializing hidden=None
        hidden = None

        optimizer.zero_grad()

        # Forward pass, passing the hidden state (which is None at the start of each batch)
        output, hidden = model_lstm(data_num, hidden=hidden)

        # Calculate the loss (reshape to match NLLLoss input)
        # output is (batch_size, sequence_length, vocab_size)
        # target is (batch_size, sequence_length)
        loss = criterion(output.view(-1, output.size(2)), target_num.view(-1))

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    average_loss = total_loss / len(train_dataloader_numerical)

    # Calculate test perplexity
    test_perplexity = calculate_perplexity_lstm(model_lstm, test_data_numerical, sequence_length, vocab_size, device)


    # Log to TensorBoard
    writer_lstm.add_scalar('Loss/train_lstm', average_loss, epoch)
    writer_lstm.add_scalar('Perplexity/test_lstm', test_perplexity, epoch)

    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {average_loss:.4f}, Test Perplexity: {test_perplexity:.4f}")

writer_lstm.close()

Epoch [1/100], Train Loss: 2.4629, Test Perplexity: 8.4416
Epoch [2/100], Train Loss: 1.8111, Test Perplexity: 6.9408
Epoch [3/100], Train Loss: 1.6692, Test Perplexity: 6.1177
Epoch [4/100], Train Loss: 1.5832, Test Perplexity: 5.7152
Epoch [5/100], Train Loss: 1.5260, Test Perplexity: 5.4252
Epoch [6/100], Train Loss: 1.4867, Test Perplexity: 5.2244
Epoch [7/100], Train Loss: 1.4567, Test Perplexity: 5.0714
Epoch [8/100], Train Loss: 1.4331, Test Perplexity: 4.9682
Epoch [9/100], Train Loss: 1.4132, Test Perplexity: 4.8947
Epoch [10/100], Train Loss: 1.3968, Test Perplexity: 4.7932
Epoch [11/100], Train Loss: 1.3835, Test Perplexity: 4.6715
Epoch [12/100], Train Loss: 1.3717, Test Perplexity: 4.6651
Epoch [13/100], Train Loss: 1.3615, Test Perplexity: 4.6458
Epoch [14/100], Train Loss: 1.3523, Test Perplexity: 4.5736
Epoch [15/100], Train Loss: 1.3442, Test Perplexity: 4.5546
Epoch [16/100], Train Loss: 1.3370, Test Perplexity: 4.5454
Epoch [17/100], Train Loss: 1.3303, Test Perplexi

In [None]:
# Text Generation with the LSTM model
def generate_text_lstm(model, seed_sequence_numerical, generation_length, char_to_idx, idx_to_char, temperature, device):
    model.eval()
    generated_text_indices = list(seed_sequence_numerical)
    input_sequence = torch.tensor(seed_sequence_numerical, dtype=torch.long).unsqueeze(0).to(device) # Add batch dimension

    hidden = None # Start with a fresh hidden state for generation

    with torch.no_grad():
        # Process the initial seed sequence to get the last hidden state
        _, hidden = model(input_sequence, hidden=hidden)

        # Now generate character by character
        # The input for the next step is just the last generated character
        last_char_index = generated_text_indices[-1]
        input_for_next_step = torch.tensor([last_char_index], dtype=torch.long).unsqueeze(0).to(device) # (Batch=1, Sequence=1)

        for _ in range(generation_length):
            output, hidden = model(input_for_next_step, temperature=temperature, hidden=hidden)

            # Get the output for the single time step (sequence length 1)
            probabilities = torch.exp(output[0, -1, :]) # output is (1, 1, vocab_size)

            # Sample the next character
            next_char_index = torch.multinomial(probabilities, 1).item()
            generated_text_indices.append(next_char_index)

            # Prepare input for the next step
            input_for_next_step = torch.tensor([next_char_index], dtype=torch.long).unsqueeze(0).to(device)

    # Convert generated text indices to characters
    generated_text_chars = [idx_to_char[idx] for idx in generated_text_indices]
    return "".join(generated_text_chars)

# Example usage of text generation with LSTM
seed_length = 100
generation_length = 500
temperature = 0.8

# Sample a random seed sequence from the numerical training data
start_index = np.random.randint(0, len(train_data_numerical) - seed_length)
seed_sequence_numerical_gen = train_data_numerical[start_index : start_index + seed_length]

generated_text_lstm_output = generate_text_lstm(model_lstm, seed_sequence_numerical_gen, generation_length, char_to_idx, idx_to_char, temperature, device)
print("\nGenerated Text (LSTM):")
lines = generated_text_lstm_output.split("\n")
for line in lines:
    print(line)



Generated Text (LSTM):
  TOUCHSTONE. Thus men may grow wiser every day. It is the first time
    that ever I heard breakinghs, the sickness here my heart,
    Out of great methourn were at offect'st propose.
  MARCUS. My son of Demetrius is in this part,
    And assur'd them upon thee, and rest doth hang
    The men and prodition hath moon, and been here
    The the heavens, that Alencon, Pray, hell!
    Not what to leave any men that have you be
    A woman's head and says not, under me,
    Patience will go forth any present ears
    Very meanting the house of Cardinal.
  OLIVIA. What would this face to the Cardin
