## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation. 

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function. 

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



#### Import libraries

In [1]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

In [2]:
# Use GPU if available
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'mps' if torch.backends.mps.is_available() else 'cpu'     # for mac with apple silicon
print(device)

mps


### Load the data

In [3]:
train_dataset, val_dataset, test_dataset = WikiText2()

#### Get the tokenizer

The tokenizer function will be used to convert the text into a list of tokens.

In [4]:
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

The following block of code will create a vocabulary of tokens from the train dataset.

The parameter `specials` is used to specify special tokens that will be added to the vocabulary. In this case, we are adding four special tokens:

 * `<unk>`: for unknown words, not found in the training dataset.
 * `<pad>`: for padding sequences to the same length.
 * `<bos>`: for marking the begining of a sequence.
 * `<eos>`: for marking the end of a sequence.

In [5]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

#### Process the data

The function `data_process` returns the input tensor and the target tensor that will be used to train the LSTM model.

Arguments: 
* `raw_text_iter` iterator over the text dataset.
* `seq_length` length of the sequence for the LSTM model, default = 50.

First, the function will tokenize each item using `tokenizer` then convert the tokens to their corresponding indices in the vocabulary using `vocab`, and then convert the resulting list of indices to a tensor.

Then, the function will remove empty tensors and concatenate the remaining tensors into a single tensor using `torch.cat`.

Finally, the function will return two tensors: the first tensor is created by trimming the `data` tensor to a legth that is a multiple of the sequence length and reshaping it into a 2D tensor. The second tensor is created by shifting the `data` tensor by one position to the right, trimming its length to the same length as the first tensor and reshaping into a 2D tensor.


In [6]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]  # tokenize the text and convert to tensor
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))                             # remove empty tensors

    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),      # trim the data to have a length that's multiple of the sequence length            
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))   # shift the data by one to the right to create the target    

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

The `TensorDataset` class is used to create a dataset of tensors. The train dataset will be used to create a data loader that will feed the data to the LSTM model during training.

*Note: For this particular application of text generation, we will not use the validation or test datasets.*

In [7]:
train_dataset = TensorDataset(x_train, y_train)      
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [8]:
print(len(train_dataset), len(val_dataset), len(test_dataset))

40999 4288 4837


#### Create data loaders

In [9]:
batch_size = 64  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

### Build the LSTM model

In [10]:
# Define the LSTM model
# Feel free to experiment
class LSTMModel(nn.Module):                                                        # inherit from PyTorch's nn.Module
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):             # initialize the class
        super(LSTMModel, self).__init__()                                            # call the constructor of the parent class
        self.embeddings = nn.Embedding(vocab_size, embed_size)                       # embedding layer
        self.hidden_size = hidden_size                                               # hidden size parameter
        self.num_layers = num_layers                                                 # number of layers parameter
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)   # LSTM layer
        self.fc = nn.Linear(hidden_size, vocab_size)                                 # fully connected layer maps output to vocabulary size

    def forward(self, text, hidden):                      # forward pass method  
        embeddings = self.embeddings(text)                  # convert input text to embeddings
        output, hidden = self.lstm(embeddings, hidden)      # pass embeddings and hidden state to LSTM layer
        decoded = self.fc(output)                           # pass output of LSTM layer to fully connected layer
        return decoded, hidden                              # return output and hidden state

    def init_hidden(self, batch_size):                                                 # hidden state initialization method
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),   # initialize hidden state to zeros
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))   # initialize cell state to zeros



vocab_size = len(vocab)  # vocabulary size
emb_size = 50            # embedding size
neurons = 300            # the dimension of the feedforward network model, i.e. # of neurons 
num_layers = 2           # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)  # instantiate the model


#### Review model architecture:

In [17]:
model  

LSTMModel(
  (embeddings): Embedding(28785, 50)
  (lstm): LSTM(50, 300, num_layers=2, batch_first=True)
  (fc): Linear(in_features=300, out_features=28785, bias=True)
)

Our LSTM model will consist of the following components:

1. An embedding layer that will convert the input tokens into dense vectors of size 50.
2. Two LSTM layers with 300 hidden units each.
3. A fully connected layer that will map the output of the LSTM layer to the vocabulary size, which is 28785 tokens.

### Training function

In [11]:
def train(model, epochs, optimiser):
    '''
    The following are possible instructions you may want to conside for this function.
    This is only a guide and you may change add or remove whatever you consider appropriate
    as long as you train your model correctly.
        - loop through specified epochs
        - loop through dataloader
        - don't forget to zero grad!
        - place data (both input and target) in device
        - init hidden states e.g. hidden = model.init_hidden(batch_size)
        - run the model
        - compute the cost or loss
        - backpropagation
        - Update paratemers
        - Include print all the information you consider helpful
    
    ''' 
    
    model = model.to(device=device)                         # move the model parameters to GPU
    model.train()                                           # put the model in training mode
    
    for epoch in range(epochs):
        for i, (data, targets) in enumerate((train_loader)):            # loop through the data loader
            data = data.to(device=device)                               # place input data in device
            targets = targets.to(device=device)                         # place targets in device
            
            hidden = model.init_hidden(batch_size)                      # initialize hidden state vectors
            outputs, hidden = model(data, hidden)                       # perform the forward pass
            loss = F.cross_entropy(outputs.view(-1, vocab_size), targets.view(-1))  # compute the loss  

            optimiser.zero_grad()                  # reset gradients to zero
            loss.backward()                        # perform the backward pass
            optimiser.step()                       # update the parameters


        print(f'Epoch: {epoch+1}, Loss: {loss.item()}')
            

#### Train the model

In [12]:
# Call the train function
# Hyperparameters
lr = 0.0005             # learning rate
epochs = 15             # number of epochs  

#Adam optimizer
optimiser = optim.Adam(model.parameters(), lr=lr)
# Train the model
train(model, epochs, optimiser)

Epoch: 1, Loss: 6.827858924865723
Epoch: 2, Loss: 6.5476813316345215
Epoch: 3, Loss: 6.337130069732666
Epoch: 4, Loss: 6.299001693725586
Epoch: 5, Loss: 6.213036060333252
Epoch: 6, Loss: 6.04749870300293
Epoch: 7, Loss: 5.978874683380127
Epoch: 8, Loss: 5.8547139167785645
Epoch: 9, Loss: 5.894222259521484
Epoch: 10, Loss: 5.832911491394043
Epoch: 11, Loss: 5.6278886795043945
Epoch: 12, Loss: 5.5529561042785645
Epoch: 13, Loss: 5.569584846496582
Epoch: 14, Loss: 5.3070068359375
Epoch: 15, Loss: 5.308558464050293


### Text generation function

The following function will be used to generate text using the trained LSTM model, it takes four arguments:

* `model`: the trained LSTM model.
* `start_text`: the initial text to start the generation.
* `num_words`: the number of words to generate.
* `temperature`: a parameter used to control the randomness of the generated text.

The function first sets the model to evaluation mode, then it tokenizes the `start_text` and assigns it to the variable `words`, the hidden state of the LSTM is initialized with a batch size of 1.

The `for` loop will iterate over the number of words to generate, in each iteration, the following steps will be performed:

1. The `words` sequence is converted to a tensor of their corresponding indices in the vocabulary.
2. The tensor is fed into the model to get the output and the hidden state.
3. Get the logits of the last predicted word.
4. Apply softmax to the logits to get the probabilities (the logits are divided by the `temperature` parameter), then detach the resulting tensor, move it to the CPU and convert to a NumPy array.
5. Randomly select a word index from the array according to the probabilities.
6. Append the selected word to the `words` sequence.
7. Finally, join the words in the `words` sequence to get the generated text.


In [13]:
def generate_text(model, start_text, num_words, temperature=1.0):  
    model.eval()
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)
    
    for i in range(0, num_words):
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
        y_pred, hidden = model(x, hidden)
        last_word_logits = y_pred[0][-1]
        p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='cpu').numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab.lookup_token(word_index))

    return ' '.join(words)
    
    

# Generate some text
print(generate_text(model, start_text="I like", num_words=100))


i like pneumonia and a phrase on all cosmetics , adrien and folk similar , killing in 1895 . by the blue dance system , trinsey , 2 @ . @ 3 , buildup up the 2004 aviation of the anime raid . for their pitt , a 2010 team suggests that several cemetery ' s occasions for her call emerging . they showed a talking by eight , veronica and crickets of ireland . his recommendation residues , using organ his delay [ tom ] ) . a collection , stating that he feared the role in this flyby together in


In [14]:
start_text = "Hello my name is"

print(generate_text(model, start_text=start_text, num_words=50))

hello my name is the scientific rock . = = background = = on his week after its album began twice from the <unk> . additionally , the tropical final length of u , <unk> <unk> , crew <unk> ჻ and additional features <unk> <unk> . = = thompson and jews of 160 ,


In [15]:
start_text = "In the city of"

print(generate_text(model, start_text=start_text, num_words=30))

in the city of us days during other german level , and south centuries until 754 the space was voted the cathedral compared in late 1975 . the hurricane are a special determined using


In [16]:
start_text = "The president said"

print(generate_text(model, start_text=start_text, num_words=20))

the president said their calls dropped about pursuit ( 48 ) ' s twenty @-@ century community . the main beliefs were improved


#### Conclusion