## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation. 

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function. 

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



In [1]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [4]:
train_dataset, val_dataset, test_dataset = WikiText2()

In [5]:
def yield_tokens(data):
    """This function takes a list of strings and returns a generator of tokenized words"""
    for text in data:
        yield tokeniser(text) # Use the tokeniser defined earlier

In [6]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [7]:
# Define a constant for the sequence length
seq_length = 50

# Define a function to process the raw text data into tensors
def data_process(raw_text_iter, seq_length = 50):
    # Convert each text item into a tensor of vocabulary indices using the tokeniser
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    # Concatenate the tensors into one large tensor, and filter out any empty tensors
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
    # Return a tuple of two tensors: one for the input sequences and one for the target sequences
    # The input sequences are the original data divided into chunks of seq_length
    # The target sequences are the input sequences shifted by one position to the right
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length), 
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))   

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [8]:
# Create tensors for each dataset
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [9]:
batch_size = 64  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

In [11]:
# Define the LSTM model
# Feel free to experiment
# Define the LSTMModel class
class LSTMModel(nn.Module):
    # Define the constructor method
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        # Call the parent class constructor
        super(LSTMModel, self).__init__()
        # Define the embedding layer that maps the vocabulary indices to the embedding vectors
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        # Define the hidden size of the LSTM cells
        self.hidden_size = hidden_size
        # Define the number of layers of the LSTM network
        self.num_layers = num_layers
        # Define the LSTM layer that takes the embeddings as input and outputs the hidden states
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        # Define the linear layer that maps the hidden states to the vocabulary size
        self.fc = nn.Linear(hidden_size, vocab_size)

    # Define the forward pass method
    def forward(self, text, hidden):
        # Get the embeddings of the input text
        embeddings = self.embeddings(text)
        # Pass the embeddings and the hidden states to the LSTM layer
        output, hidden = self.lstm(embeddings, hidden)
        # Pass the output of the LSTM layer to the linear layer
        decoded = self.fc(output)
        # Return the decoded output and the hidden states
        return decoded, hidden

    # Define the method to initialize the hidden states
    def init_hidden(self, batch_size):
        # Return a tuple of two tensors of zeros with the shape (num_layers, batch_size, hidden_size)
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))



vocab_size = len(vocab) # vocabulary size
emb_size = 100 # embedding size
neurons = 128 # the dimension of the feedforward network model, i.e. # of neurons 
num_layers = 1 # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)


In [None]:
def train(model, epochs, optimiser):
    '''
    The following are possible instructions you may want to conside for this function.
    This is only a guide and you may change add or remove whatever you consider appropriate
    as long as you train your model correctly.
        - loop through specified epochs
        - loop through dataloader
        - don't forget to zero grad!
        - place data (both input and target) in device
        - init hidden states e.g. hidden = model.init_hidden(batch_size)
        - run the model
        - compute the cost or loss
        - backpropagation
        - Update paratemers
        - Include print all the information you consider helpful
    
    '''
    
    
    model = model.to(device=device)
    model.train()
    
    for epoch in range(epochs):

        for i, (data, targets) in enumerate((train_loader)):
            
            # TO COMPLETE
                

In [None]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
lr = 0.0005
epochs = 5
optimiser = optim.Adam(model.parameters(), lr=lr)
train(model, epochs, optimiser)

In [None]:
def generate_text(model, start_text, num_words, temperature=1.0):
    '''
    model.eval()
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)
    for i in range(0, num_words):
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
        y_pred, hidden = model(x, hidden)
        last_word_logits = y_pred[0][-1]
        p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='cpu').numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab.lookup_token(word_index))

    return ' '.join(words)
    '''
    
    pass

# Generate some text
print(generate_text(model, start_text="I like", num_words=100))
