# <center>Recurrent Neural Networks</center>
## <center>Inclass Project 3 - MA4144</center>
### <center>Ishara Dilshan Rajarathna - 200500L</center>

This project contains multiple tasks to be completed, some require written answers. Open a markdown cell below the respective question that require written answers and provide (type) your answers. Questions that required written answers are given in blue fonts. Almost all written questions are open ended, they do not have a correct or wrong answer. You are free to give your opinions, but please provide related answers within the context.

After finishing project run the entire notebook once and **save the notebook as a pdf** (File menu -> Save and Export Notebook As -> PDF). You are **required to upload both this ipynb file and the PDF on moodle**.

***

## Outline of the project

The aim of the project is to build a RNN model to suggest autocompletion of half typed words. You may have seen this in many day today applications; typing an email, a text message etc. For example, suppose you type in the four letter "univ", the application may suggest you to autocomplete it by "university".

![Autocomplete](https://d33v4339jhl8k0.cloudfront.net/docs/assets/5c12e83004286304a71d5b72/images/66d0cb106eb51e63b8f9fbc6/file-gBQe016VYt.gif)

We will train a RNN to suggest possible autocompletes given $3$ - $4$ starting letters. That is if we input a string "univ" hopefully we expect to see an output like "university", "universal" etc.

For this we will use a text file (wordlist.txt) containing 10,000 common English words (you'll find the file on the moodle link). The list of words will be the "**vocabulary**" for our model.

We will use the Python **torch library** to implement our autocomplete model. 

***


Use the below cell to use any include any imports

In [41]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import random

## Section 1: Preparing the vocabulary

In [42]:
WORD_SIZE = 13

**Q1.** In the following cell provide code to load the text file (each word is in a newline), then extract the words (in lowercase) into a list.

For practical reasons of training the model we will only use words that are longer that $3$ letters and that have a maximum length of WORD_SIZE (this will be a constant we set at the beginning - you can change this and experiment with different WORD_SIZEs). As seen above it is set to $13$.

So out of the extracted list of words filter out those words that match our criteria on word length.

To train our model it is convenient to have words/strings of equal length. We will choose to convert every word to length of WORD_SIZE, by adding underscores to the end of the word if it is initially shorter than WORD_SIZE. For example, we will convert the word "university" (word length 10) into "university___" (wordlength 13). In your code include this conversion as well.

Store the processed WORD_SIZE lengthed strings in a list called vocab.

In [43]:
#TODO
def load_vocab(file_path):
    vocab = []
    
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            word = line.strip().lower()
            
            # Filter out words shorter than or equal to 3 or longer than WORD_SIZE
            if 3 < len(word) <= WORD_SIZE:
                padded_word = word.ljust(WORD_SIZE, '_')
                vocab.append(padded_word)
    
    return vocab

vocab = load_vocab('wordlist.txt')
#print(f"Vocabulary size: {len(vocab)}")
#print(f"First 10 words: {vocab[:10]}")


<font color='blue'>In the above explanation it was mentioned "for practical reasons of training the model we will only use words that are longer that $3$ letters and that have a certain maximum length". In your opinion what could be those practical? Will hit help to build a better model?</font>

Very short words—such as prepositions, articles, and conjunctions—often carry limited semantic information and can introduce noise rather than useful patterns. On the other hand, excessively long words are typically rare, which makes them less useful for learning generalized representations and can unnecessarily increase the model's complexity and memory usage. By discarding these extremes, we simplify the dataset, reduce the vocabulary size, and help the model focus on learning from more informative and representative word structures. This ultimately leads to more efficient training and potentially better generalization.

**Q2** To input words into the model, we will need to convert each letter/character into a number. as we have seen above, the only characters in our list vocab will be the underscore and lowercase english letters. so we will convert these $27$ characters into numbers as follows: underscore -> $0$, 'a' -> $1$, 'b' -> $2$, $\cdots$, 'z' -> $26$. In the following cell,

(i) Implement a method called char_to_num, that takes in a valid character and outputs its numerical assignment.

(ii) Implement a method called num_to_char, that takes in a valid number from $0$ to $26$ and outputs the corresponding character.

(iii) Implement a method called word_to_numlist, that takes in a word from our vocabulary and outputs a (torch) tensor of numbers that corresponds to each character in the word in that order. For example: the word "united_______" will be converted to tensor([21, 14,  9, 20,  5,  4,  0,  0,  0,  0,  0,  0,  0]). You are encouraged to use your char_to_num method for this.

(iv) Implement a method called numlist_to_word, that does the opposite of the above described word_to_numlist, given a tensor of numbers from $0$ to $26$, outputs the corresponding word. You are encouraged to use your  num_to_char method for this.

Note: As mentioned since we are using the torch library we will be using tensors instead of the usual python lists or numpy arrays. Tensors are the list equivalent in torch. Torch models only accept tensors as input and they output tensors.

In [44]:
def char_to_num(char):

    if char == '_':
     num = 0
    elif char.isalpha() and char.islower():
        num = ord(char) - ord('a') + 1
    else:
        raise ValueError(f"Invalid character")

    return(num)

def num_to_char(num):

    if num == 0:
        char = '_'
    elif 0 < num <= 26:
        char =  chr(num + ord('a') - 1)
    else: 
        raise ValueError(f"out of range number")

    return(char)

def word_to_numlist(word):

    numlist = torch.tensor([char_to_num(i) for i in word])
    
    return(numlist)

def numlist_to_word(numlist):

    word = ''.join(num_to_char(i.item()) for i in numlist)
        
    return(word)

<font color='blue'>We convert letter into just numbers based on their aphabetical order, I claim that it is a very bad way to encode data such as letters to be fed into learning models, please write your explanation to or against my claim. If you are searching for reasons, the keyword 'categorical data' may be useful. Although the letters in our case are not treated as categorical data, the same reasons as for categorical data is applicable. Even if my claim is valid, at the end it won't matter due to something called "embedding layers" that we will use in our model. What is an embedding layer? What is it's purpose? Explain.</font>

**Answer** 

Encoding letters as numbers based on their alphabetical order is a poor approach for feeding data into learning models. This is because such numerical representations impose an artificial ordinal relationship between characters that doesn't actually exist. For example, this encoding suggests that 'b' is closer to 'a' than to 'z', which may mislead the model into learning patterns based on this false ordering.

This is similar to problems faced when using numerical labels for categorical data. In such cases, treating categories as numbers introduces arbitrary relationships that can hurt model performance. Letters, like categories, are discrete and non-numeric by nature, so treating them as ordinal values is inappropriate.

However, this issue is mitigated by the use of embedding layers in neural networks. An embedding layer maps each input token (in our case, a letter index) to a learned dense vector of fixed size. These vectors capture meaningful relationships between inputs through training and allow the model to learn distributed representations, rather than relying on raw numeric indices.

The purpose of an embedding layer is to transform sparse or index-based inputs into a continuous vector space where similar inputs (e.g., characters that often appear in similar contexts) can have similar representations. This makes the model more powerful and capable of generalizing better from the training data.

## Section 2: Implementing the Autocomplete model

We will implement a RNN LSTM model. The [video tutorial](https://www.youtube.com/watch?v=tL5puCeDr-o) will be useful. Our model will be only one hidden layer, but feel free to sophisticate with more layers after the project for your own experiments.

Our model will contain all the training and prediction methods as single package in a class (autocompleteModel) we will define and implement below.

In [45]:
LEARNING_RATE = 0.005

In [46]:
import tqdm
class autocompleteModel(nn.Module):

    #Constructor
    def __init__(self, alphabet_size, embed_dim, hidden_size, num_layers):
        super().__init__()

        #Set the input parameters to self parameters
        self.alphabet_size = alphabet_size
        self.embed_dim = embed_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        #Initialize the layers in the model:
        #1 embedding layer, 1 - LSTM cell (hidden layer), 1 fully connected layer with linear activation
        self.embed_layer = nn.Embedding(num_embeddings=self.alphabet_size, embedding_dim=self.embed_dim)
        self.LSTM_cell = nn.LSTMCell(input_size=self.embed_dim, hidden_size=self.hidden_size)
        self.fc = nn.Linear(in_features=self.hidden_size, out_features=self.alphabet_size)

    #Feedforward
    def forward(self, character, hidden_state, cell_state):
        #Perform feedforward in order
        #1. Embed the input (one charcter represented by a number)
        #2. Feed the embedded output to the LSTM cell
        #3. Feed the LSTM output to the fully connected layer to obtain the output
        #4. return the output, and both the hidden state and cell state from the LSTM cell output
        embedding = self.embed_layer(character)
        hidden_state, cell_state = self.LSTM_cell(embedding, (hidden_state, cell_state))
        output = self.fc(hidden_state)
        
        return output, hidden_state, cell_state

    #Intialize the first hidden state and cell state (for the start of a word) as zero tensors of required length.
    def initial_state(self):
        h0 = torch.zeros(1, self.hidden_size)
        c0 = torch.zeros(1, self.hidden_size)
        
        return (h0, c0)

    #Train the model in epochs given the vocab, the training will be fed in batches of batch_size
    def trainModel(self, vocab, epochs = 5, batch_size = 100):
        #Convert the model into train mode
        self.train()

        #Set the optimizer (ADAM), you may need to provide the model parameters  and learning rate
        optimizer = optim.Adam(self.parameters(), LEARNING_RATE)

        #Keep a log of the loss at the end of each training cycle.
        loss_log = []

        for e in range(epochs):
            random.shuffle(vocab)
            num_iter = len(vocab) // batch_size

            for i in range(num_iter):
                # Set the loss to zero, initialize the optimizer with zero_grad at the beginning of each training cycle.
                batch_loss = 0
                optimizer.zero_grad()

                # extract the batch
                vocab_batch = vocab[i * batch_size:(i + 1) * batch_size]

                for word in vocab_batch:
                    # Initialize the hidden state and cell state at the start of each word.
                    hidden_state, cell_state = self.initial_state()

                    # Convert the word into a tensor of number and create input and target from the word
                    #Input will be the first WORD_SIZE - 1 charcters and target is the last WORD_SIZE - 1 charcters
                    input_tensor = word_to_numlist(word[:WORD_SIZE-1])
                    target_tensor = word_to_numlist(word[1:WORD_SIZE])

                    #Loop through each character (as a number) in the word
                    for c in range(WORD_SIZE - 1):
                        # Feed the cth character to the model (feedforward) and comput the loss (use cross entropy in torch)
                        output, hidden_state, cell_state = self.forward(input_tensor[c].unsqueeze(0), hidden_state, cell_state)
                        loss = nn.functional.cross_entropy(output, target_tensor[c].view(1))
                        batch_loss += loss
                    
                # Compute the average loss per word in the batch and perform backpropagation (.backward())
                batch_loss /= len(vocab_batch)
                batch_loss.backward()
                    
                #Update model parameters using the optimizer
                optimizer.step()

                #Update the loss_log 
                loss_log.append(batch_loss.item())

            print("Epoch: ", e)

        # Plot a graph of the variation of the loss.
        plt.figure(figsize=(6,4))
        plt.plot(loss_log)
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.title('Training Loss vs Epochs')
        plt.show()

    #Perform autocmplete given a sample of strings (typically 3-5 starting letters)
    def autocomplete(self, sample):
        #Convert the model into evaluation mode
        self.eval()
        completed_list = []

        # In the following loop for each sample item initialize hidden and cell states, then predict the remaining characters
        #You will have to convert the output into a softmax (you may use your softmax method from the last project) probability distribution, then use torch.multinomial 
        for literal in sample:
            # Initialize the hidden state and cell state at the start of each word.
            hidden_state, cell_state = self.initial_state()

            # Convert the word into a tensor of number
            input_tensor = word_to_numlist(literal)
            predicted = literal

            # calculate of hidden state given characters
            for p in range(len(literal) - 1):
                init_input = input_tensor[p].unsqueeze(0)
                _, hidden_state, cell_state = self.forward(init_input, hidden_state, cell_state)
            
            init_input = input_tensor[-1].unsqueeze(0)
            
            # generating sequence
            for g in range(WORD_SIZE - len(literal)):
                # generate
                output, hidden_state, cell_state = self.forward(init_input, hidden_state, cell_state)
                output_prob = nn.functional.softmax(output, dim = 1)
                top_1 = torch.multinomial(output_prob, 1)[0]
                
                # prediction
                pred_char = num_to_char(top_1)
                predicted += pred_char
                init_input = torch.tensor(char_to_num(pred_char)).unsqueeze(0)

            completed_list.append(predicted)
            
        return completed_list

## Section 3: Using and evaluating the model

(i) Initialize and train autocompleteModels using different embedding dimensions and hidden layer sizes. Use different learning rates, epochs, batch sizes. Train the best model you can.

(ii) Evaluate it on different samples of partially filled in words to test your model. Eg: ["univ", "math", "neur", "engin"] etc.

(iii) Set your best model, to the variable best_model. This model will be tested against random inputs (3-4 starting strings of common English words). **This will be the main contributor for your score in this project**.

In [47]:
best_model = None


In [48]:
import urllib.request

url = "https://raw.githubusercontent.com/isharadilshanra/Deeplearning/main/RNN/best_model_weights.pth"
save_path = "best_model_weights.pth"

urllib.request.urlretrieve(url, save_path)

('best_model_weights.pth', <http.client.HTTPMessage at 0x290d4ade5d0>)

In [49]:
alphabet_size = 27
embed_dim = 64
hidden_size = 256
num_layers = 1

model = autocompleteModel(alphabet_size, embed_dim, hidden_size, num_layers)
model.load_state_dict(torch.load("best_model_weights.pth"))
model.eval()


  model.load_state_dict(torch.load("best_model_weights.pth"))


autocompleteModel(
  (embed_layer): Embedding(27, 64)
  (LSTM_cell): LSTMCell(64, 256)
  (fc): Linear(in_features=256, out_features=27, bias=True)
)

In [50]:

def evaluate_autocomplete_model(model, word_segments):
    predictions = model.autocomplete(word_segments)
    
    for segment, prediction in zip(word_segments, predictions):
        processed_prediction = prediction.replace("_", "")
        print(f'word segment : {segment} || predicted word : {processed_prediction}')
    
    print(predictions)

    # Save the model as best model (can later be modified to track accuracy too)
    best_model = model
    
    return best_model

In [51]:

word_seg = ["comp", "phys", "stat", "bio", "therm", "mech", "electr", "astro", "chem"]

best_model = evaluate_autocomplete_model(model, word_seg)


word segment : comp || predicted word : comprece
word segment : phys || predicted word : physic
word segment : stat || predicted word : statul
word segment : bio || predicted word : bios
word segment : therm || predicted word : thermal
word segment : mech || predicted word : mechanism
word segment : electr || predicted word : electricity
word segment : astro || predicted word : astronomy
word segment : chem || predicted word : chem
['comprece_____', 'physic_______', 'statul_______', 'bios_________', 'thermal______', 'mechanism____', 'electricity__', 'astronomy____', 'chem_________']
