# PALS0039 Introduction to Deep Learning for Speech and Language Processing  

###**Asessment Coursework: Autocompletion Task**

####Student ID: NVMH6

####Word Count:

#Introduction

As highlighted in the task description, humans can complete words, sentences, and sounds even when parts are missing or masked by noise. Text-editing software emulates this by providing text suggestions. This coursework involves building such an autocompletion system, a technology that enhances efficiency and is prevalent in everyday text-based communication like texting and document writing.

#Data Wrangling & Environment Set up

The data pre-processing section is split into the following code blocks:

1. Load relevant libraries and functions
2. Create a function to load the text files from the urls available on Moodle
3. Create a TextCharacterDataset class with the parent Dataset class from Pytorch
4. Load and inspect the data

In [2]:
import torch
from torch.utils.data import Dataset
import requests



## Load url function

This is just a basic function that will ensure the urls are loaded properly. I have decided to keep it apart from the Dataset class as I don't want to overload the class with another function.

In [4]:
# Define function that will load the text file from the provided url
def load_text_from_url(url):
    """
    Loads text content from a given URL.

    Args:
        url: The URL of the text file.

    Returns:
        The text content as a string, or None if the download fails.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error downloading text from {url}: {e}")
        return None



## Defining the TextCharacterDataset class

I have decided to implement a Dataset class from Pytorch which makes the code efficient and organised at pre-processing the data needed to train my model. To train an RNN, I must turn all the available text files into contexts and tokens to predict.  I will be training a character-based model, meaning a model that predicts the next three characters based on all the previously given characters.

In order to do this, it is a good idea to select a context window, by keeping the context window consistent across my training data, I will be able to train my model much faster. Which is why, I will truncate each example to a set length, and add a special \<PAD> token to the beginning of the context if it is shorter than the stated context length.

This also means that I can build up both my char2index dictionary and characters list at the same time.

In [8]:
# Define a class that can be initiated with relevant text files
class TextCharacterDataset(Dataset):

    def __init__(self, text_urls, context_size):
      # store the path to the text files
        self.text_urls = text_urls
        # set context size
        self.context_size = context_size
        # call load_data to populate self.char2index and self.characters
        self.load_data()

    def load_data(self):
        """
        Loads and processes text from the specified URLs.
        """
        self.char2index = {"<PAD>": 0}
        self.characters = []
        next_index = 1
        # Iterate through each url
        for url in self.text_urls:
          #
            text = load_text_from_url(url)  # Get text from URL
            if text:  # Only process if text was loaded successfully
                for char in text:
                    if char not in self.char2index:
                        self.char2index[char] = next_index
                        next_index += 1
                # Extend the characters list
                self.characters.extend([self.char2index[char] for char in text])
            else:
                print(f"Skipping URL: {url} due to download error.")

    def __len__(self):
        return len(self.characters) - self.context_size - 3 + 1

    def __getitem__(self, index):
        start_idx = max(0, index)
        end_idx = start_idx + self.context_size
        context = self.characters[start_idx:end_idx]

        target_start_idx = end_idx
        target_end_idx = target_start_idx + 3
        targets = self.characters[target_start_idx:target_end_idx]

        # Pad context if necessary
        if len(context) < self.context_size:
            context = [self.char2index["<PAD>"]] * (self.context_size - len(context)) + context

        # Pad targets if necessary
        if len(targets) < 3:
            targets = targets + [self.char2index["<PAD>"]] * (3 - len(targets))

        input_ids = torch.LongTensor(context)
        target_ids = torch.LongTensor(targets)

        return input_ids, target_ids

In [9]:
# Create a list of the 7 books available on Project Gutenberg
text_urls = [
    "https://www.gutenberg.org/cache/epub/345/pg345.txt",
    "https://www.gutenberg.org/cache/epub/84/pg84.txt",
    "https://www.gutenberg.org/cache/epub/74/pg74.txt",
    "https://www.gutenberg.org/cache/epub/1342/pg1342.txt",
    "https://www.gutenberg.org/cache/epub/1727/pg1727.txt",
    "https://www.gutenberg.org/cache/epub/2701/pg2701.txt",
    "https://www.gutenberg.org/cache/epub/3207/pg3207.txt"
]

# Set a context size window
context_size = 50
# Creating an instance of Dataset class
dataset = TextCharacterDataset(text_urls, context_size)

# Checking length of Dataset
print("Dataset size:", len(dataset))

# Inspecting content of sentence[10]
input_ids, target_ids = dataset[10]
print("Input:", input_ids)
print("Target:", target_ids)

Dataset size: 5737424
Input: tensor([10, 11,  5, 12, 13, 11,  4, 14, 15,  4,  7, 16,  5,  4, 17,  8,  8, 18,
         5,  8, 19,  5, 20,  7, 21, 10, 13, 22, 21, 23, 24,  5,  5,  5,  5, 23,
        24,  2,  3, 25, 26,  5,  4, 15,  8,  8, 18,  5, 25, 26])
Target: tensor([ 5, 19,  8])


# Model Implementation

I have selected an LSTM model from the RNN family because it is a model that is capable of capturing context and long-term dependencies. I believe that this will be highly beneficial to the autocompletion task.

This section is broken down into the following code blocks:
1. Defining the LSTM network class
2. Training the model
3. Model evaluation
3. Creating a user-friendly interface for the model

# Limitations & Conclusion