<h1>Overview</h1>

This document aims to explain what a language model is and how we can build a simple one. You have likely heard or read news about Language Models, especially Large Language Models (or LLMs for short), frequently these days since the release of <b>ChatGPT</b>. 
<br/>
<b>ChatGPT</b>, which is built on top of LLMs like GPT-4, has become a notable example in the field. In this notebook, I will show you how to construct a language model from scratch. As we proceed step by step, I'll also discuss some considerations and challenges associated with building and training such models, as well as how the experts in the field have addressed them.

<h3>What is a language model?</h3>

A language model is a probability distribution over a sequence of tokens drawn from a specific vocabulary. For example, if we consider the vocabulary to be the English language, then a sequence of tokens could be the sentence 'The sky is blue.'
<br/> <br/>
<b>What does it mean when we say, 'A language model is a probability distribution,' and how can we utilize this concept?</b>
<br/> <br/>
Mathematically speaking, considering the vocabulary $V$, for every sequence of tokens $x_1,...x_m$, where each token $x_i \in V $, a language model is defined by the probability distribution $p(x_1,x_2,...,x_m)$. In other words, $p(x_1,x_2,...,x_m)$ tells us how likely a sequence of tokens is to be observed. Of cource, we expect that this probability distribution assigns a high probability to correct sequences while giving small probability to incorrect or meaningless sequences. For example, we expect $p(.)$ to give a higher probability to 'The sky is blue' in comparison to 'A sky was the blue'
<br/><br/>
Now let's delve into some mathematics to see what we can drive from $p(.)$.
<br/>
Using the chain rule of probability, we can rewrite $p(x_1,x_2,...,x_m)$ as follows:
$$p(x_{1:m})=p(x_1)p(x_2∣x_1)p(x_3∣x_1,x_2)⋯p(x_m∣x_{1:m−1})=\prod_{i=1}^m p(x_i∣x_1:i−1).$$
Here, each term in the multiplication represents the conditional probability of the current token given the previous tokens. 
<br/>
Knowing the conditional probability $p(x_i∣x_{1:i−1})$ means that given the sequence $x_1,...,x_{i−1}$, we can sample the next token $x_i$ from the vocabulary, and then sample another one, and so on. Sampling the tokens one after another implies that we are <b><i>generating</i></b> a sequence of vocabularies or in other words, we are <b><i>generating</i></b> a text. 
<br/><br/>
<b>Now, one key question to ask is, 'Do we know this probability distibution?' if not, 'Can we estimate it' </b>    
<br/><br/>
The answer to the first question is 'No! We do not.' The answer to the latter one is 'Yes! We can.' Here, Deep Neural Networks, especially Transformers, come to the rescue.



<b>Summary</b>
<ul>
  <li>A language model is a probability distribution over a sequence of tokens drawn from a specific vocabulary.</li>
  <li>Mathematically, this probability distribution is defined by $p(x_1,x_2,...,x_m) =\prod_{i=1}^m p(x_i∣x_1:i−1).$</li>
  <li>We do not know this probability distribution but we can estimate by building a special deep neural network.</li>
    <li>The model, which estimates the language model, can then be utilized to generate meaningful texts.</li>
</ul>

<h3>Analyzing Sequential Data Using Deep Neural Networks</h3>


NLP tasks, such as Language Modeling and Machine Translation, require analyzing data sequentially. Traditional feedforward neural networks are not capable of handling dependencies in sequential data, as they lack any memory of previous inputs. This is where RNNs (Recurrent Neural Networks), and later on, Transformers, come into the scene. Unlike traditional feedforward networks, RNNs are able to maintain a hidden state that acts as a form of memory, enabling them to capture information about previous inputs in the sequence.

This characteristic makes RNNs well-suited for various tasks. For example, they are ideal for text generation and machine translation. Due to their ability to model temporal dependencies, they can be used for forecasting stock prices, weather, and other time-dependent phenomena. The sequential nature of audio signals makes RNNs valuable for speech recognition tasks. Additionally, they can analyze sequences of video frames to recognize actions or gestures over time.

However, traditional RNNs suffered from problems like vanishing or exploding gradients, making them difficult to train on long sequences. This led to the development of more advanced types of RNNs, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which are designed to better capture long-range dependencies in sequential data. While LSTMs and GRUs are designed to capture long-range dependencies, they may still be challenged by the inherent sequential nature of their processing, limiting their parallelization efficiency and potentially affecting their ability to model extremely complex dependencies. In contrast, Transformers use a self-attention mechanism that allows them to directly model relationships between all parts of the input sequence, regardless of distance. This parallel processing enables more efficient training and can provide Transformers with a significant advantage in tasks requiring a nuanced understanding of context and long-range interactions between elements in a sequence.

Transformers are a type of neural network architecture introduced by Vaswani et al. in the paper "<a href=https://arxiv.org/abs/1706.03762>Attention is All You Need</a>". As opposed to sequential models like RNNs, LSTMs, and GRUs, Transformers are able to process an entire sequence of data simultaneously rather than sequentially.
The core of the Transformer architecture is the attention mechanism, which allows the model to weigh the importance of different parts of the input (e.g. different tokens in a sequence of tokens) when processing each individual element (e.g., a token). This enables the Transformer to capture complex relationships and dependencies between all parts of the input, regardless of their distance from each other in the sequence.

A typical Transformer model consists of an encoder and a decoder, each composed of several layers of self-attention and feed-forward neural networks. The encoder processes the input sequence (e.g. an English sentence), and the decoder generates the output (e.g. the equivalent French translation of the input), with attention mechanisms at each layer allowing the model to focus on different parts of the input as needed

The parallel processing nature of the Transformer architecture makes it highly efficient for training on modern GPUs, and its flexibility and scalability have led to its adoption in a wide variety of natural language processing tasks, including machine translation, text generation, and language understanding. Transformers have served as the foundation for many state-of-the-art models, such as BERT and GPT series

In this post, I will not be delving further into the architecture of Transformers. Interested readers can refer to <a href='http://jalammar.github.io/illustrated-transformer/'>this post</a> by Jay Alammar, which provides an amazing explanation and illustration of Transformers.

<b>Summary</b>
<ul>
  <li>Traditional feedforward neural networks are not capable of handling dependencies in sequential data, as they lack any memory of previous inputs. 
  <li>RNNs are able to maintain a hidden state that acts as a form of memory, enabling them to capture information about previous inputs in the sequence.</li>
  <li>Traditional RNNs suffered from problems like vanishing or exploding gradients, making them difficult to train on long sequences. </li>
    <li>LSTMs and GRUs are more advanced types of RNNs which are designed to better capture long-range dependencies in sequential data. </li>
    <li>LSTMs and GRUs still suffer from the inherent sequential nature of their processing, which limits their parallelization efficiency and potentially affects their ability to model extremely complex dependencies. </li>
    <li>Transformers, as opposed to RNNs, LSTMs, and GRUs, are able to process an entire sequence of data in parallel rather than sequentially.</li>
    <li>The core of the transformers is the attention mechanism which enables the Transformers to capture complex relationships and dependencies between all parts of the input, regardless of their distance from each other in the sequence.</li>
</ul>

<h3>Building and Training a Language Model </h3>


Now that we have learned what a language model is theoretically, it is time to gain some hands-on experience by implementing a real yet simple language model. 

Here is the roadmap:
<ul>
    <li>First, we will start by building a simple language model using RNNs. Although modern language models often use Transformers at their cores, I believe that exploring both approaches will be a valuable learning experience.
</li>
    <li>Second, I will alter the model to use Transformers instead. 
</li>
    <li>Finally, in a subsequent post, I will further expand the model by introducing the concept of Mixture of Experts and Switch Transformers.  </li>
</ul>


<h3>Dataset and Preprocessing</h3>

In this post, we will use publicly available datasets to train our small language model. Wikitext-103, Penn Treebank (PTB), Text8, and Gutenberg are a few examples of such datasets. Specifically, we will use Penn Treebank (PTB), which is available through the built-in datasets of PyTorch.

When dealing with NLP tasks, every input text to the model must be broken down into a sequence of tokens. In other words, we must tokenize the text. SentencePiece and Byte-Pair Encoding (BPE) are somewhat advanced tokenization techniques that work well for various NLP tasks. However, to maintain focus and keep this post as concise as possible, I will explain them in more detail in another post. Here, we will use the most basic tokenization method, which simply involves splitting the text word by word.

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import PennTreebank
# from torchtext.data import Field, BPTTIterator
from torchtext.legacy.data import Field, BPTTIterator

ModuleNotFoundError: No module named 'torchtext.legacy'

In [None]:
# Define the tokenizer
tokenizer = lambda x: x.split()

# Define the fields to process the data
TEXT = Field(sequential=True, tokenize=tokenizer, lower=True)

# Load the Penn Treebank dataset
train_data, valid_data, test_data = PennTreebank.splits(TEXT)

# Build the vocabulary using the training data
TEXT.build_vocab(train_data, min_freq=3)

# Define an iterator to create batches of data
train_iter, valid_iter, test_iter = BPTTIterator.splits((train_data, valid_data, test_data), batch_size=64, bptt_len=30, device='cuda')


<h3>An RNN based Langauage Model</h3>

In [4]:
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size):
        super(LanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.rnn(embedded)
        # Use the entire sequence of outputs
        output = self.fc(output)
        return output