<h1>Overview</h1>

This document aims to explain what a language model is and how we can build a simple one. You have likely heard or read news about Language Models, especially Large Language Models (or LLMs for short), frequently these days since the release of <b>ChatGPT</b>. 
<br/>
<b>ChatGPT</b>, which is built on top of LLMs like GPT-4, has become a notable example in the field. In this notebook, I will show you how to construct a language model from scratch. As we proceed step by step, I'll also discuss some considerations and challenges associated with building and training such models, as well as how the experts in the field have addressed them.

<h3>What is a language model?</h3>

A language model is a probability distribution over a sequence of tokens drawn from a specific vocabulary. For example, if we consider the vocabulary to be the English language, then a sequence of tokens could be the sentence 'The sky is blue.'
<br/> <br/>
<b>What does it mean when we say, 'A language model is a probability distribution,' and how can we utilize this concept?</b>
<br/> <br/>
Mathematically speaking, considering the vocabulary $V$, for every sequence of tokens $x_1,...x_m$, where each token $x_i \in V $, a language model is defined by the probability distribution $p(x_1,x_2,...,x_m)$. In other words, $p(x_1,x_2,...,x_m)$ tells us how likely a sequence of tokens is to be observed. Of cource, we expect that this probability distribution assigns a high probability to correct sequences while giving small probability to incorrect or meaningless sequences. For example, we expect $p(.)$ to give a higher probability to 'The sky is blue' in comparison to 'A sky was the blue'
<br/><br/>
Now let's delve into some mathematics to see what we can drive from $p(.)$.
<br/>
Using the chain rule of probability, we can rewrite $p(x_1,x_2,...,x_m)$ as follows:
$$p(x_{1:m})=p(x_1)p(x_2∣x_1)p(x_3∣x_1,x_2)⋯p(x_m∣x_{1:m−1})=\prod_{i=1}^m p(x_i∣x_1:i−1).$$
Here, each term in the multiplication represents the conditional probability of the current token given the previous tokens. 
<br/>
Knowing the conditional probability $p(x_i∣x_1:i−1)$ means that given $x_1:i−1$, we can sample the next token $x_i$ from the vocabulary, and then sample another one, and so on. Sampling the tokens one after another implies that we are <b><i>generating</i></b> a sequence of vocabularies or in other words, we are <b><i>generating</i></b> a text. 
<br/><br/>
<b>Now, one key question to ask is, 'Do we know this probability distibution?' if not, 'Can we estimate it' </b>    
<br/><br/>
The answer to the first question is 'No! We do not.' The answer to the latter one is 'Yes! We can.' Here, Deep Neural Networks, especially Transformers, come to the rescue.



<b>Summary</b>
<ul>
  <li>A language model is a probability distribution over a sequence of tokens drawn from a specific vocabulary</li>
  <li>Mathematically, this probability distribution is defined by $p(x_1,x_2,...,x_m) =\prod_{i=1}^m p(x_i∣x_1:i−1).$</li>
  <li>We do not know this probability distribution but we can estimate by building a special deep neural network</li>
    <li>The model, which estimates the language model, can then be utilized to generate meaningful texts</li>
</ul>