We will be building our own LLM from scratch, because we can! Now, this notebook is aimed to be a step-by-step guide, but who knows, I might get tired and then its just a bunch of random code that nobody understands. Anyways, let's start.

First thing we build is a tokenizer. Think of tokens as the smallest unit our model can interact with. For more simplicity, it can be thought as a word in a sentence (I know that character is the smallest in a sentence, but you get the point)

The simplest tokenizer one can think of is converting each character to its ASCII value. So,

 ```tokenize("ace") = [97,99,101] and similarly deTokenize([97,99,101]) = "ace"```


Tokenizer converts out input (in this case, text) into tokens. So essentially this is a funtion that takes input and returns an array of integer (our token is mapped by integer values)
So, let's get into it. I will try this guide to be as extensive as possible.

Dataset - Take any book of your liking which is available publically, for now, I have taken this shakespeare gist (https://gist.githubusercontent.com/blakesanie/dde3a2b7e698f52f389532b4b52bc254/raw/76fe1b5e9efcf0d2afdfd78b0bfaa737ad0a67d3/shakespeare.txt) for data.

In [1]:
import os

# Ensure the directory exists
os.makedirs("/content/sample_data", exist_ok=True)

file_path = "/content/sample_data/shakespeare.txt"

# Download the file if it doesn't exist
if not os.path.exists(file_path):
    print(f"Downloading {os.path.basename(file_path)}...")
    !wget -q -O {file_path} https://gist.githubusercontent.com/blakesanie/dde3a2b7e698f52f389532b4b52bc254/raw/76fe1b5e9efcf0d2afdfd78b0bfaa737ad0a67d3/shakespeare.txt
    print("Download complete.")

# Read the file and store the content in raw_text
with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Downloading shakespeare.txt...
Download complete.
Total number of character: 5436475
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as


In [2]:
import re # For Regex

# Okay, so we are now implementing our tokenizer
# Step 1 - Split text the by special characters, that's exactly what re.split is doing
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
# For the purpose of our example, we will be removing all the whitespaces (this is a decision to make depending on the case, for instance, spaces are important for python code, so we cant ignore them if the dataset had python code)
preprocessed = [item.strip() for item in preprocessed if item.strip()] #Pre Processed has all the splitted words
print(preprocessed[:30])

['From', 'fairest', 'creatures', 'we', 'desire', 'increase', ',', 'That', 'thereby', 'beauty', "'", 's', 'rose', 'might', 'never', 'die', ',', 'But', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', ',', 'His', 'tender', 'heir', 'might']


Now that we have splitted the words for our tokenizer, its time to map them to a integer value.

For simplicity, we will just sort the list in preprocessed and give the index as an integer

This sorted array will essentially create the mapping of each splitted word (or token) to exact input/human form. We will call it vocabolary

Note that both splitting by word and vocabolary creation can differ for each commercial model. We are just trying to get the intuition of the flow, rather than the exact functions being used

In [3]:
all_words = sorted(set(preprocessed)) #set to remove duplicate words
vocab_size = len(all_words)

print(vocab_size)

34191


In [4]:
vocab = {token:integer for integer,token in enumerate(all_words)} #Create vocabolary in hashmap/dict. Though all_words array should have worked but making it a doct give us O(1) lookup

We will need to both encode and decode the tokens from token id in the future, so we will create a tokenizer class that gives us both the function

Encoding will involve getting map from vocab

For decoding, we will create an inverse of vocab where key is value and value is key, for faster lookup

In [5]:
class TokenizerV1:
    # initialize Tokenizer class with the vocabolary. This tokenizer class can be used for different programs with different vocab
    def __init__(self, vocab):
        self.word_to_id_dict = vocab # Map used for encoding
        self.id_to_word_dict = {i:s for s,i in vocab.items()} # Reverse map, used for decoding

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.word_to_id_dict[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.id_to_word_dict[i] for i in ids]) # Join all the characters in the ids through spaces
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # Remove spaces from front of special characters
        return text

In [6]:
tokenizer = TokenizerV1(vocab)

text = """"From fairest creatures we desire increase,
  That thereby beauty's rose might never die,"""
ids = tokenizer.encode(text)
print(ids)

[1, 3516, 17246, 14371, 33301, 15157, 20576, 8, 8124, 30894, 11200, 5, 27669, 27515, 22889, 23644, 15292, 8]


In [7]:
tokenizer.decode(ids)

'" From fairest creatures we desire increase, That thereby beauty\' s rose might never die,'

But what if we encounter a word that is not there in the document? Then our program will fail. To handle those cases, we will use 2 special tokens
1. <|unk|> for "unknown" words and
2. <|endoftext|> for "end of file"
endoftext is needed since LLMs are trained on mulitple data sources, so it helps LLM to differentiate in the sources.

For our updated tokenizer, we need to add the two tokens in the vocab first

In [8]:
all_words = sorted(set(preprocessed))
all_words.extend(["<|unk|>", "<|endoftext|>"])

vocab = {token:integer for integer,token in enumerate(all_words)}
len(vocab.items())

34193

Now, the first intuition might be to just update the vocab and it will auto-handle the new token, but its not, we need to make a few changes to encode and decode
function as well. I am going to write a new version `TokenizerV2` for this, but feel free to continue in the previous one as well

In [9]:
class TokenizerV2:
    def __init__(self, vocab):
        self.word_to_id_dict = vocab # Updated Map used for encoding
        self.id_to_word_dict = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        # We have created the item array, hence, here is where unknown should come
        preprocessed = [
            item if item in self.word_to_id_dict
            else "<|unk|>" for item in preprocessed
        ]
        ids = [self.word_to_id_dict[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.id_to_word_dict[i] for i in ids]) # Join all the characters in the ids through spaces
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # Remove spaces from front of special characters
        return text

In [10]:
tokenizer = TokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [11]:
tokenizer.encode(text)

[34191,
 8,
 15758,
 34147,
 21881,
 34191,
 262,
 34192,
 4320,
 30855,
 34191,
 34191,
 24067,
 30855,
 24572,
 76]

In [12]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like <|unk|>? <|endoftext|> In the <|unk|> <|unk|> of the palace.'

This unknown and endoftext token is called Special Context Token and GPT uses endoftext for its working
Other Special Context Token includes

1. **[BOS] (beginning of sequence)**: This token marks the start of a text. It signifies to the LLM where a piece of content begins.

2. **[EOS] (end of sequence)**: This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins.

3. **[PAD] (padding)**: When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.

We will not worry about these for now, but you get the idea! You can add this in your implementation

We saw a simple word-based tokenization, where each word is a token, but the downside of this is we have to have a vocab with all the words as separate token
Now this might seem logical at first, it will create problems like Large Vocab needed and a word being out of vocabolary.

Also, we lose the relation between different words. For instance, assume a sentence "small smaller light lighter big bigger", noe if we use word based tokenization We need 6 tokens (each for a word) and those will be completely different.

Hence, to solve this problem, we use sub word based tokenization. (There is a character based tokenization as well, but that has its own challenges, you can read about them)

**What we will implement is a Byte Pair Encoder.**

a. Paper for BPE - https://www.derczynski.com/papers/archive/BPE_Gage.pdf

b. GPT uses tiktoken as its encoder, so you can directly use that as well (https://github.com/openai/tiktoken)

We will use titoken as our BPE tokenizer for this notebook, but we implement our own BPE tokenizer as well which can be found here. So, chose whatever you want to do.

In [13]:
! pip3 install tiktoken



In [14]:
import importlib
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2") # This was our tokenizer = TokenizerV2(vocab)

In [15]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

strings = tokenizer.decode(integers)

print(strings)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


Now that we have our encoded data, we will create input-target pairs or input-output pairs. In simpler words, LLMs dont need to be fed a lot of different data to find input-target.

If a sentence is "I am learning LLM from scratch", so we can train data like this:-

a. Pass 1: "I" -> "am" (target)

b. Pass 2: "I am" -> "learning" (target)

c. Pass 3: "I am learning" -> "LLM" (target)

Did you notice how previous target upon adding with input becomes input for the new target/output. This is called auto-regressive technique.

In [16]:
encoded_text = tokenizer.encode(raw_text)
print(len(encoded_text))

1836425


One of the easiest and most intuitive ways to create the input-target pairs for the nextword prediction task is to create two variables, x and y, where x contains the input tokens and y contains the targets, which are the inputs shifted by 1:

In [17]:
context_size = 4 # length of the input
#The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens) to predict the next word in the sequence.
#The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5]

for i in range(1, context_size+1):
    context = encoded_text[:i]
    desired = encoded_text[i]

    print(context, "---->", desired)

for i in range(1, context_size+1):
    context = encoded_text[:i]
    desired = encoded_text[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

[220] ----> 3574
[220, 3574] ----> 37063
[220, 3574, 37063] ----> 301
[220, 3574, 37063, 301] ----> 8109
  ---->  From
  From ---->  faire
  From faire ----> st
  From fairest ---->  creatures


Now we have created input-target pairs that will be helpful in training LLMs. But this is currently just a list. We will need to do it in structured order that will be helpful to us in future as well.

So, we will create something called Data Loader. Data loader iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

Our data loader will generate 2 tensors
1. Input tensor for LLM data to see
2. Target tensor to tell LLM about the target character

We will use sliding window approach to implement this. This is what we will do

1. We will create an input tensor (assume 2D array for now), where each row is the list of token of `context_length` length.
  
    For instance if token list after encoding is [1,2,3,4,5,6,7,8] and `context_length` is 2 `input_tensor = [[1,2],[3,4],[5,6],[7,8]]`

2. We will create an output tensor, where each output row is same as input row moved ahead by 1 index.

     For the above `input_tensor`, `output_tensor = [[2,3],[4,5],[6,7],[8]]`

Simple Algorithm

Step 1: Tokenize the entire text
    
Step 2: Use a sliding window to chunk the book into sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4: Return a single row from the dataset

In [18]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into sequences of max_length
        i = 0
        while i < len(token_ids) - max_length:
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            i+=stride

    # Needed for pytorch dataloader
    def __len__(self):
        return len(self.input_ids)

    # Needed for pytorch dataloader
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

A

In [19]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function

In [20]:
import torch
print("PyTorch version:", torch.__version__)
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version: 2.9.0+cpu
[tensor([[  220,  3574, 37063,   301]]), tensor([[ 3574, 37063,   301,  8109]])]


Now that we have out DataLoader, we can convert our tensor into vector embeddings to be served to the LLM. We will be converting our tensors to vector embeddings now.


Let's have an intuition behind conversion to vector embedding. Let's say you want to generate a sentence "I am learning LLM from scratch"

This is in form of words, but computers needs integers to work upon. So we convert sentences into token ids as seen above. But why cant be use these token IDs directly?

When we convert the words or subwords into tokens, we lose the relation between them. For instance "cat" and "kitten" are related, but if we just pass the token ids, we can't fiure out if they are related.

So, to do this, we will generate vector corresposnding to some features in a vector (or matrices) format. This is called vector embedding. For instance, if the features I decide to take are ["is_pet", "can_fly"], I can convert "dog" into a vector [90, 2]. So, we can group all words with a high value of "is_pet" togther and similarly for "can_fly".

There are multiple pre-trained vector embedding datasets available online that you can use, or you can train one of your own. One such dataset is [Google's word to vector](https://huggingface.co/fse/word2vec-google-news-300). You can use this if you don't feel like training your own vector embedding.

The next block will be an illustration on how to use this dataset, you can skip, if you are trying to train one on your own.

In [23]:
! pip install gensim

import gensim.downloader as api
model = api.load("word2vec-google-news-300")


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m84.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
[ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-0

In [24]:
word_vectors = model
print(word_vectors['computer'])

print(word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)) #Find most similar word to king + woman - man which should be something around 'queen'

[ 1.07421875e-01 -2.01171875e-01  1.23046875e-01  2.11914062e-01
 -9.13085938e-02  2.16796875e-01 -1.31835938e-01  8.30078125e-02
  2.02148438e-01  4.78515625e-02  3.66210938e-02 -2.45361328e-02
  2.39257812e-02 -1.60156250e-01 -2.61230469e-02  9.71679688e-02
 -6.34765625e-02  1.84570312e-01  1.70898438e-01 -1.63085938e-01
 -1.09375000e-01  1.49414062e-01 -4.65393066e-04  9.61914062e-02
  1.68945312e-01  2.60925293e-03  8.93554688e-02  6.49414062e-02
  3.56445312e-02 -6.93359375e-02 -1.46484375e-01 -1.21093750e-01
 -2.27539062e-01  2.45361328e-02 -1.24511719e-01 -3.18359375e-01
 -2.20703125e-01  1.30859375e-01  3.66210938e-02 -3.63769531e-02
 -1.13281250e-01  1.95312500e-01  9.76562500e-02  1.26953125e-01
  6.59179688e-02  6.93359375e-02  1.02539062e-02  1.75781250e-01
 -1.68945312e-01  1.21307373e-03 -2.98828125e-01 -1.15234375e-01
  5.66406250e-02 -1.77734375e-01 -2.08984375e-01  1.76757812e-01
  2.38037109e-02 -2.57812500e-01 -4.46777344e-02  1.88476562e-01
  5.51757812e-02  5.02929

So, to create a vector embedding, we need 2 things - number of token ids and number of features for a vector.

For instance, GPT model was trained on 50257 tokens and 768 features, so this can be thought as a 2D array, where each row corresponds to one of the 50257 token and each column corresponds to its value with 768 features.

How to generate vector embeddings?
1. Assign a random value to the vector values
2. Optimize embedding weights as the LLM is trained. This means that with LLM training, vector embedding are also getting trained.

In [26]:
# We will not train out embedding for 768 features like GPT, we will for now, just take 3 features as our training process.
# Now, we can take the input tensor generated above to train, or we can use a simple smaller size of let's say 6 words to train.
# The issue with training the above input tensor of shakespeare text is that we might not be able to find all the features to divide each word perfectly,
# hence, I will be using 6 words and 3 features for now. However, if you want to do it full fleged, you can check the parameters used by Google tensor and
# train for 300 or even 768 tokens. Remember that the more the parameters, more energy and resources it will take to train

input_ids = torch.tensor([2, 3, 5, 1])

vocab_size = 6 # sixe of vocabolary
output_dim = 3 # number of features

torch.manual_seed(123) # Random number to hash
embedding_layer = torch.nn.Embedding(vocab_size, output_dim) # Generates vector embedding with random values.

print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


See that weight of embedding layer is 6X3 matrix as we wanted. This acts as a lookup table, so we can lookup any token now.

In [27]:
print(embedding_layer(torch.tensor([3])))
print(embedding_layer(input_ids))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


I know, its getting tiring now, but if you are still reading this, we are just about to end our data pre-processing part. (You thought OpenAI is burning money in nothing?). Anyways, we have got our words to have meaning, but what if I have 2 sentences:-

1. I am learning LLM from scratch
2. LLM is learnt by me from scratch

You can see LLM is repeated in both the sentences but its position changes its semantics. But if we embed both the sentences, the vector embedding for LLM will be the same, though they could serve different purposes. Hence, we will now add position to our vector embeddings.

There are 2 ways to do it:-
1. Absolute position - We give a fixed embedding to a particular position of the sentence and then add the two vectors to find the new vector embedding. Positional vector will have same dimension as original token embedding.

2. Relative postion - We check how far apart are 2 words. This way, we ensure that even if sentence length is different from that of training set, model can manage it.

Commercially, Absolute position works just fine for most models.

In [29]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim) # Create a embedding, this time of a comparable magnitude to GPT

In [30]:
# Generate input_ids with dataloader

max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [32]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[  220,  3574, 37063,   301],
        [ 8109,   356,  6227,  2620],
        [   11,   198,   220,  1320],
        [12839,  8737,   338,  8278],
        [ 1244,  1239,  4656,    11],
        [  198,   220,   887,   355],
        [  262,   374,  9346,   815],
        [  416,   640, 12738,   589]])

Inputs shape:
 torch.Size([8, 4])


In [33]:
# Remmber that token embedding layer was somethig like this [[//256 values], [//256 values]], so we extract corresponding 256 values and make a 3D matrix

token_embeddings = token_embedding_layer(inputs) # create embedding for each token id
print(token_embeddings.shape)

torch.Size([8, 4, 256])


Now lets create positional embedding. At any given point, we have to process at max `context_length` tokens, so we can create token embedding as a `context_length` X `output_dim` vector. Where each row will show the position based value for a particular feature.

In [35]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([4, 256])
torch.Size([8, 4, 256])


And we are done, we are finally done with the data pre-processing that can be sent to LLM to train. We will continue with LLM training as we move ahead, but for a moment, step back, make a mind map of what we have done till now. Its a lot for data pre processing.

If you are following till now, it means you are having fun. And if you are having fun, its important that you get what we have done till now. And yeah, as I am writing this, I haven't implemented my own BPE tokenizer yet, but it will update as soon as I do that. So watch out! Anyways, onto LLM training.

Here's a small recap of the steps done till now:-

1. Tokenization (using BPE agorithm and tiktoken)
2. Create Input Target pairs
3. Creating Token embeddings
4. Creating Positional embeddings
5. Create Input Embeddings = Token + Positional