<a href="https://colab.research.google.com/github/joshuwaifo/A-Bible-Pre-trained-Transformer-Model/blob/main/BPETokeniser_Tiktoken_BibleGPT_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Andrej Karpathy ended, Build a Large Language Model (from Scratch) book by Sebastian Raschka begins

In [1]:
!wget https://raw.githubusercontent.com/tushortz/variety-bible-text/master/bibles/nasb.txt

--2024-08-12 07:51:39--  https://raw.githubusercontent.com/tushortz/variety-bible-text/master/bibles/nasb.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4685837 (4.5M) [text/plain]
Saving to: ‘nasb.txt’


2024-08-12 07:51:40 (69.8 MB/s) - ‘nasb.txt’ saved [4685837/4685837]



Tokeniser

- encode method: takes in natural text, splits it into individual tokens, converts tokens into token ID's via a vocabulary (tokenizer.encode(text))

- decode method: takes in token IDs, converts token IDs into text tokens, concatenates the text tokens onto natural text (tokenizer.decode(ids))

Vocabulary and Inverse vocabulary

In [3]:
with open("nasb.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 4623633
In the beginning God created the heavens and the earth. -- genesis 1:1
.
The earth was formless and


Extend vocabulary with additional special tokens

Add special tokens to a vocabulary to deal with certain contexts

- <|unk|> token to represent new and unknown words that were not part of the training data and thus not a part of the existing vocabulary

- <|endoftext|> token to seperate two unrelated text sources

In [14]:
# Byte Pair Encoding tokenisation scheme used by Llama, GPT-3 etc.
!pip install tiktoken
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.1 MB[0m [31m18.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0
tiktoken version: 0.7.0


Independent text source

<|endoftext|> token act as markers signalling start or end of a particular segment allowing for more effective processing and understanding by the LLM

Prepend it (add it to the start) of each subsequent (following) text source

Depending on the LLM tokens can be

- [BOS]: Beginning of sequence

- [EOS]: End of sequence

- [PAD]: padding


Byte pair encoding tokeniser doesn't use <|unk|> token

In [15]:
tokenizer = tiktoken.get_encoding("gpt2")

BPE tokenisers break down unknown words into subwords and individual characters

This allows the BPE tokeniser to parse any word and doesn't need to replace unknown words with special tokens, like <|unk|>

In [16]:
# convert string into token ids
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


Call the BPE tokeniser on the Word "Akwirw ier" and print the individual token IDs

In [19]:
resulting_integers = tokenizer.encode("Akwirw ier")
print(resulting_integers)

[33901, 86, 343, 86, 220, 959]


Call decode on resulting integers to produce a mapping of each integer (token Ids) to token texts

In [22]:
# note that the input to decode has to be a list even if it is a list of one integer element
tokens = [tokenizer.decode([integer]) for integer in resulting_integers]
print(tokens)

['Ak', 'w', 'ir', 'w', ' ', 'ier']


Call decode method of token ids (resulting integers) to reconstruct orginal input

In [24]:
original_input = tokenizer.decode(resulting_integers)
print(original_input)

Akwirw ier


During training we mask out all words that are past the target (past the next word to ideally be predicted)



In [25]:
# Encode the Bible using the BPE tokeniser
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

1249848


1.249848 million tokens in the data set

In [26]:
# tensor containing the inputs: x
# tensor containing the targets: y

# use PyTorch's built in Dataset and DataLoader classes

import torch
from torch.utils.data import Dataset, DataLoader

#  defines how individual rows are fetched from the dataset
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


Data loader to generate batches with input-output pairs

In [28]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
        stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=0
    )

    return dataloader

Test data loader on Bible txt

In [29]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch) # looks good

[tensor([[ 818,  262, 3726, 1793]]), tensor([[ 262, 3726, 1793, 2727]])]


If stride is equal to input window size we can prevent overlaps between the batches

A stride of 1 moves the input field by 1 position

Context size = max_length here which is also equivalent somewhat to timesteps

In [None]:
# Preparing the input text for an LLM involves

# Input text: "This is an example."
# tokenising text: | This | is | an | example | . |
# converting text tokens to token IDs: | 40134 | 2052 | 133 | 389 | 12 |
# converting token IDs into vector embedding vectors: token embedding vectors
# creating input token embeddings

# GPT-like decoder only transformer
# Postprocessing steps
# Output text




Example

In [30]:
input_token_ids = torch.tensor([2, 3, 5, 1])


vocab_size = 6
output_dim = 3


torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight) # embedding layer's underlying weight matrix

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


Weight matrix of embedding layer contains small random values

These values are optimised during LLM training as part of the LLM optimisation itself

Weight matrix has 6 rows and 3 columns

One row for each of the possible 6 tokens in the vocabulary

One column for each of the three embedding dimensions

In [34]:
# apply embedding layer to a token id
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


This is identical to the 4th row in the weight matrix above

Essentially it is a look up operation that retrieves rows from the embedding layer's weight matrix via a token id

In [36]:
# Apply embedding layer to all 4 input ids defined earlier
print(embedding_layer(input_token_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


This verifies the idea of the lookup operation mentioned previously

In [None]:
# Continue from Figure 2.16