## Reading in a short story text sample into Python

Reference code from [llms from scratch](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb)

### Step 1: Creating Tokens

Print out the total number of characters

In [20]:
import os

In [21]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total no. of characters :::", len(raw_text))
print(raw_text[:99])

Total no. of characters ::: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


_**NB:** When working with LLMs it's common to process Gigabytes of text as opposed to the few lines we're working with_

Use the split command to perform a common split. We will keep the whitespace as well.

In [22]:
import re

text = "Hello, world this is a text."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world', ' ', 'this', ' ', 'is', ' ', 'a', ' ', 'text.']


Now modify this to also include puctuation as well

In [23]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', ' ', 'this', ' ', 'is', ' ', 'a', ' ', 'text', '.', '']


In [24]:
result = [item for item in result if item.strip()]

print(result)

['Hello', ',', 'world', 'this', 'is', 'a', 'text', '.']


_**NB:** Removing whitespaces isn't a mandatory thing. There are times where it would be useful to have when training models where whitespace has meaning_

Let's modify the code to now handle other kinds of punctuation marks.

In [25]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)',text)
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [26]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',raw_text)
preprocessed = [item for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


Calculate the total number of tokens

In [27]:
print(len(preprocessed))

4690


### Step 2: Creating Token IDs

We've already tokenised everything and assigned them to a `preprocessed` variable.

Now we'll sort the preprocessed words and remove duplicates to allow us to create a dictionary

In [28]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


Now we create a dictionary of all words and map them to a number

In [29]:
vocab = {token:index for index, token in enumerate(all_words)}

List the first 50 entries of the dictionary with our token id mapping. This can be used to map new words

In [30]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


The process above allows us to encode. Later we'll need a way to map a token id back to the word.

A python class with an encode and decode method is used below.

In [31]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)
        preprocessed = [item for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Instantiate an instance of the Tokenizer class

In [32]:
tokenizer = SimpleTokenizerV1(vocab)

input_text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(input_text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


Decode the integers back to text

In [33]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [34]:
tokenizer.decode(tokenizer.encode(input_text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

To make things interesting let's try to include words that aren't in the vocubulary

In [35]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

Hello isn't available so our vocab errors out. LLMs need a way of dealing with this and so will ours

### Adding Special Context Tokens

After implementing the simple tokenizer and realising that it can receive tokens that aren't in it's vocabulary, we need a way to handle that and also tell the LLM when we have reached the end of text.

This is done by using special context tokens. For example, a token such as `<|endoftext>` is added to show an end of a sentence while `<|unk>` can be used to show a word that's not in the vocab.

In [36]:
all_tokens = sorted(set(preprocessed))
# Extend the tokens with the special context tokens
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:index for index, token in enumerate(all_tokens)}
print(len(vocab))

1132


Print the last 5 items in the vocabulary

In [37]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [38]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)
        # Remove the whitespaces
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        # Replace the unknown tokens with a special context token
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [39]:
tokenizer = SimpleTokenizerV1(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print("Text being fed to the encoder >>>>> ", text)

Text being fed to the encoder >>>>>  Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [40]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [41]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

The above shows us that it's possible for us to handle words that don't exist in the original corpus of words. We just augment our vocabulary with these tokens

There are other tokens that can be used such as 
- BOS - Beginning of Sequence
- EOS - End of Sequence
- PAD - Padding

GPT doesn't use any of the tokens above and only uses `<|endoftext|>` to signify end of sequences and for words which don't exist in the vocabulary, it breaks down words further into byte-pairs which we'll learn next.

## Byte Pair Encoding/Tokenisation (BPE)

This is a subword tokenization algorithm

As we saw, implementing BPE from scratch can be tasking so an open source library tiktoken will be used. GPT was tokeninsed using this.

In [43]:
# pip install tiktoken

In [None]:
# pip install importlib

In [44]:
import importlib.metadata
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.11.0


We now instantiate the tokenizer in the similar way we instantiated the simple word based one we made earlier. I specify the tokenizer for the model I'd like to use.

In [46]:
tokenizer = tiktoken.get_encoding("gpt2")

In [47]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [None]:
strings = tokenizer.decode(integers)
print(strings)

The BPE tokenizer that trained GPT assigns the `<|endoftext|>` special token the highest character. In this case it is 50256 which shows the size of the token vocabulary.

**Here's how the BPE tokenizer allows us to deal with unknown vocabulary tokens.**

In [48]:
integers = tokenizer.encode("lpowenof yeupp ppewkr")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[34431, 322, 268, 1659, 9838, 7211, 9788, 413, 38584]
lpowenof yeupp ppewkr


## Creating Input-Target Pairs

We'll implement a data loader that fetches the input-target pairs using a sliding window approach.

We first tokenize The Verdict short story we worked with earlier using the BPE algorithm

In [49]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

tokenizer = tiktoken.get_encoding("gpt2")
encoded_text = tokenizer.encode(raw_text)
print(len(encoded_text))

5145


Shorten the encoded words to first 50 for sampling. We can change this later

In [50]:
encoded_sample = encoded_text[50:]

A simple way to create input-output pairs would be to create an array of inputs and expected output. We have inputs stored in array X and output in array Y that has the target output shifted by one.

When input is `X[l..i]`, out put is `Y[i]` 
```
X = [1, 2, 3, 4]
Y = [2, 3, 4, 5]
```

The context size determines how many tokens are included in the input in order to predict the next output. It's the sequence of words/tokens that the model is trained to look at in order to predict the next token/word.

In [51]:
context_size = 4 # length of input

x = encoded_sample[:context_size]
y = encoded_sample[1:context_size + 1] #target output shifted by one

print(f"input: {x}")
print(f"output:     {y}")

input: [290, 4920, 2241, 287]
output:     [4920, 2241, 287, 257]


When we process the inputs along with targets which are just inputs shifted by one position, we can now create the next-word prediction task below:

In [52]:
for i in range(1, context_size+1):
    context = encoded_sample[:i]
    desired = encoded_sample[i]
    
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


Everything on the left of the arrow is the input the LLM would receive and then everything on the right would be the token the LLM is supposed to predict.

Now let's take up the same example and decode the output to give a practical feel.

In [53]:
for i in range(1, context_size+1):
    context = encoded_sample[:i]
    desired = encoded_sample[i]
    
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Next is to create the input-output pairs in a more efficient and structured way using a better Data Loader than what we have. We'll use PyTorch tensorts which can be thought of as multidimensional arrays. These allow us to run parallel processing since what what we have above not input and output tensors.

### Implementing a Data Loader

DataLoaders allow us to process data in a more efficient way.

To implement efficient dataloaders we collect inputs in a tensor x where each row represents one input context. The second row is simply the input tensore shifted by one.

Step 1: Tokenize the entire dataset.

Step 2: Use a sliding window to chunk the book into overlapping sequnces of max_length.

Step 3: Return the total number of rows in the dataset.

Step 4: Return a single row from the dataset.

In [54]:
# Verify if we have PyTorch
print(importlib.metadata.version("torch"))

2.2.2


In [55]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    # stride determines how much we slide.
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        #Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        #Use a sliding window to chuch the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            output_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(output_chunk))

    def __len__(self):
        return len(self.input_ids)

    # Implemented for PyTorch Dataloader to use when loading into the Dataloader
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

The number of tokens in each row is equal to the context window

Now we use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:

Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it's shorter than the specified batch_size to prevent loss spikes during training.

Step 4: The number of CPU processes to use for pre-processing.

In [56]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True, 
                         num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )

    return dataloader

Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4 to see how the `GPTDataSetV1` and the `create_dataloader_v1` method works.

In [57]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Convert dataloader into a Python iterator to fetch the next entry via Python's built in `next()` function

In [None]:
import torch

context_length = 4

# Strider of 1 is for demonstration purposes. This should be set to the context_length to avoid overfitting.
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=context_length, stride=1, shuffle=False
)

print("Dataloader ::: ", dataloader)
data_iter = iter(dataloader)

first_batch = next(data_iter)
print("first_batch :::: ", first_batch)

second_batch = next(data_iter)
print("second_batch :::: ", second_batch)

Dataloader :::  <torch.utils.data.dataloader.DataLoader object at 0x10d80d890>
first_batch ::::  [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
second_batch ::::  [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


Batch size is modified depending on training demands.

We can also create batched outputs by increasing the batch_size. If we increase batch_size we can also increase stride so that we don't have overlaps between batches, since more overlap could lead to increased overfitting. The stride is usually kept at the same length as the context_length to ensure that we don't miss anywords and we avoid overlapping and overfitting

In [63]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
