# Understanding LLM Input Data

### 1 Tokenizing text
In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="./metadata/02.png" alt="tokenization example" style="display: block; margin: 0 auto; width:700px; height:auto;" />

In [2]:
import torch 
import tiktoken

Load the raw text , a public domain dataset to tokenize 

In [3]:
with open("./datasets/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


- The goal is to tokenize and embedd this text to feed it into an LLM
- Let's develop a simple tokenizer based on some sample text that can be then later applied on our text dataset
- Let's us regex to remove all the white spaces. 

In [4]:
import re
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item]
print(preprocessed[:38])

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--', 'though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough', '--', 'so', ' ', 'it', ' ', 'was', ' ', 'no', ' ']


In [5]:
print("Number of tokens:", len(preprocessed))

Number of tokens: 8405


### 2.Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later
- For this we need to build up a vocabulary 
- The vocabulary contains all the unique words in the input text


<img src="./metadata/03.png" alt="tokenization example" style="display: block; margin: 0 auto; width:700px; height:auto;" />

In [6]:
all_words = set(sorted(preprocessed))
vocab_size = len(all_words)

print(" The vocab size is : " , vocab_size)

vocab = {token:integer for integer,token in enumerate(all_words)}

for i,t in enumerate(vocab.items()):
    print(t)
    if i>20:
        break

 The vocab size is :  1132
('loathing', 0)
('axioms', 1)
('brown', 2)
('big', 3)
('got', 4)
('luxury', 5)
('persuasively', 6)
('heard', 7)
('learned', 8)
('man', 9)
('much', 10)
('reminded', 11)
('dragged', 12)
('transmute', 13)
('diagnosis', 14)
('gave', 15)
('mighty', 16)
('Ah', 17)
('deprecatingly', 18)
('hardly', 19)
('profusion', 20)
('Only', 21)


Let's now put it all together into a tokenizer class
- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

In [7]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids 

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [8]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print("Encoded ids " , ids)

# Decode back 
decoded_text = tokenizer.decode(ids)
print("Decoded Text : ", decoded_text)

Encoded ids  [1068, 906, 650, 947, 518, 1114, 926, 520, 487, 355, 294, 487, 1068, 985, 914, 250, 308, 408, 74, 376, 914]
Decoded Text :  " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


### 3. Bytepair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words characters, enabling it to handle out-of-vocabulary words
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- In this lecture, we are using the BPE tokenizer from OpenAI's open-source tiktoken library, which implements its core algorithms in Rust to improve computational performance

In [9]:
import tiktoken
tiktoken.__version__

'0.12.0'

In [10]:
tokenizer = tiktoken.get_encoding("gpt2")
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [11]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


In [12]:
# breaking down unknown words
tokenizer.encode("SDFGHJKLIUYTSD", allowed_special={"<|endoftext|>"})

[50, 8068, 17511, 41, 42, 31271, 52, 56, 4694, 35]

### 4. Data sampling with a sliding window
- Now, let's talk about how we create the data loading for LLMs
- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict

<img src="./metadata/04.png" alt="tokenization example" style="display: block; margin: 0 auto; width:700px; height:auto;" />

**Creating data loader class**

In [13]:
import torch 
import tiktoken
from torch.utils.data import Dataset, DataLoader

In [19]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.target_ids = []
        self.input_ids = []

        # Tokenizer the entire dataset
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids)-max_length, stride):
            input_chunk = token_ids[i: i + max_length]
            target_chunk = token_ids[i+1 : max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [31]:
tokenizer = tiktoken.get_encoding("gpt2")
with open("./datasets/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

max_length = 256
stride = 128
dataset = GPTDatasetV1(raw_text, tokenizer, max_length, stride)

In [32]:
# Create dataloader
dataloader = DataLoader(
    dataset,
    batch_size=4,
    shuffle=True,
    drop_last=True,
    num_workers=0
)