### Read file content and split into words
**Try reading file content:**

In [2]:
with open("sample_text.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


**Split text on whitespace, comman and period characters:**

In [3]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
# remove redundant whitespaces
preprocessed = [item for item in preprocessed if item.strip()]

print (preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### Convert tokens into token IDs
**Create a vocabulary**

In [4]:
all_tokens = sorted(list(set(preprocessed)))

all_tokens.extend([ "<|endoftext|>", "<|unk|>" ])

vacab_size = len(all_tokens)
print (vacab_size)

1132


In [5]:
vocab = { token:integer for integer, token in enumerate(all_tokens) }
for i, item in enumerate(vocab.items()):
    print (item)
    if i >= 20:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)


In [6]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print (item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


**Implement a simple text tokenizer**

In [7]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        # create an inverse vocabulry that maps token ids back to the original text tokens
        self.int_to_str = { i:s for s, i in vocab.items() }

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        # process input text into token ids
        preprocessed = [ item.strip() for item in preprocessed if item.strip() ]
        # replace unknown words by <|unk|> tokens
        preprocessed = [ item if item in self.str_to_int else "<|unk|>" for item in preprocessed ]
        ids = [ self.str_to_int[s] for s in preprocessed ]
        return ids

    def decode(self, ids):
        # convert token ids back into text
        text = " ".join([ self.int_to_str[i] for i in ids ])
        # remove spaces before the specified punctuation
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


**Test the tokenizer**

In [8]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
       Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print (ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


**Turn the ids back to text**

In [9]:
print (tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [10]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))

print (tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [11]:
print (tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


### Byte pair encoding


In [12]:
%pip install tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
from importlib.metadata import version
import tiktoken

print ("tiktoken version:", version("tiktoken"))

tiktoken version: 0.12.0


In [14]:
tokenizer = tiktoken.get_encoding("gpt2")

text = ("Hello, do you like tea? <|endoftext|> In the sunlit terraces of the someunknownplace.")

integers = tokenizer.encode(text, allowed_special={ "<|endoftext|>" })

print (integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 262, 617, 34680, 5372, 13]


**Convert token IDs back to text**

In [15]:
strings = tokenizer.decode(integers)
print (strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the someunknownplace.


Try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and
print the individual token IDs. Then, call the decode function on each of the resulting
integers in this list to reproduce the mapping shown in figure 2.11. Lastly, call the
decode method on the token IDs to check whether it can reconstruct the original
input, “Akwirw ier.”

In [16]:
integers = tokenizer.encode("Akwirw ier")
print (integers)

[33901, 86, 343, 86, 220, 959]


In [17]:
strings = tokenizer.decode(integers)
print (strings)

Akwirw ier


In [18]:
for token in integers:
    print ("" + str(token) + " == \"" + tokenizer.decode([token]) + "\"")

33901 == "Ak"
86 == "w"
343 == "ir"
86 == "w"
220 == " "
959 == "ier"


Byte pair encoding builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words. For example, BPE starts with adding all individual single characters to its vocabulary (“a”,
 “b,” etc.). In the next stage, it merges character combinations that frequently occur together into subwords. For example, “d” and “e” may be merged into the subword “de,” which is common in many English words like “define”,  “depend,” “made,” and “hidden.” The merges are determined by a frequency cutoff.

### Data sampling with a sliding window

In [19]:
enc_text = tokenizer.encode(raw_text)
print (len(enc_text))

5145


In [20]:
# remove the first 50 tokens from the dataset for demonstration purposes,
# as it results in a slightly more interesting text passage in the next steps:
enc_sample = enc_text[50:]

One of the easiest and most intuitive ways to create the input–target pairs for the next-
word prediction task is to create two variables, x and y, where x contains the input
tokens and y contains the targets, which are the inputs shifted by 1:

In [21]:
# the context size determines how many tokens are included in the input
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size + 1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


By processing the inputs along with the targets, which are the inputs shifted by one
position, we can create the next-word prediction tasks (see figure 2.12), as follows:

In [22]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


Everything left of the arrow (---->) refers to the input an LLM would receive, and
the token ID on the right side of the arrow represents the target token ID that the
LLM is supposed to predict. Let’s repeat the previous code but convert the token IDs
into text:

In [23]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


We’ve now created the input–target pairs that we can use for LLM training.

### Implement efficient data loader that iterates over the input dataset and returns the input and target as PyTorch tensors

Text sample:

-----------
"In the heart of the city stood the old library, a relic from a bygone era ..."

```
Tensor containing the inputs (x)
-------------------------------
x = tensor([
      ["In",   "the",  "heart", "of"  ],
      ["the",  "city", "stood", "the" ],
      ["old",  "library", ",",  "a"   ],
      ...
    ])

Tensor containing the targets (y)
---------------------------------
y = tensor([
      ["the",  "heart", "of",   "the"   ],   <- next words
      ["city", "stood", "the",  "old"   ],
      ["library", ",",  "a",    "relic" ],
      ...
    ])
```
(Each row in `y` is `x` shifted one position to the left.)

In [24]:
# to test this approach
token_len = 1000       # number of tokens in the text
block_size = 256 # length of the input
stride = 128     # number of tokens that intersects in each block

for i in range(0, token_len - block_size, stride):
    print(f"input - {i}:{i + 256}, target - {i + 1}:{i + 256 + 1}")

input - 0:256, target - 1:257
input - 128:384, target - 129:385
input - 256:512, target - 257:513
input - 384:640, target - 385:641
input - 512:768, target - 513:769
input - 640:896, target - 641:897


In [25]:
%pip install torch numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [30]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, block_size, stride):
        self.input_ids = []
        self.target_ids = []

        # tokenize the entire text
        token_ids = tokenizer.encode(txt)

        # use sliding window to chunk the book into overlapping sequences of block_size
        for i in range(0, len(token_ids) - block_size, stride):
            input_chunk = token_ids[i:i + block_size]
            target_chunk = token_ids[i + 1 : i + block_size + 1]

            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    # return the total number of rows in the dataset
    def __len__(self):
        return len(self.input_ids)

    # return a single row from the dataset
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


A data loader to generate batches with input-with pairs

In [31]:
def create_dataloader_v1(txt, batch_size=4, block_size=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, block_size=block_size, stride=stride)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=shuffle,
                            drop_last=drop_last,
                            num_workers=num_workers)
    return dataloader

Let's test the dataloader

In [34]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, block_size=4, stride=1, shuffle=False)
data_iter = iter(dataloader)

first_batch = next(data_iter)
print(first_batch)

second_batch = next(data_iter)
print(second_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In [36]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, block_size=4, stride=4, shuffle=False)
data_iter = iter(dataloader)

inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)



Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
