<a href="https://colab.research.google.com/github/linhoangce/llm_from_scratch/blob/main/chapter2_preprocess_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2.2 Tokenizing Text

In [9]:
# load text
with open('the_verdict.txt', 'r') as f:
  raw_text = f.read()

print(f'Total num of characters: {len(raw_text)}')
raw_text[:99]

Total num of characters: 20479


'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no '

In [10]:
import re

text = "Hello, work. this, is a test."
result = re.split(r'(\s)', text)
result

['Hello,', ' ', 'work.', ' ', 'this,', ' ', 'is', ' ', 'a', ' ', 'test.']

In [11]:
# split on whitespaces (\s), commas, and periods ([,.])
result = re.split(r'([,.]|\s)', text)
result

['Hello',
 ',',
 '',
 ' ',
 'work',
 '.',
 '',
 ' ',
 'this',
 ',',
 '',
 ' ',
 'is',
 ' ',
 'a',
 ' ',
 'test',
 '.',
 '']

In [12]:
# remove whitespace characters
result = [item for item in result if item.strip()]
result

['Hello', ',', 'work', '.', 'this', ',', 'is', 'a', 'test', '.']

### **NOTE**

When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing).

In [13]:
# handles other types of punctuations and special characters
text = "Hello, world. Is this-- a test?"
result = re.split('([,.:;?_!"()\']|--|\\s)', text)
result = [item for item in result if item.strip()]
result

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']

In [14]:
# apply this tokenizer to the text
preprocessed = re.split('([,.:;?_!"()\']|--|\\s)', raw_text)
preprocessed = [item for item in preprocessed if item.strip()]
len(preprocessed), preprocessed[:20]

(4690,
 ['I',
  'HAD',
  'always',
  'thought',
  'Jack',
  'Gisburn',
  'rather',
  'a',
  'cheap',
  'genius',
  '--',
  'though',
  'a',
  'good',
  'fellow',
  'enough',
  '--',
  'so',
  'it',
  'was'])

## Converting tokens into token IDs

In [15]:
# create a list of all unique tokens
# and sort them alphabetically
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
vocab_size

1130

In [16]:
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
  print(item)
  if i >= 50:
    break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [17]:
class SimpleTokenizerV1:
  """
  1. Stores the vocabulary as a class attribute for access in
  the encode and decode method
  2. Creates an inverse vocabulary that maps token IDs back to
  the original text tokens
  3. Processes input text into token IDs
  4. Converts token IDs back into text
  5. Removes spaces before the specified punctuation
  """
  def __init__(self, vocab):
    self.str_to_int = vocab # 1
    self.int_to_str = {i:s for s,i in vocab.items()} # 2

  def encode(self, text): # 3
    preprocessed = re.split('([,.:;?_!"()\']|--|\\s)', text)
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join(self.int_to_str[i] for i in ids)
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # 5
    return text

In [18]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
        Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
ids

[1,
 56,
 2,
 850,
 988,
 602,
 533,
 746,
 5,
 1126,
 596,
 5,
 1,
 67,
 7,
 38,
 851,
 1108,
 754,
 793,
 7]

In [19]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [20]:
# test example
text = "Hello, do you like tea?"
tokenizer.encode(text)

KeyError: 'Hello'

## 2.4 Adding special context tokens

In [21]:
# Adding two special tokens
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

len(vocab.items())

1132

In [22]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [30]:
class SimpleTokenizerV2:
  """
  1. Replaces unknown words by <|unk|> tokens
  2. Replaces spaces before the specified punctuations
  """
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = { i:s for s,i in vocab.items()}

  def encode(self, text):
    preprocessed = re.split('([,.:;?_!"()\']|--|\\s)', text)
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    preprocessed = [item if item in self.str_to_int
                    else "<|unk|>" for item in preprocessed] # 1

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join(self.int_to_str[i] for i in ids)
    text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
    return text


In [34]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
text

'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'

In [35]:
tokenizer = SimpleTokenizerV2(vocab)
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [36]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 Byte pair encoding

In [37]:
!pip install tiktoken -q

In [38]:
import tiktoken

tiktoken.__version__

'0.12.0'

In [40]:
tokenizer = tiktoken.get_encoding('gpt2')

In [41]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
integers

[15496,
 11,
 466,
 345,
 588,
 8887,
 30,
 220,
 50256,
 554,
 262,
 4252,
 18250,
 8812,
 2114,
 286,
 617,
 34680,
 27271,
 13]

In [42]:
string = tokenizer.decode(integers)
string

'Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.'

In [43]:
word = "Akwirw ier"
tokenizer.encode(word)

[33901, 86, 343, 86, 220, 959]

In [44]:
tokenizer.decode(tokenizer.encode(word))

'Akwirw ier'

## 2.6 Data sampling with a sliding window

In [46]:
# tokenize "The Verdict" with BPE tokenizer
tokenizer = tiktoken.get_encoding('gpt2')

with open('the_verdict.txt', 'r', encoding='utf-8') as f:
  raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
len(enc_text)

5145

In [49]:
enc_sample = enc_text[50:]

In [52]:
# context_size determines how many tokens are in input
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f'x: {x}')
print(f'y:      {y}')

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [53]:
# create next-word prediction tasks
for i in range(1, context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(context, '------>', desired)

[290] ------> 4920
[290, 4920] ------> 2241
[290, 4920, 2241] ------> 287
[290, 4920, 2241, 287] ------> 257


In [56]:
# convert token IDs into text
for i in range(1, context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(tokenizer.decode(context), '----->', tokenizer.decode([desired]))

 and ----->  established
 and established ----->  himself
 and established himself ----->  in
 and established himself in ----->  a


In [57]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
  """
  1. Tokenizes the entire text
  2. Uses a sliding window to chunk the book into overlapping sequences of max_length
  3. Returns the total number of rows in the dataset
  4. Returns a single row from the dataset
  """
  def __init__(self, text, tokenizer, max_length, stride):
    self.input_ids = []
    self.target_ids = []

    token_ids = tokenizer.encode(text) # 1

    for i in range(0, len(token_ids) - max_length, stride): # 2
      input_chunk = token_ids[i : i+max_length]
      target_chunk = token_ids[i+1 : i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]


In [59]:
def create_dataloader_v1(text, batch_size=4, max_length=256,
                         stride=128, shuffle=True,
                         drop_last=True, num_workers=0):
  """
  1. Initializes the tokenizer
  2. Creates dataset
  3. drop_last=True drops the last batch if it is shorter than
   the specified batch_size to prevent loss spikes during training
  """
  tokenizer = tiktoken.get_encoding('gpt2')
  dataset = GPTDatasetV1(text, tokenizer, max_length, stride)
  dataloader = DataLoader(
      dataset,
      batch_size=batch_size,
      shuffle=shuffle,
      drop_last=drop_last,
      num_workers=num_workers
  )
  return dataloader

with open('the_verdict.txt', 'r', encoding='utf-8') as f:
  raw_text = f.read()

dataloader = create_dataloader_v1(raw_text, batch_size=1,
                                  max_length=4, stride=1,
                                  shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
first_batch

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

In [61]:
dataloader = create_dataloader_v1(raw_text, batch_size=8,
                                  max_length=8, stride=4,
                                  shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f'Inputs: {inputs}')
print(f'Targets: {targets}')

Inputs: tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [10899,  2138,   257,  7026, 15632,   438,  2016,   257],
        [15632,   438,  2016,   257,   922,  5891,  1576,   438],
        [  922,  5891,  1576,   438,   568,   340,   373,   645],
        [  568,   340,   373,   645,  1049,  5975,   284,   502],
        [ 1049,  5975,   284,   502,   284,  3285,   326,    11],
        [  284,  3285,   326,    11,   287,   262,  6001,   286]])
Targets: tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899],
        [ 3619,   402,   271, 10899,  2138,   257,  7026, 15632],
        [ 2138,   257,  7026, 15632,   438,  2016,   257,   922],
        [  438,  2016,   257,   922,  5891,  1576,   438,   568],
        [ 5891,  1576,   438,   568,   340,   373,   645,  1049],
        [  340,   373,   645,  1049,  5975,   284,   502,   284],
        [ 5975,   284,   502,   284,  3285,   326,    11, 

## 2.7 Creating token embeddings

In [62]:
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3

torch.manual_seed(123)

# create embedding layer initialized with random values of shape
# (vocab_size, output_dim)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
embedding_layer.weight

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

In [63]:
# apply to a token Id to obtain embedding vector/look up idx
embedding_layer(torch.tensor([3]))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

In [64]:
embedding_layer(input_ids)

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)

## 2.8 Encoding word positions

A minor shortcoming of LLMs is that their self-attention mechanism doesn't have a notion of position or order for the tokens within a sequence. the way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence.

In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes. However, since the self-attention mechanism of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM.

To achive this, we can use two broad categories of position-aware embeddings: relative positional embeddings and absolute positional embeddings. Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. For instance, the first token will have a specific positional embedding, the second token another distinct embedding, and so on.

Instead of focusing on the absolute position of a token, the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of "how far apart" rather than "at which exact position". The advantage here is that the model can generalize better to sequences of varying lengths, even if it hasn't seen such lengths during training.

Both types of positional embeddings aim to augment the capacity of LLMS to understand the order and relationships between tokens, ensuring more accurate and context-aware predictions. The choice between them often depends on the specific application and the nature of the data being processed.

In [75]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f'Token IDs:\n{inputs}\n'
      f'Targets:\n{targets}\n')

# data batch consists of 8 text samples and tokens each
print(f'Inputs shape: {inputs.shape}')

Token IDs:
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Targets:
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

Inputs shape: torch.Size([8, 4])


In [67]:
# use embedding layer to embed these token IDs
# into 256-dimensional vectors
token_embeddings = token_embedding_layer(inputs)
# 8x4x256: each token ID embeded as a 256-dimensional vector
token_embeddings.shape

torch.Size([8, 4, 256])

In [76]:
# create an absolute embedding layer with
# the same embedding dimension as token_embedding_layer
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length,
                                         output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
pos_embeddings.shape

torch.Size([4, 256])

In [77]:
pos_embeddings

tensor([[-0.1465, -0.0941, -0.2197,  ...,  0.6289,  0.3811, -1.1052],
        [ 1.0556, -0.1906, -0.3616,  ...,  0.4743, -0.4925, -1.7694],
        [-1.6037, -0.0578, -0.7502,  ...,  1.1309,  0.5436,  1.1704],
        [-0.0675,  0.9043,  1.0599,  ..., -1.8826,  0.1242, -1.4644]],
       grad_fn=<EmbeddingBackward0>)

In [78]:
input_embeddings = token_embeddings + pos_embeddings
input_embeddings.shape

torch.Size([8, 4, 256])

# Summary

* LLMs require textual data to be converted into numerical vectors, known as embeddings, since they can't process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.

* As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.

* Special tokens, such as $<|unk|>$ and $<|endoftext|>$, can be added to enhance the models understanding and handle various contexts, such as unknown words or making the boundary between unrelated texts.

* The pyte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.

* We use a sliding window approach on tokenized data to generate input-target pairs for LLM training.

* Embedding layers in PyTorch function as a  lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.

* While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence. to rectify this, two main types of positional embeddings exist: absoute and relative.

**Procedure**: input text -> broken into individual words/characters -> converted into token IDs using a vocabulary -> converted into embedding vectors -> added with positional embeddings of a similar size -> input embeddings as input for main LLM layers.