## Tokenization

### Creating Tokens

In [None]:
with open("verdict.txt", "r", encoding="utf-8-sig") as f:
  raw_text=f.read()

print(f"Total number of characters: {len(raw_text)}")
print(f"The first 100 characters: {raw_text[:99]}")

In [None]:
import re

# Splitting based on white-spaces
text="Hello, world. This, is a text."
result=re.split(r'(\s)', text)
print(result)

# Splitting based on white-spaces, commas and full-stops
result=re.split(r'([,.]|\s)', text)
print(result)

Here even the `white-spaces` are being considered as tokens. So first we have to remove them from the result. For now to reduce memory consumption, we will remove white-spaces. Later we will look at the tokenization scheme including white-spaces.

In [None]:
# item.strip() will be false for white-spaces
result=[item for item in result if item.strip()]
print(result)

In [None]:
text="Hello, world. Is this-- a test?"

# Splitting based on all kind of punctuations
result=re.split(r'([,.;:?_!"()\']|--|\s)', text)
result=[item.strip() for item in result if item.strip()]
print(result)

In [None]:
# Applying on our verdict.txt
pre_processed_text=re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
pre_processed_text=[item.strip() for item in pre_processed_text if item.strip()]

print(pre_processed_text[:30])

In [None]:
print(len(pre_processed_text))

### Creating Token IDs

In [None]:
all_words=sorted(set(pre_processed_text))
vocab_size=len(all_words)

print(vocab_size)

In [None]:
vocabulary={token:integer for integer, token in enumerate(all_words)}

In [None]:
for i, item in enumerate(vocabulary.items()):
  print(item)

  if i>=20:
    break

### Tokenizer Class

In [None]:
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int=vocab
    self.int_to_str={i:s for s, i in vocab.items()}

  def encode(self, text):
    pre_processed=re.split(r'([,.;:?_!"()\']|--|\s)', text)
    pre_processed=[item.strip() for item in pre_processed if item.strip()]
    ids=[self.str_to_int[s] for s in pre_processed]
    return ids
  
  def decode(self, ids):
    text=" ".join([self.int_to_str[i] for i in ids])
    # Replacing spaces before the specified punctuations
    text=re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
    return text

In [None]:
tokenizer=SimpleTokenizerV1(vocabulary)

# This text is present in the dataset
text=""""It's the last he painted, you know,"
         Mrs. Gisburn said with pardonable pride."""

ids=tokenizer.encode(text)
print(ids)

In [None]:
text=tokenizer.decode(ids)
print(text)

In [None]:
# Text which is not present in the dataset
text="Hello, I am Kushal"

ids=tokenizer.encode(text)
print(ids)

<pre>
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[26], line 4
      1 # Text which is not present in the dataset
      2 text = "Hello, I am Kushal"
----> 4 ids = tokenizer.encode(text)
      5 print(ids)

Cell In[23], line 9, in SimpleTokenizerV1.encode(self, text)
      7 pre_processed = re.split(r'([,.;:?_!"()\']|--|\s)', text)
      8 pre_processed = [item.strip() for item in pre_processed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in pre_processed]
     10 return ids

KeyError: 'Hello'
</pre>

That's why we have to use large and diverse datasets in order to avoid such errors. To avoid this, we use `Special context tokens`.

### Special Context Tokens

`<|unk|>`: To handle with the tokens that are not present in the dataset.  
`<|endoftext|>`: To seperate multiple text sources for better computation and efficiency.

In [None]:
all_tokens=sorted(list(set(pre_processed_text)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab={token:integer for integer, token in enumerate(all_tokens)}

In [None]:
len(vocab)

In [None]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

In [None]:
class SimpleTokenizerV2:
  def __init__(self, vocab):
    self.str_to_int=vocab
    self.int_to_str={i:s for s, i in vocab.items()}

  def encode(self, text):
    pre_processed=re.split(r'([,.:;?_!"()\']|--|\s)', text)
    pre_processed=[item.strip() for item in pre_processed if item.strip()]
    pre_processed=[
      item if item in self.str_to_int
      else "<|unk|>" for item in pre_processed
    ]
    ids=[self.str_to_int[s] for s in pre_processed]
    return ids
  
  def decode(self, ids):
    text=" ".join([self.int_to_str[i] for i in ids])
    # Replacing spaces before the specified punctuations
    text=re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
    return text

In [None]:
tokenizer=SimpleTokenizerV2(vocab)

text1="Hello, do you like tea?"
text2="In the sunlight terraces of the palace."

text=" <|endoftext|> ".join((text1, text2))
print(text)

In [None]:
ids=tokenizer.encode(text)
print(ids)

In [None]:
decoded_text=tokenizer.decode(ids)
print(decoded_text)

`[BOS] (beginning of sequence)`: This token marks the start of a text. It signifies the LLM where a piece of content begins.  
`[EOS] (end of sequence)`: This token is positioned at end of text and is used to concatenate multiple unrelated texts similar to <|endoftext|>.  
`[PAD] (padding)`: When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are padded using the [PAD] token, up to the length of the longest text in the batch.

- The tokenizer used for GPT models does not need any of these tokens mentioned but only uses `<|endoftext|>` for simplicity.

- To deal with out-of-vocabulary words, instead of <|unk|>, GPT uses `byte pair encoding tokenizer`, which breaks down words into subword units.

### Byte Pair Encoding

`tiktoken` is a fast BPE tokenizer for use with OpenAI's models.

In [None]:
import importlib
import tiktoken

print(importlib.metadata.version("tiktoken"))

In [None]:
tokenizer=tiktoken.get_encoding('gpt2')

In [None]:
text=(
  "Hello, do you like tea? <|endoftext|> In the sunlight terraces"
  "of someunknownPlace."
)

integers=tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

In [None]:
strings=tokenizer.decode(integers)
print(strings)

- The <|endoftext|> token is assigned to a relatively large token ID 50256.

- The BPE tokenizer has a vocabulary size of 50257 with <|endoftext|> being assigned the largest token ID.

- The BPE tokenizer encodes and decodes unknown words such as `someunknownPlace` correctly.

- The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters and solves `out-of-vocabulary` problem.

In [None]:
integers=tokenizer.encode("Akwirw ier")
print(integers)

strings=tokenizer.decode(integers)
print(strings)