In [6]:
# pip install -r requirements.txt


# **Byte-Pair Encoding**

GPT-2, used BytePair encoding (BPE) as its tokenizer. It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words.

# **How does it achieve this without using <|unk|> tokens?**

### The BPE algorithm, ensures that the most common words in vocabulary are represented as a single token, while rare words are breakdown into two or more subwords, if the tokenizer encounters an unfamiliar word during tokenization.

### For instance, if GPT-2's vocabulary doesn't have the word "analytical-nikita.io", it might tokenize it as ["analytical", "-", "nik", "ita", ".", "io"] or some other subword breakdown, depending on its trained BPE merges.

### The original BPE tokenizer can be found here: https://github.com/openai/gpt-2/blob/master/src/encoder.py

### Since, implementing BPE can be relatively complicated. In this demonstration, we are using the BPE tokenizer from OpenAI's open-source tiktoken library (https://github.com/openai/tiktoken), which implements its core algorithms in Rust to improve computational performance.

## tiktoken is a fast BPE tokeniser for use with OpenAI's models.

# Importing Dataset
In this implementation I am using The Awakening by Kate Chopin, a public domain short story, written in 1899, so there is no copyright on that.

Note: It is recommended to be aware and respectful of existing copyrights and people privacy, while preparing datasets for training LLMs.

In [84]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset("pierre-pessarossi/tiny_shakespeare_dialogue")
train_data = dataset['train']


In [96]:
df=train_data["text"]


In [100]:
df[0]

'\n    <s>[INST] <<SYS>>\n    You are William Shakespeare. Complete the following dialogue:\n    <</SYS>>>\n\n    \n\nFirst Citizen:\nBefore we proceed any further, hear me speak. [/INST] \n\nAll:\nSpeak, speak. </s>'

In [90]:

import os
import urllib.request

# URL of the raw content of the markdown file
if not os.path.exists("the-awkening.txt"):
    url = ("https://raw.githubusercontent.com/mlschmitt/classic-books-markdown/main/Kate%20Chopin/The%20Awakening.md")
    file_path = "the-awkening.txt"
    urllib.request.urlretrieve(url, file_path)

with open("the-awkening.txt", "r", encoding="utf-8") as f:
    input_text = f.read()

# Importing tiktoken

In [91]:

# ! pip3 install tiktoken

In [92]:

import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [93]:


tokenizer = tiktoken.get_encoding("gpt2")

Let's instantiate a tokenizer object of SimpleTokenzier class and tokenize our sampled input text:

In [94]:

text1 = "Hello, you're learning data science with analytical-nikita.io"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

# Encoding
This class have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.

In [98]:

token_ids = tokenizer.encode(df[0], allowed_special={"<|endoftext|>"})

print(token_ids)


[198, 220, 220, 220, 1279, 82, 36937, 38604, 60, 9959, 50, 16309, 4211, 198, 220, 220, 220, 921, 389, 3977, 22197, 13, 13248, 262, 1708, 10721, 25, 198, 220, 220, 220, 1279, 3556, 50, 16309, 33409, 628, 220, 220, 220, 220, 198, 198, 5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 46581, 38604, 60, 220, 198, 198, 3237, 25, 198, 5248, 461, 11, 2740, 13, 7359, 82, 29]


The <|endoftext|> token is assigned a relatively large token ID, namely, 50256.

In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.



## Decoding
We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizer that we build earlier:

In [99]:
output_strings = tokenizer.decode(token_ids)

print(output_strings)


    <s>[INST] <<SYS>>
    You are William Shakespeare. Complete the following dialogue:
    <</SYS>>>

    

First Citizen:
Before we proceed any further, hear me speak. [/INST] 

All:
Speak, speak. </s>


The algorithm underlying BPE breaks down words like analytical-nikita.io that aren't in its predefined vocabulary into smaller subword units or even individual characters.

The enables it to handle out-of-vocabulary words.