# Notebook to test various tokenizers

## Basic BPE Tokenizer without any special tokens

Let's test basic BPE tokenizer without any special tokens. <br>
Let's first test with limited text (paragraph) and check if it's working correctly <br>
Just to verify, let's try to compare the results with tiktoken tokenizer with gpt2 encoding <br>
Let's start with vocab_size of 500

In [1]:
from basic_bpe import BasicBPE
from tiktoken import get_encoding

In [2]:
bpe = BasicBPE(500)

In [3]:
tiktoken_gpt2 = get_encoding('gpt2')

Let's test on first paragraph of Richard Feynman's wikipedia page(https://en.wikipedia.org/wiki/Richard_Feynman)

In [4]:
text = """Richard Phillips Feynman (/ˈfaɪnmən/; May 11, 1918 – February 15, 1988) was an American theoretical physicist, known for his work in the path integral formulation of quantum mechanics, the theory of quantum electrodynamics, the physics of the superfluidity of supercooled liquid helium, as well as his work in particle physics for which he proposed the parton model. For his contributions to the development of quantum electrodynamics, Feynman received the Nobel Prize in Physics in 1965 jointly with Julian Schwinger and Shin'ichirō Tomonaga.

Feynman developed a widely used pictorial representation scheme for the mathematical expressions describing the behavior of subatomic particles, which later became known as Feynman diagrams. During his lifetime, Feynman became one of the best-known scientists in the world. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, he was ranked the seventh-greatest physicist of all time.[1]

He assisted in the development of the atomic bomb during World War II and became known to the wider public in the 1980s as a member of the Rogers Commission, the panel that investigated the Space Shuttle Challenger disaster. Along with his work in theoretical physics, Feynman has been credited with pioneering the field of quantum computing and introducing the concept of nanotechnology. He held the Richard C. Tolman professorship in theoretical physics at the California Institute of Technology.

Feynman was a keen popularizer of physics through both books and lectures, including a 1959 talk on top-down nanotechnology called There's Plenty of Room at the Bottom and the three-volume publication of his undergraduate lectures, The Feynman Lectures on Physics. Feynman also became known through his autobiographical books Surely You're Joking, Mr. Feynman! and What Do You Care What Other People Think?, and books written about him such as Tuva or Bust! by Ralph Leighton and the biography Genius: The Life and Science of Richard Feynman by James Gleick."""

In [5]:
text

"Richard Phillips Feynman (/ˈfaɪnmən/; May 11, 1918 – February 15, 1988) was an American theoretical physicist, known for his work in the path integral formulation of quantum mechanics, the theory of quantum electrodynamics, the physics of the superfluidity of supercooled liquid helium, as well as his work in particle physics for which he proposed the parton model. For his contributions to the development of quantum electrodynamics, Feynman received the Nobel Prize in Physics in 1965 jointly with Julian Schwinger and Shin'ichirō Tomonaga.\n\nFeynman developed a widely used pictorial representation scheme for the mathematical expressions describing the behavior of subatomic particles, which later became known as Feynman diagrams. During his lifetime, Feynman became one of the best-known scientists in the world. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, he was ranked the seventh-greatest physicist of all time.[1]\n\nHe assisted in the develo

### Train

In [6]:
bpe.train(text)

In [7]:
# merges
bpe.merges

{(101, 32): 256,
 (116, 104): 257,
 (110, 32): 258,
 (115, 32): 259,
 (105, 99): 260,
 (257, 256): 261,
 (100, 32): 262,
 (111, 114): 263,
 (111, 102): 264,
 (97, 110): 265,
 (264, 32): 266,
 (105, 110): 267,
 (44, 32): 268,
 (101, 114): 269,
 (97, 108): 270,
 (97, 116): 271,
 (97, 258): 272,
 (101, 99): 273,
 (101, 108): 274,
 (121, 110): 275,
 (121, 32): 276,
 (70, 101): 277,
 (104, 105): 278,
 (277, 275): 279,
 (279, 109): 280,
 (112, 104): 281,
 (121, 115): 282,
 (282, 260): 283,
 (115, 116): 284,
 (97, 114): 285,
 (280, 272): 286,
 (101, 262): 287,
 (267, 103): 288,
 (97, 259): 289,
 (105, 258): 290,
 (105, 111): 291,
 (101, 110): 292,
 (265, 262): 293,
 (114, 101): 294,
 (270, 32): 295,
 (281, 283): 296,
 (110, 111): 297,
 (116, 117): 298,
 (97, 109): 299,
 (46, 32): 300,
 (32, 261): 301,
 (32, 266): 302,
 (111, 109): 303,
 (278, 259): 304,
 (115, 268): 305,
 (111, 112): 306,
 (97, 32): 307,
 (49, 57): 308,
 (119, 258): 309,
 (104, 32): 310,
 (119, 105): 311,
 (111, 117): 312,
 (

In [8]:
# vocabulary
bpe.vocab

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

Let's try very simple encode and decode example

In [9]:
enc = bpe.encode("Feynman served as doctoral advisor to 30 students.")

In [10]:
enc

[286,
 115,
 269,
 118,
 287,
 289,
 100,
 111,
 99,
 116,
 263,
 295,
 97,
 100,
 118,
 105,
 115,
 429,
 325,
 32,
 51,
 48,
 32,
 284,
 117,
 100,
 328,
 115,
 46]

In [11]:
# Decode
bpe.decode(enc)

'Feynman served as doctoral advisor to 30 students.'

In [16]:
def compare(text, ids, enc):
    print(f"Text: {text}")
    print("My Enocder: ")
    for i, idx in enumerate(ids):
        try:
            print(f'{i+1} : {enc.vocab[idx]}')
        except:
            print(f'{i+1} : UNK')
    
    print("TikToken gpt2 encoder")
    for i, idx in enumerate(tiktoken_gpt2.encode(text)):
        try:
            print(f'{i+1} : {tiktoken_gpt2.decode_single_token_bytes(idx)}')
        except:
            print(f'{i+1} : UNK')

In [17]:
# just to compare with tiktoken tokenizer, let's just see how the sentence is getting tokenized
sent = "Feynman served as doctoral advisor to 30 students."

compare(sent, enc, bpe)

Text: Feynman served as doctoral advisor to 30 students.
My Enocder: 
1 : b'Feynman '
2 : b's'
3 : b'er'
4 : b'v'
5 : b'ed '
6 : b'as '
7 : b'd'
8 : b'o'
9 : b'c'
10 : b't'
11 : b'or'
12 : b'al '
13 : b'a'
14 : b'd'
15 : b'v'
16 : b'i'
17 : b's'
18 : b'or '
19 : b'to'
20 : b' '
21 : b'3'
22 : b'0'
23 : b' '
24 : b'st'
25 : b'u'
26 : b'd'
27 : b'ent'
28 : b's'
29 : b'.'
TikToken gpt2 encoder
1 : b'Fe'
2 : b'yn'
3 : b'man'
4 : b' served'
5 : b' as'
6 : b' doctoral'
7 : b' advisor'
8 : b' to'
9 : b' 30'
10 : b' students'
11 : b'.'


#### Let's try a bigger text now

Let's encode another paragraph from Feynman's wikipedia

In [None]:
text2 = """Surely You're Joking, Mr. Feynman!
In the 1960s, Feynman began thinking of writing an autobiography, and he began granting interviews to historians. In the 1980s, working with Ralph Leighton (Robert Leighton's son), he recorded chapters on audio tape that Ralph transcribed. The book was published in 1985 as Surely You're Joking, Mr. Feynman! and became a best-seller.[174]

Gell-Mann was upset by Feynman's account in the book of the weak interaction work, and threatened to sue, resulting in a correction being inserted in later editions.[175] This incident was just the latest provocation in decades of bad feeling between the two scientists. Gell-Mann often expressed frustration at the attention Feynman received;[176] he remarked: "[Feynman] was a great scientist, but he spent a great deal of his effort generating anecdotes about himself."[177]

Feynman has been criticized for a chapter in the book entitled "You Just Ask Them?", where he describes how he learned to seduce women at a bar he went to in the summer of 1946. A mentor taught him to ask a woman if she would sleep with him before buying her anything. He describes seeing women at the bar as "bitches" in his thoughts, and tells a story of how he told a woman named Ann that she was "worse than a whore" after Ann persuaded him to buy her sandwiches by telling him he could eat them at her place, but then, after he bought them, saying they actually could not eat together because another man was coming over. Later on that same evening, Ann returned to the bar to take Feynman to her place.[178][179][180] Feynman states at the end of the chapter that this behaviour was not typical of him: \"So it worked even with an ordinary girl! But no matter how effective the lesson was, I never really used it after that. I didn't enjoy doing it that way. But it was interesting to know that things worked much differently from how I was brought up.\""""

In [None]:
text2

In [None]:
enc1 = bpe.encode(text2)

In [None]:
#decode
bpe.decode(enc1)

In [None]:
assert text2 == bpe.decode(enc1)

So the encoding worked well with limited text. <br>
Now let's try how the tokenizer works on python code <br>
let's check how it works on its own code :) <br>

```
class BasicBPE(Tokenizer):
    def __init__(self, vocab_size: int=32000) -> None:
        super().__init__()
        assert vocab_size > 256, f"Vocabulary size should be more than 256."
        self.vocab_size = vocab_size
        self.num_merges = vocab_size - 256

    def train(self, text: str) -> None:
        text_bytes = bytes(text, encoding="utf-8")
        ids = list(text_bytes)
        for i in range(self.num_merges):
            stats = get_stats(ids)
            pair = stats.most_common(1)[0][0]
            idx = 256 + i
            ids = merge(ids, pair, idx)
            self.merges[pair] = idx

        self.build_vocab()

    def encode(self, text: str) -> List[int]:
        text_bytes = text.encode("utf-8")
        ids = list(text_bytes)

        while(len(ids) >= 2):
            stats = get_stats(ids)
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break
            idx = self.merges[pair]
            ids = merge(ids, pair, idx)

        return ids
    
    def decode(self, ids: List[int]) -> str:
        text_bytes = b"".join(self.vocab[idx] for idx in ids)
        text = text_bytes.decode("utf-8", errors="replace")
        return text
```

In [None]:
my_code = """
class BasicBPE(Tokenizer):
    def __init__(self, vocab_size: int=32000) -> None:
        super().__init__()
        assert vocab_size > 256, f"Vocabulary size should be more than 256."
        self.vocab_size = vocab_size
        self.num_merges = vocab_size - 256

    def train(self, text: str) -> None:
        text_bytes = bytes(text, encoding="utf-8")
        ids = list(text_bytes)
        for i in range(self.num_merges):
            stats = get_stats(ids)
            pair = stats.most_common(1)[0][0]
            idx = 256 + i
            ids = merge(ids, pair, idx)
            self.merges[pair] = idx

        self.build_vocab()

    def encode(self, text: str) -> List[int]:
        text_bytes = text.encode("utf-8")
        ids = list(text_bytes)

        while(len(ids) >= 2):
            stats = get_stats(ids)
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break
            idx = self.merges[pair]
            ids = merge(ids, pair, idx)

        return ids
    
    def decode(self, ids: List[int]) -> str:
        text_bytes = b"".join(self.vocab[idx] for idx in ids)
        text = text_bytes.decode("utf-8", errors="replace")
        return text
"""

In [None]:
# Let's encode
my_code_enc = bpe.encode(my_code)

In [None]:
# Let's decode and see what we get
bpe.decode(my_code_enc)

In [None]:
my_code

In [None]:
# Let's check how the code is tokenized
print("Tokens: ")
for i, idx in enumerate(my_code_enc):
    try:
        print(f'{i+1} : {bpe.vocab[idx]}')
    except:
        print(f'{i+1} : UNK')

We can see that:
1. Tokenizer doesn't have the notion of language keywords like self, def, class etc and tokenizes each character separately.
2. Tokenizer doesn't have the notion of tab
3. Will adding more python code be better for tokenizer training?

In [None]:
# Let's train the tokenizer on code and let's see how the tokenizer behaves on text
# Read directory and find out python files.
# Keep one for the testing and read all other python files into single text string
import glob

all_py = glob.glob("../**/*.py", recursive=True)
train_py, test_py = all_py[:-1], all_py[-1]

In [None]:
train_py, test_py

In [None]:
train_text = ""

for fp in train_py:
    with open(fp, 'r') as f:
        train_text += f.read()

In [None]:
test_text = ""

with open(test_py, 'r') as f:
    test_text += f.read()

Now let's train a simple tokenizer with vocab size = 2500

In [None]:
bpe_code = BasicBPE(2500)

In [None]:
bpe_code.train(train_text)

In [None]:
# Let's encode the test text now
code_enc = bpe_code.encode(test_text)

In [None]:
bpe_code.decode(code_enc)

In [None]:
# Let's check how the code is tokenized
print("Tokens: ")
for i, idx in enumerate(code_enc):
    try:
        print(f'{i+1} : {bpe_code.vocab[idx]}')
    except:
        print(f'{i+1} : UNK')

The tokenizer already starts to tokenize some keywords like import, return, def etc as a whole. <br>
Still there are some weird tokens :- ") -> None:\n        super().__init__()\n " <br>       
With more data to train on, the tokenizer may become better.

#### Let's try the tokenizer on simple text

In [None]:
enc_text = bpe_code.encode(text)

In [None]:
# Let's check how the code is tokenized
print("Tokens: ")
for i, idx in enumerate(enc_text):
    try:
        print(f'{i+1} : {bpe_code.vocab[idx]}')
    except:
        print(f'{i+1} : UNK')

Mostly the text is tokenized at the character level

In [None]:
bpe_code.decode(enc_text)