# 2 Byte-Pair Encoding (BPE) Tokenizer

### Problem (unicode1): Understanding Unicode

In [1]:
# (a) '\x00'. There's not a single pre-defined symbol representing character chr(0), so it's represented by the hex representation of its code point
chr(0)

'\x00'

In [2]:
# (b): this is a string whose content is the printed representation
chr(0).__repr__()

"'\\x00'"

In [3]:
# (c) this is the string representation of the value
chr(0)

'\x00'

In [4]:
print(chr(0)) # nothing is printed because there's no symbol corresponding to \x00

 


In [5]:
"this is a test" + chr(0) + "string" # the value of the string, where \x00 is used

'this is a test\x00string'

In [6]:
print("this is a test" + chr(0) + "string") # again, since there's no symbol corresponding to \x00, it's not printed

this is a test string


### Problem (unicode 2): Unicode Encodings

In [7]:
# (a) UTF-8 use much fewer bytes to encode string than UTF-16 and UTF-32

test_string = "你好hello"
print(f"utf-8: {test_string.encode("utf-8")}")
print(f"utf-8 len: {len(test_string.encode("utf-8"))}")
print(f"utf-16 len: {len(test_string.encode("utf-16"))}")
print(f"utf-32 len: {len(test_string.encode("utf-32"))}")

utf-8: b'\xe4\xbd\xa0\xe5\xa5\xbdhello'
utf-8 len: 11
utf-16 len: 16
utf-32 len: 32


In [8]:
# (b) Some unicode characters are represented by multiple bytes. This function treats each byte as a single character, which doesn't handle multi-byte characters correctly.

def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

decode_utf8_bytes_to_str_wrong(b'\xe4\xbd\xa0\xe5\xa5\xbdhello')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

In [None]:
# (c) According to utf-8 spec, byte \xe4 must be followed by a byte whos binary representation starts with prefix bits 10
b'\xe4\xe4'.decode("utf-8")

### Problem (train_bpe_tinystories): BPE Training on TinyStories

#### Part (a) 
It took 46 seconds, with 42 seconds on pretokenization and 3 seconds on tokenization. Peak memory usage = 152M. 

The following tokens are the longest in the vocab, each with 15 bytes: "Ġresponsibility", "Ġaccomplishment", "Ġdisappointment". They seem to make sense because they correspond to " responsibility", " accomplishment" and " disappointment" in English.

#### Part (b)

Based on the scalene profiling below, the step of using regex to break documents into tokenizations took the most of time.

![Alt Text](tinystories_profile.png)


### Problem (train_bpe_expts_owt): BPE Training on OpenWebText

#### Part (a)

The longest tokens in the vocabulary are "ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ" and "----------------------------------------------------------------", each with 64 bytes.

They seem to make sense because the former corresponds to "ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ", which seem to appear in the text as separators/fillers a lot, and the latter is a common separator.

#### Part (b)

Similarities: both are dominated by English words

Contrasts: owt vocab is more diverse, with more unicode phrases (still rare though) and lots of variants of separators