#### 2.1 The Unicode Standard

In [1]:
ord('Áâõ')

29275

In [2]:
chr(29275)

'Áâõ'

##### Problem (`unicode1`): Understanding Unicode (1 point)

_(a) What Unicode character does chr(0) return?_

_Answer_: It returns a `str` object representing null character with unicode code point `0`. Here `\x` is escape sequence, to represent this non-printable charecter.

In [3]:
chr(0), len(chr(0)), type(chr(0))

('\x00', 1, str)

_(b) How does this character‚Äôs string representation (`__repr__()`) differ from its printed representation?_

_Answer_: As we cannot print the null charecter, `repr` shows the `str` representation of this chareter. 

In [4]:
print(repr(chr(0)))

'\x00'


In [5]:
print(chr(0)) #NULL character is not visible when printed

 


In [6]:
repr(chr(0)), len(repr(chr(0))), type(repr(chr(0)))

("'\\x00'", 6, str)

_(c) What happens when this character occurs in text? It may be helpful to play around with the
following in your Python interpreter and see if it matches your expectations:_

In [7]:
chr(0)

'\x00'

In [8]:
print(chr(0))

 


In [9]:
"this is a test" + chr(0) + "string"

'this is a test\x00string'

In [10]:
print("this is a test" + chr(0) + "string")

this is a test string


#### 2.2 Unicode Encodings

In [11]:
test_string = "hello! „Åì„Çì„Å´„Å°„ÅØ!"
test_string

'hello! „Åì„Çì„Å´„Å°„ÅØ!'

In [12]:
utf8_encoded = test_string.encode("utf-8")
utf8_encoded

b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'

In [13]:
print(type(utf8_encoded))

<class 'bytes'>


In [14]:
type(utf8_encoded[10]), type(utf8_encoded[10:12]) # Single elments of `bytes` object return `int`. Slices return `bytes`

(int, bytes)

In [15]:
print(list(utf8_encoded))

[104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]


In [16]:
print(len(test_string))

13


In [17]:
print(len(utf8_encoded))

23


In [18]:
print(utf8_encoded.decode("utf-8"))

hello! „Åì„Çì„Å´„Å°„ÅØ!


##### Problem (`unicode2`): Unicode Encodings (3 points)

_(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than
UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various
input strings._

_Answer_: Because UTF-16 or UTF-32 occupy 2 bytes and 4 bytes at minimum respectively to represent a charecter. Using byte-level BPE with UTF-16 or UTF-32 means we start with vocab size of 2^16 and 2^32 respectively. Smaller vocabulary = faster training, less memory.

In [19]:
2**8, 2**16, 2**32

(256, 65536, 4294967296)

In [20]:
test_string.encode("utf-16"), test_string.encode("utf-32")

(b'\xff\xfeh\x00e\x00l\x00l\x00o\x00!\x00 \x00S0\x930k0a0o0!\x00',
 b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00 \x00\x00\x00S0\x00\x00\x930\x00\x00k0\x00\x00a0\x00\x00o0\x00\x00!\x00\x00\x00')

In [21]:
print(list(test_string.encode("utf-16")))

[255, 254, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 32, 0, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 33, 0]


In [22]:
len(test_string.encode("utf-16")), len(test_string.encode("utf-32"))

(28, 56)

_(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into a Unicode string. Why is this function incorrect? Provide an example of an input byte string that yields incorrect results._

In [23]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

In [24]:
decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))

'hello'

_(c) Give a two byte sequence that does not decode to any Unicode character(s)._

_Answer_: UTF-8 can combine multiple bytes from one charecter. In those cases the bytes must follow a pattern that individually does not translate to any valid charecter.

In [25]:
test = "üòÄ".encode("utf-8")
list(test)

[240, 159, 152, 128]

In [26]:
type(test), type(test[0]), type(bytes(test[0]))

# When we access a byte object, we get an integer representing the byte value at that position.

(bytes, int, bytes)

In [27]:
print(test)  # b'\xf0\x9f\x98\x80'

b'\xf0\x9f\x98\x80'


In [28]:
try:
    decode_utf8_bytes_to_str_wrong(test)
except UnicodeDecodeError as e:
    print("UnicodeDecodeError:", e)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: unexpected end of data


In [29]:
test = "√©".encode("utf-8")
print(test)  # b'\xc3\xa9'

b'\xc3\xa9'


In [84]:
"e".encode("utf-8"), "√©".encode("utf-8")

(b'e', b'\xc3\xa9')

In [30]:
## To add: 
## immutable nature of bytes object
## max and min byte size of different unicode encodings
## bytearrays
## unicode pattern for muli-byte charecters

#### 2.4 BPE Tokenizer Training

In [31]:
# Source: https://github.com/openai/tiktoken/pull/234/files
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

In [32]:
import regex as re
print(re.findall(PAT, "Hello, world! C'est la vie. Áâõ. ‡¶¨‡¶æ‡¶Ç‡¶≤‡¶æ ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶≠‡¶æ‡¶∑‡¶æ‡•§"))

['Hello', ',', ' world', '!', ' C', "'", 'est', ' la', ' vie', '.', ' Áâõ', '.', ' ‡¶¨', '‡¶æ‡¶Ç', '‡¶≤', '‡¶æ', ' ‡¶Ü‡¶Æ', '‡¶æ', '‡¶∞', ' ‡¶≠', '‡¶æ', '‡¶∑', '‡¶æ‡•§']


In [33]:
max([("A", "B"), ("A", "C"), ("B", "ZZ"), ("BA", "A")])

('BA', 'A')

##### Example (`bpe_example`): BPE training example

In [34]:
f = """low low low low low
lower lower widest widest widest
newest newest newest newest newest newest"""

# try the same with Bangla

In [35]:
print(f)

low low low low low
lower lower widest widest widest
newest newest newest newest newest newest


In [36]:
from cs336_basics.bpe_example import pretokenize, count_pairs, merge_pair, train_bpe

In [37]:
word_freqs = pretokenize(f)
word_freqs

{(b'l', b'o', b'w'): 5,
 (b'l', b'o', b'w', b'e', b'r'): 2,
 (b'w', b'i', b'd', b'e', b's', b't'): 3,
 (b'n', b'e', b'w', b'e', b's', b't'): 6}

In [38]:
pairs = count_pairs(word_freqs)
print(pairs)

{(b'l', b'o'): 7, (b'o', b'w'): 7, (b'w', b'e'): 8, (b'e', b'r'): 2, (b'w', b'i'): 3, (b'i', b'd'): 3, (b'd', b'e'): 3, (b'e', b's'): 9, (b's', b't'): 9, (b'n', b'e'): 6, (b'e', b'w'): 6}


In [39]:
# merge_pair(pairs, word_freqs)

In [40]:
# Get the most frequent pair
best_pair = max(pairs.items(), key=lambda x: (x[1], x[0]))
pair, count = best_pair

print(f"Most frequent pair: {pair} with count {count}")

Most frequent pair: (b's', b't') with count 9


In [41]:
# Now merge that specific pair
new_word_freqs = merge_pair(word_freqs, pair)  # Pass 'pair', not 'pairs'
print(new_word_freqs)

{(b'l', b'o', b'w'): 5, (b'l', b'o', b'w', b'e', b'r'): 2, (b'w', b'i', b'd', b'e', b'st'): 3, (b'n', b'e', b'w', b'e', b'st'): 6}


In [42]:
merges = train_bpe(f, 6)

In [43]:
merges

[(b's', b't'),
 (b'e', b'st'),
 (b'o', b'w'),
 (b'l', b'ow'),
 (b'w', b'est'),
 (b'n', b'e')]

In [44]:
chr(200).encode(), bytes([chr(200).encode()[0]])

(b'\xc3\x88', b'\xc3')

In [45]:
len(chr(200).encode()), chr(200)

(2, '√à')

#### 2.5 Experimenting with BPE Tokenizer

In [None]:
from cs336_basics.tobenamed import train_bpe#, get_all_pretokens, get_pretoken_counter

In [47]:
input_path = "data/TinyStoriesV2-GPT4-valid.txt"
vocab_size = 10_000
special_tokens = ["<|endoftext|>"]
# num_processes = 20
# split_special_token = "<|endoftext|>"

In [48]:
pre_tokens, pre_tokens_bytes = train_bpe(input_path, vocab_size, special_tokens)

In [79]:
pre_tokens

['u',
 ' don',
 "'t",
 ' have',
 ' to',
 ' be',
 ' scared',
 ' of',
 ' the',
 ' loud',
 ' dog',
 ',',
 ' I',
 "'ll",
 ' protect',
 ' you',
 '".',
 ' The',
 ' mole',
 ' felt',
 ' so',
 ' safe',
 ' with',
 ' the',
 ' little',
 ' girl',
 '.',
 ' She',
 ' was',
 ' very',
 ' kind',
 ' and',
 ' the',
 ' mole',
 ' soon',
 ' came',
 ' to',
 ' trust',
 ' her',
 '.',
 ' He',
 ' leaned',
 ' against',
 ' her',
 ' and',
 ' she',
 ' kept',
 ' him',
 ' safe',
 '.',
 ' The',
 ' mole',
 ' had',
 ' found',
 ' his',
 ' best',
 ' friend',
 '.',
 '\n',
 '<|',
 'endoftext',
 '|>',
 '\n',
 'Once',
 ' upon',
 ' a',
 ' time',
 ',',
 ' in',
 ' a',
 ' warm',
 ' and',
 ' sunny',
 ' place',
 ',',
 ' there',
 ' was',
 ' a',
 ' big',
 ' pit',
 '.',
 ' A',
 ' little',
 ' boy',
 ' named',
 ' Tom',
 ' liked',
 ' to',
 ' play',
 ' near',
 ' the',
 ' pit',
 '.',
 ' One',
 ' day',
 ',',
 ' Tom',
 ' lost',
 ' his',
 ' red',
 ' ball',
 '.',
 ' He',
 ' was',
 ' very',
 ' sad',
 '.',
 '\n',
 'Tom',
 ' asked',
 ' his',
 ' frie

In [49]:
# chunk = get_chunks(input_path, split_special_token, num_processes)
# stories = split_chunk(chunk, special_tokens)
# all_pretokens = [token for story in stories for token in re.findall(PAT, story)]

In [50]:
# all_pretokens[:10]

In [51]:
# del PAT

In [52]:
# PAT

In [53]:
# all_tokens = get_all_pretokens(input_path, vocab_size = 10)

In [54]:
# pretoken_encodings = get_pretoken_counter(all_tokens)
# # pretoken_encodings

In [55]:
# sorted_freq = dict(sorted(pretoken_encodings.items(), key=lambda item: item[1], reverse=True))
# # sorted_freq

In [56]:
# sorted_freq[(b' ', b't', b'h', b'e')]

In [57]:
# chunk = get_chunks(input_path = 'data/TinyStoriesV2-GPT4-valid.txt', 
#            split_special_token = "<|endoftext|>", 
#            num_processes = 20)

In [58]:
# from multiprocessing import Pool, cpu_count

# cpu_count()

In [59]:
# PAT

In [60]:
# special_tokens = ["<|endoftext|>"]
# stories = split_chunk(chunk, special_tokens)
# all_pretokens = [token for story in stories for token in re.findall(PAT, story)]
# len(all_pretokens)

In [61]:
# from collections import defaultdict
# freq = defaultdict(int)

# for pretoken in all_pretokens:
#     byte_tuple = tuple(bytes([b]) for b in pretoken.encode('utf-8'))
#     freq[byte_tuple] += 1

In [62]:
# sorted_freq = dict(sorted(freq.items(), key=lambda item: item[1], reverse=True))
# sorted_freq

In [63]:
# len(chunk)
# print(chunk)

In [64]:
# from cs336_basics.pretokenization_example import find_chunk_boundaries

In [65]:
# file_path = 'data/TinyStoriesV2-GPT4-valid.txt'

# ## Usage
# with open(file_path, "rb") as f:
#     num_processes = 10
#     boundaries = find_chunk_boundaries(f, num_processes, b"<|endoftext|>")
#     print(boundaries)
#     # The following is a serial implementation, but you can parallelize this
#     # by sending each start/end pair to a set of processes.
#     for start, end in zip(boundaries[:-1], boundaries[1:]):
#         f.seek(start)
#         chunk = f.read(end - start).decode("utf-8", errors="ignore")
#         # Run pre-tokenization on your chunk and store the counts for each pre-token

In [66]:
# # chunk
# end

In [67]:
# chunk.find("<|endoftext|>")

In [68]:
# b"<|endoftext|>"

In [69]:
# "<|endoftext|>".encode()

In [70]:
# chunk[720:780]

In [71]:


# special_tokens = ["<|endoftext|>"]#, "<|startoftext|>"]

In [72]:
# result = re.split(special_tokens[0], chunk[:800])
# result

In [73]:
# pattern = "|".join(re.escape(token) for token in special_tokens)
# print(pattern)

In [74]:
# result = re.split(pattern, chunk)
# len(result)

In [75]:
# print(result[0])

In [76]:
# type(b"<|endoftext|>".decode('utf-8'))

In [77]:
# print(chunk[:810])

In [78]:
# print(chunk[:2000])