Unicode is a text encoding standard that maps characters to integer code points

16进制 0-9表示0-9 A-B-C-D-E-F 表示10-15  0x开头

In [1]:
ord("牛"),chr(29275)
# ord() function to convert a single Unicode character
# into its integer representation.
# chr() function converts an integer Unicode code point 
# into a string with the corresponding character.

(29275, '牛')

### Problem (unicode1): Understanding Unicode (1 point)

What Unicode character does chr(0) return?

In [2]:
# a
chr(0)

'\x00'

How does this character’s string representation (__repr__()) differ from its printed representation?

In [3]:
repr("hello     \n\ta")

"'hello     \\n\\ta'"

In [4]:
print("hello    \n\ta")

hello    
	a


In [5]:
# for character's string representation(repr()),
# it will always add single quote around the input,
# and will show escape character with single backslash(\)

# but for printed representation, 
# it will just the raw input, and will execute the escape character, 
# like "\n","\t", it will do the tab/newline

What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:  
\>>> chr(0) 

\>>> print(chr(0)) 

\>>> "this is a test" + chr(0) + "string" 

\>>> print("this is a test" + chr(0) + "string")

In [6]:
chr(0)

'\x00'

In [7]:
print(chr(0))

 


In [8]:
"this is a test" + chr(0) +"string"

'this is a test\x00string'

In [9]:
print("this is a test" + chr(0) +"string")

this is a test string


In [10]:
# because chr(0) represent the Unicode character 0, also call NULL character(NUL)
# it is not a visiable character, will not show any content in screen

## 2.2 Unicode Encodings

In [11]:
test_string = "hello! こんにちは!"
test_string

'hello! こんにちは!'

In [12]:
utf_8 = test_string.encode("utf-8")
utf_8,len(utf_8)

(b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!', 23)

In [13]:
type(utf_8)

bytes

In [14]:
list(utf_8),len(list(utf_8))

([104,
  101,
  108,
  108,
  111,
  33,
  32,
  227,
  129,
  147,
  227,
  130,
  147,
  227,
  129,
  171,
  227,
  129,
  161,
  227,
  129,
  175,
  33],
 23)

In [15]:
utf_8.decode("utf-8")
# convert back from bytes to Unicodes

'hello! こんにちは!'

### Problem (unicode2): Unicode Encodings (3 points)

(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various input strings.

In [16]:
test_string

'hello! こんにちは!'

In [17]:
utf_8 = test_string.encode("utf-8")
utf_16 = test_string.encode("utf-16")
utf_32 = test_string.encode("utf-32")
utf_8,len(utf_8),utf_16,len(utf_16),utf_32,len(utf_32)

(b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!',
 23,
 b'\xff\xfeh\x00e\x00l\x00l\x00o\x00!\x00 \x00S0\x930k0a0o0!\x00',
 28,
 b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00 \x00\x00\x00S0\x00\x00\x930\x00\x00k0\x00\x00a0\x00\x00o0\x00\x00!\x00\x00\x00',
 56)

In [18]:
unicode_8 = utf_8.decode("utf-8")
unicode_16 = utf_16.decode("utf-16")
unicode_32 = utf_32.decode("utf-32")
unicode_8,unicode_16,unicode_32

('hello! こんにちは!', 'hello! こんにちは!', 'hello! こんにちは!')

In [19]:
# bacause for utf-8, one english character only one byte,but for utf-16/32,
# one character will take 2/4 bytes,which cost more memory
# so we chose utf-8 for its most compatibility and storage efficient.

(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into a Unicode string. Why is this function incorrect? Provide an example of an input byte string that yields incorrect results.

In [20]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes): 
    return "".join([bytes([b]).decode("utf-8") for b in bytestring]) 
decode_utf8_bytes_to_str_wrong("你好".encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

In [None]:
# for example, when the input is english character, one english character will
# represented by one bytes, in this time, the func can work
# but when the input is chinese,which one chinese chatacter is reprensented by 
# multi bytes,at this time , if we divide the bytes[111,112,113] one in one,
# it will cause error.
# UTF-8 is a "variable length encoding", not that each character only occupies one byte!
# cannot decode a complete UTF-8 byte stream separately by single bytes, 
# but must give it to decode as a whole, and it will automatically "group" it into characters.

In [21]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes): 
    return bytestring.decode("utf-8")
decode_utf8_bytes_to_str_wrong("你好".encode("utf-8"))
# if we wanna fix the error,just decode the whole sentence

'你好'

In [22]:
text = "hello"
text.encode("utf-8"),bytes("hello".encode('utf-8'))
# when input is already bytes, the bytes() will output without change input

(b'hello', b'hello')

(c) Give a two byte sequence that does not decode to any Unicode character(s).

In [23]:
for bad_bytes in [b'\x80\x80', b'\xc2\x20', b'\xc0\xaf']:
    try:
        print(bad_bytes.decode('utf-8'))
    except UnicodeDecodeError as e:
        print(f"{bad_bytes}: {e}")


b'\x80\x80': 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
b'\xc2 ': 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
b'\xc0\xaf': 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte


In [24]:
# bacause for utf-8,the head bytes should start by 0xxxxxxx(1 bytes)
# for 2 bytes character,head: 110xxxxx. 3:1110xxxx 4: 11110xxx
# no character start with \x80 -> 1000000 will considered as continuation bytes
# and the continuation bytes should start with 10xxxxxx

## 2.3 Subword Tokenization

A compression algorithm that iteratively replaces (“merges”) the most frequent pair of bytes with a single, new unused index. Note that this algorithm adds subword tokens to our vocabulary to maximize the compression of our input sequences—if a word occurs in our input text enough times, it’ll be represented as a single subword unit.

“Vocabulary initialization” 
which is one-to-one mapping from bytestring token to interger ID,so there are 256 byte value, so the size of vocabulary is of size 256

“Pre-tokenization”,
bacause directly count how often bytes occurs next to each other and then merging them starting with the most frequent pair of bytes will cost huge computation resources, and directly merging bytes across the corpus may result in token that differ only in punctuation(dog,dog!),so we use pre-tokenization,as coarse-grained tokenization over the corpus, such as 'text' appears 10 times,in this case, when we count how often the characters ‘t’ and ‘e’ appear next to each other, we will see that the word ‘text’ has ‘t’ and ‘e’ adjacent and we can increment their count by 10 instead of looking through the corpus.
Since we’re training a byte-level BPE model, each pre-token is represented as a sequence of UTF-8 bytes.

In [25]:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

import regex as re

re.findall(PAT,"some text that i'll pre-tokenize")

['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']

When using it in your code, however, you should use re.finditer to avoid storing the pre-tokenized words as you construct your mapping from pre-tokens to their counts

"Compute BPE merges"
Now that we’ve converted our input text into pre-tokens and represented each pre-token as a sequence of UTF-8 bytes

At a high level, the BPE algorithm iteratively counts every pair of bytes and identifies the pair with the highest frequency (“A”, “B”). Every occurrence of this most frequent pair (“A”, “B”) is then merged, i.e., replaced with a new token “AB”. This new merged token is added to our vocabulary; as a result, the final vocabulary after BPE training is the size of the initial vocabulary (256 in our case), plus the number of BPE merge operations performed during training

For efficiency during BPE training, we do not consider pairs that cross pre-token boundaries(some text, e and t are cross boundaries).2 When computing merges, deterministically break ties in pair frequency by preferring the lexicographically greater pair. For example, if the pairs (“A”, “B”), (“A”, “C”), (“B”, “ZZ”), and (“BA”, “A”) all have the highest frequency, we’d merge (“BA”, “A”),which has the biggest lexicographically (use max() func)

"Special tokens"
Often, some strings (e.g., <|endoftext|>) are used to encode metadata (e.g., boundaries between documents). When encoding text, it’s often desirable to treat some strings as “special tokens” that should never be split into multiple tokens (i.e., will always be preserved as a single token). For example, the end-of-sequence string <|endoftext|> should always be preserved as a single token (i.e., a single integer ID), so we know when to stop generating from the language model. These special tokens must be added to the vocabulary, so they have a corresponding fixed token ID.

Note that even a single byte is a bytes object in Python. There is no byte type in Python to represent a single byte, just as there is no char type in Python to represent a single character.

In [30]:
import os
import requests

os.makedirs('data', exist_ok=True)

def download(url, filename):
    response = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")

download(
    'https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt',
    'data/TinyStoriesV2-GPT4-train.txt'
)
download(
    'https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt',
    'data/TinyStoriesV2-GPT4-valid.txt'
)


Downloaded data/TinyStoriesV2-GPT4-train.txt
Downloaded data/TinyStoriesV2-GPT4-valid.txt


before we pretokenization,we should strip out all special tokens from our corpus

BPE的训练过程：每次找到出现次数最多的相邻字节对（pair），然后把它们合并成一个新符号。
朴素做法：每合并一次，就重新扫描整个语料，重新数所有的字节对频率。
但是合并只影响与被合并对相关的地方，比如ab->x,受影响的地方只有原本包含ab的地方，xa，bx的新组合。其他地方没有变，不需要重新算。可以通过维护一个哈希表，存储每个字节对的计数，每次合并后，只更新相关 pair 的计数，而不是全量遍历语料。