从 Llama3 开始，Meta使用的分词技术从 sentencepiece 迁移到了 tiktoken

sentencepiece和tiktoken的区别：
1. sentencepiece
   1. 预设一个vocab size 
   2. 将语料中的所有单个字符添加到词表
   3. 通过一个循环来将词表扩展到指定的大小，每一轮都会新增一个token。新增一个token的原则：选择通过已有编码下token pair频率最高的这一对作为新增加的token。(可以想像这样一来某些单词里面常见的字母组合如 th、ng 什么的会被慢慢的用来扩充 vocab，继而更长的组合如 ing、scr 等。这样一来如此编码可以处理一些语料中没有看见过的单词） 
   4. 在编码字符串的时候可以使用greedy match，与最长的token进行匹配。 
   5. 本质：带有字节回退的字符级bpe。基础是字符，未登录词回退为字符。
2. tiktoken
   1. 算法原理与sentencepiece基本一致。 
   2. 区别在于，tiktoken认为基础的token为字节。tiktoken处理的文本为字节的序列，而非字符的序列。 
   3. sentencepiece对字符进行操作，而非对UTF-8编码进行操作。例如"你好"是2个字符，但是是6个字节b"\xe4\xbd\xa0\xe5\xa5\xbd"。假如"你好"这个token既在sentenpiece的词表中，也在tiktoken的词表中，那么sentencepiece需要1次merge操作("你", "好")，而在tiktoken词表中，则需要5次merge(b"\xe4", b"\xbd"), (b"\xe4\xbd", b"\xa0"), (b"\xe5", b"\xa5"), (b"\xe5\xa5", "\xbd"), (b"\xe4\xbd\xa0", "\xe5\xa5\xbd")
   4. tiktoken和sentencepiece对未登录词的操作不同。假设"佰"( b"\xe4\xbd\xb0") 不在词表中。则sentencepiece会回退产生序列("<0xE4>", "<0xBD>", "<0xB0>")，而tiktoken可能会产生序列(b"\xe4\xbd", b"\xb0")
   5. sentencepiece使用的训练数据是一行一行的文本，但是tiktoken是将整个训练语料视作一个超长的字符串。所以sentencepiece训练出来的词表，每个换行(\n)都是一个单独的token(<0x0A>)。但是tiktoken训练出来的词表，存在将多个换行拼成一个token。


In [None]:
text = "你好"
utf8_bytes = text.encode('utf-8')
print(utf8_bytes)

In [None]:
text = "佰"
utf8_bytes = text.encode('utf-8')
print(utf8_bytes)

In [None]:
bytes = b'\xe4\xbd\xb0'
text = bytes.decode('utf-8')
print(text)

参考 Qwen 分词器的构成：

![img_1.png](../images/img_qwen_tokenize.png)

In [None]:
from transformers import AutoTokenizer
tokenizer_qwen2 = AutoTokenizer.from_pretrained("../model/Qwen2-7B")

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

In [None]:
print(tokenizer_qwen2.encode("00"))
print(enc.encode("00"))

print(tokenizer_qwen2.encode("ith"))
print(enc.encode("ith"))

print(tokenizer_qwen2.decode([411]))
print(enc.encode("out"))

因为从token id410开始，gpt4的词表和qwen的词表产生了错位，且随着token id逐渐增加，错位也逐渐增加，似乎是qwen有意剔除了某些token。

目前meta官方并没有说明他们是如何得到128K词表的，但是，基本可以认定，他们生成词表的方式和qwen类似，都是在gpt4 10w词表的基础上，合并sentencepiece训练得到的词表进行的扩充。

证明如下：

In [None]:
tokenizer_llama3 = AutoTokenizer.from_pretrained("../model/Meta-Llama-3-8B")
print(len(tokenizer_llama3))

for i in range(100000):
    if enc.decode([i]) != tokenizer_llama3.decode([i], clean_up_tokenization_spaces=False):
        print(i)
        print(enc.decode([i]))
        print(tokenizer_llama3.decode([i]))
        
print("done")

尝试进行词表合并，将 llama3 的词表与 chatglm3 的词表合并。

在 llama3 的模型文件中，只提供了fast版本的tokenizer，词表文件是 tokenizer.json，没有提供slow版本的tokenizer（也就是vocab.txt）

而在 chatglm3 的模型文件中，提供的是 tokenizer.model

In [None]:
tokenizer_chatglm = AutoTokenizer.from_pretrained("../model/chatglm3-6b", trust_remote_code=True)
print(len(tokenizer_chatglm))
print(tokenizer_chatglm.added_tokens_decoder)

合并按照下列规则：

- 0-127999: 原始llama3词表，总计128000个。llama3词表中128000之后是added_tokens。
- 127999 - x：chatglm3词表中的中文token，并与前128000个token去重。
- x - y: 以added_tokens形式添加的原chatglm3中的特殊字符

In [None]:
import json
import copy
with open("../model/Meta-Llama-3-8B/tokenizer.json") as f:
    llama3_tokenizer_json = json.load(f)
with open("../model/Meta-Llama-3-8B/config.json") as f:
    llama3_config_json = json.load(f)
with open("../model/Meta-Llama-3-8B/tokenizer_config.json") as f:
    llama3_tokenizer_config = json.load(f)
with open("../model/Meta-Llama-3-8B/special_tokens_map.json") as f:
    llama3_special_tokens_map = json.load(f)
with open("../model/Meta-Llama-3-8B/generation_config.json") as f:
    llama3_generation_config = json.load(f)

从指定路径读取 SentencePiece 分词模型文件。该模型文件包含了分词器的训练结果，包括子词单元及其频率或概率等信息。

将模型文件解析为 ModelProto 实例，以便后续可以通过该实例访问模型中的详细信息，比如子词词典、词频等。



In [None]:
import sentencepiece.sentencepiece_model_pb2 as model
chatglm3_tokenizer_sp = model.ModelProto()
chatglm3_tokenizer_sp.ParseFromString(open("../model/chatglm3-6b/tokenizer.model", "rb").read())

vocab_size = len(chatglm3_tokenizer_sp.pieces)
print(vocab_size)

vocab_list = [piece.piece for piece in chatglm3_tokenizer_sp.pieces]
vocab_file_path = "../model/chatglm3-6b/vocab.txt"

with open(vocab_file_path, "w", encoding="utf-8") as f:
    for token in vocab_list:
        f.write(token + "\n")

print(f"词汇表已保存到 {vocab_file_path}")

In [None]:
from hanzidentifier import is_simplified
def is_contain_chinese(word):
    """
    判断字符串是否包含中文字符
    :param word: 字符串
    :return: 布尔值，True表示包含中文，False表示不包含中文
    """

    for char in word:
        if is_simplified(char):
            return True
    return False

In [None]:
print(tokenizer_chatglm.special_tokens_map)

In [47]:
chatglm3_special_tokens = ["<|begin_of_text|>", "<|end_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>"]
chatglm3_normal_tokens = []
for index, piece in enumerate(chatglm3_tokenizer_sp.pieces):
    if piece.type == 4:
        if len(tokenizer_llama3.encode(piece.piece)) == 1:
            pass
        else:
            if piece.piece not in {"<human>", "<bot>", "<|im_start|>", "<|im_end|>"}:
                chatglm3_special_tokens.append(piece.piece)
    elif piece.type == 1:
        if index >= 32000:
            chatglm3_normal_tokens.append(piece.piece)
        else:
            if is_simplified(piece.piece):
                chatglm3_normal_tokens.append(piece.piece)

print(len(chatglm3_special_tokens))
print(len(chatglm3_normal_tokens))
# print(chatglm3_normal_tokens)

5
33160


In [45]:
llama3_vocab = copy.deepcopy(llama3_tokenizer_json["model"]["vocab"])
last_token_index = 0
for k, v in llama3_vocab.items():
    if v > last_token_index:
        last_token_index = v
print(last_token_index)

127999


In [50]:
# 创建一个从字节（bytes）到Unicode字符串的映射
# Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode
def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
    characters the bpe code barfs on.

    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
    decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
    tables between utf-8 bytes and unicode strings.
    """
    bs = (
            list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(
        range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(2 ** 8):
        if b not in bs:
            bs.append(b)
            cs.append(2 ** 8 + n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))
byte_encoder = bytes_to_unicode()
print(byte_encoder)

# Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._tokenize
def tokenize(token):
    """Tokenize a string."""
    token_list = [byte_encoder[b] for b in token.encode("utf-8")]
    token = "".join(token_list)  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
    return token


{33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 174: 

In [51]:
def extract_merges(vocab):
    """
    By default will return vocab and merges with respect to their order, by sending `vocab_scores` we're going to
    order the merges with respect to the piece scores instead.
    """

    vocab_scores, reverse = vocab, False
    # Merges
    merges = []
    
    for merge, piece_score in vocab_scores.items():
        local = []
        for index in range(1, len(merge)):
            piece_l, piece_r = merge[:index], merge[index:]
            if piece_l in vocab and piece_r in vocab:
                local.append((piece_l, piece_r, piece_score))
        local = sorted(local, key=lambda x: (vocab[x[0]], vocab[x[1]]))
        merges.extend(local)

    merges = sorted(merges, key=lambda val: val[2], reverse=reverse)
    merges = [" ".join([val[0], val[1]]) for val in merges]
    return merges

In [55]:
for token in chatglm3_normal_tokens:
    token_key = tokenize(token)
    # 特殊处理，如果一个token无法把它切成当前token，则没有存在的必要
    if len(tokenizer_chatglm.tokenize(token)) == 3:
        # print(token)
        extra_token_key = token_key[:2]
        if extra_token_key not in llama3_vocab:
            last_token_index += 1
            llama3_vocab[extra_token_key] = last_token_index
    if token_key not in llama3_vocab:
        last_token_index += 1
        llama3_vocab[token_key] = last_token_index
print(last_token_index)

156439


In [54]:
token = "新华社"
print(tokenizer_chatglm.tokenize(token))
print(tokenize(token))

token = "新型冠状病毒"
print(tokenizer_chatglm.tokenize(token))
print(tokenize(token))

['▁新', '华', '社']
æĸ°åįİç¤¾
['▁新', '型', '冠状病毒']
æĸ°åŀĭåĨłçĬ¶çĹħæ¯Ĵ


In [58]:
print(tokenizer_chatglm.eos_token)
print(tokenizer_chatglm.eos_token_id)

print(tokenizer_llama3.eos_token)
print(tokenizer_llama3.eos_token_id)

</s>
2
<|end_of_text|>
128001
