# 1.概述

分词算法是NLP领域内的基本任务，大多数NLP任务都会涉及到分词算法的使用。一般说到的tokenizer其实就是分词器。  
没有一种分词算法是能够解决所有场景的问题的，所以分词算法也是在不断优化的。之所以**没有一个大统一的方法**去适用所有的场景是因为大多分词算法都需要权衡以下三个问题：  
**1）OOV(Out-of-Vocabulary)问题：**遇到未登录词如何解决，或者说如何分词能避免出现未登录词的情况；  
**2）分词粒度：**粒度粗的分法能带来更具体的语义以及减少序列长度，粒度细的分法能尽可能避免OOV问题以及降低词汇表大小，分词算法如何在分词粒度层面进行选择；  
**3）歧义问题：**分词没有一定的标准，在不同的场景对分词粒度的要求也不同；  
从思路上，我们可以以两个维度去梳理现有的分词方法：  
**1）按分词方法划分：**基于词典匹配的分词方法、基于统计模型的分词方法、基于深度学习的分词方法  
**2）按分词粒度维度：**word、subword、char  

**中英文分词**
https://easyai.tech/ai-definition/tokenization/

**分词方法:** https://zhuanlan.zhihu.com/p/620603105  
**1) char-based：** 字符分词法，适合中文,参考：https://arxiv.org/pdf/1905.05526  
**2) word-based：** 单词分词法，适合英文  
**3) subword（子词）：** char-based + word-based，词根（英文）或词组（中文）分词法
- A. Byte Pair Encoding （BPE）  https://github.com/rsennrich/subword-nmt  
  
  - **minBPE**：https://github.com/karpathy/minbpe  
    [作者视频解说](https://www.bilibili.com/video/BV1BH4y1N7JA/?spm_id_from=333.337.search-card.all.click&vd_source=6616c1ef2d5d1b0f463724e69d204363)  
    [国内视频解说](https://www.bilibili.com/video/BV12x4y1t75q/?spm_id_from=333.337.search-card.all.click&vd_source=6616c1ef2d5d1b0f463724e69d204363)  
  - 因其能够__有效处理OOV问题__和保持词根词缀的完整性，而被广泛应用于大型语言模型
  - 近期，BPE技术已经发展成为**Byte-level BPE（BBPE）**
- B. Unigram
- C. WordPiece  

## 1.1 minBPE

**minbpe/base.py:**   
Implements the **Tokenizer class**, which is the base class. It contains the **train, encode, and decode** stubs, **save/load** functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.  
**minbpe/basic.py:**  
**Implements the BasicTokenizer**, the simplest implementation of the BPE algorithm that runs directly on text.  
**minbpe/regex.py:**  
Implements the RegexTokenizer that further **splits the input text by a regex pattern**, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. This class also handles special tokens, if any.  
**minbpe/gpt4.py:**  
Implements the **GPT4Tokenizer**. This class is a light wrapper around the RegexTokenizer (2, above) that **exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library**. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations.

In [14]:
!git clone https://github.com/karpathy/minbpe
!pip install --upgrade pip
try:
    import tiktoken
except ImportError:
    !pip install tiktoken

fatal: destination path 'minbpe' already exists and is not an empty directory.
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Using cached regex-2024.7.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Downloading tiktoken-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hUsing cached regex-2024.7.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (790 kB)
Installing collected packages: regex, tiktoken
Successfully installed regex-2024.7.24 tiktoken-0.7.0


### 1.1.1 quick start

In [33]:
from minbpe.minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
text = "aaabdaaabac"
tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
print(tokenizer.encode(text))
# [258, 100, 258, 97, 99]
print(tokenizer.decode([258, 100, 258, 97, 99]))
# aaabdaaabac
tokenizer.save("toy")
# writes two files: toy.model (for loading) and toy.vocab (for viewing)

[258, 100, 258, 97, 99]
aaabdaaabac


### 1.1.2 inference: GPT-4 comparison