https://huggingface.co/learn/nlp-course/chapter6/4#normalization-and-pre-tokenization

# Tokenizers used by different transformer models

**WordPiece** is the tokenization algorithm Google developed to pretrain BERT. It has since been reused in quite a few Transformer models based on BERT, such as DistilBERT, MobileBERT

It is a subword tokenization algorithm which starts with individual characters and merges them together to form subwords. The algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary. This merging continues till the right vocabulary size is reached

"##" means that the rest of the token should be attached to the previous one, without space (for decoding or reversal of the tokenization).

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Notice how the text is normalized, converted to lowercase, accents (diacritical marks on characters) have been removed

In [None]:
tokenizer.tokenize("How’re things going? How     are y’all dōin’, darling? 😁")

['how',
 '’',
 're',
 'things',
 'going',
 '?',
 'how',
 'are',
 'y',
 '’',
 'all',
 'doin',
 '’',
 ',',
 'darling',
 '?',
 '[UNK]']

#### Encoding = Tokenizing + Converting to IDs

In [2]:
tokens = tokenizer.tokenize("Fun with Hugging Face tokenizers")

tokens

['fun', 'with', 'hugging', 'face', 'token', '##izer', '##s']

Going from tokens to input ids

In [3]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[4569, 2007, 17662, 2227, 19204, 17629, 2015]


#### Decoding = Going from IDs to words

In [4]:
decoded_string = tokenizer.decode(ids)

decoded_string

'fun with hugging face tokenizers'

**Byte-Pair Encoding (BPE)** was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model.

It is a subword tokenization algorithm which starts with individual characters and merges them together to form subwords. At any step during the tokenizer training, the BPE algorithm will search for the most frequent pair (two consecutive tokens in a word) of existing tokens. That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.

### Let's use the GPT-2 tokenizer

It will split on whitespace and punctuation as well, but it will keep the spaces and replace them with a Ġ symbol, enabling it to recover the original spaces if we decode the tokens.

Also note that unlike the BERT tokenizer, this tokenizer does not ignore the double space.

In [5]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer.tokenize("Fun with Hugging Face tokenizers")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

['Fun', 'Ġwith', 'ĠHug', 'ging', 'ĠFace', 'Ġtoken', 'izers']

In [6]:
tokenizer.tokenize("How’re things going? How     are y’all dōin’, darling? 😁")

['How',
 'âĢ',
 'Ļ',
 're',
 'Ġthings',
 'Ġgoing',
 '?',
 'ĠHow',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġare',
 'Ġy',
 'âĢ',
 'Ļ',
 'all',
 'Ġd',
 'Åį',
 'in',
 'âĢ',
 'Ļ',
 ',',
 'Ġdarling',
 '?',
 'ĠðŁĺ',
 'ģ']

**SentencePiece (Unigram algorithm) tokenization** The Unigram model starts with a large set of subword candidates and iteratively prunes this set based on the likelihood of subwords being used. The model aims to keep only the most useful subwords while discarding less useful ones. Uses a loss function to determine the less useful subwords

It starts from a big vocabulary and removes tokens from it until it reaches the desired vocabulary size

### Let's use the T5-small tokenizer

Like the GPT-2 tokenizer, this one keeps spaces and replaces them with a specific token (_), but the T5 tokenizer only splits on whitespace, not punctuation. Also note that it added a space by default at the beginning of the sentence and ignores the multiple space.



In [7]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")

tokenizer.tokenize("Fun with Hugging Face tokenizers")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

['▁Fun', '▁with', '▁Hug', 'ging', '▁Face', '▁token', 'izer', 's']

In [8]:
tokenizer.tokenize("How’re things going? How     are y’all dōin’, darling? 😁")

['▁How',
 '’',
 're',
 '▁things',
 '▁going',
 '?',
 '▁How',
 '▁are',
 '▁',
 'y',
 '’',
 'all',
 '▁',
 'd',
 'ō',
 'in',
 '’',
 ',',
 '▁dar',
 'ling',
 '?',
 '▁',
 '😁']