# Comparing Trained LLM Tokenizers

In [1]:
# !pip install transformers>=4.46.1

## Tokenizing Text

Tokenize `"Hello World!"` using tokenizer of `bert-base-cased` model

In [4]:
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [5]:
sentence = "Hello World!" 

# Load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [7]:
# apply tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids
# Print the token ids
print("Token IDs:", token_ids)

Token IDs: [101, 8667, 1291, 106, 102]


Decoding the tokens

In [8]:
for id in token_ids:
    print(f"Token : {id} - {tokenizer.decode(id)}")

Token : 101 - [CLS]
Token : 8667 - Hello
Token : 1291 - World
Token : 106 - !
Token : 102 - [SEP]


## Visualizing Tokenization

In [45]:
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

In [46]:
def show_tokens(sentence: str, tokenizer_name: str):
    """ Show the tokens each separated by a different color """

    # Load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids

    # Extract vocabulary length
    print(f"Vocab length: {len(tokenizer)}")

    # Print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
                f'\x1b[{colors[idx % len(colors)]}m' +
                tokenizer.decode(t) +
                '\x1b[0m',
                end=' '
        )

Sample text to test outher tokenizers

In [47]:
text = """
English and CAPITALIZATION
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

#### tokenizer 1 : `bert-based-cased`

In [48]:
show_tokens(sentence=text, tokenizer_name='bert-base-cased')

Vocab length: 28996
[102;194;165m[CLS][0m [252;141;98mEnglish[0m [141;160;203mand[0m [231;138;195mCA[0m [166;216;84m##PI[0m [255;217;47m##TA[0m [102;194;165m##L[0m [252;141;98m##I[0m [141;160;203m##Z[0m [231;138;195m##AT[0m [166;216;84m##ION[0m [255;217;47m[UNK][0m [102;194;165m[UNK][0m [252;141;98mshow[0m [141;160;203m_[0m [231;138;195mtoken[0m [166;216;84m##s[0m [255;217;47mF[0m [102;194;165m##als[0m [252;141;98m##e[0m [141;160;203mNone[0m [231;138;195mel[0m [166;216;84m##if[0m [255;217;47m=[0m [102;194;165m=[0m [252;141;98m>[0m [141;160;203m=[0m [231;138;195melse[0m [166;216;84m:[0m [255;217;47mtwo[0m [102;194;165mta[0m [252;141;98m##bs[0m [141;160;203m:[0m [231;138;195m"[0m [166;216;84m"[0m [255;217;47mThree[0m [102;194;165mta[0m [252;141;98m##bs[0m [141;160;203m:[0m [231;138;195m"[0m [166;216;84m"[0m [255;217;47m12[0m [102;194;165m.[0m [252;141;98m0[0m [141;160;203m*[0m [231;138;195m50[0m 

In [49]:
show_tokens(text, "bert-base-uncased")

Vocab length: 30522
[102;194;165m[CLS][0m [252;141;98menglish[0m [141;160;203mand[0m [231;138;195mcapital[0m [166;216;84m##ization[0m [255;217;47m[UNK][0m [102;194;165m[UNK][0m [252;141;98mshow[0m [141;160;203m_[0m [231;138;195mtoken[0m [166;216;84m##s[0m [255;217;47mfalse[0m [102;194;165mnone[0m [252;141;98meli[0m [141;160;203m##f[0m [231;138;195m=[0m [166;216;84m=[0m [255;217;47m>[0m [102;194;165m=[0m [252;141;98melse[0m [141;160;203m:[0m [231;138;195mtwo[0m [166;216;84mtab[0m [255;217;47m##s[0m [102;194;165m:[0m [252;141;98m"[0m [141;160;203m"[0m [231;138;195mthree[0m [166;216;84mtab[0m [255;217;47m##s[0m [102;194;165m:[0m [252;141;98m"[0m [141;160;203m"[0m [231;138;195m12[0m [166;216;84m.[0m [255;217;47m0[0m [102;194;165m*[0m [252;141;98m50[0m [141;160;203m=[0m [231;138;195m600[0m [166;216;84m[SEP][0m 