# Lesson 5: Comparing Trained LLM Tokenizers

In this notebook of lesson 5, you will work with several tokenizers associated with different LLMs and explore how each tokenizer approaches tokenization differently. 

## Setup

In [1]:
# Warning control

import warnings

warnings.filterwarnings("ignore")

## Tokenizing Text

In this section, you will tokenize the sentence "Hello World!" using the tokenizer of the [`bert-base-cased` model](https://huggingface.co/google-bert/bert-base-cased). 

Let's import the `Autotokenizer` class, define the sentence to tokenize, and instantiate the tokenizer.

<p style="background-color:#fff1d7; padding:15px; "> <b>FYI: </b> The transformers library has a set of Auto classes, like AutoConfig, AutoModel, and AutoTokenizer. The Auto classes are designed to automatically do the job for you.</p>

In [2]:
from transformers import AutoTokenizer

In [3]:
# Define the sentence to tokenize
sentence = "Hello World!"

In [4]:
# Load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

You'll now apply the tokenizer to the sentence. The tokeziner splits the sentence into tokens and returns the IDs of each token.

In [5]:
# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids

In [6]:
print(token_ids)

[101, 8667, 1291, 106, 102]


To map each token ID to its corresponding token, you can use the `decode` method of the tokenizer.

In [7]:
for id in token_ids:
    print(tokenizer.decode(id))

[CLS]
Hello
World
!
[SEP]


Loading GPT models

In [29]:
tokenizer = AutoTokenizer.from_pretrained("Xenova/gpt-4")

OSError: There was a specific connection error when trying to load Xenova/gpt-4:
401 Client Error: Unauthorized for url: https://huggingface.co/Xenova/gpt-4/resolve/main/config.json (Request ID: Root=1-67c8aac0-38215b1b7d93523a276a4b1a;313a562d-bb81-42c7-9aad-4d2163d2a4d1)

Invalid credentials in Authorization header

In [30]:
from transformers import GPT2TokenizerFast

In [31]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

## Visualizing Tokenization

In this section, you'll wrap the code of the previous section in the function `show_tokens`. The function takes in a text and the model name, and prints the vocabulary length of the tokenizer and a colored list of the tokens. 

In [8]:
# A list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

In [25]:
def show_tokens(sentence: str, tokenizer_name: str, change: str="background"):
    """ Show the tokens each separated by a different color """
    
    assert change in ["foreground", "background"], "Expected either foreground or background"
    
    change_val = 38 if change == "foreground" else 48
    
    # Load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids

    # Extract vocabulary length
    print(f"Vocabulary length: {len(tokenizer)}")

    # Print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[{change_val};2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

Here's the text that you'll use to explore the different tokenization strategies of each model.

In [10]:
text = """
English and CAPITALIZATION
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

You'll now again use the tokenizer of `bert-base-cased` and compare its tokenization strategy to that of `Xenova/gpt-4`

**bert-base-cased**

In [26]:
show_tokens(text, "bert-base-cased")

Vocabulary length: 28996
[48;2;102;194;165m[CLS][0m [48;2;252;141;98mEnglish[0m [48;2;141;160;203mand[0m [48;2;231;138;195mCA[0m [48;2;166;216;84m##PI[0m [48;2;255;217;47m##TA[0m [48;2;102;194;165m##L[0m [48;2;252;141;98m##I[0m [48;2;141;160;203m##Z[0m [48;2;231;138;195m##AT[0m [48;2;166;216;84m##ION[0m [48;2;255;217;47m[UNK][0m [48;2;102;194;165m[UNK][0m [48;2;252;141;98mshow[0m [48;2;141;160;203m_[0m [48;2;231;138;195mtoken[0m [48;2;166;216;84m##s[0m [48;2;255;217;47mF[0m [48;2;102;194;165m##als[0m [48;2;252;141;98m##e[0m [48;2;141;160;203mNone[0m [48;2;231;138;195mel[0m [48;2;166;216;84m##if[0m [48;2;255;217;47m=[0m [48;2;102;194;165m=[0m [48;2;252;141;98m>[0m [48;2;141;160;203m=[0m [48;2;231;138;195melse[0m [48;2;166;216;84m:[0m [48;2;255;217;47mtwo[0m [48;2;102;194;165mta[0m [48;2;252;141;98m##bs[0m [48;2;141;160;203m:[0m [48;2;231;138;195m"[0m [48;2;166;216;84m"[0m [48;2;255;217;47mThree[0m [48;2;102;194;165

In [27]:
show_tokens(text, "bert-base-cased", change="foreground")

Vocabulary length: 28996
[38;2;102;194;165m[CLS][0m [38;2;252;141;98mEnglish[0m [38;2;141;160;203mand[0m [38;2;231;138;195mCA[0m [38;2;166;216;84m##PI[0m [38;2;255;217;47m##TA[0m [38;2;102;194;165m##L[0m [38;2;252;141;98m##I[0m [38;2;141;160;203m##Z[0m [38;2;231;138;195m##AT[0m [38;2;166;216;84m##ION[0m [38;2;255;217;47m[UNK][0m [38;2;102;194;165m[UNK][0m [38;2;252;141;98mshow[0m [38;2;141;160;203m_[0m [38;2;231;138;195mtoken[0m [38;2;166;216;84m##s[0m [38;2;255;217;47mF[0m [38;2;102;194;165m##als[0m [38;2;252;141;98m##e[0m [38;2;141;160;203mNone[0m [38;2;231;138;195mel[0m [38;2;166;216;84m##if[0m [38;2;255;217;47m=[0m [38;2;102;194;165m=[0m [38;2;252;141;98m>[0m [38;2;141;160;203m=[0m [38;2;231;138;195melse[0m [38;2;166;216;84m:[0m [38;2;255;217;47mtwo[0m [38;2;102;194;165mta[0m [38;2;252;141;98m##bs[0m [38;2;141;160;203m:[0m [38;2;231;138;195m"[0m [38;2;166;216;84m"[0m [38;2;255;217;47mThree[0m [38;2;102;194;165

**Optional - bert-base-uncased**

You can also try the uncased version of the bert model, and compare the vocab length and tokenization strategy of the two bert versions.

In [28]:
show_tokens(text, "bert-base-uncased", change="foreground")

Vocabulary length: 30522
[38;2;102;194;165m[CLS][0m [38;2;252;141;98menglish[0m [38;2;141;160;203mand[0m [38;2;231;138;195mcapital[0m [38;2;166;216;84m##ization[0m [38;2;255;217;47m[UNK][0m [38;2;102;194;165m[UNK][0m [38;2;252;141;98mshow[0m [38;2;141;160;203m_[0m [38;2;231;138;195mtoken[0m [38;2;166;216;84m##s[0m [38;2;255;217;47mfalse[0m [38;2;102;194;165mnone[0m [38;2;252;141;98meli[0m [38;2;141;160;203m##f[0m [38;2;231;138;195m=[0m [38;2;166;216;84m=[0m [38;2;255;217;47m>[0m [38;2;102;194;165m=[0m [38;2;252;141;98melse[0m [38;2;141;160;203m:[0m [38;2;231;138;195mtwo[0m [38;2;166;216;84mtab[0m [38;2;255;217;47m##s[0m [38;2;102;194;165m:[0m [38;2;252;141;98m"[0m [38;2;141;160;203m"[0m [38;2;231;138;195mthree[0m [38;2;166;216;84mtab[0m [38;2;255;217;47m##s[0m [38;2;102;194;165m:[0m [38;2;252;141;98m"[0m [38;2;141;160;203m"[0m [38;2;231;138;195m12[0m [38;2;166;216;84m.[0m [38;2;255;217;47m0[0m [38;2;102;194;165m*

In [13]:
print("\x1b[31m\"red\"\x1b[0m")

[31m"red"[0m


In [16]:
print("\x1b[31mred\x1b[0m")

[31mred[0m


In [17]:
print("\x1b[31m" + "red" + "\x1b[0m")

[31mred[0m


In [18]:
print("\x1b[38;2;255;0;0m" + "red" + "\x1b[0m")

[38;2;255;0;0mred[0m


In [19]:
print("\x1b[48;2;255;0;0m" + "red" + "\x1b[0m")

[48;2;255;0;0mred[0m


In [14]:
print("\x1b[31m\"blue\"\x1b[0m")

[31m"blue"[0m


**GPT-4**

In [34]:
show_tokens(text, "Xenova/gpt-4")

OSError: There was a specific connection error when trying to load Xenova/gpt-4:
401 Client Error: Unauthorized for url: https://huggingface.co/Xenova/gpt-4/resolve/main/config.json (Request ID: Root=1-67c9ebf2-53f183807fbe259465dc21b6;f45d2e73-b9b7-43ea-91d5-e1e70ffdbfe9)

Invalid credentials in Authorization header

In [32]:
import transformers

In [33]:
transformers.__version__

'4.41.2'

In [35]:
tokenizer = AutoTokenizer.from_pretrained("Xenova/gpt-4")

OSError: There was a specific connection error when trying to load Xenova/gpt-4:
401 Client Error: Unauthorized for url: https://huggingface.co/Xenova/gpt-4/resolve/main/config.json (Request ID: Root=1-67c9ed90-3af06d053819b9e31a6e6a1b;13637a0f-abe0-48d9-8fa0-8e7e3e8a0c16)

Invalid credentials in Authorization header

### Optional Models to Explore

You can also explore the tokenization strategy of other models. The following is a suggested list. Make sure to consider the following features when you're doing your comparison:
- Vocabulary length
- Special tokens
- Tokenization of the tabs, special characters and special keywords

**gpt2**

In [36]:
show_tokens(text, "gpt2", change="foreground")

Vocabulary length: 50257
[38;2;102;194;165m
[0m [38;2;252;141;98mEnglish[0m [38;2;141;160;203m and[0m [38;2;231;138;195m CAP[0m [38;2;166;216;84mITAL[0m [38;2;255;217;47mIZ[0m [38;2;102;194;165mATION[0m [38;2;252;141;98m
[0m [38;2;141;160;203m�[0m [38;2;231;138;195m�[0m [38;2;166;216;84m�[0m [38;2;255;217;47m �[0m [38;2;102;194;165m�[0m [38;2;252;141;98m�[0m [38;2;141;160;203m
[0m [38;2;231;138;195mshow[0m [38;2;166;216;84m_[0m [38;2;255;217;47mt[0m [38;2;102;194;165mok[0m [38;2;252;141;98mens[0m [38;2;141;160;203m False[0m [38;2;231;138;195m None[0m [38;2;166;216;84m el[0m [38;2;255;217;47mif[0m [38;2;102;194;165m ==[0m [38;2;252;141;98m >=[0m [38;2;141;160;203m else[0m [38;2;231;138;195m:[0m [38;2;166;216;84m two[0m [38;2;255;217;47m tabs[0m [38;2;102;194;165m:"[0m [38;2;252;141;98m [0m [38;2;141;160;203m [0m [38;2;231;138;195m [0m [38;2;166;216;84m "[0m [38;2;255;217;47m Three[0m [38;2;102;194;165m tabs[0m [3

**Flan-T5-small**

In [37]:
show_tokens(text, "google/flan-t5-small", change="foreground")

Vocabulary length: 32100
[38;2;102;194;165mEnglish[0m [38;2;252;141;98mand[0m [38;2;141;160;203mCA[0m [38;2;231;138;195mPI[0m [38;2;166;216;84mTAL[0m [38;2;255;217;47mIZ[0m [38;2;102;194;165mATION[0m [38;2;252;141;98m[0m [38;2;141;160;203m<unk>[0m [38;2;231;138;195m[0m [38;2;166;216;84m<unk>[0m [38;2;255;217;47mshow[0m [38;2;102;194;165m_[0m [38;2;252;141;98mto[0m [38;2;141;160;203mken[0m [38;2;231;138;195ms[0m [38;2;166;216;84mFal[0m [38;2;255;217;47ms[0m [38;2;102;194;165me[0m [38;2;252;141;98mNone[0m [38;2;141;160;203m[0m [38;2;231;138;195me[0m [38;2;166;216;84ml[0m [38;2;255;217;47mif[0m [38;2;102;194;165m=[0m [38;2;252;141;98m=[0m [38;2;141;160;203m>[0m [38;2;231;138;195m=[0m [38;2;166;216;84melse[0m [38;2;255;217;47m:[0m [38;2;102;194;165mtwo[0m [38;2;252;141;98mtab[0m [38;2;141;160;203ms[0m [38;2;231;138;195m:[0m [38;2;166;216;84m"[0m [38;2;255;217;47m"[0m [38;2;102;194;165mThree[0m [38;2;252;141;98mtab[

**Starcoder 2 - 15B**

In [38]:
show_tokens(text, "bigcode/starcoder2-15b", change="foreground")

OSError: There was a specific connection error when trying to load bigcode/starcoder2-15b:
401 Client Error: Unauthorized for url: https://huggingface.co/bigcode/starcoder2-15b/resolve/main/config.json (Request ID: Root=1-67c9f548-0687fa1858caa8563b84e556;c143c81c-b158-43c9-8bbb-2df5bf7d53d2)

Invalid credentials in Authorization header

**Phi-3**

In [39]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct", change="foreground")

OSError: There was a specific connection error when trying to load microsoft/Phi-3-mini-4k-instruct:
401 Client Error: Unauthorized for url: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/resolve/main/config.json (Request ID: Root=1-67c9f59c-57e29ab743d8aabf70f8907f;1f4c5494-2ae0-464e-85d5-de25487fa228)

Invalid credentials in Authorization header

**Qwen2 - Vision-Language Model**

In [40]:
show_tokens(text, "Qwen/Qwen2-VL-7B-Instruct", change="foreground")

OSError: There was a specific connection error when trying to load Qwen/Qwen2-VL-7B-Instruct:
401 Client Error: Unauthorized for url: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/resolve/main/config.json (Request ID: Root=1-67c9f632-4e02d07a169c4dad45a2c96f;f79f0606-fc38-4f01-ac3e-30468d654f1e)

Invalid credentials in Authorization header