# Tokenizer

In this notebook, we will:

    I. present the use of tokenizer
    II. show how to use the tokenizer tool of transformers
    III. compare the fast and slow tokenizer
    IV. Other notes

## I. Use of tokenizer

The tokenizer was used to :

* tokenize texts: convert text to units (word, grams, syllables, prefix... or mixed of them)

* construct vocabulary: based on the tokens, construct a vocabulary in order to numerize the units

* numerize the text units

* padding / truncation: pad the short text and truncate the long ones. This ensures the text length don't exceed the max
  length of model and also the batch has same length data

## II. Usage

How to use:

1) import

2) load: from_pretrained

3) save: save_pretrained

4) tokenize: tokenize

5) inspect vocabulary: vocab

6) converting: convert_tokens_to_ids / convert_ids_to_tokens

7) padding / truncate

8) outputs: input_ids, token_type_ids, attention_mask...

In [1]:
# 1) import
###########

# if we don't know which tokenizer to use, we can just use AutoTokenizer

# this will load automatically the corresponding tokenizer

# Sometimes, we have to import the correct tokenizer by ourself: this will be explained later

from transformers import AutoTokenizer


In [2]:
# 2) load
#########

# By using the hf repository for the first time, it will download directly from the HF site.

# By default, the downloaded files are in: ~/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/

# If we load again this model, it will load from the local cache folder, not from the HF site.

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [3]:
# 3) save
#########

# We can save it anywhare we want by providing a directory

tokenizer.save_pretrained("./tmp/my_tokenizer")


('./tmp/my_tokenizer/tokenizer_config.json',
 './tmp/my_tokenizer/special_tokens_map.json',
 './tmp/my_tokenizer/vocab.txt',
 './tmp/my_tokenizer/added_tokens.json',
 './tmp/my_tokenizer/tokenizer.json')

In [4]:
# and reload it later from the local directory

tokenizer = AutoTokenizer.from_pretrained("./tmp/my_tokenizer")

In [5]:
# 4) tokenize
#############

text = "We can save any cryptocurrency we want."

tokens = tokenizer.tokenize(text)

print(tokens)

['we', 'can', 'save', 'any', 'crypt', '##oc', '##ur', '##ren', '##cy', 'we', 'want', '.']


In [6]:
# 5) vocabulary
# ## represents a substr

tokenizer.vocab


{'inputs': 20407,
 'disguised': 17330,
 'nasa': 9274,
 'decorate': 29460,
 '##土': 30327,
 'shamrock': 28782,
 '##posed': 19155,
 'cesare': 26708,
 'basha': 26074,
 'ernest': 8471,
 'isa': 18061,
 'occult': 27906,
 '##り': 30212,
 '##γ': 29721,
 'dub': 12931,
 'inevitable': 13418,
 'tow': 15805,
 'voltage': 10004,
 'francisco': 3799,
 'limits': 6537,
 'yiddish': 20112,
 'lastly': 22267,
 'yielding': 21336,
 'savoy': 16394,
 'cupboard': 25337,
 'kyle': 7648,
 '##rmed': 29540,
 'fallen': 5357,
 'parentheses': 27393,
 'blue': 2630,
 '[unused26]': 27,
 'jaw': 5730,
 'favoured': 16822,
 'dwellings': 16707,
 'write': 4339,
 'beginning': 2927,
 '##mount': 20048,
 '[unused236]': 241,
 '×': 1095,
 'rectangular': 10806,
 'abandonment': 22290,
 'bates': 11205,
 'tome': 21269,
 'yearning': 29479,
 '##trick': 22881,
 'battalion': 4123,
 'elevator': 7764,
 '##oco': 24163,
 'racers': 25791,
 '##llen': 12179,
 'carr': 12385,
 'suspicions': 17817,
 'ibm': 9980,
 '##pipe': 24548,
 'fluctuations': 28892,
 

In [7]:
# vocabulary size 

tokenizer.vocab_size

30522

In [8]:
# 6) converting
###############

## token -> ids

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)


[2057, 2064, 3828, 2151, 19888, 10085, 3126, 7389, 5666, 2057, 2215, 1012]


In [9]:
## ids -> tokens

tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)


['we', 'can', 'save', 'any', 'crypt', '##oc', '##ur', '##ren', '##cy', 'we', 'want', '.']


In [10]:
## tokens -> text

text = tokenizer.convert_tokens_to_string(tokens)
print(text)

we can save any cryptocurrency we want.


In [11]:
## text -> ids

# using encode can convert the text directly to ids:

# ids = tokenizer.encode(text)

# but it will by default add 2 special tokens at the start (CLS) and the end (SEP) of the text

# and we will get result:

# [101, 2057, 2064, 3828, 2009, 2151, 2860, 8167, 2063, 2057, 2215, 1012, 102]

# If we don't want to add those special tokens, we use a parameter:

ids = tokenizer.encode(text, add_special_tokens=False)
print(ids)

[2057, 2064, 3828, 2151, 19888, 10085, 3126, 7389, 5666, 2057, 2215, 1012]


In [12]:
## ids -> text

# using decode can convert the ids directly to text
#
# text = tokenizer.decode(ids)
#
# now we can see the special tokens CLS and SEP:
# '[CLS] we can save it anywhare we want. [SEP]'

# If we don't want those special tokens, we use a parameter:

text = tokenizer.decode(ids, skip_special_tokens=False)
print(text)

2024-06-18 22:49:33.386959: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-18 22:49:33.387013: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-18 22:49:33.389244: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-18 22:49:33.402125: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


we can save any cryptocurrency we want.


In [13]:
# 7) padding / truncate
#######################

## padding

ids = tokenizer.encode(text, padding="max_length", max_length=20)
print(ids)

[101, 2057, 2064, 3828, 2151, 19888, 10085, 3126, 7389, 5666, 2057, 2215, 1012, 102, 0, 0, 0, 0, 0, 0]


In [14]:
## truncate

# the max_lenght will include the added special tokens

ids = tokenizer.encode(text, max_length=5, truncation=True)
print(ids)

[101, 2057, 2064, 3828, 102]


In [15]:
# 8) outputs
############

## single text data 

# it outputs a dict of 3 fields:
# - input_ids
# - token_type_ids
# - attention_mask

# Depending on the text input, arguments, the output element can change
# sometimes, we have to create this structure ourselves 
# some other times, we have to complete this sctruture by adding some fields

toks = tokenizer.encode_plus(text, padding="max_length", max_length=20)
print(toks)

{'input_ids': [101, 2057, 2064, 3828, 2151, 19888, 10085, 3126, 7389, 5666, 2057, 2215, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}


In [16]:
# or simply we can just call the encode directly

toks = tokenizer(text, padding="max_length", max_length=20)
print(toks)

{'input_ids': [101, 2057, 2064, 3828, 2151, 19888, 10085, 3126, 7389, 5666, 2057, 2215, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}


In [17]:
## batched data

# The tokenizer can take a list of texts and outputs a list for each dict field

texts = ["it outputs a dict of 3 elements.",
        "I got the same issue.",
        "In my case, it worked fine before."
        ]

toks = tokenizer(texts)
print(toks)

{'input_ids': [[101, 2009, 27852, 1037, 4487, 6593, 1997, 1017, 3787, 1012, 102], [101, 1045, 2288, 1996, 2168, 3277, 1012, 102], [101, 1999, 2026, 2553, 1010, 2009, 2499, 2986, 2077, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [18]:
## single data vs batched data

# batched data is faster to tokenizer
# so we should tokenize the data by batch whenever possible

In [19]:
%%time
for i in range(2000):
    tokenizer(text)

CPU times: user 302 ms, sys: 3.78 ms, total: 306 ms
Wall time: 301 ms


In [20]:
%%time
toks = tokenizer([text] * 2000)

CPU times: user 444 ms, sys: 221 ms, total: 664 ms
Wall time: 86.3 ms


## III. Fast vs. Slow

In [21]:
# by default, the fast tokenizer is loaded

# The loaded tokenizer is called "BertTokenizerFast", and the argument "is_fast" is true.

# fast tokenizer is based on RUST implementation, which is supposed to be faster than the slow version which is 
# based on python

tokenizer_fast = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tokenizer_fast

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [22]:
# to load the slow tokenizer, we should use the argument "use_fast" set to false.

# The loaded tokenizer is called "BertTokenizer", and the argument "is_fast" is false.

tokenizer_slow= AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", use_fast=False)
tokenizer_slow

BertTokenizer(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [23]:
%%time
# time the fast tokenizer on single data
for i in range(2000):
    tokenizer_fast(text)

CPU times: user 293 ms, sys: 6.53 ms, total: 299 ms
Wall time: 293 ms


In [24]:
%%time
# time the slow tokenizer on single data
for i in range(2000):
    tokenizer_slow(text)

CPU times: user 1.39 s, sys: 4.21 ms, total: 1.4 s
Wall time: 1.39 s


In [25]:
%%time
# time the fast tokenizer on batched data
toks = tokenizer_fast([text] * 2000)

CPU times: user 548 ms, sys: 652 ms, total: 1.2 s
Wall time: 232 ms


In [26]:
%%time
# time the slow tokenizer on batched data
toks = tokenizer_slow([text] * 2000)

CPU times: user 1.27 s, sys: 3.49 ms, total: 1.27 s
Wall time: 1.27 s


So, in summary (approximate time in ms for 2000 data):

|   	| single (ms) | batched (ms)|
|---	|---	 |---	   |
| fast  |  286 	 |   83.8  |
| slow	|  1240  |   1120  |


## IV. Some Notes

### offset mapping

In [27]:

# Only fast tokenizer can return this extra information

# by setting the argument "return_offsets_mapping" to true, we will get an extra element in the output
# called "offset_mapping".

# "offset_mapping" corresponds a list of tuples with the first value is the start position of the corresponding token 
# and the second the end.

# Here the position is the position of the letter the text.

toks = tokenizer_fast(text, return_offsets_mapping=True)
toks

{'input_ids': [101, 2057, 2064, 3828, 2151, 19888, 10085, 3126, 7389, 5666, 2057, 2215, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 2), (3, 6), (7, 11), (12, 15), (16, 21), (21, 23), (23, 25), (25, 28), (28, 30), (31, 33), (34, 38), (38, 39), (0, 0)]}

In [28]:
# if we look at the word_ids we get some repeated indices,
# this is because the tokenization of the text is not solely based on word.

# The tokenized text is ['we', 'can', 'save', 'any', 'crypt', '##oc', '##ur', '##ren', '##cy', 'we', 'want', '.']
# obtained before using tokenizer.tokenize(text)

# The word "cryptocurrency" is the 4th word in the text.
# It is tokenized into 4 tokens, which is why the number 4 was repeated 4 times in the word_ids.

# Letter:  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 0
# Word:      W e   c a n   s a v  e     a  n  y     c  r  y  p  t  o  c  u  r  r  e  n  c  y     w  e     w  a  n  t  .
# token:     we     can     save          any          crypt       ##oc  ##ur  ##ren    ##cy      we         want     .
# word_ids:  0       1        2            3            4           4     4      4        4        5          6       7          0         
# map: (0,0)(0,2)  (3,6)  (7,11)       (12,15)      (16,21)       (21,23)(23,25)(25,28)(28,30) (31, 33)  (34,38)  (38,39)(0,0)   

toks.word_ids()

[None, 0, 1, 2, 3, 4, 4, 4, 4, 4, 5, 6, 7, None]

### special tokenizer

In [29]:
# For some exotic tokenizer, we have to add the argument trust_remote_code=True
# otherwise, it will report error as below.

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b")

ValueError: Loading internlm/internlm-chat-7b requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.