## Tokenizers

#### Training tokenizer from scratch

In [2]:
import torch

Tokenizer is trained with BPE tokenizer

In [3]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

In [4]:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

In [5]:
tokenizer

Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[], normalizer=None, pre_tokenizer=None, post_processor=None, decoder=None, model=BPE(dropout=None, unk_token="[UNK]", continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=False, byte_fallback=False, ignore_merges=False, vocab={}, merges=[]))

## Trainer

In [9]:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])


In [10]:
trainer

BpeTrainer(BpeTrainer(min_frequency=0, vocab_size=30000, show_progress=True, special_tokens=[AddedToken(content="[UNK]", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="[CLS]", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="[SEP]", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="[PAD]", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), AddedToken(content="[MASK]", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True)], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix=None, end_of_word_suffix=None, max_token_length=None, words={}))

We could train our tokenizer right now, but it wouldn’t be optimal. Without a pre-tokenizer that will split our inputs into words, we might get tokens that overlap several words: for instance we could get an "it is" token since those two words often appear next to each other. Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer. Here we want to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting on whitespace.

## Pre Tokenizer

In [11]:
from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

In [12]:
file_paths = [
    "/content/roman_01.txt",
    "/content/roman_02.txt",
]

In [13]:
tokenizer.train(file_paths, trainer)

In [14]:
tokenizer.save("roman_tokenizer_01.json")

## Using our Tokenizer from json

In [15]:
tokenizer = Tokenizer.from_file("/content/roman_tokenizer_01.json")

In [16]:
output = tokenizer.encode("Namaskar tapailai kasto chha")

In [17]:
output.tokens

['[UNK]', 'ama', 's', 'kar', 'tapailai', 'kasto', 'ch', 'ha']

In [18]:
output.ids

[0, 2270, 68, 1629, 7115, 3404, 1786, 1562]

In [19]:
# we can get tracking of what get our unknown token in index 0

print(output.offsets[0])

(0, 1)


In [20]:
"Namaskar tapailai kasto chha"[0:1]

'N'

## Post Processing
If we want our tokenizer to automatically add special tokens like **"[CLS]"** or **"[SEP]"**.

For this we use a post processor.

In [21]:
tokenizer.token_to_id("[SEP]")

2

In [22]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

dollar A is our text and dollar B means the second one. Another :1 is for the type ids to be 1 otherwise it is 0 .

In [23]:
output = tokenizer.encode("Namaskar tapailai kasto chha", "tapai ke gardei hunu hunxa")

In [24]:
output.tokens

['[CLS]',
 '[UNK]',
 'ama',
 's',
 'kar',
 'tapailai',
 'kasto',
 'ch',
 'ha',
 '[SEP]',
 'tapai',
 'ke',
 'garde',
 'i',
 'hunu',
 'hun',
 'x',
 'a',
 '[SEP]']

In [25]:
output.ids

[1,
 0,
 2270,
 68,
 1629,
 7115,
 3404,
 1786,
 1562,
 2,
 2067,
 1637,
 12560,
 58,
 2125,
 1900,
 73,
 50,
 2]

In [26]:
output.type_ids

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [28]:
tokenizer.save("/content/roman_tokenizer_1.json")

## Encoding multiple batch of sentences at once

In [29]:
tokenizer = Tokenizer.from_file("/content/roman_tokenizer_1.json")

In [30]:
output = tokenizer.encode_batch(["yo chai auta", "ani yo chai another"])

In [42]:
output_batch = tokenizer.encode_batch([["auta", "arko"]])

In [43]:
output_batch

[Encoding(num_tokens=5, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

In [45]:
output_batch[0].tokens

['[CLS]', 'auta', '[SEP]', 'arko', '[SEP]']

In [39]:
print(output[0])

Encoding(num_tokens=5, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [40]:
output[0].tokens, output[0].ids

(['[CLS]', 'yo', 'chai', 'auta', '[SEP]'], [1, 1595, 2475, 5104, 2])

In [41]:
for i in output:
  print(i.tokens, i.ids)

['[CLS]', 'yo', 'chai', 'auta', '[SEP]'] [1, 1595, 2475, 5104, 2]
['[CLS]', 'ani', 'yo', 'chai', 'a', 'no', 'ther', '[SEP]'] [1, 2096, 1595, 2475, 50, 1736, 18593, 2]


## Padding for longer sentences

In [46]:
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

In [47]:
output = tokenizer.encode_batch(["auta", "arko ali laamo"])

In [48]:
for i in output:
  print(i.tokens, i.ids)

['[CLS]', 'auta', '[SEP]', '[PAD]', '[PAD]', '[PAD]'] [1, 5104, 2, 3, 3, 3]
['[CLS]', 'arko', 'ali', 'la', 'amo', '[SEP]'] [1, 2292, 3596, 1570, 26073, 2]


In [51]:
# attention masking

output[0].attention_mask # padding is masked with attention mask value 0

[1, 1, 1, 0, 0, 0]

## Using a Pretrained Tokenizer

In [52]:
from tokenizers import Tokenizer

In [53]:
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [54]:
tokenizer.encode("Namaskar tapailai kasto chha").tokens

['[CLS]',
 'nam',
 '##ask',
 '##ar',
 'tap',
 '##ail',
 '##ai',
 'ka',
 '##sto',
 'ch',
 '##ha',
 '[SEP]']