# Tokenizers

Doc: https://huggingface.co/docs/transformers/main_classes/tokenizer

Common functions;

- **tokenizer.tokenize(text)**: This function tokenizes a single input text into a list of tokens. It splits the text into subword units or word pieces, depending on the tokenizer's underlying tokenization algorithm.

- **tokenizer.encode(text)**: This function encodes a single input text into a sequence of token IDs. It tokenizes the text and maps each token to its corresponding token ID in the vocabulary.

- **tokenizer.decode(tokens)**: This function decodes a list of token IDs back into a single text string. It converts the token IDs into their corresponding tokens and concatenates them into a text string.

- **tokenizer.batch_encode_plus(texts)**: This function encodes a list of input texts into tokenized sequences with additional information. It returns a dictionary containing the encoded sequences, attention masks, and other relevant information.

- **tokenizer.batch_decode(tokens)**: This function decodes a list of tokenized sequences (lists of token IDs) back into a list of text strings. It performs the decoding operation on multiple sequences at once.

- **tokenizer.pad**: This attribute provides options for padding input sequences to a fixed length. It allows you to add padding tokens to make all sequences of the same length for batch processing.

- **tokenizer.special_tokens_map**: This attribute provides a mapping of special tokens used by the tokenizer. It includes tokens like [CLS], [SEP], [UNK], and [PAD], which are commonly used in transformer-based models.

In [14]:
!apt-get install tree

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 14 not upgraded.
Need to get 43.0 kB of archives.
After this operation, 115 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 tree amd64 1.8.0-1 [43.0 kB]
Fetched 43.0 kB in 0s (636 kB/s)
Selecting previously unselected package tree.
(Reading database ... 123069 files and directories currently installed.)
Preparing to unpack .../tree_1.8.0-1_amd64.deb ...
Unpacking tree (1.8.0-1) ...
Setting up tree (1.8.0-1) ...
Processing triggers for man-db (2.9.1-1) ...


In [4]:
!pip install -q datasets evaluate transformers[sentencepiece]

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/486.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/486.2 kB[0m [31m17.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m82.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

## Tokenization

- Tokenization is the process split text into words (or parts of words, punctuation symbols, etc.), usually called tokens.
- Tokenizers convert text into input data for model.
- When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model

unknown
- token that are not in our vocabulary.
- often represented as ”[UNK]”

## Encoding

- To convert text to numbers is known as encoding.
- Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

## Decoding

- To convert index number to text.

## 1. Word-based Tokenization
- Breaking down a piece of text into individual words, where each word represents a separate token.

- Pros:
  - Simple and easy to implement

- Cons:
  - Not suitable for all languages (do not use spaces to separate words)
  - May not handle all types of words correctly. ("don't)
  - Huge amount of token (500,000 words in English)

- Library: spaCy, Moses

In [6]:
text = "I am a programmer".split()

print(text)

['I', 'am', 'a', 'programmer']


In [3]:
tokenized_text = "トムはプログラマーです。".split()

print(tokenized_text)

['トムはプログラマーです。']


## 2. Character-based tokenization

- Breaking down a piece of text into individual characters, where each character represents a separate token.

- Pros:
  - The vocabulary is much smaller.
  - There are much fewer out-of-vocabulary (unknown) tokens. (unusual or rare words)

- Cons:
  - Can lose some contextual information

## 3. Subword-based Tokenization
- Breaking down a piece of text into smaller units of meaning, known as subwords.

- Pros:
  - Can handle complex morphology
  - Can improve performance

- Cons:
  - May require additional preprocessing

- Techniques:
  - Byte-level BPE (GPT-2)
  - WordPiece (BERT)
  - SentencePiece or Unigram, as used in several multilingual models

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [7]:
text = "I am a programmer".split()

tokenizer(text)

{'input_ids': [[101, 146, 102], [101, 1821, 102], [101, 170, 102], [101, 23981, 102]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]}

### 3.1 BPE (Byte Pair Encoding)

- initially developed as an algorithm to compress texts

- Iteratively merging the most frequent pair of consecutive bytes or characters in a corpus until a predefined vocabulary size is reached.
- BPE relies on a pre-tokenizer that splits the training data into words.

- GPT, GPT-2, Roberta, XLM, BART, DeBERTa, FlauBERT

- GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.



- Papaer: https://arxiv.org/abs/1508.07909

- Ref: https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

In [23]:
from transformers import AutoTokenizer

text = "I am a programmer."

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

# Print the tokens
print(len(tokens))
print(tokens)
print(ids)

5
['I', 'Ġam', 'Ġa', 'Ġprogrammer', '.']
[40, 716, 257, 24292, 13]


In [24]:
# Decoding

decoded_string = tokenizer.decode(ids)

print(decoded_string)

I am a programmer.


### 3.2 WordPiece

- Google developed to pretrain BERT
- Iteratively merging the most frequent pair by score (dividing the frequency of the pair by the product of the frequencies of each of its parts)
- The algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary.
- WordPiece only saves the final vocabulary, not the merge rules learned.

- DistilBERT, MobileBERT


In [21]:
from transformers import AutoTokenizer

text = "I am a programmer."

model_name = "distilbert-base-multilingual-cased"  # distilbert-base-multilingual-cased
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

# Print the tokens
print(len(tokens))
print(tokens)
print(ids)

6
['I', 'am', 'a', 'programme', '##r', '.']
[101, 146, 10392, 169, 19611, 10129, 119, 102]


### 3.3 Unigram

- The Unigram algorithm is often used in SentencePiece
- SentencePiece uses a more efficient algorithm called Enhanced Suffix Array (ESA) to create the initial vocabulary.
- Starts from a big vocabulary and removes tokens from it until it reaches the desired vocabulary size.
- Iteratively remove the token that least impacts the loss
- Very costly operation
- AlBERT, mT5, mBART, Big Bird, and XLNet
- Ref: https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt

In [22]:
from transformers import AutoTokenizer

text = "I am a programmer."

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

# Print the tokens
print(len(tokens))
print(tokens)
print(ids)

7
['▁I', '▁am', '▁', 'a', '▁programme', 'r', '.']
[27, 183, 3, 9, 2486, 52, 5, 1]


## Saving tokenizer

In [13]:
save_folder = "/content/tokenizer"

tokenizer.save_pretrained(save_folder)

('/content/tokenizer/tokenizer_config.json',
 '/content/tokenizer/special_tokens_map.json',
 '/content/tokenizer/spiece.model',
 '/content/tokenizer/added_tokens.json',
 '/content/tokenizer/tokenizer.json')

In [15]:
!tree -h "/content/tokenizer"

[01;34m/content/tokenizer[00m
├── [2.1K]  special_tokens_map.json
├── [773K]  spiece.model
├── [2.3K]  tokenizer_config.json
└── [2.3M]  tokenizer.json

0 directories, 4 files


## Multiple sequences

- Batching allows the model to work when you feed it multiple sentences.
- Transformers models expect multiple sentences by default.
- Output of model has 3 dimensiona as sequence length, the batch size, and the hidden size

In [82]:
import torch
from torch.nn.functional import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I am a programmer."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)

print("Output:", print)
print("Output Logits:", output.logits)
print("Softmax probabilities:", softmax(output.logits, dim=1))
print("Output dimensions:", output.logits.size())
print("Batch size:", output.logits.size(0))
print("Hidden size:", output.logits.size(-1))

Input IDs: tensor([[ 1045,  2572,  1037, 20273,  1012]])
Output: <built-in function print>
Output Logits: tensor([[ 2.1330, -1.7802]], grad_fn=<AddmmBackward0>)
Softmax probabilities: tensor([[0.9804, 0.0196]], grad_fn=<SoftmaxBackward0>)
Output dimensions: torch.Size([1, 2])
Batch size: 1
Hidden size: 2


## Padding

- To make our tensors have a rectangular shape.

In [28]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


In [29]:
# Get padding ID

tokenizer.pad_token_id

0

In [71]:
sequences = ["I am a programmer in Japan.", "So do I!"]

model_inputs = tokenizer(sequences)# Will pad the sequences up to the maximum sequence length
print("\n padding=default")
print(len(model_inputs["input_ids"][0])) # Length
print(len(model_inputs["input_ids"][1])) # Length
print(tokenizer.decode(model_inputs["input_ids"][0])) # Decode
print(tokenizer.decode(model_inputs["input_ids"][1])) # Decode

model_inputs = tokenizer(sequences, padding="longest")
print("\n padding=longest")
print(len(model_inputs["input_ids"][0])) # Length
print(len(model_inputs["input_ids"][1])) # Length
print(tokenizer.decode(model_inputs["input_ids"][0])) # Decode
print(tokenizer.decode(model_inputs["input_ids"][1])) # Decode

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
print("\n padding=max_length")
print(len(model_inputs["input_ids"][0])) # Length
print(len(model_inputs["input_ids"][1])) # Length
print(tokenizer.decode(model_inputs["input_ids"][0])) # Decode
print(tokenizer.decode(model_inputs["input_ids"][1])) # Decode

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=6)
print("\n padding=max_length, max_length=6")
print(len(model_inputs["input_ids"][0])) # Length
print(len(model_inputs["input_ids"][1])) # Length
print(tokenizer.decode(model_inputs["input_ids"][0])) # Decode
print(tokenizer.decode(model_inputs["input_ids"][1])) # Decode


 padding=default
9
6
[CLS] i am a programmer in japan. [SEP]
[CLS] so do i! [SEP]

 padding=longest
9
9
[CLS] i am a programmer in japan. [SEP]
[CLS] so do i! [SEP] [PAD] [PAD] [PAD]

 padding=max_length
512
512
[CLS] i am a programmer in japan. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

### Truncate sequences

In [56]:
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=6, truncation=True)

print("\n  max_length=8, truncation=True")
print(tokenizer.decode(model_inputs["input_ids"][0])) # Decode
print(tokenizer.decode(model_inputs["input_ids"][1])) # Decode


  max_length=8, truncation=True
[CLS] i am a programmer [SEP]
[CLS] so do i! [SEP]


### Conversion framework

In [60]:
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print(type(model_inputs["input_ids"]))

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
print(type(model_inputs["input_ids"]))

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(type(model_inputs["input_ids"]))

<class 'torch.Tensor'>
<class 'tensorflow.python.framework.ops.EagerTensor'>
<class 'numpy.ndarray'>


## Special tokens
- Special tokens refer to specific tokens that have special meanings in natural language processing tasks. They are often used to handle various aspects of language modeling, such as indicating the start and end of a sentence, marking out-of-vocabulary words, separating sentences, or representing padding and masking.

In [64]:
sequence = "I am a programmer."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

# Encode
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[101, 1045, 2572, 1037, 20273, 1012, 102]
[1045, 2572, 1037, 20273, 1012]
[CLS] i am a programmer. [SEP]
i am a programmer.


## Attention masks

- Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (ignore).

In [73]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
2


Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

## Longer sequences
- With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens.
-  There are two solutions to this problem:
  - Use a model with a longer supported sequence length.
  - Truncate your sequences.

## Reference:

- Hugging Face NLP Course
  - https://huggingface.co/learn/nlp-course/chapter1/1

- datacamp: An Introduction to Using Transformers and Hugging Face
  - https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face