# Normalization and pre-tokenization

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## Normalization and pre-tokenization

The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents. If you‚Äôre familiar with Unicode normalization (such as NFC or NFKC), this is also something the tokenizer may apply.

The ü§ó Transformers tokenizer has an attribute called backend_tokenizer that provides access to the underlying tokenizer from the ü§ó Tokenizers library:

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

<class 'tokenizers.Tokenizer'>


The normalizer attribute of the tokenizer object has a normalize_str() method that we can use to see how the normalization is performed:

In [3]:
print(tokenizer.backend_tokenizer.normalizer.normalize_str("H√©ll√≤ h√¥w are √º?"))

hello how are u?


In this example, since we picked the bert-base-uncased checkpoint, the normalization applied lowercasing and removed the accents.

## Pre-tokenization

In [None]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)), (',', (5, 6)), ('ƒ†how', (6, 10)), ('ƒ†are', (10, 14)), ('ƒ†', (14, 15)), ('ƒ†you', (15, 19)),
 ('?', (19, 20))]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('‚ñÅHello,', (0, 6)), ('‚ñÅhow', (7, 10)), ('‚ñÅare', (11, 14)), ('‚ñÅyou?', (16, 20))]