# **Fast tokenizer :**

Significant Speed-up with Batched Tokenization :
When doing batched tokenization, the PreTrainedTokenizerFast class provides a significant speed-up. This is because it leverages the fast implementation in Rust.

Mapping Between Original String and Token Space:
The PreTrainedTokenizerFast class offers methods to map between the original string (characters and words) and the token space

# **EXample: **

In [None]:
!pip install transformers

In [2]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

# Sample text
text = "Hugging Face is creating a tool that democratizes AI."

# Tokenize the text
encoding = tokenizer(text, return_offsets_mapping=True)

# Print tokens and their offsets
tokens = encoding.tokens()
offsets = encoding.offset_mapping

for token, offset in zip(tokens, offsets):
    start, end = offset
    print(f"Token: {token}, Text: '{text[start:end]}'")

# Example of getting the token comprising a given character
char_index = 5
token_index = encoding.char_to_token(char_index)
print(f"Character at index {char_index} is part of token: {tokens[token_index]}")

# Example of getting the span of characters corresponding to a given token
token_index = 3  # token 'Face'
char_span = encoding.token_to_chars(token_index)
print(f"Token at index {token_index} spans characters: {text[char_span[0]:char_span[1]]}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Token: [CLS], Text: ''
Token: hugging, Text: 'Hugging'
Token: face, Text: 'Face'
Token: is, Text: 'is'
Token: creating, Text: 'creating'
Token: a, Text: 'a'
Token: tool, Text: 'tool'
Token: that, Text: 'that'
Token: democrat, Text: 'democrat'
Token: ##izes, Text: 'izes'
Token: ai, Text: 'AI'
Token: ., Text: '.'
Token: [SEP], Text: ''
Character at index 5 is part of token: hugging
Token at index 3 spans characters: is


# **Explanation:**

Initialization: The tokenizer is initialized with the BERT model and set to use the fast tokenizer.

Tokenization: The text is tokenized and the offsets are returned.

Tokens and Offsets: For each token, the corresponding text span is printed using the offsets.

Character to Token: The char_to_token method maps a character index to the corresponding token index.

Token to Character Span: The token_to_chars method maps a token index to the span of characters in the original text.