<a href="https://colab.research.google.com/github/mshojaei77/NLP-Journey/blob/main/ch1/Hugging_Face_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face Tokenizers for Beginners

Hugging Face tokenizers are powerful tools used in natural language processing (NLP) to convert text into a format that machine learning models can understand. Let's explore how they work and why they're so useful!

## The Concept

Imagine you're translating a book into a secret code that a computer can understand. That's essentially what a tokenizer does! It takes text and breaks it down into smaller pieces (tokens) that a machine learning model can process.

Hugging Face tokenizers are special because they:
1. Are very fast
2. Handle various languages and special characters well
3. Can be customized for different tasks
4. Work seamlessly with popular NLP models

Let's see how to use a Hugging Face tokenizer!

In [2]:
# First, we need to install the transformers library
!pip install -q transformers

from transformers import AutoTokenizer

# Load a pre-trained tokenizer (in this case, BERT)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

<sub> read about bert-base-uncased pre-trained tokenizer model here: https://huggingface.co/google-bert/bert-base-*uncased*</sub>

## Using the Tokenizer

Now that we have our tokenizer, let's see it in action!

In [12]:
text = "Ancient Persia, with its sun-warmed landscapes and rich cultural heritage, was a powerful civilization"

# Tokenize the text
output = tokenizer(text)

print("Tokenized output:")
print(output)


Tokenized output:
{'input_ids': [101, 3418, 16667, 1010, 2007, 2049, 3103, 1011, 17336, 12793, 1998, 4138, 3451, 4348, 1010, 2001, 1037, 3928, 10585, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## What just happened?

The tokenizer did several things:
1. It split the text into tokens
2. It converted each token to a unique ID

Let's break this down further:

In [13]:
# Get the tokens
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Decode back to text
decoded_text = tokenizer.decode(input_ids)
print("Decoded text:", decoded_text)

Tokens: ['ancient', 'persia', ',', 'with', 'its', 'sun', '-', 'warmed', 'landscapes', 'and', 'rich', 'cultural', 'heritage', ',', 'was', 'a', 'powerful', 'civilization']
Input IDs: [3418, 16667, 1010, 2007, 2049, 3103, 1011, 17336, 12793, 1998, 4138, 3451, 4348, 1010, 2001, 1037, 3928, 10585]
Decoded text: ancient persia, with its sun - warmed landscapes and rich cultural heritage, was a powerful civilization


## Special Features of Hugging Face Tokenizers

1. **Subword Tokenization**: Notice how `sun-warmed` was split into `sun`, `-`, and `warmed`. This allows the model to understand parts of words it hasn't seen before.

2. **Handling Multiple Sentences**: Let's see how it handles multiple sentences and pads them to the same length.

In [14]:
sentences = ["Ancient Persia, with its sun-warmed landscapes and rich cultural heritage, was a powerful civilization",
             "that significantly influenced the development of art and architecture in the Middle East and beyond."]

# Tokenize with padding
padded_output = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

print("Padded output:")
print(padded_output)


Padded output:
{'input_ids': tensor([[  101,  3418, 16667,  1010,  2007,  2049,  3103,  1011, 17336, 12793,
          1998,  4138,  3451,  4348,  1010,  2001,  1037,  3928, 10585,   102],
        [  101,  2008,  6022,  5105,  1996,  2458,  1997,  2396,  1998,  4294,
          1999,  1996,  2690,  2264,  1998,  3458,  1012,   102,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}


padding=True ensures that all tokenized sequences are of the same length by adding padding tokens (usually zeros) to the shorter sequences. truncation=True limits the length of the tokenized sequences to a predefined maximum length, truncating any tokens beyond this limit. return_tensors="pt" specifies that the output should be in PyTorch tensor format, which is a common format for inputting data into neural networks. This preprocessing step is crucial for batch processing and ensuring that all inputs to the model are consistent in shape and size.

3. **Attention Masks**: The tokenizer also provides attention masks, which tell the model which tokens are actual words and which are padding.

4. **Special Tokens**: Let's look at how special tokens are handled.

In [15]:
print("Special tokens:")
print(f"CLS token: {tokenizer.cls_token}, ID: {tokenizer.cls_token_id}")
print(f"SEP token: {tokenizer.sep_token}, ID: {tokenizer.sep_token_id}")
print(f"PAD token: {tokenizer.pad_token}, ID: {tokenizer.pad_token_id}")

Special tokens:
CLS token: [CLS], ID: 101
SEP token: [SEP], ID: 102
PAD token: [PAD], ID: 0


Special tokens are predefined tokens used in natural language processing models, particularly those based on architectures like BERT, to handle specific tasks or provide essential information during the tokenization process. The [CLS] token, with an ID of 101, is placed at the beginning of each sequence and is used for classification tasks, where the model's output for this token is interpreted as a summary of the entire sequence. The [SEP] token, ID 102, is used to separate different segments of text, such as separating two sentences in a sequence or marking the end of a single sentence. The [PAD] token, ID 0, is used for padding, ensuring that all sequences in a batch have the same length by filling in extra space with this token. These special tokens help the model understand the structure of the input data and perform tasks more effectively.

## Why are Hugging Face Tokenizers Special?

1. **Speed**: They're implemented in Rust, making them very fast.
2. **Flexibility**: They can be used with various models and tasks.
3. **Pre-training**: They come pre-trained for popular models, saving time and effort.
4. **Customization**: You can fine-tune them for specific tasks or domains.

## Real-world Applications

Hugging Face tokenizers are used in many NLP tasks, including:
- Text classification
- Named Entity Recognition
- Question Answering
- Machine Translation

They're a crucial part of the pipeline in modern NLP systems.

In [16]:
# Let's see how the tokenizer handles a more complex sentence
complex_sentence = "The year is 2023, and AI is advancing rapidly! 🚀"
complex_output = tokenizer(complex_sentence)

print("Tokens for complex sentence:")
print(tokenizer.convert_ids_to_tokens(complex_output['input_ids']))

Tokens for complex sentence:
['[CLS]', 'the', 'year', 'is', '202', '##3', ',', 'and', 'ai', 'is', 'advancing', 'rapidly', '!', '[UNK]', '[SEP]']


As you can see, the tokenizer handles numbers, punctuation, and even emojis!

## Conclusion

Hugging Face tokenizers are powerful tools that bridge the gap between human-readable text and machine-processable data. They're an essential component in modern NLP pipelines, allowing models to understand and generate human language more effectively.

By using pre-trained tokenizers, you can quickly get started with advanced NLP tasks without having to worry about the intricacies of text preprocessing. As you delve deeper into NLP, understanding how these tokenizers work will help you fine-tune your models and tackle more complex language tasks.