<a href="https://colab.research.google.com/github/mshojaei77/AdvancedWebScraper/blob/main/ch1/Custom_Tokenizer_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom Persian Tokenizer Implementation with Hugging Face

In this notebook, we'll create a custom tokenizer for the Persian language using Hugging Face's tokenizers library. We'll use the OSCAR dataset, which contains a large corpus of Persian text, and train our tokenizer on a T4 GPU in Google Colab.

## Concepts We'll Cover:

1. **Datasets**: Using Hugging Face's datasets library to load large-scale text data.
2. **Tokenizers**: Understanding and implementing advanced tokenization techniques.
3. **Persian Language Processing**: Addressing the unique challenges of tokenizing Persian text.
4. **GPU Acceleration**: Utilizing GPU for faster tokenizer training.
5. **Subword Tokenization**: Implementing Byte-Pair Encoding (BPE) for effective subword tokenization.

Let's begin!


In [None]:
# Install necessary libraries
!pip install -q datasets tokenizers transformers

import torch
from datasets import load_dataset
from huggingface_hub import notebook_login
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tqdm.auto import tqdm

# Check if GPU is available
print("GPU Available:", torch.cuda.is_available())
print("GPU Name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "N/A")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Loading the OSCAR Dataset

The OSCAR (Open Super-large Crawled ALMAnaCH coRpus) dataset is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. We'll use its Persian subset.

**Concept: Large-scale Datasets**
Large datasets like OSCAR are crucial for training effective NLP models and tokenizers. They provide a diverse range of text that helps capture the nuances and variations in language use.

In [None]:
# Load the Persian Daily News dataset
dataset = load_dataset("RohanAiLab/persian_daily_news",trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/303k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

Downloading data:   0%|          | 0/21 [00:00<?, ?files/s]

Generating train split:   0%|          | 0/8203495 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/65 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'text'],
    num_rows: 6562796
})


In [None]:
print(dataset)
print(f"Number of samples: {len(dataset['train'])}")
first_row = dataset['train'][0]
print(first_row)

Dataset({
    features: ['id', 'text'],
    num_rows: 6562796
})
Number of samples: 6562796
{'id': 0, 'text': 'امشب بارون شروع کرد به باریدن...خوب هم بارید...ساعتای یه ربع به یازده به سرم زد که برم بیرون و دور پارک یه دوری بزنم و برگردم...لباس پوشیدم و گوشی و هدفون رو برداشتم و زدم بیرون...قبل این که برم بیرون ، مامان گفت چتر با خودت بردار! گفتم نمی خواد...فکر کنم بارون ناراحت میشه وقتی چتر استفاده می کنیم!مگه نه!؟ مگه ما وقتی خودمون ، یه هدیه ای به بقیه میدیم ، یه آهنگیو واسه کسی میفرستیم و اون طرف ، پسش میده یا قبولش نمی کنه ، ناراحت نمیشیم!؟ بارون هم همینه پس!\nموقع قدم زدن می خواستم آهنگ گوش بدم که دیدم صدای بارون قشنگ تره...به صدای بارون گوش دادم...صدایی که این موقع شب ، خیلی خوب آدمو پرت میکنه توی بعضی خیالات...\nیاد سه سال قبل افتادم...یادمه توی همین بهمن ماه ، برف اومد ، اون سال من کنکور داشتم و طبقه ی بالا درس می خوندم...همون شبی که بعد شیش سال ، برف بارید ، لباس پوشیدم و رفتم بیرون تا قدم بزنم...خیلی شب خوبی شد...می بینی که هنوزم که هنوزه یادمه ...حتی شاید تک تک لحظه هاشو...ی

## Preparing the Training Data

We'll use a subset of the data for tokenizer training to keep the process manageable.

**Concept: Data Sampling**
When dealing with very large datasets, it's often practical to use a subset for training. This reduces computational requirements while still providing enough data for effective learning.

In [None]:
# Function to yield batches of text
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset['train']), batch_size):
        yield dataset['train'][i:i+batch_size]['text']

## Setting Up the Tokenizer

We'll use a Byte-Pair Encoding (BPE) tokenizer, with the unknown token represented as [UNK].
The pre-tokenizer is set to Whitespace(), which means that the tokenizer will split text based on whitespace characters, effectively treating each word or token separated by spaces as a distinct unit.
The BpeTrainer is then configured with special tokens such as [UNK], [CLS], [SEP], [PAD], and [MASK], which are commonly used in natural language processing tasks.
The vocab_size parameter is set to 30,000, indicating the maximum number of unique tokens to be included in the vocabulary, which can be adjusted based on the specific needs of the application.
The min_frequency parameter is set to 10, meaning that only tokens that appear at least 10 times in the training data will be included in the vocabulary, helping to filter out rare tokens and reduce the vocabulary size.



In [None]:
# Initialize a tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# Set up the custom pre-tokenizer
tokenizer.pre_tokenizer = Whitespace()

# Set up trainer
trainer = BpeTrainer(
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    vocab_size=30000,  # Adjust based on your needs
    min_frequency=10,
    show_progress=True
)

## Training the Tokenizer

Now we'll train our tokenizer on the Persian dataset. This process will run on the GPU for faster processing.



In [None]:
# Train the tokenizer
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset['train']))

print("Vocabulary size:", tokenizer.get_vocab_size())
print("Sample of vocabulary:", list(tokenizer.get_vocab().items())[:10])

## Post-Processing

Let's add post-processing steps to handle special tokens correctly.

**Concept: Special Tokens**
Special tokens like [CLS] (Classification) and [SEP] (Separator) are used in many transformer models to denote the start and end of sequences or to separate segments in tasks like sentence pair classification.

In [None]:
# Add post-processing to handle special tokens
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

## Testing the Tokenizer

Let's test our newly trained tokenizer on some Persian text.

**Concept: Tokenization in Persian**
Persian, an Indo-Iranian language, has some unique characteristics that make tokenization challenging:
1. It uses a modified Arabic script.
2. Words are separated by spaces, but some compound words are written without spaces.
3. It has complex morphology, with many prefixes and suffixes.
Our BPE tokenizer should handle these challenges well by learning subword units.

In [None]:
# Test sentences
persian_sentences = [
    "سلام دنیا",  # "Hello World"
    "پردازش زبان طبیعی بسیار جالب است",  # "Natural Language Processing is very interesting"
    "این یک جمله‌ی طولانی‌تر برای آزمایش توکنایزر است که شامل کلمات کم‌تر رایج هم می‌شود"  # "This is a longer sentence to test the tokenizer that also includes less common words"
]

for sentence in persian_sentences:
    output = tokenizer.encode(sentence)
    print(f"Original: {sentence}")
    print(f"Tokens: {output.tokens}")
    print(f"IDs: {output.ids}")
    print()

## Saving the Tokenizer

We can now save our trained tokenizer for future use.

**Concept: Serialization**
Saving the tokenizer allows us to reuse it without retraining, ensuring consistent tokenization across different runs or even different projects.

In [None]:
tokenizer.save("persian_bpe_tokenizer.json")

print("Tokenizer saved successfully!")

## Pushing a tokenizer to Hugging Face
this allows you to share your trained models with the broader community, enabling others to leverage your work for various natural language processing tasks. By utilizing the `push_to_hub` method from the Hugging Face `transformers` library, you can easily upload your tokenizer directly to your HF model repository without the need for local storage.
This process not only enhances collaboration but also ensures that your tokenizer is readily accessible for anyone looking to implement or fine-tune models in their applications.
With a simple login and a few lines of code, you can contribute to the growing ecosystem of pre-trained models, fostering innovation and efficiency in NLP research and development.

In [None]:
notebook_login()

# Push the tokenizer directly to Hugging Face without saving locally
model_id = 'mshojaei77/persian-bpe-tokenizer'
tokenizer.push_to_hub(model_id)
print('Tokenizer pushed to Hugging Face successfully!')

## Conclusion

We've successfully created and trained a custom Persian tokenizer using the OSCAR dataset and Hugging Face's tokenizers library. This tokenizer is now ready to be used in various NLP tasks involving Persian text.

Key Takeaways:
1. We used a large-scale dataset (OSCAR) to ensure our tokenizer learns from a diverse range of Persian text.
2. We implemented Byte-Pair Encoding, which is effective for subword tokenization in morphologically rich languages like Persian.
3. We utilized GPU acceleration to speed up the training process.
4. We added special tokens and post-processing steps to make our tokenizer compatible with transformer models.
5. We tested the tokenizer on various Persian sentences to verify its effectiveness.

This custom tokenizer can now be used in conjunction with Persian language models or for preprocessing Persian text in various NLP tasks such as text classification, named entity recognition, or machine translation.