<a href="https://colab.research.google.com/github/nouraoaldawsari/T5/blob/main/Huggingface_Tokenizer_Exercise(Bonus).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Working with Hugging Face Tokenizers

### Part 1: Basic Tokenization with Hugging Face Tokenizer


1. Install the Hugging Face `transformers` library:
   ```bash
   !pip install transformers
   ```

2. Choose a pre-trained tokenizer from Hugging Face’s model hub (e.g., `bert-base-uncased`, `gpt2`, etc.) and tokenize a piece of text:
   
   **Task**: Load the tokenizer and tokenize the sentence: `"T5 is the greatest data science boot-camp!"`

   Below is a code block where you can perform this task:
    

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the sentence
sentence = "T5 is the greatest data science boot-camp!"
tokens = tokenizer.tokenize(sentence)

# Print the tokens
print(tokens)


['t', '##5', 'is', 'the', 'greatest', 'data', 'science', 'boot', '-', 'camp', '!']





### Part 2: Encoding and Decoding

3. Use the same tokenizer to encode the sentence (convert to token IDs) and then decode it back to text.

   **Task**: Encode the sentence and then decode it back to text.

   Below is a code block where you can perform this task:
    

In [None]:
# Encode the sentence
input_ids = tokenizer.encode(sentence)

# Print the encoded IDs
print(input_ids)

# Decode the encoded IDs back to text
decoded_sentence = tokenizer.decode(input_ids)

# Print the decoded sentence
print(decoded_sentence)


[101, 1056, 2629, 2003, 1996, 4602, 2951, 2671, 9573, 1011, 3409, 999, 102]
[CLS] t5 is the greatest data science boot - camp! [SEP]



### Bonus Challenge

4. **Custom Tokenizer**: Use Hugging Face’s `tokenizers` library to train a custom tokenizer on a dataset.
   
   You are provided with a dataset containing multiple sentences. Train a custom tokenizer using these sentences.

   Below is a code block to train the tokenizer:
    

In [None]:
!pip install tokenizers

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Sample dataset
sentences = [
    "This is a sample sentence.",
    "Another sentence for training.",
    "Let's train a custom tokenizer!"
]

# Initialize the tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Train the tokenizer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(sentences, trainer)

# Save the tokenizer
tokenizer.save("custom_tokenizer.json")




In [None]:

# Provided dataset for the bonus challenge
dataset = [
    "Transformers are amazing for NLP tasks.",
    "Tokenization is essential for language models.",
    "Byte Pair Encoding is a great subword tokenization algorithm.",
    "Hugging Face makes it easy to work with pre-trained models.",
    "Data science is the key to unlocking insights from data."
]
print(dataset)


['Transformers are amazing for NLP tasks.', 'Tokenization is essential for language models.', 'Byte Pair Encoding is a great subword tokenization algorithm.', 'Hugging Face makes it easy to work with pre-trained models.', 'Data science is the key to unlocking insights from data.']
