#Popular tokenizers in NLP: Hugging Face Tokenizer
The Hugging Face Transformers library provides various tokenizers tailored to different pre-trained models, such as BERT, GPT-3, T5, etc. Let's discuss how to use these tokenizers:


1. BERT:
- 'tokenizer.tokenize(text):' Splits the text into subword tokens. The ## symbol indicates that the subword is part of the preceding token.
- 'tokenizer.convert_tokens_to_ids(tokens):' Converts tokens to their corresponding integer IDs.
- 'tokenizer(text, return_tensors='pt'):' Encodes the input text into PyTorch tensors, including the input IDs and attention mask.

In [1]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Text to be tokenized
text = "Tokenizers are great!"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['token', '##izers', 'are', 'great', '!']

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
# Output: [19204, 12963, 2024, 2307, 999]

# Use tokenizer directly to encode text
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input)
# Output: {'input_ids': tensor([[  101, 19204, 12963,  2024,  2307,   999,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


  from .autonotebook import tqdm as notebook_tqdm


['token', '##izer', '##s', 'are', 'great', '!']
[19204, 17629, 2015, 2024, 2307, 999]
{'input_ids': tensor([[  101, 19204, 17629,  2015,  2024,  2307,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


Byte-Pair Encoding (BPE) Tokenizer

- BPE is a popular subword tokenization technique used by models like GPT and RoBERTa. It involves:

- Merging the most frequent pairs of characters iteratively until a certain vocabulary size is reached.
- Handling rare and unknown words by breaking them into known subwords.

In [2]:
from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

text = "Machine learning is fascinating!"

# Tokenize and encode the text
encoded_input = tokenizer(text)
print(encoded_input)
# Output: {'input_ids': [7134, 18716, 318, 27180, 0]}


{'input_ids': [37573, 4673, 318, 13899, 0], 'attention_mask': [1, 1, 1, 1, 1]}


**Special Considerations for Tokenizers**

- **Handling Unknown Words:** Some tokenizers have an unknown token ([UNK]) for words not present in the vocabulary. Subword tokenizers like BPE and WordPiece try to break the word into smaller components that exist in the vocabulary to mitigate this issue.
- **Padding and Batch Tokenization:** When preparing data for a batch of inputs, tokenizers can pad the sequences to make them equal in length, which is necessary for efficient batching.

In [7]:
# Batch encoding with padding
tokenizer.pad_token = tokenizer.eos_token

texts = ["Hello world!", "Transformers are great!"]
encoded_inputs = tokenizer(texts, padding=True, return_tensors='pt', truncation=True, max_length=10)
print(encoded_inputs)
# Output: {'input_ids': tensor([[ 101, 7592, 2088,  999,  102],
#                               [ 101, 19081,  2024,  2307,  999,  102]]),
#          'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
#                                    [1, 1, 1, 1, 1, 1]])}


{'input_ids': tensor([[15496,   995,     0, 50256, 50256],
        [41762,   364,   389,  1049,     0]]), 'attention_mask': tensor([[1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1]])}


In [11]:
#print tokenizer's tokens and their corresponding IDs
vocab = tokenizer.get_vocab()
print(list(vocab.items())[:10])

[('!', 0), ('"', 1), ('#', 2), ('$', 3), ('%', 4), ('&', 5), ("'", 6), ('(', 7), (')', 8), ('*', 9)]
