-------------------------
Tokenization: To demonstrate how a piece of text is split into individual tokens according to the rules of a pre-trained Transformer model's tokenizer. This helps in understanding how text is processed before being fed into the model.

Token IDs Conversion: To show how these tokens are mapped to their corresponding numerical IDs, which are used by the model for computation. This step is crucial as models work with numerical data rather than raw text.

Encoding: To illustrate how the entire text is converted into a format that can be directly used by the model, including creating tensors that represent the token IDs.

Decoding: To demonstrate how numerical token IDs can be converted back into human-readable text, which helps in interpreting the model's outputs.

-----------------------------

In [1]:
from transformers import BertTokenizer

In [2]:
# Load pre-trained tokenizer for BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          cache_dir=r'D:\AI-DATASETS\07-Hugging-Face-Data')

In [3]:
# Sample text to tokenize
text = "Hello, how are you doing today, unforgettable, undesirable, Chat-Masala, dosa?"

In [4]:
# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

Tokens: ['hello', ',', 'how', 'are', 'you', 'doing', 'today', ',', 'un', '##for', '##get', '##table', ',', 'und', '##es', '##ira', '##ble', ',', 'chat', '-', 'mas', '##ala', ',', 'dos', '##a', '?']


In [5]:
# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

Token IDs: [7592, 1010, 2129, 2024, 2017, 2725, 2651, 1010, 4895, 29278, 18150, 10880, 1010, 6151, 2229, 7895, 3468, 1010, 11834, 1011, 16137, 7911, 1010, 9998, 2050, 1029]


In [6]:
# Convert text to token IDs directly
encodings = tokenizer(text, return_tensors='pt')
print("Encodings:", encodings)

Encodings: {'input_ids': tensor([[  101,  7592,  1010,  2129,  2024,  2017,  2725,  2651,  1010,  4895,
         29278, 18150, 10880,  1010,  6151,  2229,  7895,  3468,  1010, 11834,
          1011, 16137,  7911,  1010,  9998,  2050,  1029,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]])}


In [7]:
# Decode token IDs back to text
decoded_text = tokenizer.decode(encodings['input_ids'][0])
print("Decoded Text:", decoded_text)

Decoded Text: [CLS] hello, how are you doing today, unforgettable, undesirable, chat - masala, dosa? [SEP]
