<a href="https://colab.research.google.com/github/nursenakok/hugging-face-beginner-guide/blob/main/Tokenizer_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pre-trained Tokenizer**


In [1]:
from transformers import AutoTokenizer

In [2]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") # Load a pre-trained tokenizer for a specific model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
sentence = "Quantum physics reveals the strange behavior of particles at tiny scales, which challenges our understanding of reality."

In [4]:
tokens = tokenizer.tokenize(sentence) #Tokenization – splitting text into tokens (words or subwords)
print (tokens)

['Quantum', 'physics', 'reveals', 'the', 'strange', 'behavior', 'of', 'particles', 'at', 'tiny', 'scales', ',', 'which', 'challenges', 'our', 'understanding', 'of', 'reality', '.']


In [5]:
input_ids = tokenizer(sentence) # Converting tokens to IDs with special tokens – mapping each token to its integer ID from the model vocabulary.
print(input_ids)

{'input_ids': [101, 25231, 7094, 7189, 1103, 4020, 4658, 1104, 9150, 1120, 4296, 9777, 117, 1134, 7806, 1412, 4287, 1104, 3958, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [6]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[25231, 7094, 7189, 1103, 4020, 4658, 1104, 9150, 1120, 4296, 9777, 117, 1134, 7806, 1412, 4287, 1104, 3958, 119]


In [7]:
decode = tokenizer.decode(token_ids) # Decoding tokens back to text – showing how token IDs can be converted back to readable text
print(decode)

Quantum physics reveals the strange behavior of particles at tiny scales, which challenges our understanding of reality.


In [8]:
tokenizer.decode(101) # Special tokens; at the beginning (used for sequence-level tasks)

'[CLS]'

In [9]:
tokenizer.decode(102) # Special tokens; at the end (used for separating sentences in pair tasks)

'[SEP]'

In [10]:
inputs = tokenizer(sentence, padding="max_length", max_length=25, return_tensors="pt") #Attention mask – indicating which tokens the model should focus on and which are padding
print(inputs["input_ids"])
print(inputs["attention_mask"])

tensor([[  101, 25231,  7094,  7189,  1103,  4020,  4658,  1104,  9150,  1120,
          4296,  9777,   117,  1134,  7806,  1412,  4287,  1104,  3958,   119,
           102,     0,     0,     0,     0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0]])


In [11]:
# if we choose another model with same sentence

In [12]:
tokenizer2 = AutoTokenizer.from_pretrained("roberta-base")

In [14]:
input_ids2 = tokenizer2(sentence)
print(input_ids2)

{'input_ids': [0, 44572, 783, 17759, 7441, 5, 7782, 3650, 9, 16710, 23, 5262, 21423, 6, 61, 2019, 84, 2969, 9, 2015, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [15]:
tokens2 = tokenizer2.tokenize(sentence)
print (tokens2)

['Quant', 'um', 'Ġphysics', 'Ġreveals', 'Ġthe', 'Ġstrange', 'Ġbehavior', 'Ġof', 'Ġparticles', 'Ġat', 'Ġtiny', 'Ġscales', ',', 'Ġwhich', 'Ġchallenges', 'Ġour', 'Ġunderstanding', 'Ġof', 'Ġreality', '.']


In [19]:
tokenizer2.decode(0)   # RoBERTa special token: <s>
                       # Marks the start of a sequence (similar to BERT's [CLS])

'<s>'

In [20]:
tokenizer2.decode(2) # RoBERTa special token: </s>
                     # Marks the end of a sequence or separates sentences (similar to BERT's [SEP])

'</s>'