<a href="https://colab.research.google.com/github/premkumarkora/Tokenizers_in_HuggingFace_Models/blob/main/Tokenizers_HuggingFace_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**AutoTokenizer** is a generic class that automatically loads the correct tokenizer for any pretrained Hugging Face transformer model.

It handles splitting text into subword tokens, adding special tokens (e.g., [CLS], [SEP]), and creating attention masks.

Depending on the model, it uses different algorithms like WordPiece, Byte-Pair Encoding (BPE), or SentencePiece under the hood.

You initialize it with tokenizer = AutoTokenizer.from_pretrained("model-name"), which downloads and caches the tokenizer configuration and vocab files.

After loading, use tokenizer(text, return_tensors="pt") to convert raw text into model-ready input IDs (and reverse with tokenizer.decode()).

Every pretrained model on Hugging Face comes paired with a tokenizer that mirrors how it was trained:

**Model-specific vocab and rules**
Each model repository on the Hub includes tokenizer files (vocabulary, merges/rules, special-token mappings) exactly matching what the model saw during pretraining.

**Algorithm varies by architecture**
BERT-style models typically use WordPiece, GPT-style models use Byte-Pair Encoding, and others may use SentencePiece or Unigram; the AutoTokenizer you load knows which under-the-hood algorithm to pull in.

**Shared across variants**
Different checkpoints of the same architecture (e.g. bert-base-uncased vs. bert-large-uncased) share the same tokenizer type and vocab, but fine-tuned or multilingual variants may have expanded or modified vocabularies.

**Consistency is crucial**
Using the exact tokenizer used at pretraining ensures that token‐to‐ID mappings match the model’s learned embeddings—mismatched tokenizers will produce incorrect inputs and degrade performance.

AutoTokenizer management

`tokenizer = AutoTokenizer.from_pretrained("model-name")`

it fetches the right tokenizer files for that model so you don’t have to worry about manually specifying vocab paths or algorithms.

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

Retrieves your Hugging Face API token from stored user data and authenticates the CLI session. It also adds the token to your Git credentials for seamless access to private repos.

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Accessing Llama 3.1 from Meta

In [None]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B', trust_remote_code=True)

In [None]:
text = "You want to see how the Tokenizers work in Llama3.1 and the family of 3.1"
tokens = tokenizer.encode(text)
tokens

Find the Length of Toens generated by Llama3.1

In [None]:
len(tokens)

In [None]:
len(text)

In [None]:
len(text.split())

Decoding the tokens

In [None]:
tokenizer.decode(tokens)

Thought for 5 seconds


The call to `tokenizer.batch_decode(tokens)` takes your sequence of token IDs (or token strings) and maps them back into human‐readable text. It:

1. Converts each token ID into its corresponding string piece (including special tokens like `<|begin_of_text|>`).
2. Joins subword fragments (e.g. `"Token"` + `"izers" → "Tokenizers"`) into full words.
3. Reconstructs the original text sequence you tokenized.


In [None]:
tokenizer.batch_decode(tokens)

Thought for a few seconds


The method `tokenizer.get_added_vocab()` returns a dictionary of all tokens you’ve dynamically added to the tokenizer (via `tokenizer.add_tokens(...)`), mapping each new token string to its assigned token ID. If you haven’t added any extra tokens, it will return an empty dict.


In [None]:
tokenizer.get_added_vocab()

In [None]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

Many models have a variant that has been trained for use in Chats.
These are typically labelled with the word "Instruct" at the end.
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.

1. **Define the chat history**
   You create `messages`, a list of dictionaries where each entry has a `"role"` (like `system` or `user`) and its `"content"`.

2. **Build a single prompt string**
   `apply_chat_template(...)` takes that list and stitches it into one long string, inserting special tokens around each role (e.g. `<|start_header_id|>system<|end_header_id|>`) plus metadata lines (knowledge cut-off date, today’s date).

3. **Add the “assistant” cue**
   By setting `add_generation_prompt=True`, it tacks on the final `<|start_header_id|>assistant…` marker so the model knows “now it’s your turn to talk.”

4. **Result**
   When you `print(prompt)`, you see one continuous text blob that the model can consume directly—complete with role markers, context dates, and a generation slot for the assistant’s reply.


In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

Phi3 from Microsoft

Qwen2 from Alibaba Cloud

Starcoder2 from BigCode (ServiceNow + HuggingFace + NVidia)

In [None]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b"

In [None]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to show Tokenizers in action for PHI3 Model"
print(tokenizer.encode(text))
print()
tokens = phi3_tokenizer.encode(text)
print(phi3_tokenizer.batch_decode(tokens))

In [None]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

# Qwen

In [None]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
print(tokenizer.encode(text))
print()
print(phi3_tokenizer.encode(text))
print()
print(qwen2_tokenizer.encode(text))

In [None]:
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print()
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

In [None]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")