## Examples for Tokenizations
### Without Chat Template
For example, as used in pre-training

In [24]:
import os
import torch
from transformers import AutoTokenizer, pipeline

HF_API_KEY = os.getenv('HF_API_KEY')

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
# tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token=HF_API_KEY)

# Define the chat messages
messages = ["Lorem ipsum dolor sit amet.","Ein Beispieltext."]
# Apply the tokenizer to the messages
tokenized_prompt = tokenizer(messages)

# Print the tokenized inputs
print("Token IDs: ", tokenized_prompt['input_ids'])
print("Tokens: ", tokenizer.convert_ids_to_tokens(tokenized_prompt['input_ids'][0]))
print("Tokens: ", tokenizer.convert_ids_to_tokens(tokenized_prompt['input_ids'][1]))


tokenizer_config.json: 100%|██████████| 50.6k/50.6k [00:00<00:00, 484kB/s]
tokenizer.json: 100%|██████████| 9.09M/9.09M [00:01<00:00, 5.08MB/s]
special_tokens_map.json: 100%|██████████| 73.0/73.0 [00:00<00:00, 9.67kB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Token IDs:  [[128000, 33883, 27439, 24578, 2503, 28311, 13], [128000, 54850, 80292, 1342, 13]]
Tokens:  ['<|begin_of_text|>', 'Lorem', 'Ġipsum', 'Ġdolor', 'Ġsit', 'Ġamet', '.']
Tokens:  ['<|begin_of_text|>', 'Ein', 'ĠBeispiel', 'text', '.']


#### With Chat Template
As used in instruction tuning for chat bots

In [25]:
import torch
from transformers import AutoTokenizer, pipeline

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
# tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token=HF_API_KEY)


# Define the chat messages
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

# Apply the chat template but do not tokenize the result yet
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Now, tokenize the prompt to see how it looks in its tokenized form
tokenized_prompt = tokenizer(prompt, return_tensors="pt")
print("Token IDs: ", tokenized_prompt['input_ids'])
print("Tokens: ", tokenizer.convert_ids_to_tokens(tokenized_prompt['input_ids'][0]))


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

No chat template is defined for this tokenizer - using a default chat template that implements the ChatML format. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



Token IDs:  tensor([[128000,     27,     91,    318,   5011,     91,     29,   9125,    198,
           2675,    527,    264,  11919,   6369,   6465,    889,   2744,  31680,
            304,    279,   1742,    315,    264,  55066,     27,     91,    318,
           6345,     91,    397,     27,     91,    318,   5011,     91,     29,
            882,    198,   4438,   1690,  59432,    649,    264,   3823,   8343,
            304,    832,  11961,  76514,     91,    318,   6345,     91,    397,
             27,     91,    318,   5011,     91,     29,  78191,    198]])
Tokens:  ['<|begin_of_text|>', '<', '|', 'im', '_start', '|', '>', 'system', 'Ċ', 'You', 'Ġare', 'Ġa', 'Ġfriendly', 'Ġchat', 'bot', 'Ġwho', 'Ġalways', 'Ġresponds', 'Ġin', 'Ġthe', 'Ġstyle', 'Ġof', 'Ġa', 'Ġpirate', '<', '|', 'im', '_end', '|', '>Ċ', '<', '|', 'im', '_start', '|', '>', 'user', 'Ċ', 'How', 'Ġmany', 'Ġhelicopters', 'Ġcan', 'Ġa', 'Ġhuman', 'Ġeat', 'Ġin', 'Ġone', 'Ġsitting', '?<', '|', 'im', '_end', '|', '>Ċ', '<'