# Special tokens in Llama 4

A prompt contains a user input, special tokens and optionally a context (chat history and/or external data), is the input to LLMs, which only see the tokens and never directly deals with any text.

Llama 4 supports the following list of special tokens:

**General tokens:**
* <|begin_of_text|>: Specifies the start of the prompt
* <|header_start|>: Start of a role for a particular message.
* <|header_end|>: End of the role for a particular message.
* <|eot|>: End of turn. Represents when the model has finished interacting with the user input.

  **NOTE**: In Llama 3, similar general tokens are used (but three of the four were renamed in Llama 4):
    * <|begin_of_text|>
    * <|start_header_id|>
    * <|end_header_id|>
    * <|eot_id|>

  You'll see the detailed comparison of Llama 4 and 3 in the examples below.

**Image tokens:**
* <|image_start|>: Start of the image data in the prompt.
* <|image_end|>: End of the image data in the prompt.
* <|patch|>: Represents subsets of the input image. Larger images have more patch tokens in the prompt.
* <|tile_x_separator|>: Separates the x tiles of an image.
* <|tile_y_separator|>: Separates the y tiles of an image.
* <|image|>: Separates the regular-sized image tokens from a downsized version of it that fits in a single tile.

Llama 4 supports the same 4 roles (`system`, `user`, `assistant`, `ipython`) as Llama 3:

1. system: Sets the context in which to interact with Llama. System prompt typically includes rules or guidelines that helps the model respond effectively.
2. user: Represents the human interacting with Llma. User prompt includes the specific user inputs, commands, or questions.
3. assistant: Represents Llama generating a response to the user.
4. ipython: Represents the output of a tool call when sent back to Llama.

We'll use Hugging Face's transformers library to generate the raw tokens of a prompt to Llama 4.

## Load API Keys

In [None]:
import os
from utils import get_llama_api_key, get_llama_base_url, get_together_api_key
from utils import get_hf_access_token

llama_api_key = get_llama_api_key()
llama_base_url = get_llama_base_url()
together_api_key = get_together_api_key()
hf_access_token = get_hf_access_token()

from utils import llama4, llama4_together
from transformers import AutoProcessor

#from transformers import AutoTokenizer


# Using Hugging Face transformers

We'll use the transformers library and its AutoProcessor function to find out the raw prompt of an input message. We'll also use 3 Llama models to compare the raw prompts of text input messages: Llama 4, Llama 3.2 (vision model) and Llama 3.3.

In [None]:
#!pip install -U transformers>=4.51.0

In [None]:
import os
#from google.colab import userdata


model_llama4_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
model_llama33_id = "meta-llama/Llama-3.3-70B-Instruct"
model_llama32_id = "meta-llama/Llama-3.2-90B-Vision-Instruct"

processor_llama4 = AutoProcessor.from_pretrained(model_llama4_id)
processor_llama33 = AutoProcessor.from_pretrained(model_llama33_id)
processor_llama32 = AutoProcessor.from_pretrained(model_llama32_id)

# Comparing Llama 4 and 3 raw text prompts



Let's first use a plain text user input with no system prompt and see its raw prompt with special tokens using Llama 3 and Llama 4 - you can see the difference in the model outputs.

In [None]:
messages=[{
    "role": "user",
    "content": "Best quote in Godfather."
}]

raw_prompt = processor_llama4.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_prompt

In [None]:
raw_prompt = processor_llama33.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_prompt

In [None]:
raw_prompt = processor_llama32.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_prompt

Below is also a plain text input but with a system prompt:

In [None]:
messages = [
    {"role": "system", "content": "Respond in French."},
    {"role": "user", "content": "Best quote in Godfather."},
]

raw_input_prompt = processor_llama4.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_input_prompt

In [None]:
raw_input_prompt = processor_llama33.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_input_prompt

In [None]:
raw_input_prompt = processor_llama32.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_input_prompt

You can see that Llama 4 simplifies the naming of the general tokens as well as the default System prompt used in Llama 3.

# Finding out raw prompt for long input

Long context is another main capability. Will long text input lead to the same raw prompt? It should and it does.

Let's get the text of the novel The Adventure of Tom Sawyer which has more than 412K characters and, as you'll see in Lesson 5 Long Context, about 105K tokens.

In [None]:
#!wget https://www.gutenberg.org/cache/epub/74/pg74.txt

In [None]:
with open("pg74.txt", "r", encoding='utf=8') as file:
    tom = file.read()
len(tom)

In [None]:
messages = [
    {"role": "user", "content": f"what's the first great law of human action Tom discovered in the book: {tom}"},
]

raw_input_prompt = processor_llama4.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
raw_input_prompt[:200], raw_input_prompt[-200:]

So `raw_input_prompt`, after tokenization, will become input tokens to Llama. You'll see the Llama 4's response to the query above in the Long Context lesson.

# Deep dive into Llama 4 image tokens

Now let's see a user message with a text question on the Llama repo image shown in Lesson 2:

In [None]:
url = "https://raw.githubusercontent.com/meta-llama/llama-models/refs/heads/main/Llama_Repo.jpeg"

import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

def display_image(image_url):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    plt.imshow(img)
    plt.axis('off')
    plt.show()

display_image(url)

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe the image below.",
            },
            {"type": "image", "url": url},
        ],
    },
]

inputs = processor_llama4.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.keys()

The values for the keys `input_ids` and `pixel_values` represent the encoded raw prompt of the messages and the transformed image data of the image(s) in the messages, respectively.

When quering Llama 4 with an image (like using the `messages` above), the following image processing steps are performed behind the scene:

1. A dynamic image transformation that divides the input image into 336×336 pixel tiles;

2. A global tile (created by resizing the entire input image to 336×336 pixels) is appended after the local tiles to provide a global view of the input image.

The Llama repo image we used above has size 768x768, and because 768/336=2.28, 3 tiles each will be needed to cover the image horizontally and vertically, leading to a total of 3*3=9 local tiles. With one global tile, 10 tiles of 336x336 will be expected to represent the image data, as shown below in the `inputs`'s `pixel_values`:

In [None]:
inputs.pixel_values.size()

To convert the `messages` above to Llama 4 raw input prompt tokens, we need to decode the `input_ids` to `raw_prompt` which has the following content (formatted differently for better readibility), with both general tokens and image tokens we introduced earlier:

  <|begin_of_text|><|header_start|>user<|header_end|>Describe the image below.\
  <|image_start|>\
  <|patch|>...<|patch|><|tile_x_separator|>\
  <|patch|>...<|patch|><|tile_x_separator|>\
  <|patch|>...<|patch|><|tile_y_separator|>\
  <|patch|>...<|patch|><|tile_x_separator|>\
  <|patch|>...<|patch|><|tile_x_separator|>\
  <|patch|>...<|patch|><|tile_y_separator|>\
  <|patch|>...<|patch|><|tile_x_separator|>\
  <|patch|>...<|patch|><|tile_x_separator|>\
  <|patch|>...<|patch|><|tile_y_separator|>\
  <|image|><|patch|>...<|patch|><|image_end|>\
  <|eot|><|header_start|>assistant<|header_end|>

The `raw_prompt` has 6 <tile_x_separator|>'s and 3 <tile_y_separator|>'s, and there're 144 <|patch|>'s between two consecutive tile separators, and between <|image|> and <|image_end|>. The size of each tile is 336x336, and the size of each patch is 28x28, hence the number of patches between tiles is 144. (336/28=12; 12*12=144)

Remember that:

* <|image_start|>: Start of the image data in the prompt.
* <|image_end|>: End of the image data in the prompt.
* <|patch|>: Represents subsets of the input image.
* <|tile_x_separator|>: Separates the x tiles of an image.
* <|tile_y_separator|>: Separates the y tiles of an image.
* <|image|>: Separates the regular-sized image tokens from a downsized version of it that fits in a single tile.





Now let's see the raw prompt based on the knowledge above - we replace 144's <|patch>'s with <|patch|>...<|patch|> to make it more readable:

In [None]:
raw_prompt = processor_llama4.tokenizer.batch_decode(inputs["input_ids"])
raw_prompt[0].replace("<|patch|>"*144, "<|patch|>...<|patch|>")

To visualize the image, its 6 <tile_x_separator|>'s, 3 <tile_y_separator|>'s, and 144 <|patch|>'s:

In [None]:
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import numpy as np

# Set the dimensions
width, height = 768, 768
tile_size = 336
patch_size = 28

# Create a new image
img = Image.new('RGB', (width, height), color='white')
draw = ImageDraw.Draw(img)

# Divide the image into tiles
for i in range(0, width, tile_size):
    for j in range(0, height, tile_size):
        # Draw a rectangle for each tile
        draw.rectangle((i, j, i + tile_size, j + tile_size), outline='black')

        # Divide each tile into patches
        for x in range(i, i + tile_size, patch_size):
            for y in range(j, j + tile_size, patch_size):
                # Draw a rectangle for each patch
                draw.rectangle((x, y, x + patch_size, y + patch_size), outline='gray')

# Add separator lines with text
font = ImageFont.load_default()
font = font.font_variant(size=28)

for i in range(tile_size, width, tile_size):
    draw.line((i, 0, i, height), fill='black')
    draw.text((i - 150, height // 5), '<tile_x_separator|>', font=font, fill='blue')
    draw.text((i - 150, height // 1.5), '<tile_x_separator|>', font=font, fill='blue')
for j in range(tile_size, height, tile_size):
    draw.line((0, j, width, j), fill='black')
    draw.text((width // 2, j - 20), '<tile_y_separator|>', font=font, fill='blue')
draw.line((0, height - 10, width, height - 10), fill='black')
draw.text((width // 2, height - 40), '<tile_y_separator|>', font=font, fill='blue')


# Add additional texts
draw.text((10, 10), 'Image Size: 768x768', font=font, fill='black')
draw.text((10, 40), 'Tile Size: 336x336; # of Tiles: 9', font=font, fill='black')
draw.text((10, 70), 'Patch Size: 28x28; # of Patches per Tile: 144 (12x12)', font=font, fill='black')

# Convert the image to a numpy array
img_array = np.array(img)

# Display the image using matplotlib
plt.imshow(img_array)
plt.axis('off')  # Turn off the axis
plt.show()


To take a look at the whole inputs value:

In [None]:
inputs

The attention_mask, with the same shape as the input data, is a binary mask used to control the attention mechanism,allowing Llama 4 to focus on specific parts of the input data when generating output.

In [None]:
inputs['attention_mask'].size(), inputs['pixel_values'].size(), inputs['input_ids'].size()

## Small image

If the image size is 336x336 or smaller, no <|tile_x_separator|> or <|tile_y_separator|> will be needed. Let's first resize the Llama repo image to a small one 300x300.

In [None]:
import requests
from PIL import Image
from io import BytesIO

response = requests.get(url)
img = Image.open(BytesIO(response.content)).resize((300, 300))
img.save("small.jpg")
img.size

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "describe the image.",
            },
            {"type": "image", "url": "small.jpg"},
        ],
    },
]

inputs = processor_llama4.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)


You can see there's only one tile, the global tile, in the generated raw prompt and pixel_values.

In [None]:
inputs.pixel_values.size()

In [None]:
raw_prompt = processor_llama4.tokenizer.batch_decode(inputs["input_ids"])
raw_prompt[0].replace("<|patch|>"*144, "<|patch|>...<|patch|>")

Just 144 patches between <|image_start|><|image|> and <|image_end|> tokens for the small image.

## Larger images

If an image is too large, resizing may be needed because the max number of local tiles is 16 - but using a Llama cloud or local inference provider, the image preprocessing including possible resizing will all be taken care of under the hood.

Let's try two larger images.

In [None]:
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"

display_image(url1)
display_image(url2)

The size of the image url1 (rabbit) is 2048x2688. 2048/336=6.09 but since the max number of titles is 16, so the image will be resized and then 16 local tiles will be generated.

The size of url2 (cat) is 1024x1024. 1024/336=3.04, so 4x4=16 local tiles would be needed (no resize needed).

A global tile will finally be appended to the 16 tiles, making the size of the pixels for both image urls [17, 3, 336, 336].


In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "describe the image.",
            },
            {"type": "image", "url": url1},
        ],
    },
]

inputs = processor_llama4.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.pixel_values.size()

The raw_prompt has 12 <tile_x_separator|>'s and 4 <tile_y_separator|>'s, and there're 144 <|patch|>'s between two consecutive tile separators:

In [None]:
raw_prompt = processor_llama4.tokenizer.batch_decode(inputs["input_ids"])
raw_prompt[0].replace("<|patch|>"*144, "<|patch|>...<|patch|>")

Same pixel_values size and 12 <tile_x_separator|>'s and 4 <tile_y_separator|>'s are generated for the second image:

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "describe the image.",
            },
            {"type": "image", "url": url2},
        ],
    },
]

inputs = processor_llama4.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.pixel_values.size()

In [None]:
raw_prompt = processor_llama4.tokenizer.batch_decode(inputs["input_ids"])
raw_prompt[0].replace("<|patch|>"*144, "<|patch|>...<|patch|>")

## Multiple images

Let's now see what the pixel_values and raw prompt format is for a query with multiple images.

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "describe the image.",
            },
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
        ],
    },
]

inputs = processor_llama4.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.pixel_values.size()

So the size is just the sum of the sizes for two single images. Note there're two <|image|>'s in the raw prompt because two images are in the input messages.

In [None]:
raw_prompt = processor_llama4.tokenizer.batch_decode(inputs["input_ids"])
raw_prompt[0].replace("<|patch|>"*144, "<|patch|>...<|patch|>")

Let's now take a quick look at how to do tool calling in Llama 4.

# Tool calling in Llama 4

Below is the Meta's recommended system prompt for using tool calling in Llama 4 - you'll need to define your own `available_functions` for your use case.

In [None]:
available_functions = """
[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]
"""

system_prompt = """
You are a helpful assistant and an expert in function composition. You can answer general questions using your internal knowledge OR invoke functions when necessary. Follow these strict guidelines:

1. FUNCTION CALLS:
- ONLY use functions that are EXPLICITLY listed in the function list below
- If NO functions are listed (empty function list []), respond ONLY with internal knowledge or "I don't have access to [Unavailable service] information"
- If a function is not in the list, respond ONLY with internal knowledge or "I don't have access to [Unavailable service] information"
- If ALL required parameters are present AND the query EXACTLY matches a listed function's purpose: output ONLY the function call(s)
- Use exact format: [
  {
    "name": "<tool_name_foo>",
    "parameters": {
      "<param1_name>": "<param1_value>",
      "<param2_name>": "<param2_value>"
    }
  }
]
Examples:
CORRECT: [
  {
    "name": "get_weather",
    "parameters": {
      "location": "Vancouver"
    }
  },
  {
    "name": "calculate_route",
    "parameters": {
      "start": "Boston",
      "end": "New York"
    }
  }
] <- Only if get_weather and calculate_route are in function list

INCORRECT: [
  {
    "name": "population_projections",
    "parameters": {
      "country": "United States",
      "years": 20
    }
  }
]}] <- Bad json format
INCORRECT: Let me check the weather: [
  {
    "name": "get_weather",
    "parameters": {
      "location": "Vancouver"
    }
  }]
INCORRECT: [
  {
    "name": "get_events",
    "parameters": {
      "location": "Singapore"
    }
  }] <- If function not in list

2. RESPONSE RULES:
- For pure function requests matching a listed function: ONLY output the function call(s)
- For knowledge questions: ONLY output text
- For missing parameters: ONLY request the specific missing parameters
- For unavailable services (not in function list): output ONLY with internal knowledge or "I don't have access to [Unavailable service] information". Do NOT execute a function call.
- If the query asks for information beyond what a listed function provides: output ONLY with internal knowledge about your limitations
- NEVER combine text and function calls in the same response
- NEVER suggest alternative functions when the requested service is unavailable
- NEVER create or invent new functions not listed below

3. STRICT BOUNDARIES:
- ONLY use functions from the list below - no exceptions
- NEVER use a function as an alternative to unavailable information
- NEVER call functions not present in the function list
- NEVER add explanatory text to function calls
- NEVER respond with empty brackets
- Use proper Python/JSON syntax for function calls
- Check the function list carefully before responding

4. TOOL RESPONSE HANDLING:
- When receiving tool responses: provide concise, natural language responses
- Don't repeat tool response verbatim
- Don't add supplementary information

Here is a list of functions in JSON format that you can invoke:
""" + available_functions

In [None]:
user_prompt = "What is the weather in SF and Seattle?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

The Llama 4 response to the message above is a JSON object for tool calling.

In [None]:
from llama_api_client import LlamaAPIClient

client = LlamaAPIClient(api_key=os.environ['LLAMA_API_KEY'])

response = client.chat.completions.create(
  model="Llama-4-Maverick-17B-128E-Instruct-FP8", # Llama-4-Scout-17B-16E-Instruct-FP8
  messages=messages,
  temperature=0
)

print(response.completion_message.content.text)


To see the raw prompt input to Llama 4 for it to return the response above, do this - notice all the special tokens added around system prompt and use prompt:

In [None]:
raw_prompt = processor_llama4.apply_chat_template(messages,
    tokenize=False,
    add_generation_prompt=True)
print(raw_prompt)

Assume we have the following tool calling result, we can update the new message accordingly:

In [None]:
tool_call_response = """
[
  {
    "response": "Sunny 75"
  },
  {
    "response": "Rainy 65"
  }
]
"""

updated_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
    {"role": "assistant", "content": response.completion_message.content.text},
    {"role": "user", "content": tool_call_response}
]

To see the final response of Llama 4 based on the tool calling result:

In [None]:
final_response = client.chat.completions.create(
  model="Llama-4-Maverick-17B-128E-Instruct-FP8",
  messages=updated_messages,
  temperature=0
)

print(final_response.completion_message.content.text)

Let's take one more look at the raw prompt input to Llama 4 for it to return the final response above:

In [None]:
raw_prompt = processor_llama4.apply_chat_template(updated_messages,
    tokenize=False,
    add_generation_prompt=True)
print(raw_prompt)

In the next Image Grounding and Understanding lesson, we'll see 8 examples of using Llama 4, with the special tokens and prompt format working behind the scene.