# Getting Started with Chat Templates for Text LLMs

**Chat templates** are part of the tokenizer for text-only LLMs or processor for multimodal LLMs. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

To start with, we use `mistralai/Mistral-7B-Instruct-v0.1` model as a example:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

In [None]:
# chat template
chat = [
    {'role': 'user', 'content': 'Hellow, how are you?'},
    {'role': 'assistant', 'content': "I'm doing great. How can I help you today?"},
    {'role': 'user', 'content': "I'd liek to show off how chat templates work!"}
]

tokenizer.apply_chat_template(chat, tokenize=False)

Note how the tokenizer has added to the control tokens `[INST]` and `[/INST]` to indicate the start and end of user messages (but not assistant messages!), and the entire chat is condensed into a single string.

If we set `tokenize=True`, the string will also be tokenized for us:

In [None]:
tokenizer.apply_chat_template(chat, tokenize=True)

Now we swap in the `HuggingFaceH4/zephyr-7b-beta` model instead:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

In [None]:
# chat template
chat = [
    {'role': 'user', 'content': 'Hellow, how are you?'},
    {'role': 'assistant', 'content': "I'm doing great. How can I help you today?"},
    {'role': 'user', 'content': "I'd liek to show off how chat templates work!"}
]

tokenizer.apply_chat_template(chat, tokenize=False)

Both Zephyr and Mistral-Instruct were fine-tuned from the same base model, `Mistral-7B-v0.1`. However, they were trained with totally diferent chat formats. Without chat templates, we would have to write manual formatting code for each model.

## Chat templates

After building the chat templates, we just need to pass it to the `apply_chat_template()` method. When using chat templates as input for model generation, it is also a good idea to use `add_generation_prompt=True` to add a generation prompt.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = 'HuggingFaceH4/zephyr-7b-beta'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='auto')

In [None]:
messages = [
    {
        'role': 'system',
        'content': 'You are a friendly chatbot who always responds in the style of a pirate'
    },
    {
        'role': 'user',
        'content': 'How many helicopters can a human eat in one sitting?'
    }
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors='pt'
)
print(tokenizer.decode(tokenized_chat[0]))

Now that our input is formatted correctly for Zephyr, we can use the model to generate a reponse to the user's question:

In [None]:
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

## Pipeline for Chat templates

In [None]:
from transformers import pipeline

pipe = pipeline('text-generation', 'HuggingFaceH4/zephyr-7b-beta')

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

# print the assistant's response
print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1])

## "Generation prompt" in `add_generation_prompt`

The `add_generation_prompt` argument in the `apply_chat_template` method tells the template to add tokens that indicate the start of a bot response.

In [None]:
messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

In [None]:
# If setting `add_generation_prompt=False`
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

In [None]:
# If setting `add_generation_prompt=True`
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

This time, we have added he tokens that indicate the start of a bot response. This ensures that when the model generates text, it will write a bot response instead of doing something unexpected, like continuing the user's message.
Because chat models are still lanuguage models and they are trained to continue text, that's why we need to guide them with appropriate control tokens.

Not all models require generation prompts. Some models, like LLaMA, do not have any special token before bot responses. In these cases, the `add_generation_prompt` argument will have no effect.

## `continue_final_message`

When passing a list of messages, we can choose to format the chat so the model will continue the final message in the chat instead of starting a new one. This is done by removing any end-of-sequence tokens that indicate the end of the final message, so that the model will simply extend the final message when it begins to generate text. This is useful for "prefilling" the model's response.

In [None]:
chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'}, # partial response
]

formatted_chat = tokenizer.apply_chat_template(
    chat,
    tokenize=True,
    return_dict=True,
    continue_final_message=True,
)
model.generate(**formatted_chat)

The model will generate text that continues the JSON string, rather than starting a new messag, which can be very useful for improving the accuracy of the model's instruction-following.

`add_generation_prompt` and `continue_final_message` cannot be used at the same time.

## Can I use chat templates in training?

It is recommended that we apply the chat template as a preprocesing step for our dataset. Then, we can continue language model training task.

When training, we shoud set `add_genenration_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during training.

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

In [None]:
chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

dataset = Dataset.from_dict(
    {'chat': [chat1, chat2]}
)
dataset = dataset.map(
    lambda x: {'formatted_chat': tokenizer.apply_chat_template(
        x['chat'],
        tokenize=False,
        add_generation_prompt=False
    )}
)

print(dataset['formatted_chat'][0])

From here we can continue training like we would with a standard language modeling task, using the `formatted_chat` column.

By default, some tokenizers add special tokens like `<bos>` and `<eos>` to text they tokenize. Chat templates should already include all the special tokens they need, and so additional special tokens will often be incorrect or duplicated, which will hurt model performance.

Therefore, if we format text with `apply_chat_template(tokenize=False)`, we should set the argument `add_special_tokens=False` when we tokenize that text later. If we use `apply_chat_template(tokenize=True)`, we do not need to worry about this.

# Multimodal Chat Templates for Vision and Audio LLMs

Multimodal models provide richer, more interactive experiences, and understanding how to effectively combine these inputs with our templates is the key.

## Image inputs

For models such as **LLaVA**, the prompts can be formatted as below. The `content` now is a list containing either a text or an image `type`.

In [None]:
from transformers import AutoProcessor

model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf'
processor = AutoProcessor.from_pretrained(model_id)

In [None]:
messages = [
    {
        'role': 'system',
        'content': [{
                'type': 'text',
                'text': "You are a friendly chatbot who always responds in the style of a pirate."
        }]
    },
    {
        'role': 'user',
        'content': [
            {'type': 'image'},
            {'type': 'text', 'text': "What are these?"}
        ]
    }
]

formatted_prompt = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False
)
print(formatted_prompt)

### Image paths or URLs

To incorporate images into our chat templates, we can pass them as file paths or URLs.

In [None]:
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

In [None]:
messages = [
    {
        'role': 'system',
        'content': [{
            'type': 'text',
            'text': "You are a friendly chatbot who always responds in the style of a pirate."
        }]
    },
    {
        'role': 'user',
        'content': [
            {'type': 'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
            {'type': 'text', 'text': 'What are these?'}
        ]
    }
]

processed_chat = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict='True',
    return_tensors='pt'
)
print(processed_chat.keys())

This dictionary is ready to be further passed into the `model.generate()` to generate text.

In [None]:
model.generate(**processed_chat)

## Video inputs

#### Sampling with fixed number of frames

The `num_frames` parameter is passed to the `apply_chat_template()` method and controls how many frames to sample uniformly from the video.

Each model checkpoint has a maximum frame count it was trained with, and exceeding this limit can significantly impact generation quality.

We also have the option to choose a specific framework to load the video. In this example, we use `decord`.

In [None]:
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

In [None]:
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"},
            {"type": "text", "text": "What do you see in this video?"},
        ],
    },
]

processed_chat = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors='pt',
    num_frames=32,
    video_load_backend='decord',
)
print(processed_chat.keys())

In [None]:
model.generate(**processed_chat)

### Sampling with FPS

When working with long videos, we want to sample more frames for better representation. Instead of a fixed number of frames, we can specify `video_fps`, which determines how many frames per second to extract. For example, if a video is **10 seconds long** and we set `video_fps=2`, the model will sample **20 frames** (2 per second, uniformly spaced).

In [None]:
# keep the same chat messages as above
processed_chat = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    video_fps=32,
    video_load_backend='decord',
)
print(processed_chat.keys())

In [None]:
model.generate(**processed_chat)

### Custom frame sampling with a function

Not all models sample frames **uniformly** - some require more complex logic to determine which frames to use.

We can **customize** frame selection:
* use the `sample_indices_fn` to pass a **callable function** for sampling
* if provided, this function **overrides** standard `num_frames` and `fps` methods
* it receives alll the arguments passed to `load_video` and must return **valid frame indices** to sample.

We should use `sample_indices_fn` when
* if we need to custom sampling strategy (e.g., **adaptive frame selection** instead of uniform sampling)
* if our model prioritizes **key momments** in a video rather than evenly spaced frames

In [None]:
# example
def sample_indices_fn(metadata, **kwargs):
    # samples only the first and the second frame
    return [0, 1]

processed_chat = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors='pt',
    sample_indices_fn=sample_indices
    video_load_backend='decord',
)
print(processed_chat.keys())

By using `sample_indices_fn`, we have **full control** over frame selection, making our model **more adaptable** to different video scenarios.

In [None]:
model.generate(**processed_chat)

### List of image frames as video

We can pass a list of image file paths, and the processor will automatically concatenate them into a video. We need to make sure that all images have the same size, as they are assumed to be from the same video.

In [None]:
frames_paths = ["/path/to/frame0.png", "/path/to/frame5.png", "/path/to/frame10.png"]
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "video", "path": frames_paths},
            {"type": "text", "text": "What do you see in this video?"},
        ],
    },
]

processed_chat = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
)
print(processed_chat.keys())

## Multimodal conversational pipeline

In [None]:
# OpenAI conversation format
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?",
            },
            {
                "type": "image_url",
                "image_url": {"url": f"http://images.cocodataset.org/val2017/000000039769.jpg"},
            },
        ],
    }
]

## Best practices for multimodal template configuration

To add a custom chat template for our multimodal LLM, we can create our template using [**Jinja**](https://jinja.palletsprojects.com/en/stable/templates/) and set it with `processor.chat_template`.

In some cases, we may want our template to handle a list of content from multiple modalities, while still supporting a plain string gfor text-only inference. Here is an example of how we can achieve that, using the `Llama-Vision` chat template:
```python
{% for message in messages %}
{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{% if message['content'] is string %}
{{ message['content'] }}
{% else %}
{% for content in message['content'] %}
{% if content['type'] == 'image' %}
{{ '<|image|>' }}
{% elif content['type'] == 'text' %}
{{ content['text'] }}
{% endif %}
{% endfor %}
{{ '<|eot_id|>' }}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}
```

# Expanding Chat Templates with Tools and Documents

In addition to the required `messages` argument we need to pass to `apply_chat_template`, we can pass any keyword argument to `apply_chat_template` and it will be accessible inside the template.

There are some common use-cases, such as passing tools for function calling, or documents for retrieval-augmented generation.

## Tool use / function calling

"Tool use" LLMs can choose to call functions as external tools before generating an answer.. When passing tools to a tool-use model, we can simply pass a list of functions to the `tools` argument:

```python
import datetime

def current_time():
    """Get the current local time as a string"""
    return str(datetime.now())

def multiply(a: float, b: float):
    """A function that multiplies two numbers
    
    Args:
        a: the first number to multiply
        b: the second number to multiply
    """
    return a * b


# define tools as a list
tools = [current_time, multiply]

model_input = tokenizer.apply_chat_template(
    messages,
    tools=tools
)
```


For the tools to work correctly, we should write our functions in the format above, so that they can be parsed correctly as tool:
* the function should have a descriptive name
* every argument must have a type hint
* the function must have a docstring in the standard **Google style** (in other words, an initial function description followed by an `Args:` block that describes the arguments, unless the function does not have any arguments.)
* do not include types in the `Args:` block. Type hints should go in the function header instead.
* the function can have a return type and a `Returns:` block in the docstring. However these are optional because most tool-use models ignore them.

### Complete tool use example

We will use 8B `Hermes-2-Pro`. If we have more memory, we can try a larger model like **Command-R** or **Mixtral-8x22B**.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = 'NousResearch/Hermes-2-Pro-Llama-3-8B'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    torch_dtype=torch.float16,
    device_map='auto',
)

Define some tools:

In [None]:
def get_current_temperature(location: str, unit: str) -> flloat:
    """Get the current temperature at a location

    Args:
        location: The location to get the temperature for, in the format "City, Country"
        unit: The unit to return the temperature in. (Choices: ['celsius', 'fahrenheit'])

    Returns:
        The current temperature at the specified location in the specified units
    """
    return 22. # dummy returns


def get_current_wind_speed(location: str) -> float:
    """Get the current wind speed in km/hr at a given location.

    Args:
        location: The location to get the wind speed for, in the format "City, Country"

    Returns:
        The current wind speed in km/hr at the specified location
    """
    return 6. # dummy returns


tools = [get_current_temperature, get_current_wind_speed]

Now we can set up our chat template:

In [None]:
messages = [
    {
        'role': 'system',
        'content': "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."
    },
    {
        'role': 'user',
        'content': "Hey, what's the temperature in Paris right now?"
    }
]

In [None]:
inputs = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors='pt'
)

inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(
    **inputs,
    max_new_tokens=128
)
print(tokenizer.decode(outputs[0][len(inputs['input_ids'][0]): ]))

The model has called the function with valid arguments, in the format requested by the function docstring.

In [None]:
# complete chat history
print(tokenizer.decode(outputs[0]))

Now we can append the model's tool call to the conversation:

In [None]:
tool_call = {
    "name": "get_current_temperature",
    "arguments": {"location": "Paris, France", "unit": "celsius"}
}
messages.append(
    {
        "role": "assistant",
        "tool_calls": [{"type": "function", "function": tool_call}]
    }
)

In the OpenAI API, the `tool_call` is a JSON string instead of a dictionary.

Now that we have added the tool call to the conversation, we can call the function and append the result to the conversation.

In [None]:
messages.append(
    {
        "role": "tool",
        "name": "get_current_temperature",
        "content": "22.0" # dummy returns
    }
)

Some model architectures, notebly Mistral/Mixtral, also require a `tool_call_id`, which i 9 randomly-generated alphanumeric characters, and assigned to the `id` key of the tool call dictionary. The same key should also be assigned to the `tool_call_id` key of the tool response dictionary below, so that tool calls can be matched to tool responses.

For Mistral/Mixtral model, the code above should be:

In [None]:
tool_call_id = "9Ae3bDc2F"  # Random ID, 9 alphanumeric characters
tool_call = {
    "name": "get_current_temperature",
    "arguments": {"location": "Paris, France", "unit": "celsius"}
}
messages.append(
    {
        "role": "assistant",
        "tool_calls": [{"type": "function", "id": tool_call_id, "function": tool_call}]
    }
)

messages.append(
    {
        "role": "tool",
        "tool_call_id": tool_call_id,
        "name": "get_current_temperature",
        "content": "22.0"
    }
)

Finally, let the assistant read the function outputs and continue chatting with the user:

In [None]:
inputs = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors='pt'
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))

In [None]:
print(tokenizer.decode(out[0]))

### Understanding the schemas

Each function we pass to the `tools` argument of `apply_chat_template` is converted into a [**JSON schema**](https://json-schema.org/learn/getting-started-step-by-step).

These schemas are passed to the model chat template. The tool-use models do not see our functions directly, and they never see the actual code inside them. What they care about is the function **definitions** and the **arguments** they need to pass to them - they care about what the tools do and how to use them, not how they work!

Generating JSON schemas to pass to the template should be automatic and invisible as long as our functions follow the specification above. If we encounter any problems or we want more control over the conversion, we can handle the conversion manually.

Example of a manual schema conversion:

In [None]:
from transformers.utils import get_json_schema

def multiply(a: float, b: float):
    """A function that multiplies two numbers

    Args:
    a: the first number to multiply
    b: the second number to multiply
    """
    return a * b

schema = get_json_schema(multiply)
schema

{'type': 'function',
 'function': {'name': 'multiply',
  'description': 'A function that multiplies two numbers',
  'parameters': {'type': 'object',
   'properties': {'a': {'type': 'number',
     'description': 'the first number to multiply'},
    'b': {'type': 'number', 'description': 'the second number to multiply'}},
   'required': ['a', 'b']}}}

We can edit these schemas, or even write them from scratch ourselves without using `get_json_schema` at all.

JSON schemas can be passed directly to the `tools` argument of `apply_chat_template`.

The more complex our schemas, the more likely the model is to get confused when dealing wih them!!! We need to have simple function signatures where possible, keeping arguments (and especially complex, nested arguments) to a minimum.

In [None]:
# A simple function that takes no arguments
current_time = {
  "type": "function",
  "function": {
    "name": "current_time",
    "description": "Get the current local time as a string.",
    "parameters": {
      'type': 'object',
      'properties': {}
    }
  }
}

# A more complete function that takes two numerical arguments
multiply = {
  'type': 'function',
  'function': {
    'name': 'multiply',
    'description': 'A function that multiplies two numbers',
    'parameters': {
      'type': 'object',
      'properties': {
        'a': {
          'type': 'number',
          'description': 'The first number to multiply'
        },
        'b': {
          'type': 'number', 'description': 'The second number to multiply'
        }
      },
      'required': ['a', 'b']
    }
  }
}

model_input = tokenizer.apply_chat_template(
    messages,
    tools = [current_time, multiply]
)

## Retrieval-augmented generation

**"Retrieval-augmented generation" (RAG)** LLMs can search a corpus of documents for information before responding to a query. This allows models to vastly expand their knowledge base beyond their limited context size. The template for RAG models should accept a `documents` arguments. This should be a list of documents, where each `"document"` is a single dict with `title` and `content` keys, both of which are strings. Because this format is much simpler than the JSON schemas used for tools, no helper functions are needed.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'CohereForAI/c4ai-command-r-v01-4bit'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
)
device = model.device

In [None]:
# 1a. Define conversation input
conversation = [
    {'role': 'user', 'content': 'What has Man always dreamed of?'}
]

# 1b. Define documents for RAG
documents = [
    {
        'title': 'The Moon: Our Age-Old Foe',
        'text': 'Man has dreamed of destroying the moon. In this essay, I shall...'
    },
    {
        "title": "The Sun: Our Age-Old Friend",
        "text": "Although often underappreciated, the sun provides several notable benefits..."
    }
]

# 2. Tokenize conversation and documents using a RAG template, returning PyTorch tensors
input_ids = tokenizer.apply_chat_template(
    conversation=conversation,
    documents=documents,
    chat_template='rag',
    tokenize=True,
    add_generation_prompt=True,
    return_tensors='pt'
).to(device)

# 3. Generate a response
gen_tokens = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3
)

# 4. Decode and print the generated text along with generation prompt
gen_text = tokenizer.decode(gen_tokens[0])
gen_text

To verify if a model supports the `documents` input, we can `print(tokenizer.chat_template` to see if the `documents` key is used anywhere.

# Advanced Usage and Constomizing Our Chat Templates

## Mechanism behind chat templates

The chat template for a model is stored on the `tokenizer.chat_template`.

For a simiplified version of `Zephyr` chat template,
```python
{%- for message in messages %}
    {{- '<|' + message['role'] + '|>\n' }}
    {{- message['content'] + eos_token }}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|assistant|>\n' }}
{%- endif %}
```
This is a [**Jinja template**](https://jinja.palletsprojects.com/en/3.1.x/templates/). Jinja is a templating language that allows use to write simple code that generates text. The code and syntax resembles Python. In pure Python, this template would look like this:
```python
for message in messages:
    print(f'<|{message["role"]}|>')
    print(message['content'] + eos_token)
if add_generation_prompt:
    print('<|assistant|>')
```

The template does three things:
* For each message, print the role enclosed in `<|` and `|>`, like `<|user|>` or `<|assistant|>`.
* Next, print the content of the message, followed by the eod-of-squence `eos_token` token.
* Finally, if `add_generation_prompt` is set, print the assistant token, so that the model knows to start generating an assistant response.

Jinja provides more flexible and more complex patterns. The following Jinja template can format inputs similarly to the way LLaMA formats them (note that the real LLaMA template includes handling for default system messages and slightly different system message handling in generate)
```python
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{- bos_token + '[INST] ' + message['content'] + ' [/INST]' }}
    {%- elif message['role'] == 'system' %}
        {{- '<<SYS>>\\n' + message['content'] + '\\n<</SYS>>\\n\\n' }}
    {%- elif message['role'] == 'assistant' %}
        {{- ' '  + message['content'] + ' ' + eos_token }}
    {%- endif %}
{%- endfor %}
```
This template adds specific tokens like `[INST]` and `[/INST]` based on the role of each message. User, assistant, and system messages are distinguishable to the model because of the tokens they are wrapped in.

## Creating chat template

To create a chat template, we just write a jinja template and set `tokenizer.chat_template`. For example, we could take the LLaMA template above and add `"[ASST]"` and `"[/ASST]"` to assistant messages:
```python
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{- bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }}
    {%- elif message['role'] == 'system' %}
        {{- '<<SYS>>\\n' + message['content'].strip() + '\\n<</SYS>>\\n\\n' }}
    {%- elif message['role'] == 'assistant' %}
        {{- '[ASST] '  + message['content'] + ' [/ASST]' + eos_token }}
    {%- endif %}
{%- endfor %}
```
The `tokenizer.chat_template` attribute will be saved in the `tokenizer_config.json` file:
```python
template = tokenizer.chat_template
template = template.replace("SYS", "SYSTEM")  # Change the system token
tokenizer.chat_template = template  # Set the new template
tokenizer.push_to_hub("model_name")  # Upload your new template to the Hub!
```

## Models with multiple templates

Some models use different templates for different use cases. For example, they might use one template for normal chat and another for tool-use, or retrieval-augmented generation. In these cases, `tokenizer.chat_template` is a dictionary. This can cause some confusion, and where possible, we needto apply a single template for all use-cases.

When a tokenizer has multiple templates, `tokenizer.chat_template` will be a `dict`, where each key is the name of a template. The `apply_chat_template` method has special handling for certain template names.

## Choosing a template

When setting the template for a model that's already been trained for chat, we should ensure that the template exactly matches the message formatting that the model saw during training, or else we will probably experience performance degradation. This is true even if we are training the model further.

If we are training a model from scratch, or fine-tuning a base language model for chat, on the other hand, we have a lot of freedom to choose an appropriate template. LLMs are smart enough to learn to handle lots of different input formats.

One popular choice is the `ChatML` format:
```python
{% for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{% endfor %}
```
This also includes support for `generation_prompts`.

If our model expects BOS or EOS tokens:
```python
{% if not add_generation_prompt is defined %}
    {% set add_generation_prompt = false %}
{% endif %}
{% for message in messages %}
    {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}
{% if add_generation_prompt %}
    {{ '<|im_start|>assistant\n' }}
{% endif %}
```

## Modifying chat templates

Jinja templates in transformers are identical to Jinja templates elsewhere. The conversation history is accessible inside our template as a variable called `messages`.

### Trimming whitespace

By default, Jinja will print any whitespace that comes before or after a block. This can be a problem for chat templates, which generally want to be very precise with whitespace. To avoid this, we strongly recommend writing our templates like this:
```python
{%- for message in messages %}
    {{- message['role'] + message['content'] }}
{%- endfor %}
```
rather than like this:
```python
{% for message in messages %}
    {{ message['role'] + message['content'] }}
{% endfor %}
```
Adding `-` will strip any whitespace that comes before the block.

### Special variables

* `messages` contains the chat history as a list of message dicts.
* `tools` contains a list of tools in JSON schema format. Will be `None` or undefined if no tools are passed.
* `documents` contains a list of documents in the format `{"title": "Title", "contents": "Contents"}`, used for retrieval-augmented generation. Will be `None` or undefined if no documents are passed.
* `add_generation_prompt` is a `bool` that is `True` if the user has requested a generation prompt, and `False` otherwise. If this is set, our template should add the header for an assistant message to the end of the conversation. If our model does not have a specific header for assistant messages, we can ignore this flag.
* Special tokens like `bos_token` and `eos_token`.

### Callable functions

Inside our templates, we can call
* `raise_exception(msg)`: Raises a `TemplateException`. This is useful for debugging.
* `strftime_now(format_str)`: Equivalent to `datetime.now().strfime(format_str)`.

### Writing generation prompts

If our model expects a header for assistant messages, then our template must support adding the header when `add_generation_prompt` is set.

Here is an example of a template that formats messages ChatML-style, with generation prompt support:
```python
{{- bos_token }}
{%- for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}
```

The exact content of the assistant header will depend on our specific model, but it should always be the string that represents the start of an assistant message, so that if the user applies your template with `add_generation_prompt=True` and then generates text, the model will write an assistant response.

### Writing and debugging larger templates

Templates for new models and features like tool-use and RAG can be really long. We can save them in separate files and extract chat templates to a file:
```python
open("template.jinja", "w").write(tokenizer.chat_template)
```
Or load the edited template back into the tokenizer:
```python
tokenizer.chat_template = open("template.jinja").read()
```

## Writing templates for tools

The whole point of chat templates is to allow code to be transferable across models, so deviating from the standard tools API means users will have to write custom code to use tools with our model.

The following elements are the elements of the standard API.

### Tool definitions

The template should expect that the variable `tools` will either be null, or is a list of JSON schema dicts.

Example of JSON schema:
```yaml
{
  "type": "function",
  "function": {
    "name": "multiply",
    "description": "A function that multiplies two numbers",
    "parameters": {
      "type": "object",
      "properties": {
        "a": {
          "type": "number",
          "description": "The first number to multiply"
        },
        "b": {
          "type": "number",
          "description": "The second number to multiply"
        }
      },
      "required": ["a", "b"]
    }
  }
}
```
and then the following code is used to handle tools in our chat template:
```python
{%- if tools %}
    {%- for tool in tools %}
        {{- '<tool>' + tool['function']['name'] + '\n' }}
        {%- for argument in tool['function']['parameters']['properties'] %}
            {{- argument + ': ' + tool['function']['parameters']['properties'][argument]['description'] + '\n' }}
        {%- endfor %}
        {{- '\n</tool>' }}
    {%- endif %}
{%- endif %}
```
The specific tokens and tool descriptions our template renders should be chosen to match the ones our model was trained with. There is no requirement that our model understands JSON schema input, only that our template can translate JSON schema into our model's format.

### Tool calls

Tool calls will be a list attached to a message with the "assistant" role. Note that `tool_calls` is always a list, even though most tool-calling models only support single tool calls at a time, which means the list will usually only have a single element

Example of tool calls:
```yaml
{
  "role": "assistant",
  "tool_calls": [
    {
      "type": "function",
      "function": {
        "name": "multiply",
        "arguments": {
          "a": 5,
          "b": 6
        }
      }
    }
  ]
}
```
and a common pattern for handling them:
```python
{%- if message['role'] == 'assistant' and 'tool_calls' in message %}
    {%- for tool_call in message['tool_calls'] %}
            {{- '<tool_call>' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n</tool_call>' }}
        {%- endif %}
    {%- endfor %}
{%- endif %}
```

## Tool responses

Tool responses have a simple format: They are a message dict with
* the `"tool"` role,
* a `"name"` key giving the name of the called function, and
* a `"content"` key containing the result of the tool call.

Example of tool response:
```yaml
{
  "role": "tool",
  "name": "multiply",
  "content": "30"
}
```
If our model does not expect the function name to be included in the tool response, then we can render it as:
```python
{%- if message['role'] == 'tool' %}
    {{- "<tool_result>" + message['content'] + "</tool_result>" }}
{%- endif %}
```