# Llama 3.1 Tool Use Exploration

This notebook explores the agentic capabilities in the Llama 3.1 instruction models. Our goal is to show how the tool system works under the hood, so you have a clear understanding of its capabilities. For deployment options, you may want to consider the [`llama-agentic-system`](https://github.com/meta-llama/llama-agentic-system) or [`llama-stack`](https://github.com/meta-llama/llama-stack) repos, which provide higher-level wrappers and convenience tools to build complete solutions.

We'll get started just with the bare [`meta-llama/Meta-Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) model, and will show how to complete tool invocations.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [2]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0"
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Built-in Tools

The Llama 3.1 models were trained with knowledge about a few built-in tools, which include `brave_search` for Web browsing and `wolfram_alpha` for mathematical and computational queries. This means that the model already knows when it's a good time to use these tools, and how to interpret the results from the corresponding services.

You must opt-in to use these services, otherwise the model will act as if it knew nothing about the tools. The way to enable them is using a special version of the system prompt that must include the first two lines shown below: `Environment: ipython`, followed by the list of tools the model can use. The remaining part of the system prompt can be configured at will.

We use the chat template built into the tokenizer to convert our conversation (the system prompt plus a query) into a text input for the model.

In [56]:
system_prompt = """\
Environment: ipython
Tools: brave_search, wolfram_alpha

You are a helpful assistant"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Who are the contenders for the next USA presidential election in November?"},
]

tokenized = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

In [57]:
print(tokenized)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

Environment: ipython
Tools: brave_search, wolfram_alpha

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are the contenders for the next USA presidential election in November?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




Note that default dates are automatically added by the template as well.

Tokenization follows the particularities of the Llama 3.1 conversation template, with the delimiters and separators that were used during training. This is essential to preserve the quality of responses. Once we have verified that the input looks fine, we can actually run the tokenization to pass the numerical input ids to the model.

In [58]:
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

In [59]:
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
)
response = outputs[0][input_ids.shape[-1]:]
assistant_response = tokenizer.decode(response, skip_special_tokens=False)
print(assistant_response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|python_tag|>brave_search.call(query="USA presidential election contenders 2024")<|eom_id|>


Based on our input, the model correctly recognized the need to search the web for an answer. The `<|python_tag|>` token is just the one they chose to indicate that a tool call invocation is needed – it does not really imply that it has to be done in Python. We are free to use whatever mechanism we want to invoke the tool – all that matters is that it processes the `query` and returns the answers in the format expected by the model.

To complete the query, we are going to use Meta's reference implementation of the tool, which we took from [the llama-stack repo](https://github.com/meta-llama/llama-stack/blob/f2e18826b6f9c0122eb714f3bda404da19d2355b/llama_toolchain/agentic_system/meta_reference/tools/builtin.py#L88). We do it this way because:
- It's already implemented.
- We must ensure the output format is exactly the one expected by the model. Using the reference implementation guarantees that's the case.

Before we proceed, note that this is **not** the final response from the model yet. Tool calling happens without the user intervening, as an intermediate step our application has to handle. If this code was part of a chat service, we'd display a progress indicator and probably a message that we are browsing the web to retrieve additional information. The user is never shown the `brave_search.call` response. What follows is what our code has to do to complete the query.

Since we are relying on the official implementation by Meta, we import the tool reference code from `llama_stack`:

In [60]:
from llama_toolchain.agentic_system.meta_reference.tools.builtin import BraveSearchTool

The Brave search tool requires an API key to work. You can sign up for a free subscription (it's free, but a credit card is required) [here](https://api.search.brave.com/register). It has a rate limit of 1 query per second, which is enough for development.

Please, _never_ write any token in source code, it's very insecure. I put my key in an environemnt variable because I'm writing this notebook locally. If you run it in Google Colab, you can also add your key as a secret.

In [61]:
import os
brave = BraveSearchTool(os.environ["BRAVE_API_KEY"])

In [62]:
brave.get_name()

'brave_search'

The API to run the tool is [here](https://github.com/meta-llama/llama-stack/blob/f2e18826b6f9c0122eb714f3bda404da19d2355b/llama_toolchain/agentic_system/meta_reference/tools/builtin.py#L42), but it relies on higher-level message types that we are not using. Instead, we'll just invoke the backend implementation like follows:

In [63]:
results = await brave.run_impl("USA presidential election contenders 2024")
results

'{"query": "USA presidential election contenders 2024", "top_k": [{"title": "2024 United States presidential election - Wikipedia", "url": "https://en.wikipedia.org/wiki/2024_United_States_presidential_election", "description": "Jason Palmer, who surprised many ... president since Ted Kennedy in 1980. However, he suspended his campaign on May 15, <strong>2024</strong>. On March 12, <strong>2024</strong>, Biden secured a majority of delegates, becoming the presumptive Democratic nominee. Despite securing the nomination, Biden faced significant opposition ...", "type": "search_result"}, {"title": "Who Is Favored To Win The 2024 Presidential Election? | FiveThirtyEight", "url": "https://projects.fivethirtyeight.com/2024-election-forecast/", "description": "538\\u2019s <strong>2024</strong> <strong>presidential</strong> <strong>election</strong> forecast model showing Democrat Kamala Harris\\u2019s and Republican Donald Trump\\u2019s chances of winning.", "type": "search_result"}, {"title"

The response from the tool is a string that must be appended to the model assistant turn, like follows:

In [64]:
assistant_turn = assistant_response + "\n\n" + results
print(assistant_turn)

<|python_tag|>brave_search.call(query="USA presidential election contenders 2024")<|eom_id|>

{"query": "USA presidential election contenders 2024", "top_k": [{"title": "2024 United States presidential election - Wikipedia", "url": "https://en.wikipedia.org/wiki/2024_United_States_presidential_election", "description": "Jason Palmer, who surprised many ... president since Ted Kennedy in 1980. However, he suspended his campaign on May 15, <strong>2024</strong>. On March 12, <strong>2024</strong>, Biden secured a majority of delegates, becoming the presumptive Democratic nominee. Despite securing the nomination, Biden faced significant opposition ...", "type": "search_result"}, {"title": "Who Is Favored To Win The 2024 Presidential Election? | FiveThirtyEight", "url": "https://projects.fivethirtyeight.com/2024-election-forecast/", "description": "538\u2019s <strong>2024</strong> <strong>presidential</strong> <strong>election</strong> forecast model showing Democrat Kamala Harris\u2019s a

In [65]:
messages.append({
    "role": "assistant", "content": assistant_turn
})

So our full conversation now consists of three messages: the system prompt, the user query, and the results from the tool. We send all this back to the model, while our progress indicator would still be running on the user's UI.

In [66]:
messages

[{'role': 'system',
  'content': 'Environment: ipython\nTools: brave_search, wolfram_alpha\n\nYou are a helpful assistant'},
 {'role': 'user',
  'content': 'Who are the contenders for the next USA presidential election in November?'},
 {'role': 'assistant',
  'content': '<|python_tag|>brave_search.call(query="USA presidential election contenders 2024")<|eom_id|>\n\n{"query": "USA presidential election contenders 2024", "top_k": [{"title": "2024 United States presidential election - Wikipedia", "url": "https://en.wikipedia.org/wiki/2024_United_States_presidential_election", "description": "Jason Palmer, who surprised many ... president since Ted Kennedy in 1980. However, he suspended his campaign on May 15, <strong>2024</strong>. On March 12, <strong>2024</strong>, Biden secured a majority of delegates, becoming the presumptive Democratic nominee. Despite securing the nomination, Biden faced significant opposition ...", "type": "search_result"}, {"title": "Who Is Favored To Win The 2024

We tokenize and use the model to generate additional text. Note that, in this case, we are setting `add_generation_prompt` to `False`. If we used `True`, the tokenizer would append the tokens that invite the model to start an assistant response. But we are still in the middle of an assistant response that is in the process of being finalized, so we don't need those tokens now.

In [67]:
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=False,
    return_tensors="pt"
).to(model.device)

In [68]:
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
)
response = outputs[0][input_ids.shape[-1]:]
assistant_response = tokenizer.decode(response, skip_special_tokens=False)
print(assistant_response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|start_header_id|>assistant<|end_header_id|>

The contenders for the next USA presidential election in November 2024 are:

* Kamala Harris (Democrat)
* Donald Trump (Republican)
* Jason Palmer (who suspended his campaign on May 15, 2024)
* Joe Biden (presumptive Democratic nominee)

Note: The list of contenders may not be exhaustive, as the search results only provide information on the most notable candidates.<|eot_id|>


If you skim over the search results returned by the tool, it's understandable that the model gave this response! Also, if the search query results are different when you run this notebook, you may get different results.

In any case, this is the **final response for this assistant turn**, to be presented to the user. If we want to continue the conversation, we need to replace the last assistant turn with this message.

In [70]:
messages[-1] = assistant_response
messages

[{'role': 'system',
  'content': 'Environment: ipython\nTools: brave_search, wolfram_alpha\n\nYou are a helpful assistant'},
 {'role': 'user',
  'content': 'Who are the contenders for the next USA presidential election in November?'},
 '<|start_header_id|>assistant<|end_header_id|>\n\nThe contenders for the next USA presidential election in November 2024 are:\n\n* Kamala Harris (Democrat)\n* Donald Trump (Republican)\n* Jason Palmer (who suspended his campaign on May 15, 2024)\n* Joe Biden (presumptive Democratic nominee)\n\nNote: The list of contenders may not be exhaustive, as the search results only provide information on the most notable candidates.<|eot_id|>']

----

## Creating a Custom Tool

## Unified Tool Use

From https://huggingface.co/blog/unified-tool-use

----