[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W6T_Advanced_Prompting_Generation.ipynb)

In [None]:
# Run in Notebook to install local packages
!pip install torch transformers bitsandbytes accelerate rank_bm25 outlines datasets

# Advanced Prompting and Generation with 🤗 Transformers

*This notebook is based on the [HuggingFace Developer Guide](https://huggingface.co/docs/transformers/chat_templating) as well as this [introduction on agents](https://huggingface.co/docs/smolagents/conceptual_guides/intro_agents)*

## Setting up

Let's load the model and the tokenizer that we will be using for this tutorial. We will be using the Qwen2.5-1.5B-Instruct model from the 🤗 [HuggingFace Hub](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), as it's lightweight and can use tools, which we will showcase later in this tutorial. We will load it using 8-bit quantization, to further reduce the VRAM requirements while maintaining relatively good performance.

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

checkpoint = "Qwen/Qwen2.5-1.5B-Instruct"

# Configure 8-bit quantization. We use this to save VRAM, as we don't have a lot available.
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True  # Enables 8-bit quantization
)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=bnb_config,  # Apply BitsAndBytesConfig
    device_map="cuda"   # Assign to GPU
)

## Chat Templates

An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more messages, each of which includes a role, like “user” or “assistant”, as well as message text.

Much like tokenization, different models expect very different input formats for chat. This is the reason chat templates exist. Chat templates are part of the tokenizer. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

Let’s make this concrete with a quick example using the previously loaded model:

In [3]:
chat = [
  {"role": "system", "content": "You are a helpful assistant"},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)

"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing great. How can I help you today?<|im_end|>\n<|im_start|>user\nI'd like to show off how chat templating works!<|im_end|>\n"

Notice how the tokenizer has added the control tokens <code> <|im_start|> </code> and <code><|im_end|></code> to indicate the start and end of messages, and the entire chat is condensed into a single string. If we use <code>tokenize=True</code>, which is the default setting, that string will also be tokenized for us.

Now, if we had used a different model, such as [Mistral's 7B model](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), the output would have been different:


```
<s> [INST] You are a helpful assistant
Hello, how are you? [/INST] I'm doing great. How can I help you today?</s>
[INST] I'd like to show off how chat templating works! [/INST]
```


The two models were trained with totally different chat formats. Without chat templates, you would have to write manual formatting code for each model, and it’s very easy to make minor errors that hurt performance! Chat templates handle the details of formatting for you, allowing you to write universal code that works for any model.



> **Good to Know**: Models fine-tuned using the same base model could still use different chat formats!



### How do I use chat templates?

As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with role and content keys, and then pass it to the <code>apply_chat_template()</code> method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use <code>add_generation_prompt=True</code> to add a generation prompt.

Here’s an example of preparing input for <code>model.generate()</code>, using the model we previously loaded.


In [3]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
print(tokenizer.decode(tokenized_chat[0])) # This will yield a string in the input format that our model expects.

<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant



Now that our input is formatted correctly for our model, we can use the model to generate a response to the user’s question:

In [4]:
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant
Ahoy there! That's quite an odd question, matey. Helicopters are not edible fare for humans, and even if they were, it would be more than just a few to swallow all at once. Let me tell you, sailors, our food is mostly fruits, veggies, fish, and other maritime delicacies. Now, what sea do ye wish to sail upon?<|im_end|>


> **Good to Know**: The <code>add_generation_prompt</code> argument tells the template to add tokens that indicate the start of a bot response, by simply appending <code><|im_start|>assistant</code>. This ensures that when the model generates text it will write a bot response instead of doing something unexpected, like continuing the user’s message. Remember, chat models are still just language models - they’re trained to continue text, and chat is just a special kind of text to them! You need to guide them with appropriate control tokens, so they know what they’re supposed to be doing.



### What does <code>continue_final_message</code> do?
When passing a list of messages to <code>apply_chat_template</code>, you can choose to format the chat so the model will continue the final message in the chat instead of starting a new one. This is done by removing any end-of-sequence tokens that indicate the end of the final message, so that the model will simply extend the final message when it begins to generate text. This is useful for “prefilling” the model’s response.

Here’s an example:


In [None]:
messages = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'},
]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, continue_final_message=True, return_tensors="pt").to("cuda")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Can you format the answer in JSON?<|im_end|>
<|im_start|>assistant
{"name": "Qwen", "role": "helpful assistant"}<|im_end|>


The model will generate text that continues the JSON string, rather than starting a new message. This approach can be very useful for improving the accuracy of the model’s instruction-following when you know how you want it to start its replies.

> **Good to Know**: Because <code>add_generation_prompt</code> adds the tokens that start a new message, and <code>continue_final_message</code> removes any end-of-message tokens from the final message, it does not make sense to use them together. As a result, you’ll get an error if you try!


## Tool use / function calling

“Tool use” LLMs can choose to call functions as external tools before generating an answer. When passing tools to a tool-use model, you can simply pass a list of functions to the <code>tools</code> argument:



In [None]:
import datetime

def current_time():
    """Get the current local time as a string."""
    return str(datetime.now())

def multiply(a: float, b: float):
    """
    A function that multiplies two numbers

    Args:
        a: The first number to multiply
        b: The second number to multiply
    """
    return a * b

tools = {
    "current_time": current_time,
    "multiply": multiply
}
model_input = tokenizer.apply_chat_template(
    messages,
    tools=list(tools.values())
)

> **Good to Know**:  In order for this to work correctly, you should write your functions in the format above, so that they can be parsed correctly as tools. Specifically, you should follow these rules:
1. The function should have a descriptive name
2. Every argument must have a type hint (e.g., "a: float")
3. The function must have a docstring in the standard Google style (in other words, an initial function description followed by an <code>Args:</code> block that describes the arguments, unless the function does not have any arguments)
4. Do not include types in the Args: block. In other words, write <code>a: The first number to multiply</code>, not <code>a (int): The first number to multiply</code>. Type hints should go in the function header instead.
5. The function can have a return type and a <code>Returns:</code> block in the docstring. However, these are optional because most tool-use models ignore them.

### Passing tool results to the model
The sample code above is enough to list the available tools for your model, but what happens if it wants to actually use one? If that happens, you should:

1. Parse the model’s output to get the tool name(s) and arguments.
2. Add the model’s tool call(s) to the conversation.
3. Call the corresponding function(s) with those arguments.
4. Add the result(s) to the conversation

### A complete tool use example
Let’s walk through a tool use example, step by step.

Next, let’s define a list of tools:

In [None]:
def get_current_temperature(location: str, unit: str) -> float:
    """
    Get the current temperature at a location.

    Args:
        location: The location to get the temperature for, in the format "City, Country"
        unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"])
    Returns:
        The current temperature at the specified location in the specified units, as a float.
    """
    return 22.  # A real function should probably actually get the temperature!

def get_current_wind_speed(location: str) -> float:
    """
    Get the current wind speed in km/h at a given location.

    Args:
        location: The location to get the temperature for, in the format "City, Country"
    Returns:
        The current wind speed at the given location in km/h, as a float.
    """
    return 6.  # A real function should probably actually get the wind speed!

tools = {
    "get_current_temperature": get_current_temperature,
    "get_current_wind_speed": get_current_wind_speed
}

Now, let’s set up a conversation for our bot:

In [None]:
messages = [
  {"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location. Use the tools provided."},
  {"role": "user", "content": "Hey, what's the temperature in Paris right now in Celsius?"}
]

Now, let’s apply the chat template and generate a response:

In [None]:
inputs = tokenizer.apply_chat_template(messages, tools=list(tools.values()), add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=128)
decoded_output = tokenizer.decode(out[0][len(inputs["input_ids"][0]):])
print(decoded_output)

<tool_call>
{"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
</tool_call><|im_end|>


Hopefully, the model produced a valid function call. Let's try to parse it.

In [None]:
import json
import re
# Extract function calls using regex
tool_call_match = re.search(r"<tool_call>(.*?)</tool_call>", decoded_output, re.DOTALL)

if tool_call_match:
    try:
        # Strip the matched string and load it as JSON
        tool_call = json.loads(tool_call_match.group(1).strip()) # Convert to Dictionary
        print("\nExtracted Function Call:", tool_call)
    except json.JSONDecodeError:
        print("Error parsing tool call:", tool_call_match.group(1))
else:
    print("No tool call found.")


Extracted Function Call: {'name': 'get_current_temperature', 'arguments': {'location': 'Paris, France', 'unit': 'celsius'}}


The model has called the function with valid arguments, in the format requested by the function docstring. It has inferred that we’re most likely referring to the Paris in France, and it remembered to display the temperature using Celsius.

Next, let’s get the result of the function call:

In [None]:
func_name = tool_call["name"]
args = tool_call["arguments"]

result = None
if func_name in tools:
    result = tools[func_name](**args)
else:
    print(f"Function {func_name} not found.")

Now that we have the result, we can append it to the conversation:

In [None]:
messages.append({"role": "tool", "name": func_name, "content": str(result)})

Finally, let’s let the assistant read the function outputs and continue chatting with the user:

In [None]:
inputs = tokenizer.apply_chat_template(messages, tools=list(tools.values()), add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))

The current temperature in Paris is 22.0°C.<|im_end|>


Although this was a simple demo with dummy tools and a single call, the same technique works with multiple real tools and longer conversations. This can be a powerful way to extend the capabilities of conversational agents with real-time information, computational tools like calculators, or access to large databases.


## Advanced Prompting Techniques

In this section we cover two advanced prompting techniques that can significantly improve the performance of any Large Language Model, especially for reasoning tasks.

### Few-shot prompting


The basic prompts in the sections above are the examples of “zero-shot” prompts, meaning, the model has been given instructions and context, but no examples with solutions. LLMs that have been fine-tuned on instruction datasets, generally perform well on such “zero-shot” tasks. However, you may find that your task has more complexity or nuance, and, perhaps, you have some requirements for the output that the model doesn’t catch on just from the instructions. In this case, you can try the technique called few-shot prompting.

In few-shot prompting, we provide examples in the prompt giving the model more context to improve the performance. The examples condition the model to generate the output following the patterns in the examples.

Here’s an example:

In [20]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": """Parse the date from the following text:
Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 12/04/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.
Date:"""},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt = True, return_tensors="pt").to("cuda")
outputs = model.generate(tokenized_chat, max_new_tokens=128, do_sample=False, temperature=None, top_p=None, top_k=None)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Parse the date from the following text:
Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 12/04/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.
Date:<|im_end|>
<|im_start|>assistant
September 28, 1960<|im_end|>


In the above code snippet we used a single example to demonstrate the desired output to the model, so this can be called a “one-shot” prompting. However, depending on the task complexity you may need to use more than one example.

> **Good to Know**: Limitations of the few-shot prompting technique:
1. While LLMs can pick up on the patterns in the examples, these technique doesn’t work well on complex reasoning tasks
2. Few-shot prompting requires creating lengthy prompts. Prompts with large number of tokens can increase computation and latency. There’s also a limit to the length of the prompts.
3. Sometimes when given a number of examples, models can learn patterns that you didn’t intend them to learn, e.g. that the third movie review is always negative.

### Structured Generation with Outlines

In the few-shot example above, the generation returns a date, but it does not respect the format that was provided in the example. It is however possible to force LMs to generate according to pre-defined constraints, simply by building a tree of valid tokens at every step and considering only those as valid when sampling the solution. The [Outlines](https://dottxt-ai.github.io/outlines) library provides a simple API with many wrappers for this purpose, supporting structures like JSONs, [Pydantic](https://docs.pydantic.dev/latest/) Python classes and Regexes. Let's try again the previous example with Regex structured generation:

In [21]:
from outlines import models, generate
from outlines.generate.api import GenerationParameters, SamplingParameters

#(0[1-9]|[12][0-9]|3[01]) - Day: Either 01-09, 10-29, or 30-31
#/ - Literal forward slash
#(0[1-9]|1[0-2]) - Month: Either 01-09 or 10-12
#/ - Literal forward slash
#([12][0-9]{3}) - Year: 1000-2999
dateformat_regex = r"(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[0-2])/([12][0-9]{3})"

# Wrap the model in outlines
outlines_model = models.Transformers(model, tokenizer)

generator = generate.regex(outlines_model, dateformat_regex)

# Input should be in text format, not vectors (automatically processed by Outlines)
tokenized_chat_txt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt = True)

# See GenerationParameters, SamplingParameters for allowed generation args
answer = generator(tokenized_chat_txt, max_tokens=128)

print(answer)

28/09/1960


More info about structured generation can be found in the [Outlines documentation](https://dottxt-ai.github.io/outlines/latest/).

### Chain of Thought (CoT)


Chain-of-thought (CoT) prompting is a technique that nudges a model to produce intermediate reasoning steps thus improving the results on complex reasoning tasks.

There are two ways of steering a model to producing the reasoning steps:

1. few-shot prompting by illustrating examples with detailed answers to questions, showing the model how to work through a problem.
2. by instructing the model to reason by adding phrases like “Let’s think step by step” or “Take a deep breath and work through the problem step by step.”

Here are two examples, one with and one without CoT (using the 2nd method above combined with <code>continue_final_message</code> to further incentivise the model to 'reason').
>Remember that <code>continue_final_message</code> and <code>add_generation_prompt</code> cannot be used together.

In [None]:
## No Chain of Thought
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "There are 4 apples on the table. Jack eats one. Elen eats a pear, and adds an apple to the table. How many apples are on the table?"},
]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
There are 4 apples on the table. Jack eats one. Elen eats a pear, and adds an apple to the table. How many apples are on the table?<|im_end|>
<|im_start|>assistant
Jack eats one apple, so there is still 1 apple left on the table.<|im_end|>


In [None]:
# Chain of Thought
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "There are 4 apples on the table. Jack eats one. Elen eats a pear, and adds an apple to the table. How many apples are on the table?"},
    {"role": "assistant", "content": "Let's think step by step:"},
]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, continue_final_message=True, return_tensors="pt").to("cuda")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
There are 4 apples on the table. Jack eats one. Elen eats a pear, and adds an apple to the table. How many apples are on the table?<|im_end|>
<|im_start|>assistant
Let's think step by step: Initially there were 4 apples on the table.
Jack ate one, so now there are 3 apples left.
Elen added one apple back to the table, making it 4 again.

Therefore, there are still 4 apples on the table.<|im_end|>


As a last example on this topic, you can of course combine few-shot prompting with CoT:

In [None]:
# Chain of Thought
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": """
  Question: There are 4 apples on the table. Jack eats one. Elen eats a pear, and adds an apple to the table. How many apples are on the table?
  Reasoning: Let's think step by step: Initially there were 4 apples on the table. Jack ate one, so now there are 3 apples left. Elen added one apple back to the table, making it 4 again.
  Answer: 4
  Question: There are 6 oranges on the table. Sam takes two. Lily eats a banana and puts one orange on the table. How many oranges are on the table now?
    """},
    {"role": "assistant", "content": "Reasoning: Let's think step by step:"},
]

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, continue_final_message=True, return_tensors="pt").to("cuda")
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user

    Question: There are 4 apples on the table. Jack eats one. Elen eats a pear, and adds an apple to the table. How many apples are on the table?
    Reasoning: Let's think step by step: Initially there were 4 apples on the table. Jack ate one, so now there are 3 apples left. Elen added one apple back to the table, making it 4 again.
    Answer: 4
    Question: There are 6 oranges on the table. Sam takes two. Lily eats a banana and puts one orange on the table. How many oranges are on the table now?
    <|im_end|>
<|im_start|>assistant
Reasoning: Let's think step by step: Initially, there were 6 oranges on the table. Sam took two, leaving 4 oranges. Lily put one orange back on the table, increasing the count to 5 oranges.

Answer: 5<|im_end|>


## Best practices of LLM prompting
In this section of the guide we have compiled a list of best practices that tend to improve the prompt results:

1. When choosing the model to work with, the latest and most capable models are likely to perform better.
2. Start with a simple and short prompt, and iterate from there.
3. Put the instructions at the beginning of the prompt, or at the very end. When working with large context, models apply various optimizations to prevent Attention complexity from scaling quadratically. This may make a model more attentive to the beginning or end of a prompt than the middle.
4. Clearly separate instructions from the text they apply to.
5. Be specific and descriptive about the task and the desired outcome - its format, length, style, language, etc.
6. Avoid ambiguous descriptions and instructions.
7. Favor instructions that say “what to do” instead of those that say “what not to do”.
8. “Lead” the output in the right direction by writing the first word (or even begin the first sentence for the model).
9. Use advanced techniques like [Few-shot prompting](https://huggingface.co/docs/transformers/tasks/prompting#few-shot-prompting) and [Chain-of-thought](https://huggingface.co/docs/transformers/tasks/prompting#chain-of-thought).
10. Test your prompts with different models to assess their robustness.
11. Version and track the performance of your prompts.


## Retrieval-augmented generation

“Retrieval-augmented generation” or “RAG” LLMs can search a corpus of documents for information before responding to a query. This allows models to vastly expand their knowledge base beyond their limited context size.
Below we present a minimal example, using a list of four facts as documents. There are four main steps:
1. Document tokenisation (using <code>BM25Okapi</code>).
2. Retrieval of most relevant documents (top 2 in this case, using <code>bm25.get_top_n</code>).
3. Use of <code>apply_chat_template</code> to prepare the prompt, inserting the context (retrieved documents) in the user message.
4. Generate the answer, as we usually do.

In [None]:
from rank_bm25 import BM25Okapi

# Sample document collection
documents = [
    "The Eiffel Tower is located in Paris, France.",
    "The capital of Germany is Berlin.",
    "Shakespeare wrote many famous plays.",
    "The Pacific Ocean is the largest ocean on Earth."
]

# Query for the model to answer
query = 'What is the most famous landmark in Paris?'

# Tokenize documents for BM25
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Retrieve top_k passages using BM25
top_k = 2
tokenized_query = query.split()
top_docs = bm25.get_top_n(tokenized_query, documents, n=top_k)

# Apply chat template
context = "\n".join(top_docs)
chat = [
  {"role": "system", "content": "You are a helpful assistant"},
  {"role": "user", "content": f"Context: {context}\nQuery: {query}"},
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False)

print("Prompt:", prompt)
# Generate response
input_ids = tokenizer.encode(prompt, return_tensors='pt').to("cuda")
output_ids = model.generate(input_ids, max_length=100)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("Response:", response)

Prompt: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Context: Shakespeare wrote many famous plays.
The Eiffel Tower is located in Paris, France.
Query: What is the most famous landmark in Paris?<|im_end|>

Response: system
You are a helpful assistant
user
Context: Shakespeare wrote many famous plays.
The Eiffel Tower is located in Paris, France.
Query: What is the most famous landmark in Paris?
system
The most famous landmark in Paris is the Eiffel Tower.


## LLM Agents

### 🤔 What are agents?

Any efficient system using AI will need to provide LLMs some kind of access to the real world: for instance the possibility to call a search tool to get external information, or to act on certain programs in order to solve a task. In other words, LLMs should have *agency*. Agentic programs are the gateway to the outside world for LLMs.

> **AI Agents are programs where LLM outputs control the workflow.**


Any system leveraging LLMs will integrate the LLM outputs into code. The influence of the LLM’s input on the code workflow is the level of agency of LLMs in the system.

Note that with this definition, “agent” is not a discrete, 0 or 1 definition: instead, “agency” evolves on a continuous spectrum, as you give more or less power to the LLM on your workflow.

This agentic system runs in a loop, executing a new action at each step (the action can involve calling some pre-determined tools), until its observations make it apparent that a satisfactory state has been reached to solve the given task. Here’s an example of how a multi-step agent can solve a simple math question:

![LLM Agents Introduction](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Agent_ManimCE.gif)


### ✅ When to use agents / ⛔ when to avoid them

Agents are useful when you need an LLM to determine the workflow of an app. But they’re often overkill. The question is: do I really need flexibility in the workflow to efficiently solve the task at hand? If the pre-determined workflow falls short too often, that means you need more flexibility. Let’s take an example: say you’re making an app that handles customer requests on a surfing trip website.

You could know in advance that the requests will belong to either of 2 buckets (based on user choice), and you have a predefined workflow for each of these 2 cases.

1. Want some knowledge on the trips? ⇒ give them access to a search bar to search your knowledge base
2. Wants to talk to sales? ⇒ let them type in a contact form.

If that deterministic workflow fits all queries, by all means just code everything! This will give you a 100% reliable system with no risk of error introduced by letting unpredictable LLMs meddle in your workflow. For the sake of simplicity and robustness, it’s advised to regularize towards not using any agentic behaviour.

But what if the workflow can’t be determined that well in advance?

For instance, a user wants to ask:

> "I can come on Monday, but I forgot my passport so risk being delayed to Wednesday, is it possible to take me and my stuff to surf on Tuesday morning, with a cancellation insurance?"

This question hinges on many factors, and probably none of the predetermined criteria above will suffice for this request.

If the pre-determined workflow falls short too often, that means you need more flexibility.

That is where an agentic setup helps.

In the above example, you could just make a multi-step agent that has access to a weather API for weather forecasts, Google Maps API to compute travel distance, an employee availability dashboard and a RAG system on your knowledge base.

Until recently, computer programs were restricted to pre-determined workflows, trying to handle complexity by piling up if/else switches. They focused on extremely narrow tasks, like “compute the sum of these numbers” or “find the shortest path in this graph”. But actually, most real-life tasks, like our trip example above, do not fit in pre-determined workflows. Agentic systems open up the vast world of real-world tasks to programs!


### Code agents

In a multi-step agent, at each step, the LLM can write an action, in the form of some calls to external tools. A common format (used by Anthropic, OpenAI, and many others) for writing these actions is generally different shades of “writing actions as a JSON of tools names and arguments to use, which you then parse to know which tool to execute and with which arguments”.

[Multiple research papers](https://huggingface.co/papers/2411.01747) have shown that having the tool calling LLMs in code is much better.

The reason for this simply that we crafted our code languages specifically to be the best possible way to express actions performed by a computer. If JSON snippets were a better expression, JSON would be the top programming language and programming would be hell on earth.

The figure below, taken from [Executable Code Actions Elicit Better LLM Agents](https://huggingface.co/papers/2402.01030), illustrate some advantages of writing actions in code:


![Code not json](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/code_vs_json_actions.png)

# Read More:


#### On Chat Templates:
1. [Automated Pipeline for Chat](https://huggingface.co/docs/transformers/chat_templating#is-there-an-automated-pipeline-for-chat)
2. [What does “continue_final_message” do?](https://huggingface.co/docs/transformers/chat_templating#what-does-continuefinalmessage-do)
3. [Using Chat Templates In Training](https://huggingface.co/docs/transformers/chat_templating#can-i-use-chat-templates-in-training)
4. [Understanding tool schemas
](https://huggingface.co/docs/transformers/chat_templating#understanding-tool-schemas)
5. [Prompting vs Fine-Tuning](https://huggingface.co/docs/transformers/tasks/prompting#prompting-vs-fine-tuning)

#### On Agents:
1. [SmolAgents](https://huggingface.co/docs/smolagents/conceptual_guides/intro_agents#why-smolagents-)
2. [Agents Example: Text to SQL](https://huggingface.co/docs/smolagents/examples/text_to_sql)