# Llama-2 agent with grammar-based sampling of function calls

LLM agents can decompose user-defined tasks into smaller steps and use tools at each step until the task is completed. Tool usage requires reasoning about current state, deciding which tool to use next, interacting with the environment by calling that tool, making an observation, updating the current state with the observation and repeating that until done. This synergy of reasoning and acting can either be achieved via prompt engineering, fine-tuning or a combination of both. 

An interface to the underlying LLM should either return a tool call response (to be executed by the caller) if the LLM decides to interact with the environment or a final response to the user. An example of such an interface is OpenAI's [function calling](https://platform.openai.com/docs/guides/function-calling) interface.

## Scope

This article presents the results of my experiments implementing a function calling interface for a Llama-2 chat model with LangChain (full code [here](https://github.com/krasserm/grammar-based-agents)). I used the [grammar-based sampling](https://github.com/ggerganov/llama.cpp/pull/1773) feature of llama.cpp to ensure that function calls generated by the model follow a user-defined schema. The resulting interface is similar to the [ChatOpenAI](https://python.langchain.com/docs/integrations/chat/openai) function calling interface and can be used with LangChain's [agent framework](https://python.langchain.com/docs/modules/agents/).

Llama-2 is known to have some zero-shot tool usage capabilities but they are limited. In its current state, the implementation is a simple prototype for demonstrating grammar-based sampling in LangChain agents. It is general enough to be used with many other language models supported by llama.cpp, after some tweaks to the prompt templates.

## Agent

A Llama-2 agent with grammar-based sampling of function calls can be created as follows (details in section [Components](#components)). The example uses a 4-bit quantized [Llama-2 70b chat model](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF) running on a [llama.cpp server](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md) (launch instructions [here](https://github.com/krasserm/grammar-based-agents/blob/master/README.md#getting-started)).

In [1]:
from langchain.agents import AgentExecutor  
from langchain.tools import StructuredTool  
from langchain_experimental.chat_models.llm_wrapper import Llama2Chat

from gba.agent import Agent
from gba.llm import LlamaCppClient  
from gba.tool import ToolCalling

from example_tools import (  
    calculate,  
    create_event,  
    search_images,  
    search_internet,  
)

# Custom LangChain LLM that interacts with a model hosted on a llama.cpp server
llm = LlamaCppClient(url="http://localhost:8080/completion", temperature=-1)  
  
# Converts incoming messages into a Llama-2 compatible chat prompt 
# and implements the LangChain chat model interface
chat_model = Llama2Chat(llm=llm)  
  
# Layers a tool calling protocol on top of chat_model resulting in
# an interface similar to the ChatOpenAI function calling interface
tool_calling_model = ToolCalling(model=chat_model)  

# List of tools created from Python functions and used by the agent  
tools = [  
    StructuredTool.from_function(search_internet),  
    StructuredTool.from_function(search_images),  
    StructuredTool.from_function(create_event),  
    StructuredTool.from_function(calculate),  
]
 
# Custom LangChain agent implementation (similar to OpenAIFunctionsAgent)
agent_obj = Agent.from_llm_and_tools(
    model=tool_calling_model, 
    tools=tools,
)

# LangChain's AgentExecutor for running the agent loop
agent = AgentExecutor.from_agent_and_tools(
    agent=agent_obj, 
    tools=tools, 
    verbose=True,
)

Here's an example of a request that is decomposed by the agent into multiple steps, using a tool at each step.

In [2]:
agent.run("Who is Leonardo DiCaprio's current girlfriend and what is her age raised to the 0.24 power?")



[1m> Entering new AgentExecutor chain...[0m

Reasoning: I need to call the tool 'search_internet' to find out who Leonardo DiCaprio's current girlfriend is and then calculate her age raised to the 0.24 power using the tool 'calculate'.
[32;1m[1;3m
Invoking: `search_internet` with `{'query': "Leonardo DiCaprio's current girlfriend"}`
[0m[36;1m[1;3mLeonardo di Caprio started dating Vittoria Ceretti in 2023. She was born in Italy and is 25 years old[0m
Reasoning: I need to call another tool to obtain more information, specifically the tool "calculate" to calculate Vittoria Ceretti's age raised to the power of 0.24.
[32;1m[1;3m
Invoking: `calculate` with `{'expression': '25^0.24'}`
[0m[36;1m[1;3m2.16524[0m
Reasoning: I have enough information to respond with a final answer to the user.
[32;1m[1;3mVittoria Ceretti, 2.16524[0m

[1m> Finished chain.[0m


'Vittoria Ceretti, 2.16524'

The answer `Vittoria Ceretti, 2.16524` is short but correct (at the time of writing). At each step, the agent first reasons about the current state and then decides for a tool call using [grammar-based sampling on the server side](https://github.com/ggerganov/llama.cpp/pull/2532) (details in section [ToolCalling](#ToolCalling)).

The agent may also decide to respond to the user directly if tool usage is not necessary. 

In [3]:
agent.run("Tell me a joke")



[1m> Entering new AgentExecutor chain...[0m

Reasoning: I can directly respond to the user with a joke, here's one: "Why don't scientists trust atoms? Because they make up everything!"
[32;1m[1;3mWhy don't scientists trust atoms? Because they make up everything![0m

[1m> Finished chain.[0m


"Why don't scientists trust atoms? Because they make up everything!"

### Conversational agent

For maintaining conversational state with the user the agent can be configured with a memory object.

In [4]:
from langchain.memory import ConversationBufferMemory
from langchain.prompts import MessagesPlaceholder  

chat_history = MessagesPlaceholder(variable_name="chat_history")  
memory = ConversationBufferMemory(
    memory_key="chat_history", 
    return_messages=True,
)

# Agent that additionally maintains conversational state with the user
conversational_agent_obj = Agent.from_llm_and_tools(
    model=tool_calling_model, 
    tools=tools, 
    extra_prompt_messages=[chat_history],
)  
conversational_agent = AgentExecutor.from_agent_and_tools(
    agent=conversational_agent_obj, 
    tools=tools, 
    memory=memory, 
    verbose=True,
)

Let's start the conversation by requesting an image of a brown dog.

In [5]:
conversational_agent.run("find an image of a brown dog")



[1m> Entering new AgentExecutor chain...[0m

Reasoning: I need to call the tool search_images(query: str) to find an image of a brown dog.
[32;1m[1;3m
Invoking: `search_images` with `{'query': 'brown dog'}`
[0m[33;1m[1;3m[brown_dog_1.jpg](https://example.com/brown_dog_1.jpg)[0m
Reasoning: I have enough information to respond with a final answer, and I can provide the user with the URL of the image.
[32;1m[1;3mHere is an image of a brown dog: <https://example.com/brown_dog_1.jpg>[0m

[1m> Finished chain.[0m


'Here is an image of a brown dog: <https://example.com/brown_dog_1.jpg>'

The next request refers to the previous one, and the agent updates the search query accordingly.

In [6]:
conversational_agent.run("dog should be running too")



[1m> Entering new AgentExecutor chain...[0m

Reasoning: I need to call another tool to search for images of a running dog.
[32;1m[1;3m
Invoking: `search_images` with `{'query': 'brown dog running'}`
[0m[33;1m[1;3m[brown_dog_running_1.jpg](https://example.com/brown_dog_running_1.jpg)[0m
Reasoning: I have enough information to respond with a final answer.
[32;1m[1;3mHere is an image of a brown dog running: <https://example.com/brown_dog_running_1.jpg>[0m

[1m> Finished chain.[0m


'Here is an image of a brown dog running: <https://example.com/brown_dog_running_1.jpg>'

The agent was able to create the query `brown dog running` only because it had access to conversational state (`brown` is mentioned only in the first request).

## Components

### `LlamaCppClient`

`LlamaCppClient` is a proxy for a model hosted on a llama.cpp server. It implements LangChain's `LLM` interface and relies on the caller to provide a valid Llama-2 chat prompt.

In [7]:
prompt = "<s>[INST] <<SYS>>\nYou are a helpful assistant\n<</SYS>>\n\nFind an image of brown dog [/INST]"
llm(prompt)

"Sure, here's a cute image of a brown dog for you! 🐶💩\n\n[Image description: A brown dog with a wagging tail, sitting on a green grassy field. The dog has a friendly expression and is looking directly at the camera.]\n\nI hope this image brings a smile to your face! Is there anything else I can assist you with? 😊"

If a tool schema is provided, it is converted by `LlamaCppClient` to a grammar used for constrained sampling on the server side so that the output is a valid instance of that schema.

In [8]:
from gba.tool import tool_to_schema

tool_schema = tool_to_schema(StructuredTool.from_function(search_images))
tool_schema

{'type': 'object',
 'properties': {'tool': {'const': 'search_images'},
  'arguments': {'type': 'object',
   'properties': {'query': {'type': 'string'}}}}}

In [9]:
llm(prompt, schema=tool_schema)

'{ "tool": "search_images", "arguments": { "query": "brown dog" } }'

### `Llama2Chat`

[Llama2Chat](https://python.langchain.com/docs/integrations/chat/llama2_chat) wraps `llm` and implements a chat model interface that applies the Llama-2 chat prompt to incoming messages (see also [these](https://github.com/langchain-ai/langchain/pull/8295#issuecomment-1668988543) [examples](https://github.com/langchain-ai/langchain/pull/8295#issuecomment-1811914445) for other chat prompt formats). 

In [10]:
from langchain.schema.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage(content="You are a helpful assistant"),
    HumanMessage(content="Find an image of brown dog"),
]
chat_model.predict_messages(messages=messages)

AIMessage(content="Sure, here's a cute image of a brown dog for you! 🐶💩\n\n[Image description: A brown dog with a wagging tail, sitting on a green grassy field. The dog has a friendly expression and is looking directly at the camera.]\n\nI hope this image brings a smile to your face! Is there anything else I can assist you with? 😊")

In [11]:
chat_model.predict_messages(messages=messages, schema=tool_to_schema(StructuredTool.from_function(search_images)))

AIMessage(content='{ "tool": "search_images", "arguments": { "query": "brown dog" } }')

### `ToolCalling`

`ToolCalling` adds tool calling functionality to the wrapped `chat_model`, resulting in an interface very similar to the function calling interface of `ChatOpenAI`. It converts a list of provided tools to a [oneOf](https://json-schema.org/understanding-json-schema/reference/combining#oneof) JSON schema and submits it to the wrapped chat model. It also constructs a system prompt internally that informs the backend LLM about the presence of these tools and adds a special `respond_to_user` tool used by the LLM for providing a final response.

In [12]:
chat = [HumanMessage(content="What is (2 * 5) raised to power of 0.8 divided by 2?")]
tool_calling_model.predict_messages(messages=chat, tools=tools)


Reasoning: I need to call the calculate tool to evaluate the expression and obtain the result.


AIMessage(content='', additional_kwargs={'tool_call': {'tool': 'calculate', 'arguments': {'expression': '(2 * 5) ** 0.8 / 2'}}})

The wrapped `chat_model` first receives a message with an instruction to reason about the current state. This message doesn't contain a schema so that the model output is unconstrained. Then the model receives another message with an instruction to act i.e. respond with a tool call. This message contains the JSON schema of the provided tools so that the tool call can be generated with grammar-based sampling.

## Conclusion

Forcing a model to generate tool calls with grammar-based sampling ensures that they follow a specific schema. To increase the probability that the model generates an appropriate tool call at each step i.e. selecting the right tool and arguments, an unconstrained reasoning phase should precede the tool call generation phase. Otherwise attention to the thoughts generated during the reasoning phase is not possible. This two-step approach may also be the basis for using more specialized models e.g. one for the reasoning phase and another one for the tool call generation phase. I plan to implement extensions with more specialized open-source LLMs later and add them to the [GitHub repo](https://github.com/krasserm/grammar-based-agents).