# Migrating from Agent objects to Responses in Llama Stack

Llama Stack had a set of server APIs for creating and managing the components of an agent including endpoints for grouping tools, making an agent that uses those groups of tools, making a session for that agent, and making a turn for that session.  However, these server APIs are deprecated in Llama Stack 0.3.0.  The functionality in those server APIs continue to exist the client APIs, but they are relatively limited and are not expected to undergo substantial improvement so they are not recommended for new development.  Instead, the recommended replacement for those APIs is the OpenAI-compatible Responses API which provides an all-in-one API for doing agentic reasoning.

To learn more about the Responses API and why you might want to use it see [Your agent, your rules: A deep dive into the Responses API with Llama Stack](https://developers.redhat.com/articles/2025/08/20/your-agent-your-rules-deep-dive-responses-api-llama-stack) from Red Hat and/or [Why we built the Responses API](https://developers.openai.com/blog/responses-api/).  This notebook is more specifically focused on the following topic:

> *What do I do if I have an existing system implemented using the old Llama Stack agent APIs and I want to use the Llama Stack Responses API instead?*

To that end, we present the following:

- *Getting Started*: Instructions for setting up, configuring, creating a client, and testing it out.
- *Legacy Agent API Example*: Code that creates and runs an agent using the Llama Stack client Agent APIs.
- *Equivalent Responses API Example*: Code that does the same thing by calling the Responses API to show how you would rewrite the first example to use Responses API.  That is probably the best approach for most users who are using the old APIs and want to move to the new one.  That code is significantly simpler and more elegant than the equivalent code using the old APIs.
- *Emulating the Legacy Agent API*: Code that provides a wrapper around the Responses API that emulates the old agent APIs along with code that mirrors the original example using that emulation.  Note that the emulation is a partial/incomplete emulation of the old APIs that is just powerful enough to implement the simple example.  It is meant to be a starting point for you to emulate those parts of the old agent APIs that are essential for your application.
- *Adopting a simpler agent class*: A simpler wrapper that is less consistent with the old APIs but might be an easier starting point if complete compatibility with the old APIs are less essential.
- *Human-in-the-loop tool approval*: Example of how to have an agent that checks with the user for approval before invoking a tool.
- *Model safety*: Example of how safety models are used with both the old agent APIs and the new Responses API.
- *ReAct agents*: Example of using the ReAct agent construct in the Llama Stack Python client library and how to accomplish something similar using the Responses API.
- *Multi-process architectures*: Thoughts on how to migrate applications that run different aspects of the old APIs in different processes (e.g., one process to make agents and another to consume them).


The development of this notebook was assisted by Google Gemini and Cursor using Claude 4 Sonnet.

## Getting Started

Before getting started, follow the following steps.

First install Llama Stack and all of the other dependencies for this notebook.
One way to do that is:

- First install Python 3.12 or later (do not try this with older versions of Python: it will not work).
- Then make a Python virtual environment.
- Then within that virtual environment run:

```
pip install -r requirements.txt
```

Once everything is installed, run the Llama Stack server:

```
llama stack run run.yaml --image-type venv
```

Also run the National Parks Service Model Context Protocol (MCP) server as described in [README_NPS.md](https://github.com/The-AI-Alliance/llama-stack-examples/blob/main/notebooks/01-responses/README_NPS.md).  Download it from that location and then run it using:

```
python nps_mcp_server.py --transport sse --port 3005
```

Alternatively, you can use this notebook with some other MCP server, but then you will need to change the example query and the server details to match 

Here we point to the locations of the servers we just started up above.  Also, we provide the model ID for the model we want to use.  The model ID should be one that that is specified in [run.yaml](./run.yaml).  In the [run.yaml](./run.yaml) included here, we have the following models defined:

- `openai/gpt-3.5-turbo` and `openai/gpt-4o` are models from OpenAI.  They will only work if you have OPENAI_API_KEY set in your environment to a [valid OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key).
- `llama-openai-compat/Llama-3.3-70B-Instruct` is a model from Meta Llama.  This will only work if you have LLAMA_API_KEY set in your environment to a valid API key for the hosted [Llama API](https://www.llama.com/products/llama-api/).
- `watsonx/Llama-3.3-70B-Instruct` is the same model running on watsonx.ai (which has a somewhat different style for model IDs).  This will only work if you have the WATSONX_API_KEY and WATSONX_PROJECT_ID environment variables set to valid values (which requires an IBM Cloud account).  You may also need to set WATSONX_BASE_URL set if your watsonx.ai instance is running anywhere other than US South (which is the default).  Note that the watsonx provider in Llama Stack was [not working](https://github.com/llamastack/llama-stack/issues/3165) when this notebook was created, but hopefully it will work by the time you read this.

If you can't or don't want to get any of those API keys, you can update [run.yaml](./run.yaml) to use [another inference provider](https://llama-stack.readthedocs.io/en/latest/providers/inference/index.html#overview).  Llama Stack includes numerous providers for calling hosted models like the ones above.  It also includes providers to call models that you deploy and run yourself using a model serving capability, e.g., the [vLLM provider](https://llama-stack.readthedocs.io/en/latest/providers/inference/remote_vllm.html) or the [ollama provider](https://llama-stack.readthedocs.io/en/latest/providers/inference/remote_ollama.html).

In [None]:
LLAMA_STACK_URL = "http://localhost:8321/"
LLAMA_STACK_MODEL_IDS = [
    "openai/gpt-3.5-turbo",
    "openai/gpt-4o",
    "llama-openai-compat/Llama-3.3-70B-Instruct",
    "watsonx/Llama-3.3-70B-Instruct"
]

# Using gpt-4o for this demo, but feel free to try one of the others or add more to run.yaml.
LLAMA_STACK_MODEL_ID = LLAMA_STACK_MODEL_IDS[1]

Replace these with other values if you are using a different MCP server:

In [None]:
NPS_MCP_URL = "http://localhost:3005/sse/"
NPS_EXAMPLE_PROMPT = "Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them."
NPS_EXAMPLE_FOLLOWUP_PROMPT = "Which of these is happening the soonest?"

NPS_INSTRUCTIONS = "You are a helpful assistant that can answer questions about the National Parks Service."

# The NPS MCP server does not require an access token, but some MCP servers do.
# We are sending it a dummy token here to show how to send the access token to the MCP server.
NPS_ACCESS_TOKEN = "frog"

Next we instantiate the client:


In [None]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

Then we test to see if it is working:

In [None]:
chat_completion_response = client.chat.completions.create(
    model=LLAMA_STACK_MODEL_ID,
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)

In [None]:
import json
from datetime import date

def pretty_print(obj) -> None:
    """
    Recursively prints an object's __dict__ in a nicely formatted JSON.
    Handles nested objects and lists of objects.
    """
    def recursive_serializer(o):
        if hasattr(o, '__dict__'):
            return o.__dict__
        # Handle types that are not directly JSON serializable
        if isinstance(o, date):
            return o.isoformat()
        # For other types, raise a TypeError to let the default handler fail.
        raise TypeError(f"Object of type {o.__class__.__name__} is not JSON serializable")

    # Determine what to serialize
    data_to_serialize = obj.__dict__ if hasattr(obj, "__dict__") else obj

    print(json.dumps(
        data_to_serialize,
        indent=2,
        default=recursive_serializer
    ))


pretty_print(chat_completion_response)

## Legacy Agent API Example

Here is a simple example for how to use the Llama Stack Agent API.  We don't recommend this approach because the primary strategic focus for Llama Stack is OpenAI-compatible APIs since those reflect a de facto industry "standard".  We're showing that API here for context to motivate the alternative approaches in later sections.

Here (commented out) is an example of how an Agent was intended to be instantiated in older versions of Llama Stack as per some [old documentation for MCP with Agent in Llama Stack](https://llamastack.github.io/v0.2.22/building_applications/tools.html#model-context-protocol-mcp):

In [None]:
# from llama_stack_client import Agent
# import uuid

# client.toolgroups.register(
#     toolgroup_id="mcp::nps",
#     provider_id="model-context-protocol",
#     mcp_endpoint=URL(uri=NPS_MCP_URL),
# )

# agent = Agent(
#     model=LLAMA_STACK_MODEL_ID,
#     instructions=NPS_INSTRUCTIONS,
#     client=client,
#     tools=["mcp::nps"],
#     extra_headers={
#         "X-LlamaStack-Provider-Data": json.dumps(
#             {
#                 "mcp_headers": {
#                     NPS_MCP_URL: {
#                         "Authorization": f"Bearer {NPS_ACCESS_TOKEN}",
#                     },
#                 },
#             }
#         ),
#     },
# )

The approach above no longer works, but you can now specify MCP tools directly in the Agent object in the Python client for similar behavior:

In [None]:
from llama_stack_client import Agent
import uuid

agent = Agent(
    model=LLAMA_STACK_MODEL_ID,
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=[{
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
            "headers": {
                "Authorization": f"Bearer {NPS_ACCESS_TOKEN}",
            },
        }])

In [None]:
# Generate a unique session ID for the example.
# This is not required, but it is useful to have if you want to create multiple sessions without restarting Llama Stack.
session_id = agent.create_session(f"nps_session-{uuid.uuid4().hex}")

agent_response1 = agent.create_turn(
    messages=[{"role": "user", "content": NPS_EXAMPLE_PROMPT}],
    session_id=session_id,
    stream=False,
)

In [None]:
pretty_print(agent_response1)

In [None]:
agent_response2 = agent.create_turn(
    messages=[{"role": "user", "content": NPS_EXAMPLE_FOLLOWUP_PROMPT}],
    session_id=session_id,
    stream=False,
)

In [None]:
pretty_print(agent_response2)

As you can see this approach still works and can be a viable substitute for the approach commented out above.  It is a bit simpler and more elegant since you no longer need to separately register tool groups and then use them in your agent definition.

## Equivalent Responses API Example

Here is some code using the Responses API that is roughly equivalent to the Agent example above.  As you can see, it no longer has separate calls for:

- Registering tools
- Creating an agent
- Creating a session
- Issuing a query for that session

The Responses API takes the list of tools as an argument so there is no need to pre-register the tools.  It does not make a persistent agent object but it does make a session object implicitly within the call -- you can see how that is used in the second call, in which `previous_response_id=responses_api_response1.id` is set to indicate that the second call is a continuation of the first.

In [None]:
responses_api_response1 = client.responses.create(
    model=LLAMA_STACK_MODEL_ID,
    input=NPS_EXAMPLE_PROMPT,
    instructions=NPS_INSTRUCTIONS,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
            "headers": {
                "Authorization": f"Bearer {NPS_ACCESS_TOKEN}",
            },
        }
    ]
)


Here we print the entire object:

In [None]:
pretty_print(responses_api_response1)

However, for some applications, we really only care about the actual text.  Here is how to print that:

In [None]:
def print_response_text(response):
    for output_block in response.output:
        if output_block.type == "message":
            for content_block in output_block.content:
                if hasattr(content_block, "text"):
                    print(content_block.text)

In [None]:
print_response_text(responses_api_response1)

In [None]:
responses_api_response2 = client.responses.create(
    model=LLAMA_STACK_MODEL_ID,
    input=NPS_EXAMPLE_FOLLOWUP_PROMPT,
    instructions=NPS_INSTRUCTIONS,
    previous_response_id=responses_api_response1.id,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
        }
    ]
)


In [None]:
print_response_text(responses_api_response2)

## Emulating the Legacy Agent API

For most use cases, the approach above is probably the best way to migrate from the Agent API to the Responses API.  However, if you have a large amount of code that uses the old API, you might want to have some objects that emulate the old Agent API instead (the approach in the commented out example at the top of the previous session).  Here is a very simple example of how to do that.

The simple example only covers the functionality needed in the original simple example above.  We'll get into some advanced functionality in later sections, but there are also many parameters in the legacy API that we do not provide sample code for.  We don't recommend trying to build a complete implementation of every parameter value in the entire API.  Instead, you can use this example as a starting point and then fill in those parameters and/or values that are important to your application.

In [None]:
TOOLS = {}
SESSIONS = {}

class LegacyURL:
    def __init__(self, uri):
        self.uri = uri

def toolgroups_register(toolgroup_id, provider_id, mcp_endpoint=None):
    """
    This replaces the client.toolgroups.register() call in the legacy agent API.  Since this version manages the
    tool group registration internally, it does not need to use the client object.
    """
    if provider_id == "model-context-protocol":
        TOOLS[toolgroup_id] = {
            "type": "mcp",
            "server_url": mcp_endpoint.uri,
            "server_label": toolgroup_id,
        }
    else:
        # TODO: Add support for other providers, not needed for this example.
        raise ValueError(f"Unsupported provider: {provider_id}")

def convert_response_to_legacy_agent_response(response):
    """
    This is a placeholder for now.  The code to convert objects to the legacy agent response format would go here.
    Note that just returning the response object is good enough for this example because the response object has
    all of the same fields that we are using in the example print statements.  However, in a more complex application
    you may need to do more work here.
    """
    return response

class LegacyAgent:
    def __init__(self, model, instructions, client, tools, extra_headers):
        self.model = model
        self.instructions = instructions
        self.client = client
        self.tool_ids = tools

        header_json = extra_headers["X-LlamaStack-Provider-Data"]
        headers = json.loads(header_json)
        self.headers = headers["mcp_headers"]
    
    def create_session(self, session_id):
        SESSIONS[session_id] = []
        return session_id
    
    def create_turn(self, messages, session_id, stream=False):
        # Note that the stream parameter is not used.  That works for this example because we are not using the stream=True,
        # but if you are using that option, you will need to update this code.
        if session_id not in SESSIONS:
            raise ValueError(f"Session {session_id} not found")
            
        previous_response_id = None
        if len(SESSIONS[session_id]) > 0:
            previous_response_id = SESSIONS[session_id][-1].id

        tools = [TOOLS[tool_id] for tool_id in self.tool_ids]
        for tool in tools:
            if tool["server_url"] in self.headers:
                tool["headers"] = self.headers[tool["server_url"]]

        response = client.responses.create(
            model=self.model,
            # TODO: This is using the last message in the list, which is fine for this example, but
            # if you call create_turn() with multiple messages, you will need to update this code.
            input=messages[-1]["content"],
            previous_response_id=previous_response_id,
            instructions=self.instructions,
            tools=tools
        )
        SESSIONS[session_id].append(response)
        return convert_response_to_legacy_agent_response(response)

With this simple wrapper in place, we can maintain much of the structure of the original Agent code we saw earlier:

In [None]:
toolgroups_register(
    toolgroup_id="mcp::nps",
    provider_id="model-context-protocol",
    mcp_endpoint=LegacyURL(uri=NPS_MCP_URL),
)

agent = LegacyAgent(
    model=LLAMA_STACK_MODEL_ID,
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=["mcp::nps"],
    extra_headers={
        "X-LlamaStack-Provider-Data": json.dumps(
            {
                "mcp_headers": {
                    NPS_MCP_URL: {
                        "Authorization": f"Bearer {NPS_ACCESS_TOKEN}",
                    },
                },
            }
        ),
    },
)

In [None]:
session_id = agent.create_session(f"nps_session-{uuid.uuid4().hex}")

agent_response1 = agent.create_turn(
    messages=[{"role": "user", "content": NPS_EXAMPLE_PROMPT}],
    session_id=session_id,
    stream=False,
)
print_response_text(agent_response1)

In [None]:
agent_response2 = agent.create_turn(
    messages=[{"role": "user", "content": NPS_EXAMPLE_FOLLOWUP_PROMPT}],
    session_id=session_id,
    stream=False,
)
print_response_text(agent_response2)

## Adopting a simpler agent class

As you can see above, a downside of the LegacyAgent is it drags in the clunky inelegance of the legacy APIs (by design).  Calling Responses API directly as shown in the section before that is often the best way to avoid this inelegance, but some developers really want a more object-oriented experience with an object representing an agent especially if they have existing code that is structured around such an object.  The example below is a compromise between calling a mirror of the legacy Agent class (as shown in the previous section) and just calling Responses directly (as shown in the section before that).

In [None]:
class SimpleExampleSession:
    def __init__(self, agent):
        self.agent = agent
        self.previous_response_id = None
    
    def create_turn(self, input):
        response = self.agent.client.responses.create(
            model=self.agent.model,
            input=input,
            previous_response_id=self.previous_response_id,
            instructions=self.agent.instructions,
            tools=self.agent.tools
        )
        self.previous_response_id = response.id
        return response

class SimpleExampleAgent:
    def __init__(self, model, instructions, client, tools):
        self.model = model
        self.instructions = instructions
        self.client = client
        self.tools = tools

    def create_session(self):
        return SimpleExampleSession(self)

As you can see, this is much simpler than LegacyAgent and the code to use it is simpler too:

In [None]:
simple_agent = SimpleExampleAgent(
    model=LLAMA_STACK_MODEL_ID,
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
        }
    ]
)
simple_session = simple_agent.create_session()

simple_agent_response1 = simple_session.create_turn(NPS_EXAMPLE_PROMPT)
print_response_text(simple_agent_response1)

In [None]:
simple_agent_response2 = simple_session.create_turn(NPS_EXAMPLE_FOLLOWUP_PROMPT)
print_response_text(simple_agent_response2)

Of course, the downside of the simpler version here is that it is less compatible with the legacy API so if you have existing code that uses the legacy API, it will need more rewriting to use this structure.

## Human-in-the-loop tool approval

Both LegacyAgent and SimpleExampleAgent here are extremely minimal and that there are many features of Responses that are not covered in these example class.  In the next section, we will address one of those features mainly to show how you can go about expanding on the basic examples.  A more full-featured agent construct is outside the scope of this notebook; it is up to the reader to explore the [Responses API documentation](https://platform.openai.com/docs/api-reference/responses) and decide which ones fit their use case.  We will use the SimpleExampleAgent as a starting point because it is simpler to build on, but you could also make the same kinds of changes to LegacyAgent if you prefer.

In [None]:
class SimpleExampleSessionWithApproval:
    def __init__(self, agent):
        self.agent = agent
        self.previous_response_id = None
    
    def create_turn(self, input):
        response = self.agent.client.responses.create(
            model=self.agent.model,
            input=input,
            previous_response_id=self.previous_response_id,
            instructions=self.agent.instructions,
            tools=self.agent.tools
        )
        self.previous_response_id = response.id

        approval_requests = []
        for block in response.output:
            if block.type == "mcp_approval_request":
                approval_request = (block.name, block.arguments, block.id)
                approval_requests.append(approval_request)

        return response, approval_requests if approval_requests else None

    def create_approval_turn(self, approval_requests_to_approve):
        """
        This is a helper method to create a turn that is used to approve or reject tool calls.
        The input is a dictionary that maps approval requests to boolean values indicating whether to approve or reject.
        """
        approval_request_entries = []
        for approval_request in approval_requests_to_approve:
            approval_request_entries.append({
                "type": "mcp_approval_request",
                "approval_request_id": approval_request[2],
                "approval": approval_requests_to_approve[approval_request]
            })

        response = self.client.responses.create(
            model=self.model,
            input=approval_request_entries,
            previous_response_id=self.previous_response_id,
            instructions=self.instructions,
            tools=self.tools
        )
        self.previous_response_id = response.id
        return response
    
class SimpleExampleAgentWithApproval:
    def __init__(self, model, instructions, client, tools):
        self.model = model
        self.instructions = instructions
        self.client = client
        self.tools = tools

    def create_session(self):
        return SimpleExampleSessionWithApproval(self)

Unlike the version in the previous section, this version it checks to see if there is an `mcp_approval_request` in the response, which there may be if you set `"requires_approval": True` in one of your MCP tool entries as shown below:

In [None]:
simple_agent_with_approval = SimpleExampleAgentWithApproval(
    model=LLAMA_STACK_MODEL_ID,
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
            "requires_approval": True
        }
    ]
)

The `create_turn` method on a session now returns both the response object and optionally an approval request extracted from that response object:

In [None]:
simple_session_with_approval = simple_agent_with_approval.create_session()

simple_agent_response1, approval_requests1 = simple_session_with_approval.create_turn(NPS_EXAMPLE_PROMPT)
print_response_text(simple_agent_response1)
print(approval_requests1)

When the client gets an approval request, it should ask a user or apply whatever other logic it deems appropriate to approve or reject the request to run the tool.  This can be very important if you have tools that perform destructive operations since models will sometimes choose the wrong tools or the wrong values for those tools.  Here is an example of how to approve a tool use with this implementation:

In [None]:
def get_approval_from_user(approval_request):
    """
    This is a placeholder for the actual logic to get approval from the user.
    It prints the approval request so you can see the information that would be
    available to the user.  In a real application, you would replace this with
    the actual logic to show this to the user in a reasonable way and then ask
    for a yes/no decision on whether to approve the tool call.
    """
    print(approval_request)
    return True

In [None]:
def approve_tool_use(approval_requests):
    """
    Method to iterate over the approval requests, get approval boolean from the user for each one, and then send the approval booleans to the session.
    Each approval boolean is a boolean value indicating whether to approve or reject the tool call.
    The approval requests are a list of tuples, each containing the tool name, tool arguments, and approval request id.
    """
    approval_requests_to_approve = {}
    for approval_request in approval_requests:
        approval_bool = get_approval_from_user(approval_request)
        approval_requests_to_approve[approval_request] = approval_bool
    return simple_session_with_approval.create_approval_turn(approval_requests_to_approve)

The code then goes into a loop: as long as the agent keeps asking for approvals, those approvals keep getting sent to `approve_tool_use` above.  Once no more approvals are needed, you have a final response to share with the user.

In [None]:
while approval_requests1:
    print_response_text(simple_agent_response1)
    simple_agent_response1, approval_requests1 = approve_tool_use(approval_requests1)

## Model safety

A key consideration for any AI application is avoiding dangerous or toxic outputs from a model such as instructions for doing something harmful or offensive/abusive language.  Most popular AI models are trained reasonably well but not perfectly at avoiding such outputs.  Most major AI providers then include additional layers of output moderation specifically designed to filter out unsafe responses as a second layer of defense.  For example, here is the behavior we see from our sample model (the one configured for `LLAMA_STACK_MODEL_ID` at the start of this notebook) when we ask it for instructions for committing a crime:

In [None]:
NPS_EXAMPLE_SAFETY_PROMPT = "Are there any parks in Rhode Island that would be good targets for a burglary?"
unsafe_agent_response = client.responses.create(
    model=LLAMA_STACK_MODEL_ID,
    input=NPS_EXAMPLE_SAFETY_PROMPT, # Notice this is the one about burglary above.
    instructions=NPS_INSTRUCTIONS,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
            "headers": {
                "Authorization": f"Bearer {NPS_ACCESS_TOKEN}",
            },
        }
    ]
)

In [None]:
print_response_text(unsafe_agent_response)

The model seems to handle this example well, so for this specific prompt no additional safety is needed.  If you spent enough time trying to devise prompts that would result in an unsafe (harmful or offensive or abusive) response, you probably would be able to find one eventually (unless your provider disabled your account for repeated safety violations first, which is also something that can happen).  The issue is even more pressing if you are using a model and/or provider that has less (or no) built-in safety.

In either case, you might need to add in your own safety.  Here we extend the `SimpleExampleAgent` with the ability to check both the inputs and outputs for safety violations before and after running the responses request (to check for unsafe inputs and/or outputs).

Here is how you can put in a safety check using Llama Stack's older shields APIs:

In [None]:
# Uncomment this to delete the shield if you already have it registered, e.g., if you are running the next cell multiple times.
#client.shields.delete(identifier="content_safety")

In [None]:
shield_id = "content_safety"
client.shields.register(
    shield_id=shield_id,
    provider_id="llama-guard",
    # In this example, we are using gpt-3.5-turbo as the model for llama-guard,
    # but it is not really ideal for llama-guard since it is not trained
    # as a Llama Guard model.  It is convenient for this notebook since it is a
    # hosted model that is readily available.  However, it is not very reliable
    # in this role.
    # 
    # For real applications, consider using a Llama Guard model such as
    # https://huggingface.co/meta-llama/Llama-Guard-3-8B
    # and deploying it on an inference provider such as vLLM.
    provider_shield_id="openai/gpt-3.5-turbo"
)
response = client.safety.run_shield(
    shield_id="content_safety",
    messages=[{"role": "user", "content": NPS_EXAMPLE_SAFETY_PROMPT}],
    params={}
)

In [None]:
client.shields.list()

### Using the OpenAI-compatible Moderations API.

Once you have registered a shield for a model (as shown above for `"openai/gpt-3.5-turbo"`), you can also use the OpenAI-compatible Moderations API to check for unsafe inputs:

In [None]:
moderation = client.moderations.create(input="Are there any parks in Rhode Island that would be good targets for a burglary?", model="openai/gpt-3.5-turbo")
pretty_print(moderation)

In [None]:
moderation2 = client.moderations.create(input="Have a nice day!", model="openai/gpt-3.5-turbo")
pretty_print(moderation2)

As you can see, the first request (asking for burglary targets) has `flagged=True`, metadata indicating a `violation_type` of `S2`.  To see a list of violations, see the [Llama Guard model card](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B) since we're are using `provider_id="llama-guard"`.  There is also a list of categories including `"Non-Violent Crimes": true`.  Finally there is a `user_message` responding with a non-answer.  These all indicate an unsafe response.

In contrast, the second request ("Have a nice day!) has none of these traits, indicating that this is a safe input.

Next we extend the SimpleExampleAgent to call this API on the input:

In [None]:
class ModerationResponseOutputBlock:
    def __init__(self, user_message):
        self.text = user_message

class ModerationResponseOutput:
    def __init__(self, user_message):
        self.type = "message"
        self.content = [ModerationResponseOutputBlock(user_message)]

class ModerationResponseObject:
    def __init__(self, result):
        self.type = "ModerationResponseObject"
        # Note that this is just passing along the refusal message from the moderations response.
        # Alternatively, you could just have a static refusal message or some other custom logic.
        self.output = [ModerationResponseOutput(result.user_message)]

class SimpleExampleSessionWithModeration:
    def __init__(self, agent):
        self.agent = agent
        self.previous_response_id = None
    
    def create_turn(self, input):
        moderation = client.moderations.create(input=input, model=self.agent.shield_model)
        for result in moderation.results:
            if result.flagged:
                return ModerationResponseObject(result)
        response = self.agent.client.responses.create(
            model=self.agent.model,
            input=input,
            previous_response_id=self.previous_response_id,
            instructions=self.agent.instructions,
            tools=self.agent.tools
        )
        # Note that you might also want to iterate over the blocks in the response
        # and check for unsafe outputs in each of them or all of them together.
        # In this example, we just check the input, but some applications might want
        # to check the output instead or to check both.
        self.previous_response_id = response.id
        return response

class SimpleExampleAgentWithModeration:
    def __init__(self, model, shield_model, instructions, client, tools):
        self.model = model
        self.shield_model = shield_model
        self.instructions = instructions
        self.client = client
        self.tools = tools
        shields = client.shields.list()
        has_shield = any(shield.provider_resource_id == shield_model for shield in shields)
        if not has_shield:
            shield_id = f"content_safety_{shield_model}"
            client.shields.register(
                shield_id=shield_id,
                provider_id="llama-guard",
                provider_shield_id=shield_model
            )

    def create_session(self):
        return SimpleExampleSessionWithModeration(self)

We verify that this still works with the safe original example prompt:

In [None]:
simple_agent_with_moderation = SimpleExampleAgentWithModeration(
    model=LLAMA_STACK_MODEL_ID, # gpt-4o
    shield_model="openai/gpt-3.5-turbo",
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
        }
    ]
)
simple_session_with_moderation = simple_agent_with_moderation.create_session()

simple_agent_response_with_moderation1 = simple_session_with_moderation.create_turn(NPS_EXAMPLE_PROMPT)
print_response_text(simple_agent_response_with_moderation1)

Next we try with the NPS_EXAMPLE_SAFETY_PROMPT (the one about burglary):

In [None]:
simple_agent_response_with_moderation2 = simple_session_with_moderation.create_turn(NPS_EXAMPLE_SAFETY_PROMPT)
print_response_text(simple_agent_response_with_moderation2)

The output here is our simple ModerationResponseObject with just the moderation output as you can see below:

In [None]:
pretty_print(simple_agent_response_with_moderation2)

Notice that this example didn't include the human-in-the-loop tool approval capabilities discussed in the previous section.  You may want to combine the extensions in both of these sections and then continue to extend with all of the other advanced features you need for your application.

### Using extra_body/guardrails

Llama Stack also has a special non-breaking extension to the OpenAI-compatible APIs.  This lets you embed the moderation call inside the Responses API so you don't need to call it separately. To do this, you use a parameter called `extra_body` and provide a dictionary with key `guardrails` and value of a list of shield `identifier` values.  Here is an example of use (assuming you have already run `client.shields.register` as we did earlier in this notebook):

In [None]:
shields = client.shields.list()
first_shield_identifier = shields[0].identifier
shields

In [None]:
unsafe_agent_response_using_guardrails = client.responses.create(
    model=LLAMA_STACK_MODEL_ID,
    input=NPS_EXAMPLE_SAFETY_PROMPT, # This is the one about burglary above.
    instructions=NPS_INSTRUCTIONS,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
            "headers": {
                "Authorization": f"Bearer {NPS_ACCESS_TOKEN}",
            },
        }
    ],
    extra_body={"guardrails": [first_shield_identifier]}, # This is the special extension to the OpenAI-compatible APIs
)

In [None]:
unsafe_agent_response_using_guardrails

If you look closely at the ResponseObject above, you will see that the first entry in the `output.content` list is a dictionary with key `type` and value `refusal`.  Compare that to a "normal" responses output below:

In [None]:
normal_response = client.responses.create(
    model=LLAMA_STACK_MODEL_ID,
    input="What is the capital of France?",
)
normal_response

Here you see that the first entry in the `output.content` list is a `OutputOpenAIResponseMessageContentUnionMember2` object with a `type` *field* with value `output_text`.  This complicates processing of output blocks because some of them have a `type` field and others have a dictionary with a `type` key.  This is a consequence of the fact that the `extra_body`/`guardrails` capability is a non-breaking extension to the OpenAI APIs but the actual response object classes in the Llama Stack client only cover outputs produced by the actual OpenAI APIs, so the client represents this extension information as a dictionary.

An advantage of the extra_body/guardrails approach is that you don't need to separately call the Moderations API before and/or after the call to Responses because it is all done for you in a single call.  One disadvantage of this approach is the added complexity in the client data structures discussed above.  Another disadvantage of course is that you no longer have access to the fine-grained control that you get from being able to make separate Moderations calls before and/or after Responses, inspect all the structured output and use it in your application.  For example, some chat UI applications might want to use different icons or colors for input moderation violations, output moderation violations, violations of different types, etc.  However, if you don't mind dealing with the data structures and you don't need all that fine-grained control, then the `extra_body` approach might be the ideal fit.  Here is how we do that in our simple example agent:

In [None]:
class AlternateModerationResponseOutputBlock:
    def __init__(self, user_message):
        self.text = user_message

class AlternateModerationResponseOutput:
    def __init__(self, user_message):
        self.type = "message"
        self.content = [AlternateModerationResponseOutputBlock(user_message)]

class AlternateModerationResponseObject:
    def __init__(self, user_message):
        self.type = "AlternateModerationResponseObject"
        self.output = [AlternateModerationResponseOutput(user_message)]

class AlternateSimpleExampleSessionWithModeration:
    def __init__(self, agent):
        self.agent = agent
        self.previous_response_id = None
    
    def create_turn(self, input):
        response = self.agent.client.responses.create(
            model=self.agent.model,
            input=input,
            previous_response_id=self.previous_response_id,
            instructions=self.agent.instructions,
            tools=self.agent.tools,
            extra_body={"guardrails": [self.agent.shield_identifier]}
        )
        self.previous_response_id = response.id
        for output_block in response.output:
            if output_block.type == "message":
                for content_block in output_block.content:
                    # We need to check for a dictionary with a `type` key because the extra_body/guardrails
                    # feature returns a dictionary but other content blocks are generally objects with a `type` field.
                    if isinstance(content_block, dict) and content_block["type"] == "refusal":
                        # Note that this is just passing along the refusal message from the guardrails.
                        # Alternatively, you could just have a static refusal message or some other custom logic.
                        # Or you could just return the response object as is and let the caller decide what to do.
                        return AlternateModerationResponseObject(content_block["refusal"])
        return response

class AlternateSimpleExampleAgentWithModeration:
    def __init__(self, model, shield_model, instructions, client, tools):
        self.model = model
        self.instructions = instructions
        self.client = client
        self.tools = tools
        shields = client.shields.list()
        self.shield_identifier = None
        for shield in shields:
            if shield.provider_resource_id == shield_model:
                self.shield_identifier = shield.identifier
        if not self.shield_identifier:
            shield_id = f"content_safety_{shield_model}"
            client.shields.register(
                shield_id=shield_id,
                provider_id="llama-guard",
                provider_shield_id=shield_model
            )
            self.shield_identifier = shield_id

    def create_session(self):
        return AlternateSimpleExampleSessionWithModeration(self)

Once again, we first instantiate the agent and verify that a safe prompt still works as before:

In [None]:
client.shields.list()

In [None]:
alt_simple_agent_with_moderation = AlternateSimpleExampleAgentWithModeration(
    model=LLAMA_STACK_MODEL_ID, # gpt-4o
    shield_model="openai/gpt-3.5-turbo",
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
        }
    ]
)

In [None]:

alt_simple_session_with_moderation = alt_simple_agent_with_moderation.create_session()
alt_simple_agent_response_with_moderation1 = alt_simple_session_with_moderation.create_turn("Hello!")
print_response_text(alt_simple_agent_response_with_moderation1)

Then we test with the unsafe (burglary) example:

In [None]:
alt_simple_agent_response_with_moderation2 = alt_simple_session_with_moderation.create_turn(NPS_EXAMPLE_SAFETY_PROMPT)
print_response_text(alt_simple_agent_response_with_moderation2)

In [None]:
pretty_print(alt_simple_agent_response_with_moderation2)

Notice that this is returning our custom "AlternateModerationResponseObject" because the moderation flagged this output.


This notebook has presented two alternative approaches to using guardrails models:

- Calling the Moderations API on the inputs and outputs of the Responses call lets you control the guardrails yourself at the cost of having to make additional API calls.
- Including `extra_body={"guardrails": [shield_id]}` in the Responses call gets Llama Stack to run the guardrails for you at the cost of less control and some added complexity in the client data structures.

Developers should consider the advantages and disadvantages of each option and choose the one that is best suited to their application.

## ReAct Agents

The Llama Stack Python Client has a client-side [ReAct Agent](https://github.com/llamastack/llama-stack-client-python/blob/main/src/llama_stack_client/lib/agents/react/agent.py) construct.  This construct still works.  It operates by listing the tools and formulating a text prompt that combines the user request, system instructions, and tool list into one big text prompt.  Those system instructions direct the model to take in observations and output thoughts before selecting an action, which can improve accuracy because it forces the model to split up the challenging process of identifying the right tool and parameter values into multiple simpler steps.  The accuracy benefits come at a cost of longer latency and higher token generation charges since outputting thoughts takes time and tokens.  For some challenging, high-value tasks the improved accuracy is worth the added time and cost.  For more details about ReAct see [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629).

ReAct was a revolutionary technique in 2022 when the paper was first published, but it is less relevant today because the lessons from ReAct have been expanded upon and incorporated into other parts of the technology stack.  In particular, some key elements and where they have wound up are:

- *Listing tools with formal input schemas*: In 2022, this needed to be done inside a single block of text because generative AI models were trained to take in a single block of text and output a single block of text.  However, modern "chat completion" models are generally trained to take a formal structure in which there are pre-defined separators and schemas for session history, tools, etc.  Of course, it is still *possible* to just lump all that info into a plain text blob like people did in 2022, but since the models are tuned to use the formal structure, it seems likely that you will get better results more often using that structure.  The way to use that structure is just to use the appropriate parameters in the Responses or Chat Completions APIs.  For example, if you list your tools in the `tools` parameter of Responses, a well-trained chat completion model is more likely to select the right tool more often than if you just describe them in plain text in the user message.
- *Outputting thoughts before deciding how to ReAct*: Recent "reasoning" models are trained specifically to do this using a formal structure.  For example, gpt-oss uses a format called [Harmony](https://cookbook.openai.com/articles/openai-harmony) in which the models output their reasoning before selecting tools and/or providing response text to end users for exactly this reason.

So if you are working with a modern "reasoning" model and an API that has formal structures for listing tools and outputting thoughts, then you would normally expect that your model and infrastructure are already providing the core benefits of ReAct.  On the other hand, if you are just using a "chat completion" model that is trained for tool selection but isn't explicitly trained to output thoughts before selecting a tool, then some sort of hybrid where you use a ReAct style prompt to encourage outputting thoughts before selecting a tool might be effective. 

### How the existing ReAct agents in the Llama Stack Python client work

Before getting into such a hybrid approach, let's take a look at how the ReAct agents in the Llama Stack Python client work now.  First we will register an MCP server and then list all the tools.

In [None]:
from llama_stack_client.types.toolgroup_register_params import McpEndpoint

client.toolgroups.register(
    toolgroup_id="mcp::nps",
    provider_id="model-context-protocol",
    mcp_endpoint=McpEndpoint(uri=NPS_MCP_URL),
)

In [None]:
tool_defs = client.tools.list(toolgroup_id="mcp::nps")
tool_defs

Above are the tools from the MCP server (`search_parks`, etc.)  The ReAct prompt generator takes these tools in the form of dictionary objects, so we extract those:

In [None]:
tool_def_dictionary_objects = [x.__dict__ for x in tool_defs]

and then we call the prompt generator to produce a prompt using these tool definitions:

In [None]:
from llama_stack_client.lib.agents.react.agent import get_default_react_instructions
print(get_default_react_instructions(tool_def_dictionary_objects))

As you can see, there are a few parts to this prompt:

1. It explains the output format to produce including how to list selected tools and their parameter values.
2. It tells the model to always take exactly one action on each turn.
3. It provides the model with a bunch of examples of the desired output format ("few-shot prompting").
4. It says "Above example were using notional tools that might not exist for you. You only have access to these tools:"
5. Then it lists the actual MCP tools and the arguments that they take in a semi-formal notation.
6. Then it lists a set of general rules such as not re-doing a tool with the same parameters.
7. Finally, it offers the model a reward of $1,000,000 if the model succeeds.

The offer of a reward is not part of the original ReAct paper, but it became popular around the same time because there were some anecdotes about it being effective.  Late 2022 through early 2023 were an extraordinarily peculiar time in the history of AI.  It seems not inconceivable it could be effective if the training data it is exposed to includes many examples of people offering a reward for answering a question followed by extraordinarily thoughtful, high quality answers.  In that case, the model pre-training could have learned this statistical trend and thus be more likely to produce extraordinarily thoughtful, high quality answers following such a prompt.  On the other hand, there doesn't seem to be substantial evidence that this actually works and it does add to the length of the input, driving up compute costs and latency.  So it is not included in the prompt for the "ReAct-inspired agentic reasoning for chat completions models using the Responses API" section later in this notebook, but feel free to try adding it and see for yourself if it helps.

### Using the existing ReAct agents in the Llama Stack Python client

Before proposing an alternative, this notebook provides an example of how to use the existing ReAct agent construct in the Python client, which does still work in the current version of Llama Stack:

In [None]:
from llama_stack_client.lib.agents.react.agent import ReActAgent
react_agent = ReActAgent(
    model=LLAMA_STACK_MODEL_ID,
    instructions=NPS_INSTRUCTIONS,
    client=client,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
        }
    ]
)

In [None]:
session_id = agent.create_session(f"nps_react_session-{uuid.uuid4().hex}")

agent_response1 = agent.create_turn(
    messages=[{"role": "user", "content": NPS_EXAMPLE_PROMPT}],
    session_id=session_id,
    stream=False,
)
print_response_text(agent_response1)

On this simple example query, it seems to get comparable results to the results you get from calling Responses directly (see the earlier "Equivalent Responses API Example" section for more details).  The ReAct agent is instructing the model to output thoughts before acting which potentially makes it more effective.  On the other hand, pushing tool descriptions into the text of the prompt (instead of using the tool-list structure that the model is trained to process) potentially makes it less effective.  So either might be more effective for your use cases.  The Responses API is a core element of the Llama Stack strategy which focuses mainly on compatibility with the de facto industry "standard" of the OpenAI APIs.  In contrast, the ReAct agent is left over from an older strategy for Llama Stack and thus seems unlikely to undergo substantial improvement in the future.

The next section proposes an alternative based on the Responses API that preserves some of the benefits of ReAct while also benefitting from some of the power of more modern chat completions models that are trained to handle structured inputs.

## ReAct-style agentic reasoning for chat completions models using the Responses API

If you are using a "reasoning" model such as gpt-oss or gpt-4o or gpt-5 or Claude Sonnet 4.5, then the models are already trained to output thoughts before selecting actions, so there isn't really a need to explicitly prompt the model to do so -- just make sure that you are enabling the reasoning mode and then everything you might have wanted from ReAct is already covered using structures that the model is specifically trained to handle.  Conversely, if you are using a model that is only pre-trained and instruction-tuned but without a formal chat completion template, then you probably want to use something like the original ReAct capability described above.  However, many models fall in between these extremes: they are trained to recognize specific structures for a chat-completions API including both session history and tool calling, but they are not explicitly trained to output reasoning traces.  For models of this sort, you want to use those API structures for chaining messages together and telling the model what tools are available.  Thus an ideal fit might be a compromise solution like the one below:

In [None]:
REACT_STYLE_SYSTEM_INSTRUCTIONS = """
You are an expert assistant who can solve any task using tool calls. You will be given a task to solve as best you can.

You can use the result of the previous action as input for the next action.
The observation will always be the response from calling the tool: it can represent a file, like "image_1.jpg". You do not need to generate them, it will be provided to you. 
Then you can use it as input for the next action. You can do it for instance as follows:

Here are the rules you should always follow to solve your task:
1. Always use the right arguments for tools. Never use variable names in tool parameters, use the value instead.
2. Call a tool only when needed: do not call a tool if you do not need information, try to solve the task yourself.
3. Never re-do a tool call that you previously did with the exact same parameters.
"""

REACT_STYLE_THINK_PROMPT_TEMPLATE = """
An AI agent has been given a request from a user.  It has access to the specified tools.
Think about what that AI agent should do next.  That might include using a tool or it
might involve solving the task without using a tool.  Do not output any text other than
your thoughts about what the agent should do next.

Here is the user request:

{user_request}
"""

REACT_STYLE_ACT_PROMPT_TEMPLATE = """
You are an AI agent that has been given a request from a user.  There are some existing
thoughts for you to consider.  You should either produce a final response or select
tools to call.

Here is the user request:

{user_request}
"""

In [None]:
class ReActStyleSimpleExampleSession:
    def __init__(self, agent):
        self.agent = agent
        self.previous_response_id = None
    
    def create_turn(self, input):
        # Thinking phase
        response = self.agent.client.responses.create(
            model=self.agent.model,
            input=REACT_STYLE_THINK_PROMPT_TEMPLATE.format(user_request=input),
            previous_response_id=self.previous_response_id,
            instructions=REACT_STYLE_SYSTEM_INSTRUCTIONS,
            tools=self.agent.tools,
            # In the Thinking phase, we don't want the model to output anything other than the thoughts.
            # So it shouldn't actually select tools, it should just explain why it might or might not want to use a tool.

            # This parameter is not supported in the current version of the Llama Stack,
            # but you should uncomment it if you are using a future version that does support it.
            #tool_choice="none"
        )
        self.previous_response_id = response.id

        # For demo purposes, we will print the response from the thinking phase.
        print(f"Thinking phase response:")
        print_response_text(response)

        # Acting phase
        response = self.agent.client.responses.create(
            model=self.agent.model,
            input=REACT_STYLE_ACT_PROMPT_TEMPLATE.format(user_request=input),
            # Note that we are using the previous response id from the thinking phase
            # as the previous response id for the acting phase.  This allows the
            # acting phase to use the result of the thinking phase as input.
            previous_response_id=self.previous_response_id,
            instructions=REACT_STYLE_SYSTEM_INSTRUCTIONS,
            tools=self.agent.tools,
        )
        self.previous_response_id = response.id
        return response

class ReActStyleSimpleExampleAgent:
    def __init__(self, model, client, tools):
        self.model = model
        self.client = client
        self.tools = tools

    def create_session(self):
        return ReActStyleSimpleExampleSession(self)

Now we instantiate this agent and try it out:

In [None]:
simple_react_style_agent = ReActStyleSimpleExampleAgent(
    model=LLAMA_STACK_MODEL_ID,
    client=client,
    tools=[
        {
            "type": "mcp",
            "server_url": NPS_MCP_URL,
            "server_label": "National Parks Service tools",
        }
    ]
)
simple_react_style_session = simple_react_style_agent.create_session()

simple_react_style_agent_response1 = simple_react_style_session.create_turn(NPS_EXAMPLE_PROMPT)

As you can see above, the thinking phase response includes an overall plan for how to satisfy the requirement.  That plan is then available as an input for the next step.  This can be helpful to a model for the same reason it can be helpful for a human doing a challenging task to first have an abstract end-to-end plan before starting to act: coming up with an outline and filling in the details are both challenging tasks so trying to do them at the same time can be harder than doing each separately.

In [None]:
print_response_text(simple_react_style_agent_response1)

As you can see, we still get the same (correct) behavior from this example.  For more challenging examples or less powerful models, you might see higher quality (but slower and more expensive) results from this approach.

As in the earlier examples, you can combine this enhancement with other enhancements (human-in-the-loop tool approval, model safety, etc.) to develop an agent object that meets your needs.  Alternatively, you can just call the Responses API in the ways shown here without having a container object. These are just some examples to help you get started.

## Multi-process architectures

The examples above assume that the agent creation, session creation, and turn creation are all taking place in the same process.  However, some users of the legacy Agents API might be having one process that creates agents (e.g., an agent management console) and another process that uses the agents (e.g., a chatbot app that invokes that process).  Assuming these processes have different life cycles and run on different kinds of machines, here are some examples of ideas for how to handle these cases:

- If the agent creation logic is simple and static, then you can just move that agent creation logic into the application that uses the agent as in the examples above.  One downside of doing that is that you need to release a new version of the application that uses the agent every time the definition of the agent changes.  If that's not feasible for you, consider one of the other options.
- You can have some sort of key-value storage mechanism contain a description of the agent (i.e., the instructions, list of tools).  The process that creates agents can write that configuration to the storage and the process that invokes the agent can read from it.
- You can have the process that creates the agent deploy a container with that agent into a cluster and the process that invokes the agent call that container (which would then call Llama Stack via the Responses API).  This is a much heavier and more disruptive change than just putting agent configuration into a key-value store, but potentially much more powerful too since you can build complex agents with multiple phases that use different models under different circumstances, etc.

There are many other possible architectures to explore too.