# Level 5: MCP Based RAG (Medium Difficulty)

This notebook is an extension of the [Level 4 Agentic & MCP notebook](./Level4_agentic_and_mcp.ipynb) with the addition of RAG.
This tutorial is for developers who are already familiar with [basic Agentic workflows](./Level2_simple_agentic_with_websearch.ipynb). This tutorial will highlight a couple of slightly more advanced use cases for agents where a single tool call is insufficient to complete the required task. Here we will rely on both agentic RAG and MCP server to expand our agents capabilities.

## Overview

This tutorial covers the following steps:
1. Review Review OpenShift logs for a failing pod.
2. Categorize the pod and summarize its error.
3. Search available troubleshooting documentations for ideas on how to resolve the error.
4. Send a Slack message to the ops team with a brief summary of the error and next steps to take.

### MCP Tools:

Throughout this notebook we will be relying on the [kuberenetes-mcp-server](https://github.com/manusa/kubernetes-mcp-server) by [manusa](https://github.com/manusa) to interact with our OpenShift cluster.

We will also be using the [Slack MCP Server](https://github.com/modelcontextprotocol/servers/tree/main/src/slack) in this notebook.





### Pre-Requisites

Before starting, ensure you have the following:
- A running Llama Stack server
- A running Slack MCP server. Refer to our [documentation](https://github.com/opendatahub-io/llama-stack-demos/tree/main/kubernetes/mcp-servers/slack-mcp) on how you can set this up on your OpenShift cluster
- Access to an OpeShift cluster with a deployment of the [OpenShift MCP server](https://github.com/opendatahub-io/llama-stack-on-ocp/tree/main/mcp-servers/openshift) (see the [deployment manifests](https://github.com/opendatahub-io/llama-stack-on-ocp/tree/main/kubernetes/mcp-servers/openshift-mcp) for assistance with this).

### Setting your ENV variables:

Use the [.env.example](../../../.env.example) to create a new file called `.env` and ensure you add all the relevant environment variables below. 

In addition to the environment variables listed in the ["Getting Started" notebook](demos/rag_agentic/notebooks/Level0_getting_started_with_Llama_Stack.ipynb), the following should be provided for this demo to run:

- `REMOTE` (boolean): dictates if you are using a remote llama-stack instance.
- `REMOTE_BASE_URL` (string): the URL for your llama-stack instance if using remote connection.
- `REMOTE_OCP_MCP_URL` (string): the URL for your Openshift MCP server. If the client does not find the tool registered to the llama-stack instance, it will use this URL to register the Openshift tool (used in demos 1 and 3).
- `REMOTE_SLACK_MCP_URL` (string): the URL for your Slack MCP server. If the client does not find the tool registered to the llama-stack instance, it will use this URL to register the Slack tool (used in demo 3).
- `USE_PROMPT_CHAINING` (boolean): dictates if the prompt should be formatted as 3 separate prompts to isolate each step or in a single turn. 

## Setting Up the Environment,
We will initialize our environment as described in detail in our [\"Getting Started\" notebook](demos/rag_agentic/notebooks/Level0_getting_started_with_Llama_Stack.ipynb). Please refer to it for additional explanations.

### Configuring logging

Now that we have our dependencies, lets setup logging for the application:

In [1]:
from llama_stack_client.lib.agents.event_logger import EventLogger
import logging
import sys
sys.path.append('..')  
from src.utils import step_printer

logger = logging.getLogger(__name__)
if not logger.hasHandlers():  
    logger.setLevel(logging.INFO)
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.INFO)
    formatter = logging.Formatter('%(message)s')
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)

### Configuration
This section sets up key parameters for model inference and the RAG (Retrieval-Augmented Generation) vector database.

In [2]:
import uuid

# Inference settings
MODEL="meta-llama/Llama-3.2-3B-Instruct"
TEMPERATURE = 0.0
TOP_P = 0.95
if TEMPERATURE > 0.0:
    strategy = {"type": "top_p", "temperature": TEMPERATURE, "top_p": TOP_P}
else:
    strategy = {"type": "greedy"}

# # For this demo, we are using Milvus Lite, which is our preferred solution. Any other Vector DB supported by Llama Stack can be used.

####
# RAG vector DB settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
VECTOR_DB_PROVIDER_ID = 'milvus'

# Unique DB ID for session
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

### Connecting to llama-stack server

For the llama-stack instance, you can either run it locally or connect to a remote llama-stack instance.

#### Remote llama-stack

- For remote, be sure to set `remote` to `True` and populate the `remote_llama_stack_endpoint` variable with your llama-stack remote.
- [Remote Setup Guide](https://github.com/opendatahub-io/llama-stack-on-ocp/tree/main/kubernetes)

#### Local llama-stack
- For local, be sure to set `remote` to `False` and validate the `local_llama_stack_endpoint` variable. It is based off of the default llama-stack port which is `8321` but is configurable with your deployment of llama-stack.
- [Local Setup Guide](https://github.com/redhat-et/agent-frameworks/tree/main/prototype/frameworks/llamastack)

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

remote = os.getenv("REMOTE", True) # Use the `remote` variable to switching between a local development environment and a remote kubernetes cluster.
stream_output = False # Set to True to stream the output of the agent.

if remote:
    base_url = os.getenv("REMOTE_BASE_URL")
else:
    base_url = "http://localhost:8321"

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url=base_url
)

### Indexing the Documents
- Initialize a new document collection in the target vector DB. All parameters related to the vector DB, such as the embedding model and dimension, must be specified here.
- Provide a list of document URLs to the RAG tool. Llama Stack will handle fetching, conversion and chunking of the documents' content.

In [4]:
from llama_stack_client import RAGDocument

# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=VECTOR_DB_EMBEDDING_MODEL,
    embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
    provider_id=VECTOR_DB_PROVIDER_ID,
)

# ingest the documents into the newly created document collection
urls = [
    ("https://docs.redhat.com/en/documentation/openshift_container_platform/4.11/html/support/troubleshooting#troubleshooting-installations.html", "application/html"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)

### Validate tools are available in our llama-stack instance

When an instance of llama-stack is redeployed your tools need to re-registered. Also if a tool is already registered with a llama-stack instance, if you try to register one with the same `toolgroup_id`, llama-stack will throw you an error.

For this reason it is recommended to include some code to validate your tools and toolgroups. This is where the `mcp_url` comes into play. The following code will check that the `builtin::rag`,`mcp::openshift`  and `mcp::slack` tools are registered as tools, but if any mcp tool is not listed there, it will attempt to register it using the mcp url.

If you are running the MCP server from source, the default value for this is: `http://localhost:8000/sse`.

If you are running the MCP server from a container, the default value for this is: `http://host.containers.internal:8000/sse`.

Make sure to pass the corresponding MCP URL for the server you are trying to register/validate tools for.

In [5]:
# Optional: Enter your MCP server URL here
ocp_mcp_url = os.getenv("REMOTE_OCP_MCP_URL") # Optional: enter your MCP server url here
slack_mcp_url = os.getenv("REMOTE_SLACK_MCP_URL") # Optional: enter your MCP server url here

# Get list of registered tools and extract their toolgroup IDs
registered_tools = client.tools.list()
registered_toolgroups = [tool.toolgroup_id for tool in registered_tools]

if  "builtin::rag" not in registered_toolgroups: # Required
    client.toolgroups.register(
        toolgroup_id="builtin::rag",
        provider_id="milvus"
    )

if "mcp::openshift" not in registered_toolgroups: # required
    client.toolgroups.register(
        toolgroup_id="mcp::openshift",
        provider_id="model-context-protocol",
        mcp_endpoint={"uri":ocp_mcp_url},
    )

if "mcp::slack" not in registered_toolgroups: # required
    client.toolgroups.register(
        toolgroup_id="mcp::slack",
        provider_id="model-context-protocol",
        mcp_endpoint={"uri":slack_mcp_url},
    )

# Log the current toolgroups registered
logger.info(
    f"Your Llama Stack server is already registered with the following tool groups: {set(registered_toolgroups)}\n"
)

Your Llama Stack server is already registered with the following tool groups: {'mcp::github', 'builtin::websearch', 'builtin::code_interpreter', 'mcp::openshift', 'mcp::slack', 'mcp::ansible', 'builtin::rag', 'mcp::custom_tool'}



### System Prompts for different models

**Note:** If you have multiple models configured with your Llama Stack server, you can choose which one to run your queries against. When switching to a different model, you may need to adjust the system prompt to align with that model’s expected behavior. Many models provide recommended system prompts for optimal and reliable outputs these are typically documented on their respective websites.

In [6]:
# Here is a system prompt we have come up with which works well for granite-3.2-8b
granite_model="ibm-granite/granite-3.2-8b-instruct"
granite_prompt="""You are a helpful AI assistant with access to the tools listed next. When a tool is required to answer the user's query, respond with `<tool_call>` followed by a JSON object of the tool used. For example: `<tool_call> {"name":"function_name","arguments":{"arg1":"value"}} </tool_call>`:The user will respond with the output of the tool execution response so you can continue with the rest of the initial user prompt (continue).
If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request. """

# Here is a system prompt we have come up with which works well for llama-3.2-3b
llama_model="meta-llama/Llama-3.2-3B-Instruct"
llama_prompt= """You are a helpful assistant. You have access to a number of tools.
Whenever a tool is called, be sure return the Response in a friendly and helpful tone. Whenever a pod has an error query the vector DB on the error from the pods and return a summary of steps to take"""

# Resolve and report any errors that might be happening on my OpenShift Cluster

We will also be using the [Slack MCP Server](https://github.com/modelcontextprotocol/servers/tree/main/src/slack) for this query. If you haven't already, you can follow the instructions [here](https://github.com/opendatahub-io/llama-stack-demos/tree/main/kubernetes/mcp-servers/slack-mcp#setting-up-on-ocp) to install the Slack MCP server. Once you have the Slack MCP server running, you will need to:

- Setup a slack app with your Slack workspace by following the steps [here](https://github.com/opendatahub-io/llama-stack-demos/tree/main/kubernetes/mcp-servers/slack-mcp#setting-up-the-slack-bot)
- Once your Slack app is set up and you've got the OAuth token, you will need to provide the token into your Slack MCP server to let the app post messages to your channels.
- Finally, you will need to register your Slack MCP server with your Llama Stack server

In [7]:
from llama_stack_client import Agent
# Create simple agent with tools
agent = Agent(
    client,
    model=llama_model, # replace this with your choice of model
    instructions = llama_prompt , # update system prompt based on the model you are using
    tools=[dict(
            name="builtin::rag/knowledge_search",
            args={
                "vector_db_ids": [vector_db_id],  # list of IDs of document collections to consider during retrieval
            },
        ),"mcp::openshift", "mcp::slack"],
    tool_config={"tool_choice":"auto"},
    sampling_params={"max_tokens":4096, "strategy":{"type": "greedy"},}
)

user_prompts = ["View the logs for pod slack-test in the llama-serve OpenShift namespace. Categorize it as normal or error.",
                "search for solutions on this error and provide a summary of the steps to take .",
               "Summarize the results with the pod name, category along with a briefly explaination as to why you categorized it as normal or error and brief on the next steps to take.",
               "Send a message with the summarization to the demos channel on Slack."]
session_id = agent.create_session(session_name="OCP_Slack_demo")
for i, prompt in enumerate(user_prompts):
    response = agent.create_turn(
        messages=[
            {
                "role":"user",
                "content": prompt
            }
        ],
        session_id=session_id,
        stream=stream_output,
    )
    if stream_output:
        for log in EventLogger().log(response):
            log.print()
    else:
        step_printer(response.steps) # print the steps of an agent's response in a formatted way. 


---------- 📍 Step 1: InferenceStep ----------
🛠️ Tool call Generated:
[33mTool call: pods_log, Arguments: {'name': 'slack-test', 'namespace': 'llama-serve'}[0m

---------- 📍 Step 2: ToolExecutionStep ----------
🔧 Executing tool...



---------- 📍 Step 3: InferenceStep ----------
🤖 Model Response:
[33mThe logs for the pod "slack-test" in the namespace "llama-serve" could not be retrieved. The error message indicates that the pod "slack-test" was not found in the namespace "llama-serve". This may be due to the pod being deleted or not existing in the first place.

To resolve this issue, you can try the following steps:

1. Check if the pod "slack-test" exists in the namespace "llama-serve" using the `pods_list_in_namespace` function.
2. If the pod exists, check the status of the pod using the `pods_get` function to see if it is running or not.
3. If the pod is not running, try to restart it using the `pods_run` function.
4. If the pod is running, check the logs of the pod using the `pods_log` function to see if there are any errors.

Here is an example of how you can use the `pods_list_in_namespace` function to check if the pod exists:

```
{
  "type": "function",
  "name": "pods_list_in_namespace",
  "parameters":


---------- 📍 Step 3: InferenceStep ----------
🤖 Model Response:
[33mBased on the knowledge search results, here is a summary of the steps to take to troubleshoot the issue:

1. Check the status of pods in the `openshift-etcd-operator` namespace using the `oc get pods -n openshift-etcd-operator` command.
2. If any of the pods are not showing a `Running` or `Completed` status, gather diagnostic information for the pod by reviewing events for the pod using the `oc describe pod/<pod_name> -n <namespace>` command.
3. Inspect pod status in detail by using the `crictl inspect` command to get the pod's logs.
4. List containers related to a pod using the `crictl ps` command and inspect container status in detail by using the `crictl inspect` command to get the container's logs.
5. Review the logs for any containers not showing a `Ready` status.
6. If the namespace is managed by a deployment configuration, check the deployment configuration name and base image reference using the `oc status` c


---------- 📍 Step 3: InferenceStep ----------
🤖 Model Response:
[33mA message has been sent to the #demos channel on Slack with the summarization of the pod issue.
[0m



### Output Analysis

Lets step through the output to further understands whats happening in this Agentic demo.

1. First the LLM sends off a tool call to the pods_log tool configured with the OpenShift MCP server, to fetch the logs for the pod specified from the OpenShift cluster.
2. The tool successfully retrieves the logs for the pod.
3. The LLM recieves the response from the tool call, which are the pod logs, along with the original query.
4. The LLM sends a tool call to the RAG tool, to query the vector DB on the error from the pod.
5. The LLM recieves the response from the tool call, and summarizes the information into steps to take.
6. Finally, the LLM performs a final inference on the context. The inference result provides the final answer in a structured table format as requested in the user prompt and provides the pod name, its category of 'Normal' or 'Error' along with a brief explanantion and results from the vector db.

## Key Takeaways

This tutorial demonstrates how to implement agentic RAG and MCP applications with Llama Stack. We do so by initializing an agent while giving it access to the MCP tools, and RAG tool configured with Llama Stack, then invoking the agent on each of the specified queries. Please check out our other notebooks for more examples using Llama Stack.