# Humanloop RAG Evaluation Walkthrough

The goal of this notebook is to demonstrate how to take an existing RAG pipeline and use Humanloop to evaluate it.

This notebook demonstrates how to:

1. Run an Eval on your RAG pipeline.
2. Set up detailed logging with SDK decorators.
3. Log to Humanloop manually


## What is Humanloop?

Humanloop is an interactive development environment designed to streamline the entire lifecycle of LLM app development. It serves as a central hub where AI, Product, and Engineering teams can collaborate on Prompt management, Evaluation and Monitoring workflows. 


## What is RAG?

RAG stands for Retrieval Augmented Generation.
- **Retrieval** - Getting the relevant information from a larger data source for a given a query.
- **Augmented** - Using the retrieved information as input to an LLM.
- **Generation** - Generating an output from the model given the input.

In practice, it remains an effective way to use LLMs for things like question answering, summarization, and more, where the data source is too large to fit in the context window of the LLM, or where providing the full data source for each query is not cost-effective.


## What are the major challenges with RAG?

Implementing RAG and other similar flows complicates the process of [Prompt Engineering](https://humanloop.com/blog/prompt-engineering-101) because you expand the design space of your application. There are lots of choices you need to make around the retrieval and Prompt components that can significantly impact the performance of your overall application. For example,
- How do you select the data source?
- How should it be chunked up and indexed?
- What embedding and retrieval model should you use?
- How should you combine the retrieved information with the query?
- What should your system Prompt be? 
- Which model should you use?
- What should your system message be?
etc...

The process of versioning, evaluating and monitoring your pipeline therefore needs to consider both the retrieval and generation components. This is where Humanloop can help.


# Example RAG Pipeline

We first need a reference RAG implementation. Our use case will be Q&A over a corpus of medical documents.

- **Dataset**: we'll use a version of the [MedQA dataset](https://huggingface.co/dataset    s/bigbio/med_qa) from Hugging Face. This is a multiple-choice question-answering problem based on the United States Medical License Exams (USMLE), with reference textbooks that contain the required information to answer the questions.
- **Retriever**: we're going to use [Chroma](https://docs.trychroma.com/getting-started) as a simple local vector DB with their default embedding model `all-MiniLM-L6-v2`. You can replace this with your favorite retrieval system.
- **Prompt**: **the Prompt will be managed in code**, populated with the users question and the context retrieved from the Retriever and sent to [OpenAI](https://platform.openai.com/docs/api-reference/introduction) to generate the answer.


### Where to store your Prompts?

Generally speaking, when the engineering/applied AI teams are mainly responsible for managing the details of the Prompt, then the pattern of storing or constructing the Prompt in code works well. This is the pattern we follow in this tutorial. 

However, if the Product/Domain Expert teams are more involved in Prompt engineering and management, then the Prompt can instead be managed on Humanloop and retrieved or called by your code - this workflow lies outside the scope of this tutorial and we cover it separately. 

## Complete Prerequisites

### Install packages

This repository uses [Poetry](https://python-poetry.org/) to manage dependencies:

In [None]:
!poetry install

### Initialise the SDKs

You will need to set your OpenAI API key in the  `.env` file in the root of the repo. You can retrieve your API key from your [OpenAI account](https://platform.openai.com/api-keys).


In [1]:
# Set up dependencies
from dotenv import load_dotenv
import os
from chromadb import chromadb
from openai import OpenAI

import pandas as pd

# load .env file that contains API keys
load_dotenv()

# init clients
chroma = chromadb.Client()
openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))


### Set up the Vector DB
This involves loading the data from the MedQA dataset and embedding the data within a collection in Chroma. This will take a couple of minutes to complete.

In [2]:
# init collection into which we will add documents
collection = chroma.get_or_create_collection(name="MedQA")

# load knowledge base
knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
knowledge_base = knowledge_base.sample(10, random_state=42)


# Add to Chroma - will by default use local vector DB and model all-MiniLM-L6-v2
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

### Define the Prompt
We define a simple prompt template that has variables for the question, answer options and retrieved data.

It is generally good practise to define the Prompt details that impact the behaviour of the model in one place separate to your application logic.

In [3]:
model = "gpt-4o-mini"
temperature = 0
template = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Here is some retrieved information that might be helpful.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]



## Define the RAG Pipeline

Now we provide the reference RAG pipeline using Chroma and OpenAI that takes a question and returns an answer. This is ultimately what we will evaluate.


In [4]:
def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc

def call_llm(**inputs):
    # Populate the Prompt template
    messages = humanloop.prompts.populate_template(template, inputs)
    
    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    return chat_completion.choices[0].message.content

def ask_question(**inputs)-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    
    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Call LLM
    return call_llm(**inputs)


In [None]:
# Test the pipeline
chat_completion = ask_question(
    **{
        "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
        "option_A": "0%",
        "option_B": "25%",
        "option_C": "50%",
        "option_D": "75%",
        "option_E": "100%",
    }
)
print(chat_completion)

We now have a working RAG pipeline. We can now evaluate this pipeline using Humanloop.

# Eval with Humanloop

Now we will integrate Humanloop into our RAG pipeline to evaluate it. We will use the Humanloop SDK to run an Eval on our RAG pipeline.

## Initialise the SDK
You will need to set your Humanloop API key in the  `.env` file in the root of the repo. You can retrieve your API key from your [Humanloop organization](https://app.humanloop.com/account/api-keys).


In [6]:
# Init the Humanloop SDK
from humanloop import Humanloop

load_dotenv()
humanloop = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"), base_url=os.getenv("HUMANLOOP_BASE_URL"))

## Set up Evaluators

Here we will upload some Evaluators defined in code in `assets/evaluators/` so that Humanloop can manage running these for Evaluations (and later for Monitoring!)

Alternatively you can define [AI](https://humanloop.com/docs/v5/guides/evals/llm-as-a-judge), [Code](https://humanloop.com/docs/v5/guides/evals/code-based-evaluator) and [Human-based](https://humanloop.com/docs/v5/guides/evals/human-evaluators) Evaluators via the UI.

Furthermore, you can choose to not host the Evaluator on Humanloop and instead use your own runtime and post the results as part of the Evaluation. This can be useful for more complex workflows that require custom dependencies or resources. See the "Running Evaluators locally" section below for details.

In [7]:
def upload_evaluators():
    """Uploads Evaluators to Humanloop.
    
    Uploads the "Exact match", "Levenshtein", and "Reasoning" Evaluators.
    The "Exact match" and "Levenshtein" Evaluators are slight modifications to the examples
    automatically created in the "Example Evaluators" folder in Humanloop when you signed up,
    with some additional parsing for the output of this RAG pipeline.
    """
    # Upload Code Evaluators
    for evaluator_name, file_name, return_type in [
        ("Exact match", "exact_match.py", "boolean"),
        ("Levenshtein", "levenshtein.py", "number"),
    ]:
        with open(f"../../assets/evaluators/{file_name}", "r") as f:
            code = f.read()
        humanloop.evaluators.upsert(
            path=f"Evals demo/{evaluator_name}",
            spec={
                "evaluator_type": "python",
                "arguments_type": "target_required",
                "return_type": return_type,
                "code": code,
            },
            commit_message=f"New version from {file_name}",
        )

    # Upload an LLM Evaluator
    humanloop.evaluators.upsert(
        path="Evals demo/Reasoning",
        spec={
            "evaluator_type": "llm",
            "arguments_type": "target_free",
            "return_type": "boolean",
            "prompt": {
                "model": "gpt-4o-mini",
                "endpoint": "complete",
                "temperature": 0,
                "template": "An answer is shown below. The answer contains 3 sections, separated by \"---\". The first section is the final answer. The second section is an explanation. The third section is a citation.\n\nEvaluate if the final answer follows from the citation and the reasoning in the explanation section. Give a brief explanation/discussion. Do not make your judgment based on factuality, but purely based on the logic presented.\nOn a new line, give a final verdict of \"True\" or \"False\".\n\nAnswer:\n{{log.output}}",
            },
        },
        commit_message="Initial reasoning evaluator.",
    )

In [9]:
upload_evaluators()

## Create a Dataset
Here we will create a Dataset on Humanloop using the MedQA test dataset. Alternatively you can create a data from Logs on Humanloop, or upload via the UI - see our [guide](https://humanloop.com/docs/v5/evaluation/guides/create-dataset). 

You can then effectively version control your Dataset centrally on Humanloop and hook into it for Evaluation workflows in code and via the UI.

In [10]:
def upload_dataset_to_humanloop():
    df = pd.read_json("../../assets/datapoints.jsonl", lines=True)

    datapoints = [row.to_dict() for _i, row in df.iterrows()][0:20]
    return humanloop.datasets.upsert(
        path="Evals demo/MedQA test",
        datapoints=datapoints,
        commit_message=f"Added {len(datapoints)} datapoints from MedQA test dataset.",
    )


In [11]:
dataset = upload_dataset_to_humanloop()

### Evaluate the pipeline over the Dataset

In [12]:
datapoints_pager = humanloop.datasets.list_datapoints(dataset.id, version_id=dataset.version_id)
datapoints = [datapoint for datapoint in datapoints_pager]

In [None]:
checks = humanloop.evaluations.run(
    name="Demo cookbook",
    file={
        "path": "Evals demo/MedQA pipeline",
        "callable": ask_question,
    },
    dataset={
        "path": "Evals demo/MedQA test",
        "datapoints": datapoints,
    },
    evaluators=[
        {"path": "Evals demo/Exact match"},
        {"path": "Evals demo/Levenshtein"},
        {"path": "Evals demo/Reasoning"},
        {"path": "Example Evaluators/Code/Latency"},
    ],
)

# Detailed logging

So far, we've been treating the RAG pipeline like a black box and evaluating it as a whole.
We can extend our logging to include more detailed information about the pipeline's internal steps. This can be useful for debugging and monitoring the various parts of the pipeline.

We can do this by adding logging for the Prompt and Tool steps within the Flow using Humanloop's Python decorators. If you're using a different language, you can still log to Humanloop via the API. Skip to the "Logging with the API" section below or check out our [guide](https://humanloop.com/docs/v5/guides/observability/logging-through-api) for more details.


```python
@humanloop.flow(path="Evals demo/MedQA pipeline")
def rag_pipeline(...):
    ...

@humanloop.tool(path="Evals demo/Retrieval tool")
def retrieval_tool(...):
    ...

@humanloop.prompt(path="Evals demo/LLM call")
def llm_call(...):
    ...
```

In [15]:
# Here we give them a different name to keep the original functions around to allow for this cell to be run multiple times,
# but you would not need to do so in your actual implementation.

@humanloop.flow(path="Evals demo/MedQA pipeline")
def ask_question_decorated(**inputs: dict[str, str]):
    retrieved_data = retrieval_tool_decorated(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}
    return call_llm_decorated(**inputs)

@humanloop.tool(path="Evals demo/Retrieval tool")
def retrieval_tool_decorated(question: str) -> str:
    return retrieval_tool(question)

@humanloop.prompt(path="Evals demo/LLM call")
def call_llm_decorated(**inputs):
    return call_llm(**inputs)


In [None]:
import random

print(
    ask_question_decorated(
        **random.choice(datapoints).inputs
    )
)

After running the above, you should see a new Log on Humanloop corresponding to the execution of the pipeline.

# Running Eval with decorators

These decorated functions can similarly be used to run an Eval on the pipeline. This will allow you to evaluate the pipeline and see the detailed logs for each step in the pipeline.

Let's change from `gpt-4o-mini` to `gpt-4o` and re-run the Eval.

By passing in the same `name` to `humanloop.evaluations.run(...)` call, we'll add another run to the previously-created Evaluation on Humanloop. This will allow us to compare the two Runs side-by-side.

In [17]:
model = "gpt-4o"

In [None]:
checks = humanloop.evaluations.run(
    name="Demo cookbook",
    file={
        "path": "Evals demo/MedQA pipeline",
        "callable": ask_question_decorated,
        "type": "flow",
    },
    dataset={
        "path": "Evals demo/MedQA test",
        "datapoints": datapoints,
    },
    evaluators=[
        {"path": "Evals demo/Exact match"},
        {"path": "Evals demo/Levenshtein"},
        {"path": "Evals demo/Reasoning"},
        {"path": "Example Evaluators/Code/Latency"},
    ],
)

Viewing our Evaluation on Humanloop, we can see that our newly-added Run with `gpt-4o` has been added to the Evaluation.
On the **Stats** tab, we can see that `gpt-4o` scores better for our "Exact match" (and "Levenshtein") metrics, but has higher latency.

![Eval runs](../../assets/images/evaluate_rag_flow_stats.png)

Perhaps surprisingly, `gpt-4o` performs worse according to our "Reasoning" Evaluator.

Humanloop also allows you to dive deeper into the specific Logs for each Run, to understand why the model performed the way it did.
For example, looking closer at the "Reasoning" Evaluator in the above screenshot, `gpt-4o` performs worse according to our "Reasoning" Evaluator.

Going to the **Review** tab and paging through Logs, we see that the "Reasoning" Evaluator has flagged the following `gpt-4o` Log, with the justification that it provided a citation that was not relevant to the question. `gpt-4o-mini` on the other hand did not provide any citation.

![Eval runs](../../assets/images/evaluate_rag_flow_review.png)

With Humanloop, you can measure the performance of your RAG pipelines and investigate changes in performance.

# Logging with the API

Above, we've let the SDK handle logging and versioning for us. However, you can also log data to Humanloop using the API directly. This can be useful if you want to perform some post-processing on the data before logging it, or if you want to include additional metadata in the logs or versions.

We'll now demonstrate how to extend your Humanloop logging with more fidelity; creating Tool, Prompt, and Flow Logs to give you full visibility.

We add additional logging steps to our `ask_question` function to represent the full trace of events on Humanloop.

(Note that the `run_id` and `source_datapoint_id` arguments are optional, and are included here for use in the Evaluation workflow demonstrated later.)

In [34]:
from datetime import datetime
import inspect

def ask_question_with_logging(run_id: str | None = None, source_datapoint_id: str | None = None, **inputs)-> str:
    """Ask a question and get an answer using a simple RAG pipeline."""


    trace = humanloop.flows.log(
        path="evals_demo/medqa-flow",
        flow={
            "attributes": {
                "prompt": {
                    "template": template,
                    "model": model,
                    "temperature": temperature,
                },
                "tool": {
                    "name": "retrieval_tool_v3",
                    "description": "Retrieval tool for MedQA.",
                    "source_code": inspect.getsource(retrieval_tool),
                },
            }
        },
        inputs=inputs,
        start_time=datetime.now(),
        run_id=run_id,
        source_datapoint_id=source_datapoint_id,
    )

    # Retrieve context
    start_time=datetime.now()
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Log the retriever information to Humanloop separately
    humanloop.tools.log(
        path="Evals demo/Retrieval tool",
        tool={
            "function": {
                "name": "retrieval_tool",
                "description": "Retrieval tool for MedQA.",
            },
            "source_code": inspect.getsource(retrieval_tool),
        },
        output=retrieved_data,
        trace_parent_id=trace.id,
        start_time=start_time,
        end_time=datetime.now()
    )
    
    # Populate the Prompt template
    start_time=datetime.now()
    messages = humanloop.prompts.populate_template(template, inputs)
    
    # Call OpenAI to get response
    chat_completion= openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    output = chat_completion.choices[0].message.content

    # Log the prompt information to Humanloop separately
    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
        },
        inputs=inputs,
        output=output,
        output_message=chat_completion.choices[0].message,
        trace_parent_id=trace.id,
        start_time=start_time,
        end_time=datetime.now()
    )

    # Close the trace
    humanloop.flows.update_log(
        log_id=trace.id,
        output=output,
        trace_status="complete",
    )
    
    return output

In [None]:
print(
    ask_question_with_logging(
        **random.choice(datapoints).inputs
    )
)

## Evaluating with manual logging

To orchestrate your own Evaluations, you can pass in `run_id` and `source_datapoint_id` to the `humanloop.flows.log(...)` call to associate Logs with a specific Run and Datapoint.

The following is an example of how you can manually create an Evaluation and Run, and log data to Humanloop using the API,
giving you full control over the Evaluation process.

In [36]:
# Create Evaluation
evaluation = humanloop.evaluations.create(
    name="Manual logging demo",
    file={"path": "Evals demo/MedQA pipeline"},
    evaluators=[
        {"path": "Evals demo/Exact match"},
        {"path": "Evals demo/Levenshtein"},
        {"path": "Evals demo/Reasoning"},
        {"path": "Example Evaluators/Code/Latency"},
    ],
)

# Create Run
run = humanloop.evaluations.create_run(id=evaluation.id, dataset={"path": "Evals demo/MedQA test"})


In [None]:
from tqdm import tqdm

# Run the pipeline over the Dataset
for datapoint in tqdm(datapoints):
    ask_question_with_logging(run_id=run.id, source_datapoint_id=datapoint.id, **datapoint.inputs)


You can then similarly view results on the Humanloop UI.

![Eval Logs table](../../assets/images/evaluate_rag_flow_logs.png)

# Using Retriever as a Tool

In the above example, the flow of our app is static - given the question we always call the retriever, pass the context to the model and then respond to the user. It can be helpful to have a more dynamic setup where we instead allow the model to choose when use the retriever.
This allows the model to do smart query expansion, or ask for clarifying questions, before deciding to retrieve the context which can be significant boost in performance.

This entails updating the Prompt with a Tool schema definition for the retriever. And then routing your flow to the retriever based on when or not the model makes a tool call. The Flow on Humanloop can capture the full trace of events so that you have full transparency into the models actions. And the `evaluations.run(...)` utility works as normal. We will demonstrate how to do this below.

First we define the tools required for the model to call and their corresponding schemas to provide to the model. We will then define the full function for interacting with this more `agentic` rag variant

In [None]:
# Define the tools and their schemas:

tool_schemas =  [
    {
        "type": "function",
        "function": {
            "name": "retrieve_knowledge",
            "description": "Looks up relevant context in a knowledge base for a given query.",
            "parameters": {
                "type": "object",
                "required": ["query"],
                "properties": {
                    "query": {"type": "string", "description": "The query to retrieve knowledge to help answer."},
                },
                "additionalProperties": False,
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_clarification",
            "description": "Asks the user clarifying information.",
            "parameters": {
                "type": "object",
                "required": ["question"],
                "properties": {
                    "question": {"type": "string", "description": "Details of the clarification you would like from the user in order to help answer the question."},
                },
                "additionalProperties": False,
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "provide_answer",
            "description": "Submits the answer to the user.",
            "parameters": {
                "type": "object",
                "required": ["answer", "reasoning", "citation"],
                "properties": {
                    "answer": {"type": "string", "description": "The final synthesized answer."},
                    "reasoning": {"type": "string", "description": "The reasoning for your answer."},
                    "citation": {"type": "string", "description": "The citation to the knowledge base entry you used when answering."},
                },
                "additionalProperties": False,
            },
        },
    }
]


def retrieve_knowledge(query: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[query], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def get_clarification(question: str) -> str:
    """Ask the user for clarifying information."""
    # TODO: Implement this with your own logic for sending and receiving messages from the user
    print("AI: ", question)
    return input("User: ")

def provide_answer(answer: str, reasoning: str, citation: str) -> str:
    """Provide the answer to the user."""
    # TODO: Implement this with your own logic for sending the final response to the user
    print(
        f"Answer: {answer}\nReasoning: {reasoning}\nCitation: {citation}"
    )


In [None]:

import json

model = "gpt-4o-mini"
temperature = 0
# define your Prompt template that contains variables for answer options and retrieved context
template = [
    {
        "role": "system",
        "content": """Answer the question provided by the user factually. And reference the options for the answer:

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---
If the question is clearly stated, you can use the `retrieve_knowledge` tool to get more context before answering.
If you need more information, first call the `get_clarification` tool to ask the user for more details, before then calling the `retrieve_knowledge` tool.

When you are happy with the context, call the `answer_question` tool to provide the answer to the user, where you should provide your reasoning and a clear citation.
""",
    }
]


def call_model(inputs: dict[str, str], messages: list[dict], trace_id: str) -> dict:
    """Calls the model with the provided inputs and messages."""
    
    # Populate the Prompt template
    start_time=datetime.now()
    populated_template_messages = humanloop.prompts.populate_template(template, inputs)
    
    # Call OpenAI to get response - note you can also instead manage your prompts on Humanloop and use our proxy:
    # https://humanloop.com/docs/v5/guides/prompts/call-prompt
    chat_completion= openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=populated_template_messages + messages,
        # include the tool schemas so the model can make a call
        tools=tool_schemas,
    )
    # Log to Humanloop
    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
            "tools": tool_schemas,
        },
        inputs=inputs,
        output=chat_completion.choices[0].message.content,
        output_message=chat_completion.choices[0].message,
        trace_parent_id=trace_id,
        start_time=start_time,
        end_time=datetime.now()
    )
    
    return chat_completion.choices[0].message.to_dict(exclude_unset=False)



def ask_question(question: str, **inputs):
    """Ask a question and get an answer using a simple RAG pipeline.
    
    param: inputs: the question and options needed by the Prompt template above
    
    Return: the message history with the latest being the model response
    """
    
    # Open up the trace on Humanloop
    trace = humanloop.flows.log(
        path="evals_demo/medqa-flow",
        flow={
            # Optionally define attributes that uniquely determine the your app so HL can version to Flow
            "attributes": {
                "prompt": {
                    "template": template,
                    "model": model,
                    "temperature": temperature,
                },
                "tool": {
                    "name": "retrieval_tool_v3",
                    "description": "Retrieval tool for MedQA.",
                    "source_code": inspect.getsource(retrieval_tool),
                    "schemas": tool_schemas
                },
            }
        },
        inputs=inputs,
        start_time=datetime.now(),
    )
    # Initialize the messages with the question
    messages = [{"role": "user", "question": question}]
    
    # Do the main agent loop
    steps = 0
    while True:
        steps += 1
        response = call_model(inputs, messages, trace.id)    
        messages.append(response)
        # process the tool calls in your system
        if response["tool_calls"]:
            # Call wikipedia to get up-to-date information
            for tool_call in response["tool_calls"]:
                tool_args = json.loads(tool_call["function"]["arguments"])
                start_time=datetime.now()
                if tool_call["function"]["name"] == "retrieve_knowledge":
                    tool_output = retrieve_knowledge(tool_args)
                elif tool_call == "get_clarification":
                    tool_output = get_clarification(tool_args)
                elif tool_call == "provide_answer":
                    tool_output = provide_answer(tool_args)
                    # session is over, so break
                    break
                else:
                    raise ValueError(f"Unknown tool call: {tool_call}") 
                # Add the tool output to the messages to send back to the model
                messages.append(
                    {
                        "role": "tool",
                        "content": json.dumps(tool_output),
                        "tool_call_id": tool_call["id"],
                    }
                )
                # Log the tool call to Humanloop
                humanloop.tools.log(
                    path="Evals demo/Retrieval tool",
                    tool={
                        "function": {
                            "name": tool_call["function"]["name"],
                            "description": tool_call["function"]["name"],
                            "schema": ...,
                        },
                        "source_code": inspect.getsource(retrieval_tool),
                    },
                    output=tool_output,
                    trace_parent_id=trace.id,
                    start_time=start_time,
                    end_time=datetime.now()
                )
        
        if steps > 10:
            raise ValueError("Too many steps in the conversation.")
    
    # Close the trace on Humanloop so any monitoring evaluators will be run
    humanloop.flows.update_log(
        log_id=trace.id,
        output=output,
        trace_status="complete",
    )
