# Humanloop RAG Evaluation Walkthrough
The goal of this notebook is to demonstrate how to take an existing RAG pipeline and integrate Humanloop in order to:
1. Setup logging for your AI application using a Flow.
2. Create a [Dataset](https://humanloop.com/docs/v5/concepts/datasets) and run Evaluations to benchmark the performance of your RAG pipeline.
3. Extend your logging to include Prompt and Tool steps within your Flow.


## What is Humanloop?
Humanloop is an interactive development environment designed to streamline the entire lifecycle of LLM app development. It serves as a central hub where AI, Product, and Engineering teams can collaborate on Prompt management, Evaluation and Monitoring workflows. 


## What is RAG?
RAG stands for Retrieval Augmented Generation.
- **Retrieval** - Getting the relevant information from a larger data source for a given a query.
- **Augmented** - Using the retrieved information as input to an LLM.
- **Generation** - Generating an output from the model given the input.

In practise, it remains an effective way to exploit LLMs for things like question answering, summarization, and more, where the data source is too large to fit in the context window of the LLM, or where providing the full data source for each query is not cost-effective.


## What are the major challenges with RAG?
Implementing RAG and other similar flows complicates the process of [Prompt Engineering](https://humanloop.com/blog/prompt-engineering-101) because you expand the design space of your application. There are lots of choices you need to make around the retrieval and Prompt components that can significantly impact the performance of your overall application. For example,
- How do you select the data source?
- How should it be chunked up and indexed?
- What embedding and retrieval model should you use?
- How should you combine the retrieved information with the query?
- What should your system Prompt be? 
- Which model should you use?
- What should your system message be?
etc...

The process of versioning, evaluating and monitoring your pipeline therefore needs to consider both the retrieval and generation components. This is where Humanloop can help.


# Example RAG Pipeline

We first need a reference RAG implementation. Our use case will be Q&A over a corpus of medical documents.

- **Dataset**: we'll use a version of the [MedQA dataset](https://huggingface.co/datasets/bigbio/med_qa) from Hugging Face. This is a multiple choice question answering problem based on the United States Medical License Exams (USMLE), with reference text books that contain the required information to answer the questions.
- **Retriever**: we're going to use [Chroma](https://docs.trychroma.com/getting-started) as a simple local vector DB with their default embedding model `all-MiniLM-L6-v2`. You can replace this with your favorite retrieval system.
- **Prompt**: **the Prompt will be managed in code**, populated with the users question and the context retrieved from the Retriever and sent to [OpenAI](https://platform.openai.com/docs/api-reference/introduction) to generate the answer.

### Where to store your Prompts?

Generally speaking, when the engineering/applied AI teams are mainly responsible for managing the details of the Prompt, then the pattern of storing or constructing the Prompt in code works well. This is the pattern we follow in this tutorial. 

However, if the Product/Domain Expert teams are more involved in Prompt engineering and management, then the Prompt can instead be managed on Humanloop and retrieved or called by your code - this workflow lies outside the scope of this tutorial and we cover it separately. 

## Complete Prerequisites

### Install packages
We use poetry to manage dependencies:

In [1]:
!poetry install

[34mInstalling dependencies from lock file[39m

No dependencies to install or update


### Initialise the SDKs

You will need to set your OpenAI API key in the  `.env` file in the root of the repo. You can retrieve your API key from your [OpenAI account](https://platform.openai.com/api-keys).


In [2]:
# Set up dependencies
from dotenv import load_dotenv
import os
from chromadb import chromadb
from openai import OpenAI

import pandas as pd

# load .env file that contains API keys
load_dotenv()

# init clients
chroma = chromadb.Client()
openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))


### Set up the Vector DB
This involves loading the data from the MedQA dataset and embedding the data within a collection in Chroma. This will take a couple of minutes to complete.

In [3]:
# init collection into which we will add documents
collection = chroma.get_or_create_collection(name="MedQA")

# load knowledge base
knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
knowledge_base = knowledge_base.sample(10, random_state=42)


# Add to Chroma - will by default use local vector DB and model all-MiniLM-L6-v2
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

### Define the Prompt
We define a simple prompt template that has variables for the question, answer options and retrieved data.

It is generally good practise to define the Prompt details that impact the behaviour of the model in one place separate to your application logic.

In [4]:
model = "gpt-4o"
temperature = 0
template = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Here is some retrieved information that might be helpful.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]

def populate_template(template: list, inputs: dict[str, str]) -> list:
    """Populate a template with input variables."""
    messages = []
    for i, template_message in enumerate(template):
        content = template_message["content"]
        for key, value in inputs.items():
            content = content.replace("{{" + key + "}}", value)
        message = {**template_message, "content": content}
        messages.append(message)
    return messages


## Define the RAG Pipeline

Now we provide the reference RAG pipeline using Chroma and OpenAI that takes a question and returns an answer. This is ultimately what we will evaluate.


In [5]:
from openai.types.chat import ChatCompletionMessage

def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc

def ask_question(inputs: dict[str, str])-> ChatCompletionMessage:
    """Ask a question and get an answer using a simple RAG pipeline"""
    
    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}
    
    # Populate the Prompt template
    messages = populate_template(template, inputs)
    
    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    return chat_completion.choices[0].message

In [None]:
# Test the pipeline
chat_completion = ask_question(
    {
        "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
        "option_A": "0%",
        "option_B": "25%",
        "option_C": "50%",
        "option_D": "75%",
        "option_E": "100%",
    }
)
print(chat_completion.content)

# Humanloop Integration

We now integrate Humanloop into the RAG pipeline to first enable logging and then to trigger evaluations against a dataset.


## Initialise the SDK
You will need to set your Humanloop API key in the  `.env` file in the root of the repo. You can retrieve your API key from your [Humanloop organization](https://app.humanloop.com/account/api-keys).


In [None]:
# Init the Humanloop SDK
from humanloop import Humanloop

load_dotenv()
humanloop = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))

## Integrate Logging

Below, we just need to add a `humanloop.flows.log(...)` call after we execute our pipeline. We include in the `attributes` the information that will version our Flow on Humanloop. This can be arbitrary configuration of your choice.

On running this updated code, Humanloop will now begin to track the versions of your Flow along with inputs, outputs and associated metadata. 

In [8]:
# Add a `humanloop.prompts.log()` call after `ask_question()` to log the response and Prompt details to your Humanloop Prompt
import inspect 
from datetime import datetime

inputs = {
    "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
    "option_A": "0%",
    "option_B": "25%",
    "option_C": "50%",
    "option_D": "75%",
    "option_E": "100%",
}

start_time = datetime.now()
chat_completion = ask_question(inputs)

# Log the response to a Humanloop Flow
humanloop.flows.log(
    path="evals_demo/medqa-flow",
    # We want our Prompt details and Tool implementation to define the version of our Flow. 
    # If these details change in code, Humanloop will bump the version automatically.
    flow={
        "attributes": {
            "prompt": {
                "template": template,
                "model": model,
                "temperature": temperature,
            },
            "tool": {
                "name": "retrieval_tool_v3",
                "description": "Retrieval tool for MedQA.",
                "source_code": inspect.getsource(retrieval_tool),
            },
        }
    },
    inputs=inputs,
    output=chat_completion.content,
    start_time=start_time,
    end_time=datetime.now(),
    # We don't intend on adding any more Logs to our trace, so mark as complete.
    trace_status="complete",
)

CreateFlowLogResponse(id='log_LscHgTkvAUcGf0g47xV5o', flow_id='fl_jLavsElEpb6dqckBMRObh', version_id='flv_ebvvb8oK8i3UmDZnYkQ6X', trace_status=None, status='complete')

## Check your Humanloop workspace

After running this pipeline, you will now see your Flow logs in your Humanloop workspace.

If you make changes to your `attributes` in code and re-run the pipeline, you will see a new version of the Flow created in Humanloop.


![Flow Logs](../../assets/images/flow_log.png)



# Trigger Evaluations

We will now extend our implementation to allow us to run Evaluations on Humanloop against a specific test dataset.

This involves the following steps:
1. Create a Dataset that we can manage and re-use on Humanloop as the source of truth.
2. Create some Evaluators that we can manage and re-use on Humanloop that can provide judgements on the performance of our Pipeline.
3. Trigger an Evaluation with logging to Humanloop.
4. View the results.

Then as you tweak your pipeline in code, this will allow you to easily track and compare the performance of different versions. 

## Create a Dataset
Here we will create a Dataset on Humanloop using the MedQA test dataset. Alternatively you can create a data from Logs on Humanloop, or upload via the UI - see our [guide](https://humanloop.com/docs/v5/evaluation/guides/create-dataset). 

You can then effectively version control your Dataset centrally on Humanloop and hook into it for Evaluation workflows in code and via the UI.

In [9]:
def upload_dataset_to_humanloop():
    df = pd.read_json("../../assets/datapoints.jsonl", lines=True)

    datapoints = [row.to_dict() for _i, row in df.iterrows()][0:20]
    return humanloop.datasets.upsert(
        path="evals_demo/medqa-test",
        datapoints=datapoints,
        commit_message=f"Added {len(datapoints)} datapoints from MedQA test dataset.",
    )


In [None]:
dataset = upload_dataset_to_humanloop()

## Set up Evaluators

Here we will upload some Evaluators defined in code in `assets/evaluators/` so that Humanloop can manage running these for Evaluations (and later for Monitoring!)

Alternatively you can define AI, Code and Human based Evaluators via the UI - see the relevant `How-to guides` on [Evaluations](https://humanloop.com/docs/v5/evaluation/overview) for creating Evaluators of different kinds.

Further you can choose to not host the Evaluator on Humanloop and instead use your own runtime and instead post the results as part of the Evaluation. This can be useful for more complex workflows that require custom dependencies or resources. See the "Running Evaluators locally" section below for details.

In [12]:
def upload_evaluators():
    # Upload Code Evaluators
    for evaluator_name, return_type in [
        ("exact_match", "boolean"),
        ("levenshtein", "number"),
    ]:
        with open(f"../../assets/evaluators/{evaluator_name}.py", "r") as f:
            code = f.read()
        humanloop.evaluators.upsert(
            path=f"evals_demo/{evaluator_name}",
            spec={
                "evaluator_type": "python",
                "arguments_type": "target_required",
                "return_type": return_type,
                "code": code,
            },
            commit_message=f"New version from {evaluator_name}.py",
        )

    # Upload an LLM Evaluator
    humanloop.evaluators.upsert(
        path="evals_demo/reasoning",
        spec={
            "evaluator_type": "llm",
            "arguments_type": "target_free",
            "return_type": "boolean",
            "prompt": {
                "model": "gpt-4o",
                "endpoint": "complete",
                "temperature": 0,
                "template": "An answer is shown below. The answer contains 3 sections, separated by \"---\". The first section is the final answer. The second section is an explanation. The third section is a citation.\n\nEvaluate if the final answer follows from the citation and the reasoning in the explanation section. Give a brief explanation/discussion. Do not make your judgment based on factuality, but purely based on the logic presented.\nOn a new line, give a final verdict of \"True\" or \"False\".\n\nAnswer:\n{{log.output}}",
            },
        },
        commit_message="Initial reasoning evaluator.",
    )

In [None]:
upload_evaluators()

## Run Evaluation

Now we can start to trigger Evaluations on Humanloop using our Dataset and Evaluators:

In [15]:
from tqdm import tqdm


# Create the Evaluation specifying the Dataset and Evaluators to use
evaluation = humanloop.evaluations.create(
    name="Demo evals 2",
    file={"path":"evals_demo/medqa-flow"},
    dataset={"path": "evals_demo/medqa-test"},
    evaluators=[
        {"path": "evals_demo/exact_match"},
        {"path": "evals_demo/levenshtein"},
        {"path": "evals_demo/reasoning"},
    ],
)
print(f"Evaluation created: {evaluation.id}")


Evaluation created: evr_hlf4paqOeAy41ECJRtchp


Add `source_datapoint_id` and `evaluation_id` to the `humanloop.flow.log(...)` so that the logs are added the Evaluation

In [16]:
def populate_evaluation():
    """Run a variation of your Pipeline over the Dataset to populate results"""
    retrieved_dataset = humanloop.datasets.get(
        id=evaluation.dataset.id,
        include_datapoints=True,
    )
    for datapoint in tqdm(retrieved_dataset.datapoints):
        start_time = datetime.now()
        
        chat_completion = ask_question(datapoint.inputs)
        
        humanloop.flows.log(
            path="evals_demo/medqa-flow",
            flow={
                "attributes": {
                    "prompt": {
                        "template": template,
                        "model": model,
                        "temperature": temperature,
                    },
                    "tool": {
                        "name": "retrieval_tool_v4",
                        "description": "Retrieval tool for MedQA.",
                        "source_code": inspect.getsource(retrieval_tool),
                    },
                }
            },
            inputs=inputs,
            output=chat_completion.content,
            start_time=start_time,
            end_time=datetime.now(),
            trace_status="complete",
            # NB: New arguments to link to Evaluation and Dataset
            source_datapoint_id=datapoint.id,
            evaluation_id=evaluation.id,
)


In [None]:
populate_evaluation()

# Then change your pipeline and run this function again, keeping the Evaluation ID the same, to populate additional columns in your Evaluation!

## Get Results and URL
We can not get the aggregate results via the API and the URL to navigate to the Evaluation in the Humanloop UI.

In [18]:
evaluation = humanloop.evaluations.get(id=evaluation.id)
print("URL: ", evaluation.url)

URL:  https://stg.humanloop.com/project/fl_jLavsElEpb6dqckBMRObh/evaluations/evr_hlf4paqOeAy41ECJRtchp/stats


![Flow Evals](../../assets/images/flow_evals.png)

## Logging the full trace

So far our Humanloop integration only cares about the output of our RAG pipeline. In many applications, it's helpful to have visibility into the behaviour of the components that make up your pipeline. In this example, we have a retrieval step and a LLM step that make up our application.

We'll now demonstrate how to extend your Humanloop logging with more fidelity; adding separate Tool and Prompt steps into your Flow logs to give you full visibility. 

We add additional logging steps to our `ask_pipeline` that are linked to our Flow Log to represent the full trace of events on Humanloop.

In [38]:
# We want to nest our subsequent logs under our Flow log to represent the full trace of events.

def ask_question(inputs: dict[str, str], trace_id: str)-> ChatCompletionMessage:
    """Ask a question and get an answer using a simple RAG pipeline"""
    
    # Retrieve context
    start_time=datetime.now()
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Log the retriever information to Humanloop separately
    humanloop.tools.log(
        path="evals_demo/medqa-retrieval",
        tool={
            "function": {
                "name": "retrieval_tool",
                "description": "Retrieval tool for MedQA.",
            },
            "source_code": inspect.getsource(retrieval_tool),
        },
        output=retrieved_data,
        trace_parent_id=trace_id,
        start_time=start_time,
        end_time=datetime.now()
    )
    
    # Populate the Prompt template
    state_time=datetime.now()
    messages = populate_template(template, inputs)
    
    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )

    # Log the prompt information to Humanloop separately
    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
        },
        inputs=inputs,
        output=chat_completion.choices[0].message.content,
        output_message=chat_completion.choices[0].message,
        trace_parent_id=trace_id,
        start_time=start_time,
        end_time=datetime.now()
    )
    
    return chat_completion.choices[0].message

In [None]:
# Update Flow logging to provide trace_id

inputs = {
    "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
    "option_A": "0%",
    "option_B": "25%",
    "option_C": "50%",
    "option_D": "75%",
    "option_E": "100%",
}

start_time = datetime.now()

# Create the Flow log
log = humanloop.flows.log(
    path="evals_demo/medqa-flow",
    flow={
        "attributes": {
            "prompt": {
                "template": template,
                "model": model,
                "temperature": temperature,
            },
            "tool": {
                "name": "retrieval_tool_v3",
                "description": "Retrieval tool for MedQA.",
                "source_code": inspect.getsource(retrieval_tool),
            },
        }
    },
    inputs=inputs,
    start_time=start_time,
)

chat_completion = ask_question(inputs, log.id)

# Close the trace
humanloop.flows.update_log(
    log_id=log.id,
    output=chat_completion.content,
    trace_status="complete",
)

![Flow Evals](../../assets/images/flow_full_trace.png)