# Humanloop RAG Evaluation Walkthrough

The goal of this notebook is to demonstrate how to take an existing RAG pipeline and integrate Humanloop in order to:

1. Setup logging for both your [Prompt](https://humanloop.com/docs/v5/concepts/prompts) and retriever [Tool](https://humanloop.com/docs/v5/concepts/prompts) so that you can easily track the versions of these components.
2. Create a [Dataset](https://humanloop.com/docs/v5/concepts/prompts) and run Evaluations to benchmark the performance of your RAG pipeline.
3. Configure [Evaluators](https://humanloop.com/docs/v5/concepts/evaluators) for monitoring your RAG pipeline in production.

## What is Humanloop?

Humanloop is an interactive development environment designed to streamline the entire lifecycle of LLM app development. It serves as a central hub where AI, Product, and Engineering teams can collaborate on Prompt management, Evaluation and Monitoring workflows.

## What is RAG?

RAG stands for Retrieval Augmented Generation.

- **Retrieval** - Getting the relevant information from a larger data source for a given a query.
- **Augmented** - Using the retrieved information as input to an LLM.
- **Generation** - Generating an output from the model given the input.

In practise, it remains an effective way to exploit LLMs for things like question answering, summarization, and more, where the data source is too large to fit in the context window of the LLM, or where providing the full data source for each query is not cost-effective.

## What are the major challenges with RAG?

Implementing RAG and other similar flows complicates the process of [Prompt Engineering](https://humanloop.com/blog/prompt-engineering-101) because you expand the design space of your application. There are lots of choices you need to make around the retrieval and Prompt components that can significantly impact the performance of your overall application. For example,

- How do you select the data source?
- How should it be chunked up and indexed?
- What embedding and retrieval model should you use?
- How should you combine the retrieved information with the query?
- What should your system Prompt be?
- Which model should you use?
- What should your system message be?
  etc...

The process of versioning, evaluating and monitoring your pipeline therefore needs to consider both the retrieval and generation components. This is where Humanloop can help.


# Example RAG Pipeline

We first need a reference RAG implementation. Our use case will be Q&A over a corpus of medical documents.

- **Dataset**: we'll use a version of the [MedQA dataset](https://huggingface.co/datasets/bigbio/med_qa) from Hugging Face. This is a multiple choice question answering problem based on the United States Medical License Exams (USMLE), with reference text books that contain the required information to answer the questions.
- **Retriever**: we're going to use [Chroma](https://docs.trychroma.com/getting-started) as a simple local vector DB with their default embedding model `all-MiniLM-L6-v2`. You can replace this with your favorite retrieval system.
- **Prompt**: **the Prompt will be managed in code**, populated with the users question and the context retrieved from the Retriever and sent to [OpenAI](https://platform.openai.com/docs/api-reference/introduction) to generate the answer.

### Where to store your Prompts?

Generally speaking, when the engineering/applied AI teams are mainly responsible for managing the details of the Prompt, then the pattern of storing or constructing the Prompt in code works well. This is the pattern we follow in this tutorial.

However, if the Product/Domain Expert teams are more involved in Prompt engineering and management, then the Prompt can instead be managed on Humanloop and retrieved or called by your code - this workflow lies outside the scope of this tutorial and we cover it separately.


## Complete Prerequisites

### Install packages

We use poetry to manage dependencies:


In [1]:
!poetry install

[34mInstalling dependencies from lock file[39m

No dependencies to install or update

[39;1mInstalling[39;22m the current project: [36mhumanloop-cookbook[39m ([39;1m0.1.0[39;22m)
If you do not want to install the current project use [39m[36m--no-root[39m[33m.
If you want to use Poetry only for dependency management but not for packaging, you can disable package mode by setting [39m[36mpackage-mode = false[39m[33m in your pyproject.toml file.


### Initialise the SDKs

You will need to set your OpenAI API key in the `.env` file in the root of the repo. You can retrieve your API key from your [OpenAI account](https://platform.openai.com/api-keys).


In [2]:
IS_DEV = True

In [3]:
HL_KEY = "hl_sk_28fbe04c3ebbdd0c51138e1b1d7e824badee2063bb15eb42" if IS_DEV else ""
OPENAI_KEY = ""
HOST = "neostaging.humanloop.ml" if IS_DEV else "http://0.0.0.0:80"

In [8]:
# Set up dependencies
from dotenv import load_dotenv
import os
from chromadb import chromadb
from openai import OpenAI
import requests
import datetime

import pandas as pd

# load .env file that contains API keys
load_dotenv()

# init clients
chroma = chromadb.Client()
openai = OpenAI(api_key=OPENAI_KEY)


ModuleNotFoundError: No module named 'dotenv'

### Set up the Vector DB

This involves loading the data from the MedQA dataset and embedding the data within a collection in Chroma. This will take a couple of minutes to complete.


In [None]:
# init collection into which we will add documents
collection = chroma.get_or_create_collection(name="MedQA")

# load knowledge base
knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
knowledge_base = knowledge_base.sample(5, random_state=42)


# Add to Chroma - will by default use local vector DB and model all-MiniLM-L6-v2
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

### Define the Prompt

We define a simple prompt template that has variables for the question, answer options and retrieved data.

It is generally good practise to define the Prompt details that impact the behaviour of the model in one place separate to your application logic.


In [22]:
model = "gpt-3.5-turbo"
temperature = 0
template = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Here is some retrieved information that might be helpful.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]


def populate_template(template: list, inputs: dict[str, str]) -> list:
    """Populate a template with input variables."""
    messages = []
    for i, template_message in enumerate(template):
        content = template_message["content"]
        for key, value in inputs.items():
            content = content.replace("{{" + key + "}}", value)
        message = {**template_message, "content": content}
        messages.append(message)
    return messages


## Define the RAG Pipeline

Now we provide the reference RAG pipeline using Chroma and OpenAI that takes a question and returns an answer. This is ultimately what we will evaluate.


In [23]:
def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def ask_question(inputs: dict[str, str]) -> str:
    """Ask a question and get an answer using a simple RAG pipeline"""

    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Populate the Prompt template
    messages = populate_template(template, inputs)

    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    return chat_completion.choices[0].message.content

In [None]:
# Test the pipeline

print(
    ask_question(
        {
            "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
            "option_A": "0%",
            "option_B": "25%",
            "option_C": "50%",
            "option_D": "75%",
            "option_E": "100%",
        }
    )
)

# Humanloop Integration

We now integrate Humanloop into the RAG pipeline to first enable logging and then to trigger evaluations against a dataset.


## Flow V1 Experiment


In [25]:
import inspect
import uuid


def ask_question(
    inputs: dict[str, str],
    datapoint_id: str | None = None,
    evaluation_id: str | None = None,
) -> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    trace_request = requests.post(
        f"{HOST}/v5/flows/log",
        headers={"X-API-KEY": HL_KEY},
        json={
            "trace_id": uuid.uuid4().hex,
            "flow": {
                "attributes": {
                    "description": "Answering medical questions",
                    "chroma": 1,
                }
            },
            "path": "evals_demo/MedQA Flow",
            "source_datapoint_id": datapoint_id,
            "evaluation_id": evaluation_id,
        },
    ).json()
    print(trace_request)
    trace_id = trace_request["id"]

    start_time = datetime.datetime.now()

    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])

    end_time = datetime.datetime.now()

    # Log the context and retriever details to your Humanloop Tool
    session_id = uuid.uuid4().hex
    requests.post(
        f"{HOST}/v5/tools/log",
        json={
            "path": "evals_demo/medqa-retrieval",
            "tool": {
                "function": {
                    "name": "retrieval_tool",
                    "description": "Retrieval tool for MedQA.",
                },
                "source_code": inspect.getsource(retrieval_tool),
            },
            "output": retrieved_data,
            "session_id": session_id,
            "trace_id": trace_id,
            "start_time": start_time.isoformat(),
            "end_time": end_time.isoformat(),
        },
        headers={"X-API-Key": HL_KEY},
    )

    # Populate the Prompt template
    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)

    # Call OpenAI to get a response
    start_time = datetime.datetime.now()

    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    message = chat_completion.choices[0].message
    answer = message.content

    end_time = datetime.datetime.now()

    requests.post(
        f"{HOST}/v5/prompts/log",
        headers={"X-API-Key": HL_KEY},
        json={
            "path": "evals_demo/medqa-answer",
            "prompt": {
                "model": model,
                "temperature": temperature,
                "template": template,
            },
            "inputs": inputs,
            "output": answer,
            "output_message": message.to_dict(),
            "session_id": session_id,
            "trace_id": trace_id,
            "source_datapoint_id": datapoint_id,
            "evaluation_id": evaluation_id,
            "start_time": start_time.isoformat(),
            "end_time": end_time.isoformat(),
        },
    )

    try:
        requests.patch(
            f"{HOST}/v5/flows/log/{trace_id}",
            headers={"X-API-KEY": HL_KEY},
            json={
                "inputs": inputs,
                "output": answer,
                "status": "complete",
            },
            timeout=2,
        )
    except:
        pass

    return answer

## Create a Dataset

Here we will create a Dataset on Humanloop using the MedQA test dataset. Alternatively you can create a data from Logs on Humanloop, or upload via the UI - see our [guide](https://humanloop.com/docs/v5/evaluation/guides/create-dataset).

You can then effectively version control your Dataset centrally on Humanloop and hook into it for Evaluation workflows in code and via the UI.


In [26]:
def upload_dataset_to_humanloop():
    df = pd.read_json("../../assets/datapoints.jsonl", lines=True)

    datapoints = [row.to_dict() for _i, row in df.iterrows()][0:20]

    response = requests.post(
        f"{HOST}/v5/datasets",
        json={
            "path": "evals_demo/medqa-test",
            "commit_message": f"Added {len(datapoints)} datapoints from MedQA test dataset.",
            "datapoints": datapoints,
        },
        headers={"X-API-Key": HL_KEY},
    )
    print(response.json())
    return response.json()["id"]


In [None]:
dataset_id = upload_dataset_to_humanloop()

## Set up Evaluators

Here we will upload some Evaluators defined in code in `assets/evaluators/` so that Humanloop can manage running these for Evaluations (and later for Monitoring!)

Alternatively you can define AI, Code and Human based Evaluators via the UI - see the relevant `How-to guides` on [Evaluations](https://humanloop.com/docs/v5/evaluation/overview) for creating Evaluators of different kinds.

Further you can choose to not host the Evaluator on Humanloop and instead use your own runtime and instead post the results as part of the Evaluation. This can be useful for more complex workflows that require custom dependencies or resources, but lies outside the scope of this tutorial.


In [28]:
def upload_evaluators():
    for evaluator_name, return_type in [
        ("exact_match", "boolean"),
        ("levenshtein", "number"),
    ]:
        with open(f"../../assets/evaluators/{evaluator_name}.py", "r") as f:
            code = f.read()

        requests.post(
            f"{HOST}/v5/evaluators",
            json={
                "path": f"evals_demo/{evaluator_name}",
                "spec": {
                    "evaluator_type": "python",
                    "arguments_type": "target_required",
                    "return_type": return_type,
                    "code": code,
                },
                "commit_message": f"New version from {evaluator_name}.py",
            },
            headers={"Content-Type": "application/json", "X-API-Key": HL_KEY},
        )

In [29]:
upload_evaluators()

## Run Evaluation

Now we can start to trigger Evaluations on Humanloop using our Dataset and Evaluators:


In [None]:
from tqdm import tqdm


# Create the Evaluation specifying the Dataset and Evaluators to use
evaluation_id = requests.post(
    f"{HOST}/v5/evaluations",
    json={
        "dataset": {"path": "evals_demo/medqa-test"},
        "evaluators": [
            {"path": "evals_demo/exact_match"},
            {"path": "evals_demo/levenshtein"},
        ],
    },
    headers={"X-API-Key": HL_KEY},
).json()["id"]
print(f"Evaluation created: {evaluation_id}")


def populate_evaluation():
    """Run a variation of your Pipeline over the Dataset to populate results"""
    datapoints_response = requests.get(
        f"{HOST}/v5/datasets/{dataset_id}?include_datapoints=true",
        headers={"X-API-KEY": HL_KEY},
    ).json()
    for datapoint in tqdm(datapoints_response["datapoints"]):
        ask_question(
            inputs=datapoint["inputs"],
            datapoint_id=datapoint["id"],
            evaluation_id=evaluation_id,
        )


In [None]:
populate_evaluation()

Mark Traces as complete so evaluation can run.


## Get Results and URL

We can not get the aggregate results via the API and the URL to navigate to the Evaluation in the Humanloop UI.


In [16]:
evaluation_response = requests.get(
    f"{HOST}/v5/evaluations/{evaluation_id}",
    headers={"X-API-KEY": HL_KEY},
)
print("URL: ", evaluation_response.url)