# Humanloop RAG Evaluation Walkthrough
The goal of this notebook is to demonstrate how to take an existing RAG pipeline and integrate Humanloop in order to:
1. Setup logging for both your [Prompt](https://humanloop.com/docs/v5/concepts/prompts) and retriever [Tool](https://humanloop.com/docs/v5/concepts/prompts) so that you can easily track the versions of these components.
2. Create a [Dataset](https://humanloop.com/docs/v5/concepts/prompts) and run Evaluations to benchmark the performance of your RAG pipeline.
3. Configure [Evaluators](https://humanloop.com/docs/v5/concepts/evaluators) for monitoring your RAG pipeline in production.


## What is Humanloop?
Humanloop is an interactive development environment designed to streamline the entire lifecycle of LLM app development. It serves as a central hub where AI, Product, and Engineering teams can collaborate on Prompt management, Evaluation and Monitoring workflows. 


## What is RAG?
RAG stands for Retrieval Augmented Generation.
- **Retrieval** - Getting the relevant information from a larger data source for a given a query.
- **Augmented** - Using the retrieved information as input to an LLM.
- **Generation** - Generating an output from the model given the input.

In practise, it remains an effective way to exploit LLMs for things like question answering, summarization, and more, where the data source is too large to fit in the context window of the LLM, or where providing the full data source for each query is not cost-effective.


## What are the major challenges with RAG?
Implementing RAG and other similar flows complicates the process of [Prompt Engineering](https://humanloop.com/blog/prompt-engineering-101) because you expand the design space of your application. There are lots of choices you need to make around the retrieval and Prompt components that can significantly impact the performance of your overall application. For example,
- How do you select the data source?
- How should it be chunked up and indexed?
- What embedding and retrieval model should you use?
- How should you combine the retrieved information with the query?
- What should your system Prompt be? 
- Which model should you use?
- What should your system message be?
etc...

The process of versioning, evaluating and monitoring your pipeline therefore needs to consider both the retrieval and generation components. This is where Humanloop can help.


# Example RAG Pipeline

We first need a reference RAG implementation. Our use case will be Q&A over a corpus of medical documents.

- **Dataset**: We'll use a version of the [MedQA dataset](https://huggingface.co/datasets/bigbio/med_qa) from Hugging Face. This is a multiple choice question answering problem based on the United States Medical License Exams (USMLE), with reference text books that contain the required information to answer the questions.
- **Retriever**: We're going to use [Chroma](https://docs.trychroma.com/getting-started) as a simple local vector DB with their default embedding model ``. You can replace this with your favorite retrieval system.
- **Prompt**: **The Prompt will be managed in code**, populated with the users question and the context retrieved from the Retriever and sent to [OpenAI](https://platform.openai.com/docs/api-reference/introduction) to generate the answer.

### NB: where to store your Prompts?

Generally speaking, when the engineering/applied AI teams are mainly responsible for managing the details of the Prompt, then the pattern of storing or constructing the Prompt in code works well. This is the pattern we follow in this tutorial. 

However, if the Product/Domain Expert teams are more involved in Prompt engineering and management, then the Prompt can instead be managed on Humanloop and retrieved or called by your code - this workflow lies outside the scope of this tutorial and we cover it separately. 

## Complete Pre-requisites

### Install packages
We use poetry to manage dependencies:

In [5]:
!poetry install

The following packages are already present in the pyproject.toml and will be skipped:

  - [36mchromadb[39m
  - [36mopenai[39m
  - [36mpandas[39m

If you want to update it to the latest compatible version, you can use `poetry update package`.
If you prefer to upgrade it to the latest available version, you can use `poetry add package@latest`.

Using version [39;1m^17.0.0[39;22m for [36mpyarrow[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(1.1s)[39;22m

[39;1mPackage operations[39;22m: [34m1[39m install, [34m0[39m updates, [34m0[39m removals

  [34;1m-[39;22m [39mInstalling [39m[36mpyarrow[39m[39m ([39m[39;1m17.0.0[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpyarrow[39m[39m ([39m[39;1m17.0.0[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpyarrow[39m[39m ([39m[39;1m17.0.0[39;22m[39m)[39m: [34m

### Initialise the SDKs

You will need to set your OpenAI API key in the  `.env` file in the root of the repo. You can retrieve your API key from your [OpenAI account](https://platform.openai.com/api-keys).


In [11]:
# Set up dependencies
from dotenv import load_dotenv
import os
from chromadb import chromadb
from openai import OpenAI

import pandas as pd

# load .env file that contains API keys
load_dotenv()

# init clients
chroma = chromadb.Client()
openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))


### Set up the Vector DB
This involves loading the data from the MedQA dataset and embedding the data within a collection in Chroma. This will take a couple of minutes to complete.

In [6]:
# init collection into which we will add documents
collection = chroma.get_or_create_collection(name="MedQA")

# load knowledge base
knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
knowledge_base = knowledge_base.sample(2, random_state=42)


# Add to Chroma - will by default use local vector DB and model all-MiniLM-L6-v2
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

### Define the Prompt
We define a simple prompt template that has variables for the question, answer options and retrieved data.

It is generally good practise to define the Prompt details that impact the behaviour of the model in one place separate to your application logic.

In [7]:
model = "gpt-3.5-turbo"
temperature = 0
template = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Here is some retrieved information that might be helpful.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]

def populate_template(template: list, inputs: dict[str, str]) -> list:
    """Populate a template with input variables."""
    messages = []
    for i, template_message in enumerate(template):
        content = template_message["content"]
        for key, value in inputs.items():
            content = content.replace("{{" + key + "}}", value)
        message = {**template_message, "content": content}
        messages.append(message)
    return messages


### Define the RAG Pipeline

Now we provide the reference RAG pipeline using Chroma and OpenAI that takes a question and returns an answer. This is ultimately what we will evaluate.


In [12]:
def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def ask_question(inputs: dict[str, str])-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    retrieved_data = retrieval_tool(inputs["question"])

    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    answer = chat_completion.choices[0].message.content
    return answer

In [13]:
# Test the pipeline

print(
    ask_question(
        {
            "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
            'option_A': '0%', 'option_B': '25%', 'option_C': '50%', 'option_D': '75%', 'option_E': '100%'
        }
    )
)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: <INSERT KEY>. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

# Humanloop Integration

We now integrate Humanloop into the RAG pipeline to first enable logging and then to trigger evaluations against a dataset.


### Initialise the SDK
You will need to set your Humanloop API key in the  `.env` file in the root of the repo. You can retrieve your API key from your [Humanloop organization](https://app.humanloop.com/account/api-keys).


In [None]:
# Init the Humanloop SDK
from humanloop import Humanloop

load_dotenv()
humanloop = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))

## Integrate Logging

Below, we add a `humanloop.tools.log(...)` call after the retrieval step to log the retrieved documents to Humanloop,
and a `humanloop.prompts.log(...)` call after the chat completion generation.
We also pass in a `session_id` to link these two Logs together.

On running this updated code, Humanloop will now begin to track the versions of your Tool and Prompt and their inputs, outputs and associated metadata. 

In [36]:
# redefine the ask_question function to include logging

def ask_question(inputs: dict[str, str])-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    retrieved_data = retrieval_tool(inputs["question"])

    session_id = uuid.uuid4().hex
    humanloop.tools.log(
        path="evals_demo/medqa-retrieval",
        tool={
            "function": {
                "name": "retrieval_tool",
                "description": "Retrieval tool for MedQA.",
            },
            "source_code": inspect.getsource(retrieval_tool),
        },
        output=retrieved_data,
        session_id=session_id,
    )

    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    answer = chat_completion.choices[0].message.content

    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
        },
        inputs=inputs,
        output=chat_completion.choices[0].message.content,
        output_message=chat_completion.choices[0].message,
        session_id=session_id,
    )
    return answer

In [None]:
# Test the pipeline

print(
    ask_question(
        {
            "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
            'option_A': '0%', 'option_B': '25%', 'option_C': '50%', 'option_D': '75%', 'option_E': '100%'
        }
    )
)

### Check your Humanloop workspace

After running this pipeline, you will now see your Prompt and Tool logs in your Humanloop workspace:

If you make changes to your Prompt in code and re-run the pipeline, you will a new version of the Prompt created in Humanloop:

<INSERT PICTURE>


# Triggering Evaluations

We will now extend our implementation to allow us to run Evaluations on Humanloop against a specific test dataset.

This involves the following steps:
1. Extend our logging to include info needed by Evaluations.
2. Create a Dataset that we can manage and re-use on Humanloop as the source of truth.
3. Create some Evaluators that we can manage and re-use on Humanloop that can provide judgements on the performance of our Pipeline.
4. Trigger an Evaluation and view the results.

Now as you tweak your pipeline, this will allow you to easily track and compare the performance of different versions. 

## Extend logging

Add `source_datapoint_id` to the `humanloop.prompt.log(...)` and `humanloop.tool.log(...)` calls.
We do this below by adding the optional `datapoint_id` argument to `ask_question(...)`.

In [64]:
import inspect
import uuid


def ask_question(inputs: dict[str, str], datapoint_id: str | None = None, evaluation_id: str| None = None)-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    retrieved_data = retrieval_tool(inputs["question"])

    session_id = uuid.uuid4().hex
    humanloop.tools.log(
        path="evals_demo/medqa-retrieval",
        tool={
            "function": {
                "name": "retrieval_tool",
                "description": "Retrieval tool for MedQA.",
            },
            "source_code": inspect.getsource(retrieval_tool),
        },
        output=retrieved_data,
        session_id=session_id,
        source_datapoint_id=datapoint_id,
        evaluation_id=evaluation_id,
    )

    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    answer = chat_completion.choices[0].message.content

    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
        },
        inputs=inputs,
        output=chat_completion.choices[0].message.content,
        output_message=chat_completion.choices[0].message,
        session_id=session_id,
        source_datapoint_id=datapoint_id,
        evaluation_id=evaluation_id,
    )

    return answer

### Create a dataset
Here we will create a Dataset on Humanloop using the MedQA test dataset. Alternatively you can create a data from Logs on Humanloop, or upload via the UI - see our [guide](https://humanloop.com/docs/v5/evaluation/guides/create-dataset). 

In [56]:
def upload_dataset_to_humanloop():
    df = pd.read_json("../../assets/datapoints.jsonl", lines=True)

    datapoints = [row.to_dict() for _i, row in df.iterrows()]
    return humanloop.datasets.upsert(
        path="evals_demo/medqa-test",
        datapoints=datapoints,
        commit_message=f"Added {len(datapoints)} datapoints from MedQA test dataset.",
    )


In [57]:
dataset = upload_dataset_to_humanloop()

UnprocessableEntityError: status_code: 422, body: detail={'loc': ['commit_message'], 'msg': 'Error creating Version', 'description': "Version 'dsv_No7Ofzh5RrFPWBbfujGfy' has already been committed.", 'type': 'invalid_request_error'}

### Set up Evaluators

Here we will upload some Evaluators defined in code in `assets/evaluators/` so that Humanloop can manage running these for Evaluations.

Alternatively you can define AI, Code and Human based Evaluators via the UI - see the relevant `How-to guides` on [Evaluations](https://humanloop.com/docs/v5/evaluation/overview) for creating Evaluators of different kinds.

Further you can choose to not host the Evaluator on Humanloop and instead use your own runtime and instead post the results as part of the Evaluation. This can be useful for more complex workflows that require custom dependencies or resources, but lies outside the scope of this tutorial.

In [58]:
def upload_evaluators():
    for evaluator_name, return_type in [
        ("exact_match", "boolean"),
        ("levenshtein", "number"),
    ]:
        with open(f"../../assets/evaluators/{evaluator_name}.py", "r") as f:
            code = f.read()
        humanloop.evaluators.upsert(
            path=f"evals_demo/{evaluator_name}",
            spec={
                "evaluator_type": "python",
                "arguments_type": "target_required",
                "return_type": return_type,
                "code": code,
            },
            commit_message=f"New version from {evaluator_name}.py",
        )

In [59]:
upload_evaluators()

UnprocessableEntityError: status_code: 422, body: detail={'loc': ['commit_message'], 'msg': 'Error creating Version', 'description': "Version 'evv_w85ulWbWFqElPWxXe3lEu' has already been committed.", 'type': 'invalid_request_error'}

## Run Evaluation

Now we will define the function to run the Evaluation Humanloop.

In [61]:
from tqdm import tqdm

DATASET = "evals_demo/medqa-test"

def run_evaluation():
    """Runs an Evaluation."""
    
    # Create the Evaluation specifying the Dataset and Evaluators to use
    evaluation = humanloop.evaluations.create(
        # NB: you can also use the `id` to reference Datasets and Evaluators 
        dataset={"path": DATASET},
        evaluators=[
            {"path": "evals_demo/exact_match"},
            {"path": "evals_demo/levenshtein"},
        ],
    )
    print(f"Evaluation created: {evaluation.id}")
    
    # Run you pipeline over the Dataset
    retrieved_dataset = humanloop.datasets.get(
        path=DATASET,
        include_datapoints=True,
    )
    for datapoint in tqdm(retrieved_dataset.datapoints):
        ask_question(
            inputs=datapoint.inputs,
            datapoint_id=datapoint.id,
            evaluation_id=evaluation.id,
        )


In [66]:
run_evaluation()

Evaluation created: evr_tDoZQgxw3ZCSV5EV7HTXy


  4%|▍         | 53/1273 [03:12<1:10:42,  3.48s/it]

Follow the URL to see the Evaluation report building on Humanloop