# Humanloop RAG Evaluation Walkthrough
The goal of this notebook is to demonstrate how to take an existing RAG pipeline and integrate Humanloop in order to:
1. Setup logging for both your [Prompt](https://humanloop.com/docs/v5/concepts/prompts) and retriever [Tool](https://humanloop.com/docs/v5/concepts/prompts) so that you can easily track the versions of these components.
2. Create a [Dataset](https://humanloop.com/docs/v5/concepts/prompts) and run Evaluations to benchmark the performance of your RAG pipeline.
3. Configure [Evaluators](https://humanloop.com/docs/v5/concepts/evaluators) for monitoring your RAG pipeline in production.


## What is Humanloop?
Humanloop is an interactive development environment designed to streamline the entire lifecycle of LLM app development. It serves as a central hub where AI, Product, and Engineering teams can collaborate on Prompt management, Evaluation and Monitoring workflows. 


## What is RAG?
RAG stands for Retrieval Augmented Generation.
- **Retrieval** - Getting the relevant information from a larger data source for a given a query.
- **Augmented** - Using the retrieved information as input to an LLM.
- **Generation** - Generating an output from the model given the input.

In practise, it remains an effective way to exploit LLMs for things like question answering, summarization, and more, where the data source is too large to fit in the context window of the LLM, or where providing the full data source for each query is not cost-effective.


## What are the major challenges with RAG?
Implementing RAG and other similar flows complicates the process of [Prompt Engineering](https://humanloop.com/blog/prompt-engineering-101) because you expand the design space of your application. There are lots of choices you need to make around the retrieval and Prompt components that can significantly impact the performance of your overall application. For example,
- How do you select the data source?
- How should it be chunked up and indexed?
- What embedding and retrieval model should you use?
- How should you combine the retrieved information with the query?
- What should your system Prompt be? 
- Which model should you use?
- What should your system message be?
etc...

The process of versioning, evaluating and monitoring your pipeline therefore needs to consider both the retrieval and generation components. This is where Humanloop can help.


# Example RAG Pipeline

We first need a reference RAG implementation. Our use case will be Q&A over a corpus of medical documents.

### Dataset
We will use a version of the [MedQA dataset](https://huggingface.co/datasets/bigbio/med_qa) from Hugging Face. This is a multiple choice question answering problem based on the United States Medical License Exams (USMLE), with reference text books that contain the required information to answer the questions.

### Retriever
We're going to use [Chroma](https://docs.trychroma.com/getting-started) as a simple local vector DB with their default embedding model ``. You can replace this with your favorite retrieval system.




## Configure pre-requisites

In [5]:
!poetry add chromadb openai humanloop==0.8.0b6 pandas pyarrow

The following packages are already present in the pyproject.toml and will be skipped:

  - [36mchromadb[39m
  - [36mopenai[39m
  - [36mpandas[39m

If you want to update it to the latest compatible version, you can use `poetry update package`.
If you prefer to upgrade it to the latest available version, you can use `poetry add package@latest`.

Using version [39;1m^17.0.0[39;22m for [36mpyarrow[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(1.1s)[39;22m

[39;1mPackage operations[39;22m: [34m1[39m install, [34m0[39m updates, [34m0[39m removals

  [34;1m-[39;22m [39mInstalling [39m[36mpyarrow[39m[39m ([39m[39;1m17.0.0[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpyarrow[39m[39m ([39m[39;1m17.0.0[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpyarrow[39m[39m ([39m[39;1m17.0.0[39;22m[39m)[39m: [34m

In [13]:
knowledge_base

Unnamed: 0,id,title,content,contents
0,Anatomy_Gray_0,Anatomy_Gray,What is anatomy? Anatomy includes those struct...,Anatomy_Gray. What is anatomy? Anatomy include...
1,Anatomy_Gray_1,Anatomy_Gray,Observation and visualization are the primary ...,Anatomy_Gray. Observation and visualization ar...
2,Anatomy_Gray_2,Anatomy_Gray,How can gross anatomy be studied? The term ana...,Anatomy_Gray. How can gross anatomy be studied...
3,Anatomy_Gray_3,Anatomy_Gray,"This includes the vasculature, the nerves, the...","Anatomy_Gray. This includes the vasculature, t..."
4,Anatomy_Gray_4,Anatomy_Gray,Each of these approaches has benefits and defi...,Anatomy_Gray. Each of these approaches has ben...
...,...,...,...,...
125842,Surgery_Schwartz_14344,Surgery_Schwartz,"feedback. However, the evidence base upon whic...","Surgery_Schwartz. feedback. However, the evide..."
125843,Surgery_Schwartz_14345,Surgery_Schwartz,College of Physicians Council of Associates; F...,Surgery_Schwartz. College of Physicians Counci...
125844,Surgery_Schwartz_14346,Surgery_Schwartz,This review of 10 articles published between 2...,Surgery_Schwartz. This review of 10 articles p...
125845,Surgery_Schwartz_14347,Surgery_Schwartz,a systematic review. J Surg Educ. 2015;72(6):1...,Surgery_Schwartz. a systematic review. J Surg ...


In [39]:
# Set up dependencies for reference implementation
from dotenv import load_dotenv
import os
from chromadb import chromadb
from openai import OpenAI
from humanloop import Humanloop

import pandas as pd

load_dotenv()

# init clients
chroma = chromadb.Client()
openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))
humanloop = Humanloop(
    api_key=os.getenv("HUMANLOOP_KEY"), base_url=os.getenv("HUMANLOOP_BASE_URL")
)

In [38]:
# init collection into which we will add documents
collection = chroma.get_or_create_collection(name="MedQA")

# load knowledge base
knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
knowledge_base = knowledge_base.sample(1000, random_state=42)


# Add to Chroma - will by default use local vector DB and model all-MiniLM-L6-v2
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

KeyboardInterrupt: 

In [33]:
model = "gpt-3.5-turbo"
temperature = 0
template = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Here is some retrieved information that might be helpful.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]

def populate_template(template: list, inputs: dict[str, str]) -> list:
    """Populate a template with input variables."""
    # TODO: Move to utils.
    messages = []
    for i, template_message in enumerate(template):
        content = template_message["content"]
        for key, value in inputs.items():
            content = content.replace("{{" + key + "}}", value)
        message = {**template_message, "content": content}
        messages.append(message)
    return messages


In [34]:
# Reference RAG pipeline using Chroma and OpenAI

def retrieval_tool(question: str) -> str:
    """Retrieve relevant documents using a chroma collection."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def ask_question(inputs: dict[str, str])-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    retrieved_data = retrieval_tool(inputs["question"])

    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    answer = chat_completion.choices[0].message.content
    return answer

In [65]:
# Test the pipeline

print(
    ask_question(
        {
            "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
            'option_A': '0%', 'option_B': '25%', 'option_C': '50%', 'option_D': '75%', 'option_E': '100%'
        }
    )
)

```
50%
---
The probability that the second daughter is a carrier of the disease is 50%. This is because the daughters of a male with hemophilia A will either be carriers (50% chance) or unaffected (50% chance), as daughters inherit one X chromosome from their father. The other options are incorrect because the daughters cannot have hemophilia A themselves due to the inheritance pattern of the disease.

---
"During meiosis or mitosis, failure of a chromosomal pair to separate properly results in nondisjunction."
```


# Humanloop Integration

The steps to the Humanloop integration are as follows:
....

We demonstrate how you can log to or call any of the core entities on Humanloop 

Add the appropriate Humanloop "log" calls to your code to log the relevant information to Humanloop.

Below, we add a `humanloop.tools.log(...)` call after the retrieval step to log the retrieved documents to Humanloop,
and a `humanloop.prompts.log(...)` call after the chat completion generation.
We also pass in a `session_id` to link these two Logs together.

In [36]:
import inspect
import uuid

def retrieval_tool(question: str) -> str:
    """Retrieve relevant documents using a chroma collection."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def ask_question(inputs: dict[str, str])-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    retrieved_data = retrieval_tool(inputs["question"])

    session_id = uuid.uuid4().hex
    humanloop.tools.log(
        path="evals_demo/medqa-retrieval",
        tool={
            "function": {
                "name": "retrieval_tool",
                "description": "Retrieval tool for MedQA.",
            },
            "source_code": inspect.getsource(retrieval_tool),
        },
        output=retrieved_data,
        session_id=session_id,
    )

    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    answer = chat_completion.choices[0].message.content

    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
        },
        inputs=inputs,
        output=chat_completion.choices[0].message.content,
        output_message=chat_completion.choices[0].message,
        session_id=session_id,
    )

    return answer

In [None]:
# TODO: Update. This is old.

# Manage your Prompt on Humanloop

def ask_question(question: str)-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    # Retrieve relevant documents
    response = collection.query(query_texts=["question"], n_results=1)
    retrieved_doc = response["documents"][0][0]
    
    # Generate answer using Prompt managed on Humanloop
    messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": question},
            {"role": "assistant", "content": retrieved_doc}
        ]
    answer = hl.prompt.call(
        path="faq-bot/rag-prompt",
        messages=messages,
    )
    return answer

# Setting up Evaluations

## Extend Humanloop integration

Add `source_datapoint_id` to the `humanloop.prompt.log(...)` and `humanloop.tool.log(...)` calls.
We do this below by adding the optional `datapoint_id` argument to `ask_question(...)`.

In [64]:
import inspect
import uuid

def retrieval_tool(question: str) -> str:
    """Retrieve relevant documents using a chroma collection."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def ask_question(inputs: dict[str, str], datapoint_id: str | None = None)-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    retrieved_data = retrieval_tool(inputs["question"])

    session_id = uuid.uuid4().hex
    humanloop.tools.log(
        path="evals_demo/medqa-retrieval",
        tool={
            "function": {
                "name": "retrieval_tool",
                "description": "Retrieval tool for MedQA.",
            },
            "source_code": inspect.getsource(retrieval_tool),
        },
        output=retrieved_data,
        session_id=session_id,
        source_datapoint_id=datapoint_id,
    )

    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    answer = chat_completion.choices[0].message.content

    humanloop.prompts.log(
        path="evals_demo/medqa-answer",
        prompt={
            "model": model,
            "temperature": temperature,
            "template": template,
        },
        inputs=inputs,
        output=chat_completion.choices[0].message.content,
        output_message=chat_completion.choices[0].message,
        session_id=session_id,
        source_datapoint_id=datapoint_id,
    )

    return answer

## Creating a dataset
- From your existing logs
- using the SDK 

In [56]:
def upload_dataset_to_humanloop(df: pd.DataFrame):
    df = pd.read_json("../../assets/datapoints.jsonl", lines=True)

    datapoints = [row.to_dict() for _i, row in df.iterrows()]
    return humanloop.datasets.upsert(
        path="evals_demo/medqa-test",
        datapoints=datapoints,
        commit_message=f"Added {len(datapoints)} datapoints from MedQA test dataset.",
    )


In [57]:
upload_dataset_to_humanloop(df)

UnprocessableEntityError: status_code: 422, body: detail={'loc': ['commit_message'], 'msg': 'Error creating Version', 'description': "Version 'dsv_No7Ofzh5RrFPWBbfujGfy' has already been committed.", 'type': 'invalid_request_error'}

## Set up Evaluators

In [58]:
def upload_evaluators():
    for evaluator_name, return_type in [
        ("exact_match", "boolean"),
        ("levenshtein", "number"),
    ]:
        with open(f"../../assets/evaluators/{evaluator_name}.py", "r") as f:
            code = f.read()
        humanloop.evaluators.upsert(
            path=f"evals_demo/{evaluator_name}",
            spec={
                "evaluator_type": "python",
                "arguments_type": "target_required",
                "return_type": return_type,
                "code": code,
            },
            commit_message=f"New version from {evaluator_name}.py",
        )

In [59]:
upload_evaluators()

UnprocessableEntityError: status_code: 422, body: detail={'loc': ['commit_message'], 'msg': 'Error creating Version', 'description': "Version 'evv_w85ulWbWFqElPWxXe3lEu' has already been committed.", 'type': 'invalid_request_error'}

## Run Evaluation

In [61]:
from tqdm import tqdm

def run_evaluation():
    """Runs an Evaluation."""

    DATASET_ID = "ds_"
    DATASET_VERSION_ID = "dsv_"
    PROMPT_VERSION_ID = "prv_"
    EVALUATOR_VERSION_IDS = [
        "evv_",  # exact_match
        "evv_",  # levenshtein
    ]

    evaluation = humanloop.evaluations.create(
        dataset={"version_id": DATASET_VERSION_ID},
        evaluatees=[{"version_id": PROMPT_VERSION_ID, "orchestrated": False}],
        evaluators=[{"version_id": ev_id} for ev_id in EVALUATOR_VERSION_IDS],
    )
    print(f"Evaluation created: {evaluation.id}")

    retrieved_dataset = humanloop.datasets.get(
        id=DATASET_ID,
        version_id=DATASET_VERSION_ID,
        include_datapoints=True,
    )
    for datapoint in tqdm(retrieved_dataset.datapoints):
        # with evaluation(datapoint.id):
        ask_question(
            inputs=datapoint.inputs,
            datapoint_id=datapoint.id,
        )


In [66]:
run_evaluation()

Evaluation created: evr_tDoZQgxw3ZCSV5EV7HTXy


  4%|▍         | 53/1273 [03:12<1:10:42,  3.48s/it]