## Complete Prerequisites

### Install packages

We use poetry to manage dependencies:


In [1]:
!poetry install

[34mInstalling dependencies from lock file[39m

No dependencies to install or update

[39;1mInstalling[39;22m the current project: [36mhumanloop-cookbook[39m ([39;1m0.1.0[39;22m)
If you do not want to install the current project use [39m[36m--no-root[39m[33m.
If you want to use Poetry only for dependency management but not for packaging, you can disable package mode by setting [39m[36mpackage-mode = false[39m[33m in your pyproject.toml file.


### Initialise the SDKs

You will need to set your OpenAI API key in the `.env` file in the root of the repo. You can retrieve your API key from your [OpenAI account](https://platform.openai.com/api-keys).


In [2]:
IS_DEV = False

In [3]:
HL_KEY = ""
OPENAI_KEY = ""
HOST = "neostaging.humanloop.ml" if IS_DEV else "http://0.0.0.0:80"

In [4]:
# Set up dependencies
import os
from chromadb import chromadb
from openai import OpenAI
import requests
import datetime

import pandas as pd

# init clients
chroma = chromadb.Client()
openai = OpenAI(api_key=OPENAI_KEY)


### Set up the Vector DB

This involves loading the data from the MedQA dataset and embedding the data within a collection in Chroma. This will take a couple of minutes to complete.


In [5]:
# init collection into which we will add documents
collection = chroma.get_or_create_collection(name="MedQA")

# load knowledge base
knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
knowledge_base = knowledge_base.sample(5, random_state=42)


# Add to Chroma - will by default use local vector DB and model all-MiniLM-L6-v2
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

### Define the Prompt

We define a simple prompt template that has variables for the question, answer options and retrieved data.

It is generally good practise to define the Prompt details that impact the behaviour of the model in one place separate to your application logic.


In [6]:
model = "gpt-3.5-turbo"
temperature = 0
template = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Here is some retrieved information that might be helpful.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]


def populate_template(template: list, inputs: dict[str, str]) -> list:
    """Populate a template with input variables."""
    messages = []
    for i, template_message in enumerate(template):
        content = template_message["content"]
        for key, value in inputs.items():
            content = content.replace("{{" + key + "}}", value)
        message = {**template_message, "content": content}
        messages.append(message)
    return messages


## Define the RAG Pipeline

Now we provide the reference RAG pipeline using Chroma and OpenAI that takes a question and returns an answer. This is ultimately what we will evaluate.


In [7]:
def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=1)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc


def ask_question(inputs: dict[str, str]) -> str:
    """Ask a question and get an answer using a simple RAG pipeline"""

    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Populate the Prompt template
    messages = populate_template(template, inputs)

    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    return chat_completion.choices[0].message.content

In [8]:
# Test the pipeline

print(
    ask_question(
        {
            "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
            "option_A": "0%",
            "option_B": "25%",
            "option_C": "50%",
            "option_D": "75%",
            "option_E": "100%",
        }
    )
)

```
50%
---
The probability that the second daughter is a carrier of the disease is 50%. This is because the daughters of a male with hemophilia A will either inherit the disease (if they receive the affected X chromosome) or be carriers (if they receive the normal X chromosome). Since the daughters are unaffected, they must be carriers of the disease.

---
```



# Humanloop Integration

We now integrate Humanloop into the RAG pipeline to first enable logging and then to trigger evaluations against a dataset.


## Flow V1 Experiment


In [9]:
import inspect
import uuid

flow_id = None


def ask_question(
    inputs: dict[str, str],
    datapoint_id: str | None = None,
    evaluation_id: str | None = None,
) -> str:
    """Ask a question and get an answer using a simple RAG pipeline"""
    trace_request = requests.post(
        f"{HOST}/v5/flows/log",
        headers={"X-API-KEY": HL_KEY},
        json={
            "log_id": uuid.uuid4().hex,
            "flow": {
                "attributes": {
                    "description": "Answering medical questions",
                    "chroma": 1,
                }
            },
            "path": "evals_demo/medqa-flow",
            "source_datapoint_id": datapoint_id,
            "evaluation_id": evaluation_id,
        },
    ).json()
    print(trace_request)
    trace_id = trace_request["id"]

    try:
        flow_id = trace_request["flow_id"]

        # Create an Evaluator to count the number of children in the trace
        with open("../../assets/evaluators/count_trace_children.py", "r") as fp:
            code = fp.read()

        evaluator_response = requests.post(
            f"{HOST}/v5/evaluators",
            headers={"X-API-KEY": HL_KEY},
            json={
                "spec": {
                    "code": code,
                    "evaluator_type": "python",
                    "return_type": "number",
                    "arguments_type": "target_free",
                },
                "path": "evals_demo/count-trace-children",
            },
        )
        # Associate trace_children evaluator with the Flow
        ev_version_id: int = evaluator_response.json()["version_id"]
        requests.post(
            f"{HOST}/v5/flows/{flow_id}/evaluators",
            headers={"X-API-KEY": HL_KEY},
            json={
                "activate": [
                    {
                        "evaluator_version_id": ev_version_id,
                    }
                ]
            },
        )
    except:  # noqa: E722
        # Evaluator already exists
        pass

    start_time = datetime.datetime.now()

    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])

    end_time = datetime.datetime.now()

    # Log the context and retriever details to your Humanloop Tool
    requests.post(
        f"{HOST}/v5/tools/log",
        json={
            "path": "evals_demo/medqa-retrieval",
            "tool": {
                "function": {
                    "name": "retrieval_tool",
                    "description": "Retrieval tool for MedQA.",
                },
                "source_code": inspect.getsource(retrieval_tool),
            },
            "output": retrieved_data,
            "trace_parent_id": trace_id,
            "start_time": start_time.isoformat(),
            "end_time": end_time.isoformat(),
        },
        headers={"X-API-Key": HL_KEY},
    )

    # Populate the Prompt template
    inputs = {**inputs, "retrieved_data": retrieved_data}
    messages = populate_template(template, inputs)

    # Call OpenAI to get a response
    start_time = datetime.datetime.now()

    chat_completion = openai.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=messages,
    )
    message = chat_completion.choices[0].message
    answer = message.content

    end_time = datetime.datetime.now()

    requests.post(
        f"{HOST}/v5/prompts/log",
        headers={"X-API-Key": HL_KEY},
        json={
            "path": "evals_demo/medqa-answer",
            "prompt": {
                "model": model,
                "temperature": temperature,
                "template": template,
            },
            "inputs": inputs,
            "output": answer,
            "output_message": message.to_dict(),
            "trace_parent_id": trace_id,
            "source_datapoint_id": datapoint_id,
            "evaluation_id": evaluation_id,
            "start_time": start_time.isoformat(),
            "end_time": end_time.isoformat(),
        },
    )

    try:
        requests.patch(
            f"{HOST}/v5/flows/logs/{trace_id}",
            headers={"X-API-KEY": HL_KEY},
            json={
                "inputs": inputs,
                "output": answer,
                "trace_status": "complete",
            },
            timeout=2,
        )
    except Exception as e:  # noqa: E722
        print(e)
        pass

    return answer

## Create a Dataset

Here we will create a Dataset on Humanloop using the MedQA test dataset. Alternatively you can create a data from Logs on Humanloop, or upload via the UI - see our [guide](https://humanloop.com/docs/v5/evaluation/guides/create-dataset).

You can then effectively version control your Dataset centrally on Humanloop and hook into it for Evaluation workflows in code and via the UI.


In [10]:
def upload_dataset_to_humanloop():
    df = pd.read_json("../../assets/datapoints.jsonl", lines=True)

    datapoints = [row.to_dict() for _i, row in df.iterrows()][0:20]

    response = requests.post(
        f"{HOST}/v5/datasets",
        json={
            "path": "evals_demo/medqa-test",
            "commit_message": f"Added {len(datapoints)} datapoints from MedQA test dataset.",
            "datapoints": datapoints,
        },
        headers={"X-API-Key": HL_KEY},
    )
    print(response.json())
    return response.json()["id"]


In [11]:
dataset_id = upload_dataset_to_humanloop()

{'path': 'evals_demo/medqa-test', 'id': 'ds_JHjdcy1Pwn6ZxAnDglU30', 'directory_id': 'dir_7pYI47Zq55Ytx7eDyuYDo', 'name': 'medqa-test', 'version_id': 'dsv_8E2y5YjQt9Y9Y4Ni1DTvY', 'type': 'dataset', 'environments': [{'id': 'env_NakuPkQXr8w4dYkTAAynO', 'created_at': '2024-04-29T08:01:19.415384', 'name': 'production', 'tag': 'default'}], 'created_at': '2024-10-04T13:18:52.708228', 'updated_at': '2024-10-04T13:18:52.708228', 'created_by': {'id': 'usr_mp3w0ne3k8cwCh3IdyqQJ', 'email_address': 'andrei@humanloop.com', 'full_name': 'Andrei Bratu', 'platform_access': 'user'}, 'status': 'committed', 'last_used_at': '2024-10-04T13:18:52.708228', 'commit_message': 'Added 20 datapoints from MedQA test dataset.', 'datapoints_count': 20, 'datapoints': None, 'team_id': 'tm_b79syTwUvFjr0T1tmT8wq', 'attributes': None}


## Set up Evaluators

Here we will upload some Evaluators defined in code in `assets/evaluators/` so that Humanloop can manage running these for Evaluations (and later for Monitoring!)

Alternatively you can define AI, Code and Human based Evaluators via the UI - see the relevant `How-to guides` on [Evaluations](https://humanloop.com/docs/v5/evaluation/overview) for creating Evaluators of different kinds.

Further you can choose to not host the Evaluator on Humanloop and instead use your own runtime and instead post the results as part of the Evaluation. This can be useful for more complex workflows that require custom dependencies or resources, but lies outside the scope of this tutorial.


In [12]:
def upload_evaluators():
    for evaluator_name, return_type in [
        ("exact_match", "boolean"),
        ("levenshtein", "number"),
    ]:
        with open(f"../../assets/evaluators/{evaluator_name}.py", "r") as f:
            code = f.read()

        requests.post(
            f"{HOST}/v5/evaluators",
            json={
                "path": f"evals_demo/{evaluator_name}",
                "spec": {
                    "evaluator_type": "python",
                    "arguments_type": "target_required",
                    "return_type": return_type,
                    "code": code,
                },
                "commit_message": f"New version from {evaluator_name}.py",
            },
            headers={"Content-Type": "application/json", "X-API-Key": HL_KEY},
        )

In [13]:
upload_evaluators()

## Run Evaluation

Now we can start to trigger Evaluations on Humanloop using our Dataset and Evaluators:


In [14]:
from tqdm import tqdm


# Create the Evaluation specifying the Dataset and Evaluators to use
response = requests.post(
    f"{HOST}/v5/evaluations",
    json={
        "dataset": {"path": "evals_demo/medqa-test"},
        "evaluators": [
            {"path": "evals_demo/exact_match"},
            {"path": "evals_demo/levenshtein"},
        ],
    },
    headers={"X-API-KEY": HL_KEY},
)
evaluation_id = response.json()["id"]
print(f"Evaluation created: {evaluation_id}")


def populate_evaluation():
    """Run a variation of your Pipeline over the Dataset to populate results"""
    datapoints_response = requests.get(
        f"{HOST}/v5/datasets/{dataset_id}?include_datapoints=true",
        headers={"X-API-KEY": HL_KEY},
    ).json()
    for datapoint in tqdm(datapoints_response["datapoints"]):
        ask_question(
            inputs=datapoint["inputs"],
            datapoint_id=datapoint["id"],
            evaluation_id=evaluation_id,
        )

Evaluation created: evr_B7RELNW5C3VoCRuGCPRoh


In [15]:
populate_evaluation()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0%|          | 0/20 [00:00<?, ?it/s]

{'id': '4b1ead72e474486180337bc3bbb68900', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


  5%|▌         | 1/20 [00:11<03:38, 11.48s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '45599725d68941139496993ba14d0bcd', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 10%|█         | 2/20 [00:22<03:21, 11.20s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'e3c7c753c45c401596bd91a62d162338', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 15%|█▌        | 3/20 [00:34<03:13, 11.36s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'c3ce3094582742618e9f4ea5c5deb158', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 20%|██        | 4/20 [00:43<02:49, 10.56s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '4aa1806f2e3646338642555ff986ee3a', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 25%|██▌       | 5/20 [00:54<02:41, 10.78s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'd5b15d2b7f1a420e848971c457c6eaa0', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 30%|███       | 6/20 [01:08<02:43, 11.71s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'd6f076d4a41f43dfb6f729067823dab6', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 35%|███▌      | 7/20 [01:17<02:22, 10.93s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'd4c614a80a694708a804f96520ff076a', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 40%|████      | 8/20 [01:32<02:26, 12.17s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'b0f1a08f4e4a41189d2a7a0f5facd30f', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 45%|████▌     | 9/20 [01:41<02:05, 11.38s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '5a09d0c2767f4da487ea41695b289f6a', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 50%|█████     | 10/20 [01:51<01:49, 10.94s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '95a59037cb0b48feaec69f9a38e12228', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 55%|█████▌    | 11/20 [02:01<01:35, 10.62s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '4ce8d30e24a649f1a144d50d4acd5935', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 60%|██████    | 12/20 [02:12<01:25, 10.66s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'de41e03a15e24b6fb310f886f823e431', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 65%|██████▌   | 13/20 [02:22<01:12, 10.41s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '9b627e74b407458da4c65a24b634d39e', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 70%|███████   | 14/20 [02:32<01:01, 10.23s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '06fd059892c34245b10b921a2be189ad', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 75%|███████▌  | 15/20 [02:41<00:49,  9.92s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '033a05f1f45842e6a591310c84ec8287', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 80%|████████  | 16/20 [02:51<00:39,  9.88s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '0563a08c195c43aea85d256ce6ac5920', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 85%|████████▌ | 17/20 [03:02<00:31, 10.37s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': '15967e5f21e64128a75913edc3b02625', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 90%|█████████ | 18/20 [03:12<00:20, 10.35s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'cb508789c613464b9e52e16b1f0c7ce1', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


 95%|█████████▌| 19/20 [03:24<00:10, 10.75s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)
{'id': 'd224e7af34b34b2eadc0062bc17b2303', 'flow_id': 'fl_MndWfGWogEIMKCDaISagx', 'version_id': 'flv_Nu0O5GMl49i5YxuEJaPlK', 'trace_status': 'incomplete'}


100%|██████████| 20/20 [03:35<00:00, 10.77s/it]

HTTPConnectionPool(host='0.0.0.0', port=80): Read timed out. (read timeout=2)





Mark Traces as complete so evaluation can run.


## Get Results and URL

We can not get the aggregate results via the API and the URL to navigate to the Evaluation in the Humanloop UI.


In [16]:
evaluation_response = requests.get(
    f"{HOST}/v5/evaluations/{evaluation_id}",
    headers={"X-API-KEY": HL_KEY},
)
print("URL: ", evaluation_response.url)

URL:  http://0.0.0.0:80/v5/evaluations/evr_B7RELNW5C3VoCRuGCPRoh
