## Step 1: Import modules and packages, download reference docs

This goes through entire LastMile AI Eval flow from

1. create ingestion trace
2. generate a query + ground truth context pair per each node context in a document
  - taking those queries and running rag query traces to get actual retrieved context
3. listing query traces I want to include in a test set (defaults to last N queries for now)
4. create test Set with given query_traces, as well as storing the ground truth for the associated context for each query
5. create evaluation metrics based on ones provided by Llama Index
   - note: this is mainly from Llama Index, so the evaluation metrics are only focused on retrieval, nothing on outputs (though I store those as output events too)
6. create evaluation set by feeding these metrics with test set we just created

Some notes:
- no manual id grepping needed --> all taken care of by helper functions
probably needs to be better designed in future, just was focused on getting unblocked
- need to refactor ingestion_trace_id to map to trace-level, not marking rag query event level (right now it doesn't work, I'll add that later)
- some other small API convenience functions need to be added to the API, such as a helper function for `list_evaluation_sets()`


In [15]:
# Install dependencies
# IMPORTANT: After running this cell, you MUST
# restart kernel for these changes to take effect

# !pip list | grep lastmile

# !pip3 install lastmile-eval #--upgrade --force-reinstall

!pwd

# Hacky way to locally install the lastmile-eval package lol
!pip3 install -e ../../../../..

!pip3 install llama-index

/Users/rossdancraig/Projects/eval/src/lastmile_eval/examples/rag_debugger/getting_started
Obtaining file:///Users/rossdancraig/Projects/eval
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: lastmile-eval
  Building editable for lastmile-eval (pyproject.toml) ... [?25ldone
[?25h  Created wheel for lastmile-eval: filename=lastmile_eval-0.0.14-0.editable-py3-none-any.whl size=5150 sha256=8196950b3f8654adb90d152850e36e8768eba95d1e1f1cbd7460f7ff1479884d
  Stored in directory: /private/var/folders/n9/fr1zcc3x3m327h0r11mr5b7c0000gn/T/pip-ephem-wheel-cache-xc0zf5j9/wheels/f5/5c/e6/f8760477828ee734f8b060f518c34939861874bc3ff8be5687
Successfully built lastmile-eval
Installing collected packages: lastmile-eval
  Attempting uninstall: lastmile-eval
    Found 

In [16]:
!pip list | grep lastmile-eval

lastmile-eval                             0.0.14          /Users/rossdancraig/Projects/eval


In [17]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms.openai import OpenAI

import pandas as pd

from lastmile_eval.rag.debugger.tracing import get_lastmile_tracer


In [18]:
import os
import dotenv
# You can get your OPENAI_API_KEY from https://platform.openai.com/api-keys

dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LASTMILE_API_TOKEN = os.getenv("LASTMILE_API_TOKEN")

os.environ["LASTMILE_API_TOKEN"] = LASTMILE_API_TOKEN
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   269k      0 --:--:-- --:--:-- --:--:--  269k


## Step 2: Run and Trace Ingestion Pipeline

In [19]:
# Instantiate a tracer object
tracer = get_lastmile_tracer("my_cool_tracer")


# You can use the tracer either as a decorator around a function (like below)
# or with the "with ... as span_variable_name:" syntax
@tracer.start_as_current_span("ingestion-root-span")
def run_ingestion_flow():
    documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

    # Register the doc file paths as a parameter
    doc_file_paths = [
        doc.metadata.get("file_path")
        for doc in documents
        if doc.metadata.get("file_path") is not None
    ]
    tracer.register_param("doc_file_paths", str(doc_file_paths))

    with tracer.start_as_current_span(
        "create-document-nodes"
    ) as _node_parser_span:
      # Register chunk_size as a parameter in this
      # trace's parameter set
      chunk_size = 512
      tracer.register_param("chunk_size", chunk_size)

      node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size)
      nodes = node_parser.get_nodes_from_documents(documents)

    # Mark a RAG Ingestion trace event
    #   --> For now this only accepts strings and list of strings
    #   --> We can add more specific events (like what you'll see with
    #      the `mark_rag_query_trace_event` method) in the future
      tracer.mark_rag_ingestion_trace_event("Created document nodes!")
    with tracer.start_as_current_span(
        "embed-document-nodes"
    ) as _create_node_span:
        vector_index = VectorStoreIndex(nodes)
        query_engine = vector_index.as_query_engine()
        tracer.mark_rag_ingestion_trace_event("Created embeddings!")

    # We use these variables later in the notebook so need to return them
    # in this function
    return nodes, vector_index, query_engine


2024-04-25 18:06:43,475 - Overriding of current TracerProvider is not allowed


In [20]:
# Run the ingestion flow and save the trace data
# This saves it to two tables:
# 1) The raw trace data that gets saved to Jaeger
# 2) The structured trace data that includes the paramSets, events,
#   etc that gets saved to our Postgres tables

# Run this cell once to generate an ingestion trace
nodes, vector_index, query_engine = run_ingestion_flow()

2024-04-25 18:06:46,268 - > [SimpleDirectoryReader] Total files added: 1
2024-04-25 18:06:46,269 - open file: /Users/rossdancraig/Projects/eval/src/lastmile_eval/examples/rag_debugger/getting_started/data/paul_graham/paul_graham_essay.txt
2024-04-25 18:06:46,312 - > Adding chunk: What I Worked On

February 2021

Before college...
2024-04-25 18:06:46,313 - > Adding chunk: I was puzzled by the 1401. I couldn't figure ou...
2024-04-25 18:06:46,313 - > Adding chunk: I remember vividly how impressed and envious I ...
2024-04-25 18:06:46,313 - > Adding chunk: All that seemed left for philosophy were edge c...
2024-04-25 18:06:46,314 - > Adding chunk: The commonly used programming languages then we...
2024-04-25 18:06:46,314 - > Adding chunk: At the time this bothered me, but now it seems ...
2024-04-25 18:06:46,315 - > Adding chunk: Its brokenness did, as so often happens, genera...
2024-04-25 18:06:46,315 - > Adding chunk: Any program you wrote today, no matter how good...
2024-04-25 18:06:

In [21]:
# Let's print the trace data from Jaeger to
# show you what it looks like (search for "operationName" in the data)

from lastmile_eval.rag.debugger.tracing import (
    get_latest_ingestion_trace_id,
    get_trace_data,
)

ingestion_trace_id = get_latest_ingestion_trace_id()
get_trace_data(ingestion_trace_id)

2024-04-25 18:06:56,554 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:06:56,646 - https://lastmileai.dev:443 "GET /api/rag_ingestion_traces/list?pageSize=1 HTTP/1.1" 200 560
2024-04-25 18:06:56,651 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:06:56,768 - https://lastmileai.dev:443 "GET /api/trace/read?id=977e7740aded018a8cab036372e89dfd HTTP/1.1" 200 None


{'data': [{'traceID': '977e7740aded018a8cab036372e89dfd',
   'spans': [{'traceID': '977e7740aded018a8cab036372e89dfd',
     'spanID': '7c658349bf74c0f2',
     'operationName': 'ingestion-root-span',
     'references': [],
     'startTime': 1714082806267571,
     'duration': 1819495,
     'tags': [{'key': 'doc_file_paths',
       'type': 'string',
       'value': "['/Users/rossdancraig/Projects/eval/src/lastmile_eval/examples/rag_debugger/getting_started/data/paul_graham/paul_graham_essay.txt']"},
      {'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [],
     'processID': 'p1',
    {'traceID': '977e7740aded018a8cab036372e89dfd',
     'spanID': '58fdd7f952ecaa1b',
     'operationName': 'create-document-nodes',
     'references': [{'refType': 'CHILD_OF',
       'traceID': '977e7740aded018a8cab036372e89dfd',
       'spanID': '7c658349bf74c0f2'}],
     'startTime': 1714082806270176,
     'du

In [22]:
# Now let's fetch the trace event data from our postgres table
# Notice that the `traceId` column matches with the raw trace data

from lastmile_eval.rag.debugger.tracing import list_ingestion_trace_events

ingestion_trace_events = list_ingestion_trace_events(take=1)
pd.DataFrame.from_records(ingestion_trace_events["ingestionTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragIngestionTraceEventId"}
)

2024-04-25 18:07:05,381 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:07:05,545 - https://lastmileai.dev:443 "GET /api/rag_ingestion_traces/list?pageSize=1 HTTP/1.1" 200 560


Unnamed: 0,ragIngestionTraceEventId,createdAt,updatedAt,paramSet,metadata,traceId,creatorId,projectId,organizationId,visibility,active
0,clvfsnbf000mypbmlmn6bqmlw,2024-04-25T22:06:48.205Z,2024-04-25T22:06:48.205Z,"{'chunk_size': 512, 'doc_file_paths': '['/User...",,977e7740aded018a8cab036372e89dfd,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True


## Part 3: Run and Trace Query Pipeline

In [24]:
import openai
from lastmile_eval.rag.debugger.api import (
    QueryReceived,
    ContextRetrieved,
    PromptResolved,
    LLMOutputReceived,
)

LLM_NAME = "gpt-4"

# Note, normally you can just call `query_engine.query(user_query)`
# but this abstracts away a lot of the steps so we will be doing
# each step manually to showcase how to use the tracer
PROMPT_TEMPLATE = """
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:
"""


@tracer.start_as_current_span("query-root-span")
def run_query_flow(user_query: str, ingestion_trace_id: str):
    tracer.mark_rag_query_trace_event(
        QueryReceived(query=user_query), ingestion_trace_id
    )

    with tracer.start_as_current_span(
        "retrieve-context"
    ) as _retrieve_context_span:
        similarity_top_k = 5
        tracer.register_param("similarity_top_k", similarity_top_k)

        retriever = vector_index.as_retriever(
            similarity_top_k=similarity_top_k
        )
        retrieved_nodes = retriever.retrieve(user_query)
        retrieved_contexts = [node.get_text() for node in retrieved_nodes]

        retrieved_node_ids = [node.id_ for node in retrieved_nodes]
        tracer.register_param("retrieved_node_ids", retrieved_node_ids)

        tracer.mark_rag_query_trace_event(
            ContextRetrieved(context=retrieved_contexts), ingestion_trace_id
        )

    with tracer.start_as_current_span("resolve-prompt") as _resolve_prompt_span:
        resolved_prompt = PROMPT_TEMPLATE.replace(
            "{context_str}", "\n\n\n".join(retrieved_contexts)
        ).replace("{query_str}", user_query)
        tracer.mark_rag_query_trace_event(
            PromptResolved(fully_resolved_prompt=resolved_prompt),
            ingestion_trace_id,
        )

    with tracer.start_as_current_span("call-llm") as _llm_span:
        openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
        response = openai_client.chat.completions.create(
            model=LLM_NAME,
            messages=[{"role": "user", "content": resolved_prompt}],
        )
        output: str = response.choices[0].message.content
        tracer.mark_rag_query_trace_event(
            LLMOutputReceived(llm_output=output), ingestion_trace_id
        )


In [25]:
# TODO: Right now the ingestion_trace_id within mark_rag_query_trace_event is
# no-op due to changes in assumptions, I'll fix later
run_query_flow("What did the author do growing up?", ingestion_trace_id)

2024-04-25 18:07:49,609 - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x16f60c400>, 'json_data': {'input': ['What did the author do growing up?'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
2024-04-25 18:07:49,610 - close.started
2024-04-25 18:07:49,610 - close.complete
2024-04-25 18:07:49,611 - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
2024-04-25 18:07:49,624 - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x34283f2d0>
2024-04-25 18:07:49,625 - start_tls.started ssl_context=<ssl.SSLContext object at 0x33fd46a80> server_hostname='api.openai.com' timeout=60.0
2024-04-25 18:07:49,639 - start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x3411fcc90>
2024-04-25 18:07:49,639 - send_request_headers.started request=<Request [b'POST']>
2024-04-25 18:07:49,639 - s

In [26]:
# Just like what we did with the ingestion trace,
# let's print out what this looks like in the PostGres data, as well as the
# pure trace data again
from lastmile_eval.rag.debugger.tracing import (
    list_query_trace_events,
)

query_trace_events = list_query_trace_events(take=1)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceEventsId"}
)
query_trace_events_df



2024-04-25 18:08:02,008 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:08:02,121 - https://lastmileai.dev:443 "GET /api/rag_query_traces/list?pageSize=1 HTTP/1.1" 200 None


Unnamed: 0,ragQueryTraceEventsId,createdAt,updatedAt,paramSet,query,context,fullyResolvedPrompt,output,metadata,traceId,ragIngestionTraceId,creatorId,projectId,organizationId,visibility,active,ragIngestionTrace
0,clvfsoqb500mzpbmlksxehtsu,2024-04-25T22:07:54.161Z,2024-04-25T22:07:54.161Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""What did the author do growing up?""}","{""context"":[""This was now only weeks away. My ...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author grew up working on w...",,418f91f146e7e9c1eda0fb34f3bdc764,,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True,


In [27]:
# This is what the trace data looks like
from lastmile_eval.rag.debugger.tracing import (
    get_trace_data,
)

query_trace_id = query_trace_events_df.iloc[0]["traceId"]
get_trace_data(query_trace_id)




2024-04-25 18:08:06,027 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:08:06,155 - https://lastmileai.dev:443 "GET /api/trace/read?id=418f91f146e7e9c1eda0fb34f3bdc764 HTTP/1.1" 200 None


{'data': [{'traceID': '418f91f146e7e9c1eda0fb34f3bdc764',
   'spans': [{'traceID': '418f91f146e7e9c1eda0fb34f3bdc764',
     'spanID': '966a71f1a87be0f6',
     'operationName': 'query-root-span',
     'references': [],
     'startTime': 1714082869608619,
     'duration': 4421682,
     'tags': [{'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [{'timestamp': 1714082869608685,
       'fields': [{'key': 'event', 'type': 'string', 'value': 'QueryReceived'},
        {'key': 'indexing_trace_id',
         'type': 'string',
         'value': '977e7740aded018a8cab036372e89dfd'},
        {'key': 'rag_query_event',
         'type': 'string',
         'value': '{"query":"What did the author do growing up?"}'}]}],
     'processID': 'p1',
    {'traceID': '418f91f146e7e9c1eda0fb34f3bdc764',
     'spanID': '9ab89315e0454e33',
     'operationName': 'retrieve-context',
     'references': [{'refType': 'CHILD

## Part 4: Create Test Sets and Run Evaluators

In [28]:
# NOTE: Running this cell on all the nodes will take a while (probably 5-10mins), so please be patient

# Change this to a lower value if you want to run faster
# If we use None, we will not use this value and use total_queries_per_batch
# instead
total_queries_to_run_override = (
    5  # None
)


# Ok we're now going to artifically generate a bunch of query + context
# (ground truth) pairs. We will then run the `run_query_flow()` method on these
# generated queries later

# Define an LLM
llm = OpenAI(model=LLM_NAME)


# This method `generate_question_context_pairs()` essentially
# calls an LLM to generate questions for us. See this URL for more details:
# https://github.com/run-llama/llama_index/blob/8b373239396134a92c9277b36aa7023c633c018a/llama-index-finetuning/llama_index/finetuning/embeddings/common.py#L49-L64
num_questions_per_chunk = 1
qa_dataset = generate_question_context_pairs(
    nodes[0:total_queries_to_run_override or len(nodes)],
    llm=llm,
    num_questions_per_chunk=num_questions_per_chunk
)

  0%|          | 0/5 [00:00<?, ?it/s]2024-04-25 18:08:10,250 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-04-25 18:08:10,251 - load_verify_locations cafile='/Users/rossdancraig/.pyenv/versions/3.11.6/lib/python3.11/site-packages/certifi/cacert.pem'
2024-04-25 18:08:10,259 - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Context information is below.\n\n---------------------\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processin

In [29]:
# Run these queries through the `run_query_flow()` method

total_queries_per_batch = len(qa_dataset.queries)
total_queries_to_run = min(total_queries_to_run_override or total_queries_per_batch, total_queries_per_batch)

expected_node_ids: list[str] = []
for i, (query_id, query) in enumerate(qa_dataset.queries.items()):
    run_query_flow(query, ingestion_trace_id)
    associated_node_id_for_query = qa_dataset.relevant_docs[query_id]
    expected_node_ids.append(associated_node_id_for_query[0])

    print(f"Finished running {i+1}/{total_queries_to_run} queries...")
    if i + 1 == total_queries_to_run:
        break

# Have to reverse because the get_rag_query_trace_events() method
# returns the most recent trace events first
expected_node_ids.reverse()

2024-04-25 18:10:03,415 - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x34443fce0>, 'json_data': {'input': ["Describe the author's early experiences with programming, including the type of computer and programming language used, the challenges faced, and how these experiences influenced his understanding of programming."], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
2024-04-25 18:10:03,417 - close.started
2024-04-25 18:10:03,418 - close.complete
2024-04-25 18:10:03,418 - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
2024-04-25 18:10:03,511 - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x16f610210>
2024-04-25 18:10:03,512 - start_tls.started ssl_context=<ssl.SSLContext object at 0x33fd46a80> server_hostname='api.openai.com' timeout=60.0
2024-04-25 18:10:03,526 - start_tls.complete retu

Finished running 1/5 queries...


2024-04-25 18:10:33,953 - https://lastmileai.dev:443 "POST /api/trace/create HTTP/1.1" 200 10
2024-04-25 18:10:33,954 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-04-25 18:10:33,955 - load_verify_locations cafile='/Users/rossdancraig/.pyenv/versions/3.11.6/lib/python3.11/site-packages/certifi/cacert.pem'
2024-04-25 18:10:33,961 - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '\nContext information is below.\n---------------------\nI remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.\n\nComputers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model roc

Finished running 2/5 queries...


2024-04-25 18:10:58,340 - https://lastmileai.dev:443 "POST /api/trace/create HTTP/1.1" 200 10
2024-04-25 18:10:58,397 - https://lastmileai.dev:443 "POST /api/trace/create HTTP/1.1" 200 10
2024-04-25 18:10:58,398 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-04-25 18:10:58,398 - load_verify_locations cafile='/Users/rossdancraig/.pyenv/versions/3.11.6/lib/python3.11/site-packages/certifi/cacert.pem'
2024-04-25 18:10:58,406 - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '\nContext information is below.\n---------------------\nAll that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.\n\nI couldn\'t have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.\n\nAI was in the air in the mid 1980s, but there were two things es

Finished running 3/5 queries...


2024-04-25 18:11:15,413 - https://lastmileai.dev:443 "POST /api/trace/create HTTP/1.1" 200 10
2024-04-25 18:11:15,444 - https://lastmileai.dev:443 "POST /api/trace/create HTTP/1.1" 200 10
2024-04-25 18:11:15,445 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-04-25 18:11:15,446 - load_verify_locations cafile='/Users/rossdancraig/.pyenv/versions/3.11.6/lib/python3.11/site-packages/certifi/cacert.pem'
2024-04-25 18:11:15,453 - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '\nContext information is below.\n---------------------\nAll that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.\n\nI couldn\'t have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.\n\nAI was in the air in the mid 1980s, but there were two things es

Finished running 4/5 queries...


2024-04-25 18:11:38,713 - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Thu, 25 Apr 2024 22:11:38 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-model', b'text-embedding-ada-002'), (b'openai-organization', b'lastmile-ai'), (b'openai-processing-ms', b'20'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'10000000'), (b'x-ratelimit-remaining-requests', b'9999'), (b'x-ratelimit-remaining-tokens', b'9999942'), (b'x-ratelimit-reset-requests', b'6ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_f6d2f9bab44e534fc2eb2b63a4a9b934'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'87a1ab860ee11865-EWR'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; 

Finished running 5/5 queries...


In [33]:
from lastmile_eval.rag.debugger.tracing import list_query_trace_events

query_trace_events = list_query_trace_events(take=total_queries_to_run)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceId"}
)
query_trace_events_df

2024-04-25 18:13:07,008 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:13:07,175 - https://lastmileai.dev:443 "GET /api/rag_query_traces/list?pageSize=5 HTTP/1.1" 200 None


Unnamed: 0,ragQueryTraceId,createdAt,updatedAt,paramSet,query,context,fullyResolvedPrompt,output,metadata,traceId,ragIngestionTraceId,creatorId,projectId,organizationId,visibility,active,ragIngestionTrace
0,clvfsttpk00u0qunwz1u77mxn,2024-04-25T22:11:51.848Z,2024-04-25T22:11:51.848Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Based on the author's experience and...","{""context"":[""At the time this bothered me, but...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author's interest in AI beg...",,5be1fdebda4574e442328614d3666f38,,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True,
1,clvfstjcd00tyqunwzddu0ajo,2024-04-25T22:11:38.413Z,2024-04-25T22:11:38.413Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Discuss the author's journey from st...","{""context"":[""All that seemed left for philosop...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author initially planned to...",,c076246cd6e431eba787b5715902de67,,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True,
2,clvfst1en0035qpi8fdgzozll,2024-04-25T22:11:15.168Z,2024-04-25T22:11:15.168Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Discuss the author's initial interes...","{""context"":[""All that seemed left for philosop...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author initially intended t...",,ea376d7689a2614ff1aac08fad1e6c2,,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True,
3,clvfsso7v00txqunw2hqcmsmr,2024-04-25T22:10:58.075Z,2024-04-25T22:10:58.075Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Discuss the transition from using th...","{""context"":[""I remember vividly how impressed ...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The writer first began programm...",,ae4803ca1dd01eb1fc6f492b701f6a22,,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True,
4,clvfss5c700bhpe3kqs55g7oo,2024-04-25T22:10:33.608Z,2024-04-25T22:10:33.608Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Describe the author's early experien...","{""context"":[""I remember vividly how impressed ...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author's early experiences ...",,394347d69b08229b29bf92a1411279f0,,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,True,


## Part 5 - Run Evaluators from Query Trace Events data

This directly creates evaluation sets using the method `evaluate_rag_outputs()` without the need to create intermediate test cases and test sets. All you need is to define your query trace event rows in a dataframe.

In [None]:
from lastmile_eval.rag.debugger.api.evaluation import evaluate_rag_outputs
from lastmile_eval.rag.debugger.tracing import (
    get_query_trace_event,
)

from lastmile_eval.text.metrics import calculate_rouge1_score
from llama_index.core.evaluation import HitRate, MRR


def rouge1(df: pd.DataFrame):
    return [
        # some weird error where it doesn't work for 0 values
        0.01 + x for x in calculate_rouge1_score(df["output"].tolist(), df["groundTruth"].tolist())
    ]

def extract_data_to_evaluate(
    row: pd.Series,
) -> tuple[list[str], list[str]]:
    trace_query_id: str = row["ragQueryTraceId"]
    trace_query_data = get_query_trace_event(trace_query_id)
    retrieved_node_ids = trace_query_data["paramSet"]["retrieved_node_ids"]
    expected_node_ids: list[str] = [row["groundTruth"]]
    return (retrieved_node_ids, expected_node_ids)


def compute_eval_score(
    retrieved_and_expected_node_ids_tuple: tuple[list[str], list[str]],
    evaluator: HitRate | MRR,
) -> float:
    retrieved_node_ids, expected_node_ids = (
        retrieved_and_expected_node_ids_tuple
    )
    return evaluator.compute(
        retrieved_ids=retrieved_node_ids, expected_ids=expected_node_ids
    ).score

# Example using a row-level function on the dataframe
def compute_mrr(df: pd.DataFrame):
    """
    We are demonstrating methods that are applied across a row instead of
    entire dataframe, such as the MRR and Hit Rate metrics from the 
    llama_index.core.evaluation package. In order to do this, we define a
    method at the row level where we:
    
    1. Extract the data to evaluate from the row
    2. Run the evaluators on this extracted data
    
    After that's done, we pass this row-level method to df.apply()
    """
    def evaluate_using_row_method(row: pd.Series) -> float:
        node_id_tuple = extract_data_to_evaluate(row)
        return compute_eval_score(node_id_tuple, MRR())
    
    return df.apply(evaluate_using_row_method, axis=1)

def compute_hit_rate(df: pd.DataFrame):
    """
    Another row-function example with hit_rate
    """
    def evaluate_using_row_method(row: pd.Series) -> float:
        node_id_tuple = extract_data_to_evaluate(row)
        return compute_eval_score(node_id_tuple, HitRate())
    
    return df.apply(evaluate_using_row_method, axis=1)
    
trace_level_evaluators = {
    "rouge1": rouge1,
    "mrr": compute_mrr,
    "hit_rate": compute_hit_rate,
}

# We must add groundTruth to the dataframe
query_trace_events_df["groundTruth"] = expected_node_ids

eval_result = evaluate_rag_outputs(
    project_id="can be anything for now",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    df=query_trace_events_df,
    lastmile_api_token=LASTMILE_API_TOKEN,
    evaluation_set_name="Cool new evaluation set name"
)

#print out result
eval_result

## Part 6 - Run Evaluators by creating intermediate test cases, test sets first

This is showing how to manually create test set 

In [34]:
from lastmile_eval.rag.debugger.api import (
    create_test_set_from_rag_query_traces,
)

create_test_set_from_rag_query_traces(
    query_trace_events_df,
    test_set_name="Retrieval Eval Test Set",
    lastmile_api_token=LASTMILE_API_TOKEN,
    ground_truth=expected_node_ids,
)

2024-04-25 18:13:11,477 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:13:11,668 - https://lastmileai.dev:443 "POST /api/evaluation_test_sets/create HTTP/1.1" 200 295


CreateTestSetsResult(success=True, message='{"id":"clvfsvja200n1pbmlnq59a74y","createdAt":"2024-04-25T22:13:11.641Z","updatedAt":"2024-04-25T22:13:11.641Z","name":"Retrieval Eval Test Set","description":null,"creatorId":"clp1m7n3l0062qpqnd4nyabbl","projectId":null,"organizationId":null,"visibility":"MEMBER","active":true,"metadata":null}', ids=['clvfsvja200n1pbmlnq59a74y'])

In [35]:
from lastmile_eval.rag.debugger.api import (
    get_latest_test_set_id,
)

test_set_id = get_latest_test_set_id()

2024-04-25 18:13:13,926 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:13:14,019 - https://lastmileai.dev:443 "GET /api/evaluation_test_sets/list?pageSize=1 HTTP/1.1" 200 372


In [36]:
from lastmile_eval.rag.debugger.api import download_test_set
test_set_df = download_test_set(test_set_id)
test_set_df

2024-04-25 18:13:16,129 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:13:16,313 - https://lastmileai.dev:443 "GET /api/evaluation_test_cases/list HTTP/1.1" 200 None


{'evaluationTestCases': [{'id': 'clvfsvjad00n2pbml9q6k9n9p', 'createdAt': '2024-04-25T22:13:11.641Z', 'updatedAt': '2024-04-25T22:13:11.641Z', 'query': '{"query":"Based on the author\'s experience and observations during his undergraduate and graduate studies, how did his perception of Artificial Intelligence evolve and what led him to conclude that the AI practiced at the time was a hoax?"}', 'context': '{"context":["At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover.\\n\\nI applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I\'d visited because Rich Draves went there, and was also home to Bill Woods, who\'d invented the type of parser I used in my SHRDLU clone. Only Harvard accepted me, so that was where I went.\\n\\nI don\'t remember the moment it happened, or if there even was a specific moment, but during the first year of grad school I realized that AI, as practiced at the time, w

Unnamed: 0,index,testCaseId,createdAt,updatedAt,query,context,fullyResolvedPrompt,output,groundTruth,metadata,ragQueryTraceId,testSetId,ragQueryTrace
0,0,clvfsvjad00n2pbml9q6k9n9p,2024-04-25T22:13:11.641Z,2024-04-25T22:13:11.641Z,"{""query"":""Based on the author's experience and...","{""context"":[""At the time this bothered me, but...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author's interest in AI beg...",cd886ed4-00b9-4803-a9a3-8d55dfb5432e,,clvfsttpk00u0qunwz1u77mxn,clvfsvja200n1pbmlnq59a74y,"{'id': 'clvfsttpk00u0qunwz1u77mxn', 'createdAt..."
1,1,clvfsvjad00n3pbmljqp28gai,2024-04-25T22:13:11.641Z,2024-04-25T22:13:11.641Z,"{""query"":""Discuss the author's journey from st...","{""context"":[""All that seemed left for philosop...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author initially planned to...",530867c8-9c07-44b7-9773-80bc5e57ff47,,clvfstjcd00tyqunwzddu0ajo,clvfsvja200n1pbmlnq59a74y,"{'id': 'clvfstjcd00tyqunwzddu0ajo', 'createdAt..."
2,2,clvfsvjad00n4pbml5hjnrzcp,2024-04-25T22:13:11.641Z,2024-04-25T22:13:11.641Z,"{""query"":""Discuss the author's initial interes...","{""context"":[""All that seemed left for philosop...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author initially intended t...",71f9d4ff-c517-40f8-b54d-f5eb49b97301,,clvfst1en0035qpi8fdgzozll,clvfsvja200n1pbmlnq59a74y,"{'id': 'clvfst1en0035qpi8fdgzozll', 'createdAt..."
3,3,clvfsvjae00n5pbmlgyuqte0s,2024-04-25T22:13:11.641Z,2024-04-25T22:13:11.641Z,"{""query"":""Discuss the transition from using th...","{""context"":[""I remember vividly how impressed ...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The writer first began programm...",9318fbc7-13c0-48cb-9c0d-cad45bbea0ed,,clvfsso7v00txqunw2hqcmsmr,clvfsvja200n1pbmlnq59a74y,"{'id': 'clvfsso7v00txqunw2hqcmsmr', 'createdAt..."
4,4,clvfsvjae00n6pbmls1b6yc5a,2024-04-25T22:13:11.641Z,2024-04-25T22:13:11.641Z,"{""query"":""Describe the author's early experien...","{""context"":[""I remember vividly how impressed ...","{""fully_resolved_prompt"":""\nContext informatio...","{""llm_output"":""The author's early experiences ...",5ada222a-6228-4ca9-8b4e-7e6180d5d461,,clvfss5c700bhpe3kqs55g7oo,clvfsvja200n1pbmlnq59a74y,"{'id': 'clvfss5c700bhpe3kqs55g7oo', 'createdAt..."


In [37]:
from lastmile_eval.rag.debugger.tracing import (
    get_query_trace_event,
)

# Define some out of the box retrieval evaluators
# TODO: Set up some evaluators that also measure outputs too
"""
Hit Rate:
Hit rate calculates the fraction of queries where the correct answer is found
within the top-k retrieved documents. In simpler terms, it’s about how often
our system gets it right within the top few guesses.

Mean Reciprocal Rank (MRR):
For each query, MRR evaluates the system’s accuracy by looking at the rank of
the highest-placed relevant document. Specifically, it’s the average of the
reciprocals of these ranks across all the queries. So, if the first relevant
document is the top result, the reciprocal rank is 1; if it’s second, the
reciprocal rank is 1/2, and so on.
"""
from llama_index.core.evaluation import HitRate, MRR
hit_rate_evaluator = HitRate()
mrr_evaluator = MRR()
metric_evaluators = [hit_rate_evaluator, mrr_evaluator]

In [41]:
# Manually doing it
data = []

def retrieved_correct_context_node(test_set_df):
    for _index, row in test_set_df.iterrows():
        trace_query_id = row["ragQueryTraceId"]
        trace_query_data = get_query_trace_event(trace_query_id)
        print(f"{trace_query_data=}")
        retrieved_node_ids = trace_query_data["paramSet"]["retrieved_node_ids"]
        expected_node_ids: list[str] = [row["groundTruth"]]

        evaluator_results = [
            evaluator.compute(
                retrieved_ids=retrieved_node_ids,
                expected_ids=expected_node_ids,
            ).score
            for evaluator in metric_evaluators
        ]
        data.append([trace_query_id, *evaluator_results])
    # trace_query_id = list_query_trace_events(take=1)["queryTraces"][0]["id"]


retrieved_correct_context_node(test_set_df)

import pandas as pd

columns = ["Trace Query Event Id", "Hit Rate", "MRR"]
eval_pd = pd.DataFrame(data, columns=columns)
eval_pd

2024-04-25 18:15:16,322 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:16,418 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfsttpk00u0qunwz1u77mxn HTTP/1.1" 200 None
2024-04-25 18:15:16,422 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:16,513 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfstjcd00tyqunwzddu0ajo HTTP/1.1" 200 None
2024-04-25 18:15:16,517 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:16,609 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfst1en0035qpi8fdgzozll HTTP/1.1" 200 None
2024-04-25 18:15:16,613 - Starting new HTTPS connection (1): lastmileai.dev:443


trace_query_data={'id': 'clvfsttpk00u0qunwz1u77mxn', 'createdAt': '2024-04-25T22:11:51.848Z', 'updatedAt': '2024-04-25T22:11:51.848Z', 'paramSet': {'similarity_top_k': 5, 'retrieved_node_ids': ['ff3fe49f-f677-4662-911b-5881697056a9', '530867c8-9c07-44b7-9773-80bc5e57ff47', 'cd886ed4-00b9-4803-a9a3-8d55dfb5432e', '71f9d4ff-c517-40f8-b54d-f5eb49b97301', '0017c3d9-2bc4-4ded-8db1-26a5b25d3a20']}, 'query': '{"query":"Based on the author\'s experience and observations during his undergraduate and graduate studies, how did his perception of Artificial Intelligence evolve and what led him to conclude that the AI practiced at the time was a hoax?"}', 'context': '{"context":["At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover.\\n\\nI applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I\'d visited because Rich Draves went there, and was also home to Bill Woods, who\'d invented the type of parser I

2024-04-25 18:15:16,699 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfsso7v00txqunw2hqcmsmr HTTP/1.1" 200 None
2024-04-25 18:15:16,703 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:16,784 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfss5c700bhpe3kqs55g7oo HTTP/1.1" 200 None


trace_query_data={'id': 'clvfsso7v00txqunw2hqcmsmr', 'createdAt': '2024-04-25T22:10:58.075Z', 'updatedAt': '2024-04-25T22:10:58.075Z', 'paramSet': {'similarity_top_k': 5, 'retrieved_node_ids': ['71f9d4ff-c517-40f8-b54d-f5eb49b97301', '9318fbc7-13c0-48cb-9c0d-cad45bbea0ed', '5ada222a-6228-4ca9-8b4e-7e6180d5d461', 'aa502187-d057-470d-a5b2-3c86ecc9eac4', 'cd886ed4-00b9-4803-a9a3-8d55dfb5432e']}, 'query': '{"query":"Discuss the transition from using the IBM 1401 to microcomputers, highlighting the differences in programming and user interaction, as described in the text. Include specific examples such as the Heathkit and TRS-80."}', 'context': '{"context":["I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.\\n\\nComputers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good eno

Unnamed: 0,Trace Query Event Id,Hit Rate,MRR
0,clvfsttpk00u0qunwz1u77mxn,1.0,0.333333
1,clvfstjcd00tyqunwzddu0ajo,1.0,1.0
2,clvfst1en0035qpi8fdgzozll,1.0,0.333333
3,clvfsso7v00txqunw2hqcmsmr,1.0,0.5
4,clvfss5c700bhpe3kqs55g7oo,1.0,0.5


In [43]:
# Using the run_and_store_evaluations method
from lastmile_eval.rag.debugger.api import run_and_store_evaluations

result = run_and_store_evaluations(
    test_set_id,
    "Fake project name",
    {
        "Hit Rate": compute_hit_rate,
        "MRR": compute_mrr,
    },
    {},
    LASTMILE_API_TOKEN,
    f"Evaluation Results for Test Set {test_set_id}",
)

# TODO: Print out the evaluation test table with final evaluation metrics
result


2024-04-25 18:15:49,242 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:49,365 - https://lastmileai.dev:443 "GET /api/evaluation_test_cases/list HTTP/1.1" 200 None
2024-04-25 18:15:49,389 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:49,475 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfsttpk00u0qunwz1u77mxn HTTP/1.1" 200 None
2024-04-25 18:15:49,479 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:49,564 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfstjcd00tyqunwzddu0ajo HTTP/1.1" 200 None
2024-04-25 18:15:49,568 - Starting new HTTPS connection (1): lastmileai.dev:443


{'evaluationTestCases': [{'id': 'clvfsvjad00n2pbml9q6k9n9p', 'createdAt': '2024-04-25T22:13:11.641Z', 'updatedAt': '2024-04-25T22:13:11.641Z', 'query': '{"query":"Based on the author\'s experience and observations during his undergraduate and graduate studies, how did his perception of Artificial Intelligence evolve and what led him to conclude that the AI practiced at the time was a hoax?"}', 'context': '{"context":["At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover.\\n\\nI applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I\'d visited because Rich Draves went there, and was also home to Bill Woods, who\'d invented the type of parser I used in my SHRDLU clone. Only Harvard accepted me, so that was where I went.\\n\\nI don\'t remember the moment it happened, or if there even was a specific moment, but during the first year of grad school I realized that AI, as practiced at the time, w

2024-04-25 18:15:49,656 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfst1en0035qpi8fdgzozll HTTP/1.1" 200 None
2024-04-25 18:15:49,661 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:49,779 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfsso7v00txqunw2hqcmsmr HTTP/1.1" 200 None
2024-04-25 18:15:49,784 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:49,880 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfss5c700bhpe3kqs55g7oo HTTP/1.1" 200 None
2024-04-25 18:15:49,885 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:49,966 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfsttpk00u0qunwz1u77mxn HTTP/1.1" 200 None
2024-04-25 18:15:49,971 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:50,054 - https://lastmileai.dev:443 "GET /api/rag_query_traces/read?id=clvfstjcd00tyqunwzddu0ajo HTTP/1.1" 200 None
2024-04-25 18:1

CreateEvaluationsResult(success=True, message='{"id":"clvfsyyhd0037qpi8j43btbrs","createdAt":"2024-04-25T22:15:51.312Z","updatedAt":"2024-04-25T22:15:51.312Z","name":"Evaluation Results for Test Set clvfsvja200n1pbmlnq59a74y","paramSet":null,"testSetId":"clvfsvja200n1pbmlnq59a74y","creatorId":"clp1m7n3l0062qpqnd4nyabbl","projectId":null,"organizationId":null,"visibility":"MEMBER","metadata":null,"active":true}')

In [44]:
import requests
from requests import Response
from typing import Any, Optional

# TODO: Save this as it's own helper SDK from the lastmile-eval package
def list_evaluation_sets(
    take: int = 10,
    # TODO: Create macro for default timeout value
    timeout: int = 60,
) -> dict[str, Any]:  # TODO: Define eplicit typing for JSON response return
    """
    Get a list of evaluation sets from the LastMile API.

    Args:
        take: The number of evaluation sets to return. The default is 10.
        lastmile_api_token: The API token for the LastMile API. If not provided,
            will try to get the token from the LASTMILE_API_TOKEN
            environment variable.
            You can create a token from the "API Tokens" section from this website:
            https://lastmileai.dev/settings?page=tokens
        timeout: The maximum time in seconds to wait for the request to complete.
            The default is 60.

    Returns:
        A dictionary containing the evaluation sets.
    """
    lastmile_endpoint = f"https://lastmileai.dev/api/evaluation_sets/list?pageSize={str(take)}"

    response: Response = requests.get(
        lastmile_endpoint,
        headers={"Authorization": f"Bearer {LASTMILE_API_TOKEN}"},
        timeout=timeout,
    )
    # TODO: Handle response errors
    return response.json()

evaluation_sets = list_evaluation_sets(take=1)
evaluation_sets_df = pd.DataFrame.from_records(evaluation_sets["evaluationSets"]).rename(  # type: ignore[fixme]
    columns={"id": "evaluationSetId"}
)
pd.set_option('display.max_colwidth', None)
evaluation_sets_df

# TODO: evaluationSetMetrics looks a bit weird, should probalby have helper
# method to display it better, but it's ok for now

2024-04-25 18:15:55,713 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-04-25 18:15:55,879 - https://lastmileai.dev:443 "GET /api/evaluation_sets/list?pageSize=1 HTTP/1.1" 200 1023


Unnamed: 0,evaluationSetId,createdAt,updatedAt,name,paramSet,testSetId,creatorId,projectId,organizationId,visibility,metadata,active,evaluationSetMetrics,testSet
0,clvfsyyhd0037qpi8j43btbrs,2024-04-25T22:15:51.312Z,2024-04-25T22:15:51.312Z,Evaluation Results for Test Set clvfsvja200n1pbmlnq59a74y,,clvfsvja200n1pbmlnq59a74y,clp1m7n3l0062qpqnd4nyabbl,,,MEMBER,,True,"[{'createdAt': '2024-04-25T22:15:51.312Z', 'updatedAt': '2024-04-25T22:15:51.312Z', 'metricName': 'Hit Rate_mean', 'metricValue': 1, 'evaluationSetId': 'clvfsyyhd0037qpi8j43btbrs', 'creatorId': 'clp1m7n3l0062qpqnd4nyabbl', 'metadata': None}, {'createdAt': '2024-04-25T22:15:51.312Z', 'updatedAt': '2024-04-25T22:15:51.312Z', 'metricName': 'MRR_mean', 'metricValue': 0.5333333333333333, 'evaluationSetId': 'clvfsyyhd0037qpi8j43btbrs', 'creatorId': 'clp1m7n3l0062qpqnd4nyabbl', 'metadata': None}]","{'id': 'clvfsvja200n1pbmlnq59a74y', 'name': 'Retrieval Eval Test Set'}"
