## Step 1: Import modules and packages, download reference docs

This goes through entire LastMile AI Eval flow from

1. create ingestion trace
2. generate a query + ground truth context pair per each node context in a document
  - taking those queries and running rag query traces to get actual retrieved context
3. listing query traces I want to include in a test set (defaults to last N queries for now)
4. create test Set with given query_traces, as well as storing the ground truth for the associated context for each query
5. create evaluation metrics based on ones provided by Llama Index
   - note: this is mainly from Llama Index, so the evaluation metrics are only focused on retrieval, nothing on outputs (though I store those as output events too)
6. create evaluation set by feeding these metrics with test set we just created

Some notes:
- no manual id grepping needed --> all taken care of by helper functions
probably needs to be better designed in future, just was focused on getting unblocked
- need to refactor ingestion_trace_id to map to trace-level, not marking rag query event level (right now it doesn't work, I'll add that later)
- some other small API convenience functions need to be added to the API, such as a helper function for `list_evaluation_sets()`


In [15]:
# Install dependencies
# IMPORTANT: After running this cell, you MUST
# restart kernel for these changes to take effect

# !pip list | grep lastmile

# !pip3 install lastmile-eval #--upgrade --force-reinstall

!pwd

# Hacky way to locally install the lastmile-eval package lol
!pip3 install -e ../../../../..

!pip3 install llama-index

/Users/rossdancraig/Projects/eval/src/lastmile_eval/examples/rag_debugger/getting_started
Obtaining file:///Users/rossdancraig/Projects/eval
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: lastmile-eval
  Building editable for lastmile-eval (pyproject.toml) ... [?25ldone
[?25h  Created wheel for lastmile-eval: filename=lastmile_eval-0.0.14-0.editable-py3-none-any.whl size=5150 sha256=8196950b3f8654adb90d152850e36e8768eba95d1e1f1cbd7460f7ff1479884d
  Stored in directory: /private/var/folders/n9/fr1zcc3x3m327h0r11mr5b7c0000gn/T/pip-ephem-wheel-cache-xc0zf5j9/wheels/f5/5c/e6/f8760477828ee734f8b060f518c34939861874bc3ff8be5687
Successfully built lastmile-eval
Installing collected packages: lastmile-eval
  Attempting uninstall: lastmile-eval
    Found 

In [1]:
!pip list | grep lastmile-eval

lastmile-eval                             0.0.37


In [2]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms.openai import OpenAI

import pandas as pd

from lastmile_eval.rag.debugger.tracing import get_lastmile_tracer


In [3]:
import os
import dotenv
# You can get your OPENAI_API_KEY from https://platform.openai.com/api-keys

dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LASTMILE_API_TOKEN = os.getenv("LASTMILE_API_TOKEN")

os.environ["LASTMILE_API_TOKEN"] = LASTMILE_API_TOKEN
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

print(f"{len(OPENAI_API_KEY)=}, {len(LASTMILE_API_TOKEN)=}")
assert len(OPENAI_API_KEY) > 0
assert len(LASTMILE_API_TOKEN) > 0

len(OPENAI_API_KEY)=51, len(LASTMILE_API_TOKEN)=324


In [5]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   536k      0 --:--:-- --:--:-- --:--:--  555k


## Step 2: Run and Trace Ingestion Pipeline

In [4]:
# Instantiate a tracer object
tracer = get_lastmile_tracer("my_cool_tracer")


# You can use the tracer either as a decorator around a function (like below)
# or with the "with ... as span_variable_name:" syntax
@tracer.start_as_current_span("ingestion-root-span")
def run_ingestion_flow():
    documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

    # Register the doc file paths as a parameter
    doc_file_paths = [
        doc.metadata.get("file_path")
        for doc in documents
        if doc.metadata.get("file_path") is not None
    ]
    tracer.register_param("doc_file_paths", str(doc_file_paths))

    with tracer.start_as_current_span(
        "create-document-nodes"
    ) as _node_parser_span:
      # Register chunk_size as a parameter in this
      # trace's parameter set
      chunk_size = 512
      tracer.register_param("chunk_size", chunk_size)

      node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size)
      nodes = node_parser.get_nodes_from_documents(documents)

    # Mark a RAG Ingestion trace event
    #   --> For now this only accepts strings and list of strings
    #   --> We can add more specific events (like what you'll see with
    #      the `mark_rag_query_trace_event` method) in the future
      tracer.mark_rag_ingestion_trace_event("Created document nodes!")
    with tracer.start_as_current_span(
        "embed-document-nodes"
    ) as _create_node_span:
        vector_index = VectorStoreIndex(nodes)
        query_engine = vector_index.as_query_engine()
        tracer.mark_rag_ingestion_trace_event("Created embeddings!")

    # We use these variables later in the notebook so need to return them
    # in this function
    return nodes, vector_index, query_engine


2024-05-14 12:54:53,971 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:54:54,067 - https://lastmileai.dev:443 "GET /api/evaluation_projects/list?name=my_cool_tracer HTTP/1.1" 200 349


In [9]:
# Run the ingestion flow and save the trace data
# This saves it to two tables:
# 1) The raw trace data that gets saved to Jaeger
# 2) The structured trace data that includes the paramSets, events,
#   etc that gets saved to our Postgres tables

# Run this cell once to generate an ingestion trace
nodes, vector_index, query_engine = run_ingestion_flow()

2024-05-14 12:55:44,772 - > [SimpleDirectoryReader] Total files added: 1
2024-05-14 12:55:44,782 - open file: /Users/jonathan/Projects/eval-cookbook/examples/data/paul_graham/paul_graham_essay.txt
2024-05-14 12:55:44,987 - > Adding chunk: What I Worked On

February 2021

Before college...
2024-05-14 12:55:44,988 - > Adding chunk: I was puzzled by the 1401. I couldn't figure ou...
2024-05-14 12:55:44,988 - > Adding chunk: I remember vividly how impressed and envious I ...
2024-05-14 12:55:44,988 - > Adding chunk: I couldn't have put this into words when I was ...
2024-05-14 12:55:44,988 - > Adding chunk: Learning Lisp expanded my concept of a program ...
2024-05-14 12:55:44,989 - > Adding chunk: Only Harvard accepted me, so that was where I w...
2024-05-14 12:55:44,989 - > Adding chunk: So I decided to focus on Lisp. In fact, I decid...
2024-05-14 12:55:44,989 - > Adding chunk: Anyone who wanted one to play around with could...
2024-05-14 12:55:44,989 - > Adding chunk: I'd never imagine

In [10]:
# Let's print the trace data from Jaeger to
# show you what it looks like (search for "operationName" in the data)

from lastmile_eval.rag.debugger.tracing import (
    get_latest_ingestion_trace_id,
    get_trace_data,
)

ingestion_trace_id = get_latest_ingestion_trace_id()
get_trace_data(ingestion_trace_id)

2024-05-14 12:56:00,754 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:56:00,854 - https://lastmileai.dev:443 "GET /api/rag_ingestion_traces/list?pageSize=1 HTTP/1.1" 200 596
2024-05-14 12:56:00,859 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:56:00,992 - https://lastmileai.dev:443 "GET /api/trace/read?id=2d1db49cbcbf707f8bea0cccbe43946a HTTP/1.1" 200 None


{'data': [{'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
   'spans': [{'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
     'spanID': 'd4ad661565f03ae2',
     'operationName': 'ingestion-root-span',
     'references': [],
     'startTime': 1715705744770425,
     'duration': 1466427,
     'tags': [{'key': 'doc_file_paths',
       'type': 'string',
       'value': "['/Users/jonathan/Projects/eval-cookbook/examples/data/paul_graham/paul_graham_essay.txt']"},
      {'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [],
     'processID': 'p1',
    {'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
     'spanID': '01b23423d2a3313d',
     'operationName': 'create-document-nodes',
     'references': [{'refType': 'CHILD_OF',
       'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
       'spanID': 'd4ad661565f03ae2'}],
     'startTime': 1715705744783201,
     'duration': 221595,
     'tags': [{'key': 'ch

In [11]:
# Now let's fetch the trace event data from our postgres table
# Notice that the `traceId` column matches with the raw trace data

from lastmile_eval.rag.debugger.tracing import list_ingestion_trace_events

ingestion_trace_events = list_ingestion_trace_events(take=1)
pd.DataFrame.from_records(ingestion_trace_events["ingestionTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragIngestionTraceEventId"}
)

2024-05-14 12:56:04,110 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:56:04,212 - https://lastmileai.dev:443 "GET /api/rag_ingestion_traces/list?pageSize=1 HTTP/1.1" 200 596


Unnamed: 0,ragIngestionTraceEventId,createdAt,updatedAt,paramSet,metadata,input,output,eventData,eventName,traceId,creatorId,projectId,organizationId,visibility,active,annotations
0,clw6mwiik027ape2bdc6cvdnl,2024-05-14T16:55:46.364Z,2024-05-14T16:55:46.364Z,"{'chunk_size': 512, 'doc_file_paths': '['/User...",,,,,,2d1db49cbcbf707f8bea0cccbe43946a,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,[]


## Part 3: Run and Trace Query Pipeline

In [12]:
import openai
from lastmile_eval.rag.debugger.api import (
    QueryReceived,
    ContextRetrieved,
    PromptResolved,
    LLMOutputReceived,
)

LLM_NAME = "gpt-4"

# Note, normally you can just call `query_engine.query(user_query)`
# but this abstracts away a lot of the steps so we will be doing
# each step manually to showcase how to use the tracer
PROMPT_TEMPLATE = """
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:
"""


@tracer.start_as_current_span("query-root-span")
def run_query_flow(user_query: str, ingestion_trace_id: str):
    tracer.mark_rag_query_trace_event(
        QueryReceived(query=user_query), ingestion_trace_id
    )

    with tracer.start_as_current_span(
        "retrieve-context"
    ) as _retrieve_context_span:
        similarity_top_k = 5
        tracer.register_param("similarity_top_k", similarity_top_k)

        retriever = vector_index.as_retriever(
            similarity_top_k=similarity_top_k
        )
        retrieved_nodes = retriever.retrieve(user_query)
        retrieved_contexts = [node.get_text() for node in retrieved_nodes]

        retrieved_node_ids = [node.id_ for node in retrieved_nodes]
        tracer.register_param("retrieved_node_ids", retrieved_node_ids)

        tracer.mark_rag_query_trace_event(
            ContextRetrieved(context=retrieved_contexts), ingestion_trace_id
        )

    with tracer.start_as_current_span("resolve-prompt") as _resolve_prompt_span:
        resolved_prompt = PROMPT_TEMPLATE.replace(
            "{context_str}", "\n\n\n".join(retrieved_contexts)
        ).replace("{query_str}", user_query)
        tracer.mark_rag_query_trace_event(
            PromptResolved(fully_resolved_prompt=resolved_prompt),
            ingestion_trace_id,
        )

    with tracer.start_as_current_span("call-llm") as _llm_span:
        openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
        response = openai_client.chat.completions.create(
            model=LLM_NAME,
            messages=[{"role": "user", "content": resolved_prompt}],
        )
        output: str = response.choices[0].message.content
        tracer.mark_rag_query_trace_event(
            LLMOutputReceived(llm_output=output), ingestion_trace_id
        )

    return output


In [13]:
# TODO: Right now the ingestion_trace_id within mark_rag_query_trace_event is
# no-op due to changes in assumptions, I'll fix later
response = run_query_flow("What did the author do growing up?", ingestion_trace_id)

print(f"Response: {response}")

2024-05-14 12:56:13,844 - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x16fb563b0>, 'json_data': {'input': ['What did the author do growing up?'], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
2024-05-14 12:56:13,845 - Sending HTTP Request: POST https://api.openai.com/v1/embeddings
2024-05-14 12:56:13,846 - close.started
2024-05-14 12:56:13,847 - close.complete
2024-05-14 12:56:13,847 - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
2024-05-14 12:56:13,861 - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x1432dcd00>
2024-05-14 12:56:13,861 - start_tls.started ssl_context=<ssl.SSLContext object at 0x2a55cec40> server_hostname='api.openai.com' timeout=60.0
2024-05-14 12:56:13,876 - start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x2a1136bc0>
2024-05-14 12:56:13,

Response: The author spent a significant amount of time writing and programming while growing up. He began writing short stories and attempted his first program on the IBM 1401 during the 9th grade. The author was also involved in learning Fortran, a programming language, using punch cards to store and run programs.



In [14]:
# Just like what we did with the ingestion trace,
# let's print out what this looks like in the PostGres data, as well as the
# pure trace data again
from lastmile_eval.rag.debugger.tracing import (
    list_query_trace_events,
)

query_trace_events = list_query_trace_events(take=1)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceEventsId"}
)
query_trace_events_df



2024-05-14 12:56:21,840 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:56:21,936 - https://lastmileai.dev:443 "GET /api/rag_query_traces/list?pageSize=1 HTTP/1.1" 200 None


Unnamed: 0,ragQueryTraceEventsId,createdAt,updatedAt,paramSet,query,context,fullyResolvedPrompt,input,output,eventData,...,metadata,traceId,ragIngestionTraceId,creatorId,projectId,organizationId,visibility,active,ragIngestionTrace,annotations
0,clw6mx7ie004jqpmoibjb7vb4,2024-05-14T16:56:18.758Z,2024-05-14T16:56:18.758Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""What did the author do growing up?""}","{""context"":[""What I Worked On\n\nFebruary 2021...","{""fully_resolved_prompt"":""\nContext informatio...","{'query': '{""query"":""What did the author do gr...","{""llm_output"":""The author spent a significant ...",,...,,42585b4f5bdff6ca1c6c1c6496724618,,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,,[]


In [13]:
# This is what the trace data looks like
from lastmile_eval.rag.debugger.tracing import (
    get_trace_data,
)

query_trace_id = query_trace_events_df.iloc[0]["traceId"]
get_trace_data(query_trace_id)




2024-05-14 12:29:31,207 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:29:31,353 - https://lastmileai.dev:443 "GET /api/trace/read?id=aeb91d410bcc4da40e33447dc6268577 HTTP/1.1" 200 None


{'data': [{'traceID': 'aeb91d410bcc4da40e33447dc6268577',
   'spans': [{'traceID': 'aeb91d410bcc4da40e33447dc6268577',
     'spanID': '44ce6047750a837b',
     'operationName': 'query-root-span',
     'references': [],
     'startTime': 1715704151183028,
     'duration': 8506000,
     'tags': [{'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [{'timestamp': 1715704151183184,
       'fields': [{'key': 'event', 'type': 'string', 'value': 'QueryReceived'},
        {'key': 'indexing_trace_id',
         'type': 'string',
         'value': 'fa74a5444904e0f16bec4c32ba7fff70'},
        {'key': 'rag_query_event',
         'type': 'string',
         'value': '{"query":"What did the author do growing up?"}'}]}],
     'processID': 'p1',
    {'traceID': 'aeb91d410bcc4da40e33447dc6268577',
     'spanID': '91d29cfe5dc3227e',
     'operationName': 'retrieve-context',
     'references': [{'refType': 'CHILD

## Part 4: Create Test Sets and Run Evaluators

In [17]:
# NOTE: Running this cell on all the nodes will take a while (probably 5-10mins), so please be patient

# Change this to a lower value if you want to run faster
# If we use None, we will not use this value and use total_queries_per_batch
# instead
total_queries_to_run_override = (
    5  # None
)


# Ok we're now going to artifically generate a bunch of query + context
# (ground truth) pairs. We will then run the `run_query_flow()` method on these
# generated queries later

# Define an LLM
llm = OpenAI(model=LLM_NAME)


# This method `generate_question_context_pairs()` essentially
# calls an LLM to generate questions for us. See this URL for more details:
# https://github.com/run-llama/llama_index/blob/8b373239396134a92c9277b36aa7023c633c018a/llama-index-finetuning/llama_index/finetuning/embeddings/common.py#L49-L64
num_questions_per_chunk = 1
qa_dataset = generate_question_context_pairs(
    nodes[0:total_queries_to_run_override or len(nodes)],
    llm=llm,
    num_questions_per_chunk=num_questions_per_chunk
)

  0%|          | 0/5 [00:00<?, ?it/s]2024-05-14 12:57:15,858 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-14 12:57:15,860 - load_verify_locations cafile='/opt/homebrew/Caskroom/miniconda/base/envs/env310/lib/python3.10/site-packages/certifi/cacert.pem'
2024-05-14 12:57:15,884 - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': 'Context information is below.\n\n---------------------\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data pr

In [15]:
# Run these queries through the `run_query_flow()` method

total_queries_per_batch = len(qa_dataset.queries)
total_queries_to_run = min(total_queries_to_run_override or total_queries_per_batch, total_queries_per_batch)

expected_node_ids: list[str] = []
for i, (query_id, query) in enumerate(qa_dataset.queries.items()):
    run_query_flow(query, ingestion_trace_id)
    associated_node_id_for_query = qa_dataset.relevant_docs[query_id]
    expected_node_ids.append(associated_node_id_for_query[0])

    print(f"Finished running {i+1}/{total_queries_to_run} queries...")
    if i + 1 == total_queries_to_run:
        break

# Have to reverse because the get_rag_query_trace_events() method
# returns the most recent trace events first
expected_node_ids.reverse()

2024-05-14 12:31:12,862 - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x1452d93f0>, 'json_data': {'input': ["Describe the author's early experiences with programming on the IBM 1401, including the challenges he faced and how he attempted to overcome them."], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
2024-05-14 12:31:12,863 - Sending HTTP Request: POST https://api.openai.com/v1/embeddings
2024-05-14 12:31:12,864 - close.started
2024-05-14 12:31:12,865 - close.complete
2024-05-14 12:31:12,865 - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
2024-05-14 12:31:12,889 - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x135fcb160>
2024-05-14 12:31:12,890 - start_tls.started ssl_context=<ssl.SSLContext object at 0x14445ee40> server_hostname='api.openai.com' timeout=60.0
2024-05-14 12:31:12,905 -

Finished running 1/5 queries...


2024-05-14 12:31:47,447 - HTTP Response: POST https://api.openai.com/v1/embeddings "200 OK" Headers({'date': 'Tue, 14 May 2024 16:31:27 GMT', 'content-type': 'application/json', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'access-control-allow-origin': '*', 'openai-model': 'text-embedding-ada-002', 'openai-organization': 'lastmile-ai', 'openai-processing-ms': '29', 'openai-version': '2020-10-01', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'x-ratelimit-limit-requests': '10000', 'x-ratelimit-limit-tokens': '10000000', 'x-ratelimit-remaining-requests': '9999', 'x-ratelimit-remaining-tokens': '9999928', 'x-ratelimit-reset-requests': '6ms', 'x-ratelimit-reset-tokens': '0s', 'x-request-id': 'req_8d94015396419ad61cbb68335c7e40c8', 'cf-cache-status': 'DYNAMIC', 'server': 'cloudflare', 'cf-ray': '883c4753895f7c82-EWR', 'content-encoding': 'gzip', 'alt-svc': 'h3=":443"; ma=86400'})
2024-05-14 12:31:47,448 - request_id: req_8d94015396419ad61cbb68335c7e40c8

Finished running 2/5 queries...


2024-05-14 12:32:10,887 - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 14 May 2024 16:32:10 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-model', b'text-embedding-ada-002'), (b'openai-organization', b'lastmile-ai'), (b'openai-processing-ms', b'24'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'10000000'), (b'x-ratelimit-remaining-requests', b'9999'), (b'x-ratelimit-remaining-tokens', b'9999965'), (b'x-ratelimit-reset-requests', b'6ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_68f5273547dcc4f2314c8fc2e9cd9a77'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'883c48630d43c339-EWR'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; 

Finished running 3/5 queries...


2024-05-14 12:32:41,461 - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 14 May 2024 16:32:41 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-model', b'text-embedding-ada-002'), (b'openai-organization', b'lastmile-ai'), (b'openai-processing-ms', b'411'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'10000000'), (b'x-ratelimit-remaining-requests', b'9999'), (b'x-ratelimit-remaining-tokens', b'9999939'), (b'x-ratelimit-reset-requests', b'6ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_eadef9817677803f4312ba6f09a72aa6'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'883c491ee93742a0-EWR'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443";

Finished running 4/5 queries...


2024-05-14 12:32:59,036 - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 14 May 2024 16:32:59 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b'*'), (b'openai-model', b'text-embedding-ada-002'), (b'openai-organization', b'lastmile-ai'), (b'openai-processing-ms', b'24'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'10000000'), (b'x-ratelimit-remaining-requests', b'9999'), (b'x-ratelimit-remaining-tokens', b'9999937'), (b'x-ratelimit-reset-requests', b'6ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_0daa4263cf995a650800a6ec213208bc'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'883c498fd9a842f2-EWR'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; 

Finished running 5/5 queries...


In [16]:
from lastmile_eval.rag.debugger.tracing import list_query_trace_events

query_trace_events = list_query_trace_events(take=total_queries_to_run)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceId"}
)
query_trace_events_df

2024-05-14 12:33:48,459 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:33:48,574 - https://lastmileai.dev:443 "GET /api/rag_query_traces/list?pageSize=5 HTTP/1.1" 200 None


Unnamed: 0,ragQueryTraceId,createdAt,updatedAt,paramSet,query,context,fullyResolvedPrompt,input,output,eventData,...,metadata,traceId,ragIngestionTraceId,creatorId,projectId,organizationId,visibility,active,ragIngestionTrace,annotations
0,clw6m3qbe001bqpmolpyr2njs,2024-05-14T16:33:23.451Z,2024-05-14T16:33:23.451Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Based on the context, discuss the au...","{""context"":[""Only Harvard accepted me, so that...","{""fully_resolved_prompt"":""\nContext informatio...","{'query': '{""query"":""Based on the context, dis...","{""llm_output"":""The author's initial enthusiasm...",,...,,786f1ea8345886c2b80a6ae8c42eb0b8,,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,,[]
1,clw6m37an0039pb1a0te49e2x,2024-05-14T16:32:58.800Z,2024-05-14T16:32:58.800Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Discuss the influence of Heinlein's ...","{""context"":[""I couldn't have put this into wor...","{""fully_resolved_prompt"":""\nContext informatio...","{'query': '{""query"":""Discuss the influence of ...","{""llm_output"":""Heinlein's novel \""The Moon is ...",,...,,f3ad21b11573d8d68754cd25a56c7939,,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,,[]
2,clw6m2tc70037pb1ah43yykfo,2024-05-14T16:32:40.711Z,2024-05-14T16:32:40.711Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Discuss the author's initial interes...","{""context"":[""I couldn't have put this into wor...","{""fully_resolved_prompt"":""\nContext informatio...","{'query': '{""query"":""Discuss the author's init...","{""llm_output"":""The author initially planned to...",,...,,34b652175b9e60e1d9270cfcf54c7e43,,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,,[]
3,clw6m265l0243pe2b1osn2x49,2024-05-14T16:32:10.665Z,2024-05-14T16:32:10.665Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Discuss the transition from using th...","{""context"":[""I was puzzled by the 1401. I coul...","{""fully_resolved_prompt"":""\nContext informatio...","{'query': '{""query"":""Discuss the transition fr...","{""llm_output"":""In the text, the author's early...",,...,,d30a45a3bf2960cfd19a5c067519c03f,,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,,[]
4,clw6m18mw0241pe2bz55au62k,2024-05-14T16:31:27.225Z,2024-05-14T16:31:27.225Z,"{'similarity_top_k': 5, 'retrieved_node_ids': ...","{""query"":""Describe the author's early experien...","{""context"":[""What I Worked On\n\nFebruary 2021...","{""fully_resolved_prompt"":""\nContext informatio...","{'query': '{""query"":""Describe the author's ear...","{""llm_output"":""The author's initial foray into...",,...,,2c424fca6a6c7c1231d7fc09892f1a55,,clkrgxm850004phi6ee5mvhd1,,,MEMBER,True,,[]


## Part 5 - Run Evaluators from Query Trace Events data

This directly creates evaluation sets using the method `evaluate_rag_outputs()` without the need to create intermediate test cases and test sets. All you need is to define your query trace event rows in a dataframe.

In [19]:
from lastmile_eval.rag.debugger.api.evaluation import evaluate_rag_outputs
from lastmile_eval.rag.debugger.tracing import (
    get_query_trace_event,
)

from lastmile_eval.text.metrics import calculate_rouge1_score
from llama_index.core.evaluation import HitRate, MRR


def wrap_rouge1(df: pd.DataFrame):
    return calculate_rouge1_score(df["output"].tolist(), df["groundTruth"].tolist())

def extract_data_to_evaluate(
    row: pd.Series,
) -> tuple[list[str], list[str]]:
    trace_query_id: str = row["ragQueryTraceId"]
    trace_query_data = get_query_trace_event(trace_query_id)
    retrieved_node_ids = trace_query_data["paramSet"]["retrieved_node_ids"]
    expected_node_ids: list[str] = [row["groundTruth"]]
    return (retrieved_node_ids, expected_node_ids)


def wrap_llama_index_evaluator(
    retrieved_and_expected_node_ids_tuple: tuple[list[str], list[str]],
    evaluator: HitRate | MRR,
) -> float:
    retrieved_node_ids, expected_node_ids = (
        retrieved_and_expected_node_ids_tuple
    )
    return evaluator.compute(
        retrieved_ids=retrieved_node_ids, expected_ids=expected_node_ids
    ).score

# Example using a row-level function on the dataframe
def compute_mrr(df: pd.DataFrame):
    """
    We are demonstrating methods that are applied across a row instead of
    entire dataframe, such as the MRR and Hit Rate metrics from the 
    llama_index.core.evaluation package. In order to do this, we define a
    method at the row level where we:
    
    1. Extract the data to evaluate from the row
    2. Run the evaluators on this extracted data
    
    After that's done, we pass this row-level method to df.apply()
    """
    def _evaluate_row(row: pd.Series) -> float:
        node_id_tuple = extract_data_to_evaluate(row)
        return wrap_llama_index_evaluator(node_id_tuple, MRR())
    
    return df.apply(_evaluate_row, axis=1)

def compute_hit_rate(df: pd.DataFrame):
    """
    Another row-function example with hit_rate
    """
    def _evaluate_row(row: pd.Series) -> float:
        node_id_tuple = extract_data_to_evaluate(row)
        return wrap_llama_index_evaluator(node_id_tuple, HitRate())
    
    return df.apply(_evaluate_row, axis=1)
    
trace_level_evaluators = {
    "rouge1": wrap_rouge1,
    "mrr": compute_mrr,
    "hit_rate": compute_hit_rate,
}

# We must add groundTruth to the dataframe
query_trace_events_df["groundTruth"] = expected_node_ids

eval_result = evaluate_rag_outputs(
    project_id="can be anything for now",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    df=query_trace_events_df,
    lastmile_api_token=LASTMILE_API_TOKEN,
    evaluation_set_name="Cool new evaluation set name"
)

#print out result
eval_result

2024-05-14 12:37:29,222 - Starting new HTTPS connection (1): lastmileai.dev:443


data={'name': 'Cool new evaluation set name', 'testCases': [{'query': '{"query":"Based on the context, discuss the author\'s initial enthusiasm for Artificial Intelligence and how his perspective changed during his first year of graduate school. What were his observations about the limitations of AI in understanding natural language?"}', 'context': '{"context":["Only Harvard accepted me, so that was where I went.\\n\\nI don\'t remember the moment it happened, or if there even was a specific moment, but during the first year of grad school I realized that AI, as practiced at the time, was a hoax. By which I mean the sort of AI in which a program that\'s told \\"the dog is sitting on the chair\\" translates this into some formal representation and adds it to the list of things it knows.\\n\\nWhat these programs really showed was that there\'s a subset of natural language that\'s a formal language. But a very proper subset. It was clear that there was an unbridgeable gap between what they

2024-05-14 12:37:29,440 - https://lastmileai.dev:443 "POST /api/evaluation_test_sets/create HTTP/1.1" 200 300
2024-05-14 12:37:29,443 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 12:37:29,612 - https://lastmileai.dev:443 "GET /api/evaluation_test_cases/list HTTP/1.1" 200 None
2024-05-14 12:37:29,644 - Starting new HTTPS connection (1): s3.amazonaws.com:443
2024-05-14 12:37:29,712 - https://s3.amazonaws.com:443 "HEAD /datasets.huggingface.co/datasets/metrics/evaluate-metric/rouge/evaluate-metric/rouge.py HTTP/1.1" 404 0
2024-05-14 12:37:29,716 - Starting new HTTPS connection (1): huggingface.co:443
2024-05-14 12:37:29,777 - https://huggingface.co:443 "HEAD /spaces/evaluate-metric/rouge/resolve/v0.4.1/rouge.py HTTP/1.1" 404 0
2024-05-14 12:37:29,779 - Starting new HTTPS connection (1): huggingface.co:443
2024-05-14 12:37:29,839 - https://huggingface.co:443 "HEAD /spaces/evaluate-metric/rouge/resolve/main/rouge.py HTTP/1.1" 200 0
2024-05-14 12:37:29,845 - Attempting 

data={'testSetId': 'clw6m903n0259pe2bw8f7venr', 'name': 'Cool new evaluation set name', 'evaluationMetrics': [{'testCaseId': 'clw6m9043025bpe2bir498m2f', 'metricName': 'rouge1', 'metricValue': 0.0}, {'testCaseId': 'clw6m9043025cpe2b5e94d8r4', 'metricName': 'rouge1', 'metricValue': 0.0}, {'testCaseId': 'clw6m9043025dpe2bifcoejom', 'metricName': 'rouge1', 'metricValue': 0.0}, {'testCaseId': 'clw6m9043025epe2b4ab91sgr', 'metricName': 'rouge1', 'metricValue': 0.0}, {'testCaseId': 'clw6m9043025fpe2b6ocaw7rf', 'metricName': 'rouge1', 'metricValue': 0.0}, {'testCaseId': 'clw6m9043025bpe2bir498m2f', 'metricName': 'mrr', 'metricValue': 0.3333333333333333}, {'testCaseId': 'clw6m9043025cpe2b5e94d8r4', 'metricName': 'mrr', 'metricValue': 1.0}, {'testCaseId': 'clw6m9043025dpe2bifcoejom', 'metricName': 'mrr', 'metricValue': 0.3333333333333333}, {'testCaseId': 'clw6m9043025epe2b4ab91sgr', 'metricName': 'mrr', 'metricValue': 1.0}, {'testCaseId': 'clw6m9043025fpe2b6ocaw7rf', 'metricName': 'mrr', 'metri

CreateEvaluationsResult(success=True, message='{"id":"clw6m94wy0030qyql2lqcugqe","createdAt":"2024-05-14T16:37:35.650Z","updatedAt":"2024-05-14T16:37:35.650Z","name":"Cool new evaluation set name","paramSet":null,"testSetId":"clw6m903n0259pe2bw8f7venr","creatorId":"clkrgxm850004phi6ee5mvhd1","projectId":null,"organizationId":null,"visibility":"MEMBER","metadata":null,"active":true}', df_metrics_trace=                   testSetId                 testCaseId metricName     value
0  clw6m903n0259pe2bw8f7venr  clw6m9043025bpe2bir498m2f     rouge1  0.000000
1  clw6m903n0259pe2bw8f7venr  clw6m9043025cpe2b5e94d8r4     rouge1  0.000000
2  clw6m903n0259pe2bw8f7venr  clw6m9043025dpe2bifcoejom     rouge1  0.000000
3  clw6m903n0259pe2bw8f7venr  clw6m9043025epe2b4ab91sgr     rouge1  0.000000
4  clw6m903n0259pe2bw8f7venr  clw6m9043025fpe2b6ocaw7rf     rouge1  0.000000
0  clw6m903n0259pe2bw8f7venr  clw6m9043025bpe2bir498m2f        mrr  0.333333
1  clw6m903n0259pe2bw8f7venr  clw6m9043025cpe2b5e94d8r4  

## Part 6 - Run RAG query function and then evaluations

This is a convenience function that runs your query logic for you and then your specified evaluators. You can use it to more conveniently evaluate your RAG query input/output pairs. 

In [43]:
from functools import partial
from lastmile_eval.rag.debugger.api.evaluation import (
    run_and_evaluate_outputs,
    get_default_rag_trace_level_metrics
)

# Evaluate the relevance of the returned answer to the ground-truth answer.
trace_level_evaluators = get_default_rag_trace_level_metrics(    
    names={"relevance"},
    lastmile_api_token=LASTMILE_API_TOKEN
)

inputs = list(qa_dataset.queries.values())

ground_truth_answers = [
    "The author first interacted with programming on a mainframe computer, using punch cards to input Fortran code, which was a challenging and time-consuming process",
    "The transition from the IBM 1401 to microcomputers like the TRS-80 represented a significant step forward in terms of both programming capabilities and user interaction.",
    "A turning point came after reading Nick Bostrom's \"Superintelligence,\" which presented a persuasive argument on the potential of Artificial Intelligence (AI)",
    "Heinlein's \"The Moon is a Harsh Mistress\" and Terry Winograd's SHRDLU heavily influenced the author's decision to pursue AI",
    "The author considered the AI practices during his first year of grad school as a \"hoax\" because they didn't meet his expectations for understanding and interpreting natural language accurately.",
]

evaluate_result = run_and_evaluate_outputs(
    "my_project_id",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    rag_query_fn=partial(
        run_query_flow,
        ingestion_trace_id=ingestion_trace_id
    ),
    inputs=inputs,
    ground_truth=ground_truth_answers
)

2024-05-14 14:19:07,623 - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create.<locals>.parser at 0x2a89b4790>, 'json_data': {'input': ["Describe the author's early experiences with programming, including the type of computer used, the programming language, and the challenges faced. How did these experiences change with the advent of microcomputers?"], 'model': 'text-embedding-ada-002', 'encoding_format': 'base64'}}
2024-05-14 14:19:07,624 - Sending HTTP Request: POST https://api.openai.com/v1/embeddings
2024-05-14 14:19:07,625 - close.started
2024-05-14 14:19:07,626 - close.complete
2024-05-14 14:19:07,626 - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=60.0 socket_options=None
2024-05-14 14:19:07,663 - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x2a9eb9cf0>
2024-05-14 14:19:07,664 - start_tls.started ssl_context=<ssl.SSLContext object at 0x2a55cec40> ser

data={'name': 'Test Set', 'testCases': [{'query': "Describe the author's early experiences with programming, including the type of computer used, the programming language, and the challenges faced. How did these experiences change with the advent of microcomputers?", 'output': "The author's early experiences with programming began during 9th grade when he and his friend Rich Draves were permitted to use the IBM 1401 that their school district used for data processing, located in the basement of their junior high school. They used an early version of Fortran as their programming language and had to type programs on punch cards, then stack them in the card reader to load the program into memory and run it. The author faced challenges with this type of computer since it was not interactive; the only form of input to programs was data stored on punch cards. He didn't have any data stored on punch cards, and he didn't know enough math to do anything interesting without relying on the input.

2024-05-14 14:20:43,941 - https://lastmileai.dev:443 "GET /api/evaluation_test_cases/list HTTP/1.1" 200 None
2024-05-14 14:20:43,946 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-14 14:20:43,946 - load_verify_locations cafile='/opt/homebrew/Caskroom/miniconda/base/envs/env310/lib/python3.10/site-packages/certifi/cacert.pem'
2024-05-14 14:20:43,967 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-14 14:20:43,967 - load_verify_locations cafile='/opt/homebrew/Caskroom/miniconda/base/envs/env310/lib/python3.10/site-packages/certifi/cacert.pem'
2024-05-14 14:20:43,985 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-14 14:20:43,986 - load_verify_locations cafile='/opt/homebrew/Caskroom/miniconda/base/envs/env310/lib/python3.10/site-packages/certifi/cacert.pem'
2024-05-14 14:20:44,002 - load_ssl_context verify=True cert=None trust_env=True http2=False
2024-05-14 14:20:44,002 - load_verify_locations cafile='/o

data={'testSetId': 'clw6pxrob0072petb9bhf862l', 'name': 'Test Set', 'evaluationMetrics': [{'testCaseId': 'clw6pxroh0073petbi46h40wl', 'metricName': 'relevance', 'metricValue': 1.0}, {'testCaseId': 'clw6pxroh0074petbgt1zaojm', 'metricName': 'relevance', 'metricValue': 1.0}, {'testCaseId': 'clw6pxroh0075petbkiogdcj9', 'metricName': 'relevance', 'metricValue': 1.0}, {'testCaseId': 'clw6pxroh0076petbdmhlvb2s', 'metricName': 'relevance', 'metricValue': 1.0}, {'testCaseId': 'clw6pxroh0077petbeykv1kct', 'metricName': 'relevance', 'metricValue': 1.0}], 'evaluationSetMetrics': [{'metricName': 'relevance_mean', 'metricValue': 0.8}, {'metricName': 'relevance_std', 'metricValue': 0.4472135954999579}, {'metricName': 'relevance_count', 'metricValue': 5.0}]}


In [38]:
evaluate_result.df_metrics_dataset

Unnamed: 0,testSetId,metricName,value
0,clw6odszj002hqu5eegwo1g2a,relevance_mean,0.8
0,clw6odszj002hqu5eegwo1g2a,relevance_std,0.447214
0,clw6odszj002hqu5eegwo1g2a,relevance_count,5.0


In [41]:
evaluate_result.df_metrics_trace

Unnamed: 0,testSetId,testCaseId,metricName,value
0,clw6odszj002hqu5eegwo1g2a,clw6odszo002iqu5e08dyuxmt,relevance,0.0
1,clw6odszj002hqu5eegwo1g2a,clw6odszo002jqu5eyn72fsji,relevance,1.0
2,clw6odszj002hqu5eegwo1g2a,clw6odszo002kqu5ef6dqqqcu,relevance,0.0
3,clw6odszj002hqu5eegwo1g2a,clw6odszo002lqu5eawdwljjf,relevance,1.0
4,clw6odszj002hqu5eegwo1g2a,clw6odszo002mqu5eqhhf5kq1,relevance,1.0


In [42]:
import requests
from requests import Response
from typing import Any, Optional

# TODO: Save this as it's own helper SDK from the lastmile-eval package
def list_evaluation_sets(
    take: int = 10,
    # TODO: Create macro for default timeout value
    timeout: int = 60,
) -> dict[str, Any]:  # TODO: Define eplicit typing for JSON response return
    """
    Get a list of evaluation sets from the LastMile API.

    Args:
        take: The number of evaluation sets to return. The default is 10.
        lastmile_api_token: The API token for the LastMile API. If not provided,
            will try to get the token from the LASTMILE_API_TOKEN
            environment variable.
            You can create a token from the "API Tokens" section from this website:
            https://lastmileai.dev/settings?page=tokens
        timeout: The maximum time in seconds to wait for the request to complete.
            The default is 60.

    Returns:
        A dictionary containing the evaluation sets.
    """
    lastmile_endpoint = f"https://lastmileai.dev/api/evaluation_sets/list?pageSize={str(take)}"

    response: Response = requests.get(
        lastmile_endpoint,
        headers={"Authorization": f"Bearer {LASTMILE_API_TOKEN}"},
        timeout=timeout,
    )
    # TODO: Handle response errors
    return response.json()

evaluation_sets = list_evaluation_sets(take=2)
evaluation_sets_df = pd.DataFrame.from_records(evaluation_sets["evaluationSets"]).rename(  # type: ignore[fixme]
    columns={"id": "evaluationSetId"}
)
pd.set_option('display.max_colwidth', 200)
evaluation_sets_df

# TODO: evaluationSetMetrics looks a bit weird, should probalby have helper
# method to display it better, but it's ok for now

2024-05-14 13:37:52,741 - Starting new HTTPS connection (1): lastmileai.dev:443


2024-05-14 13:37:52,839 - https://lastmileai.dev:443 "GET /api/evaluation_sets/list?pageSize=2 HTTP/1.1" 200 None


Unnamed: 0,evaluationSetId,createdAt,updatedAt,name,paramSet,testSetId,creatorId,projectId,organizationId,visibility,metadata,active,evaluationSetMetrics,testSet
0,clw6oe1j2001spbr76i941iu0,2024-05-14T17:37:23.773Z,2024-05-14T17:37:23.773Z,Test Set,,clw6odszj002hqu5eegwo1g2a,clkrgxm850004phi6ee5mvhd1,,,MEMBER,,True,"[{'createdAt': '2024-05-14T17:37:23.773Z', 'updatedAt': '2024-05-14T17:37:23.773Z', 'metricName': 'relevance_mean', 'metricValue': 0.8, 'evaluationSetId': 'clw6oe1j2001spbr76i941iu0', 'creatorId':...","{'id': 'clw6odszj002hqu5eegwo1g2a', 'name': 'Test Set'}"
1,clw6nokww0070qpmobk8dwl1y,2024-05-14T17:17:35.840Z,2024-05-14T17:17:35.840Z,Test Set,,clw6noc7q006xqu1pyiec6hwx,clkrgxm850004phi6ee5mvhd1,,,MEMBER,,True,"[{'createdAt': '2024-05-14T17:17:35.840Z', 'updatedAt': '2024-05-14T17:17:35.840Z', 'metricName': 'relevance_mean', 'metricValue': 0.6, 'evaluationSetId': 'clw6nokww0070qpmobk8dwl1y', 'creatorId':...","{'id': 'clw6noc7q006xqu1pyiec6hwx', 'name': 'Test Set'}"
