## Step 1: Import modules and packages, download reference docs

This goes through entire LastMile AI Eval flow from

1. create ingestion trace
2. generate a query + ground truth context pair per each node context in a document
  - taking those queries and running rag query traces to get actual retrieved context
3. listing query traces I want to include in a test set (defaults to last N queries for now)
4. create test Set with given query_traces, as well as storing the ground truth for the associated context for each query
5. create evaluation metrics based on ones provided by Llama Index
   - note: this is mainly from Llama Index, so the evaluation metrics are only focused on retrieval, nothing on outputs (though I store those as output events too)
6. create evaluation set by feeding these metrics with test set we just created

Some notes:
- no manual id grepping needed --> all taken care of by helper functions
probably needs to be better designed in future, just was focused on getting unblocked
- need to refactor ingestion_trace_id to map to trace-level, not marking rag query event level (right now it doesn't work, I'll add that later)
- some other small API convenience functions need to be added to the API, such as a helper function for `list_evaluation_sets()`


In [None]:
!pip install chromadb 
!pip install llama-index
!pip install lastmile-eval --upgrade

## 1. Setup
- Install required Packages
- Download Sample data

In [1]:
import os
import dotenv
# Set API Keys
# You can get your OPENAI_API_KEY from https://platform.openai.com/api-keys

dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LASTMILE_API_TOKEN = os.getenv("LASTMILE_API_TOKEN")

os.environ["LASTMILE_API_TOKEN"] = LASTMILE_API_TOKEN
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

print(f"{len(OPENAI_API_KEY)=}, {len(LASTMILE_API_TOKEN)=}")
assert len(OPENAI_API_KEY) > 0
assert len(LASTMILE_API_TOKEN) > 0

len(OPENAI_API_KEY)=51, len(LASTMILE_API_TOKEN)=324


## Step 2: Run and Trace Ingestion Pipeline

Lets make a basic ingestion pipeline using chromadb. We will use a sample essay by paul graham.

- Run the following cell to download the sample data
- Run the following cell to create an ingestion pipeline, which is traced with the @traced decorator.

In [None]:
# Download sample data
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

In [None]:
from lastmile_eval.rag.debugger.tracing import (
    get_lastmile_tracer)
from lastmile_eval.rag.debugger.tracing.decorators import (
    traced,
)

import chromadb
import os
chroma_client = chromadb.Client()

# Instantiate a tracer object
tracer = get_lastmile_tracer("Paul-Graham")


@traced(tracer) #Decorate the function with the tracer
def chunk_document(file_path: str, chunk_size: int = 1000) -> list[str]:
    """
    Chunk a text file into a list of strings based on the specified chunk size.

    Args:
        file_path (str): The path to the text file.
        chunk_size (int): The desired number of characters in each chunk.

    Returns:
        list[str]: A list of strings, where each string represents a chunk of text.
    """
    with open(file_path, "r") as file:
        text = file.read()

    chunks: list[str] = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])

    return chunks

@traced(tracer)
def run_ingestion_flow() -> chromadb.Collection:
    collection = chroma_client.create_collection(name="my_collection")
    
    document_chunks = chunk_document("data/paul_graham/paul_graham_essay.txt")
    document_ids = [f"chunk_{i}" for i in range(len(document_chunks))]

    collection.add(
        ids=document_ids,
        documents=document_chunks, # ex: ["What I Worked On", "February 2021", ...]
    )
    return collection

collection:chromadb.Collection = run_ingestion_flow()

In [11]:
# Let's print the trace data from Jaeger to
# show you what it looks like (search for "operationName" in the data)

from lastmile_eval.rag.debugger.tracing import (
    get_latest_ingestion_trace_id,
    get_trace_data,
)

ingestion_trace_id = get_latest_ingestion_trace_id()
get_trace_data(ingestion_trace_id)

2024-05-14 18:15:52,369 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 18:15:52,467 - https://lastmileai.dev:443 "GET /api/rag_ingestion_traces/list?pageSize=1 HTTP/1.1" 200 596
2024-05-14 18:15:52,471 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 18:15:52,661 - https://lastmileai.dev:443 "GET /api/trace/read?id=2d1db49cbcbf707f8bea0cccbe43946a HTTP/1.1" 200 None


{'data': [{'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
   'spans': [{'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
     'spanID': 'd4ad661565f03ae2',
     'operationName': 'ingestion-root-span',
     'references': [],
     'startTime': 1715705744770425,
     'duration': 1466427,
     'tags': [{'key': 'doc_file_paths',
       'type': 'string',
       'value': "['/Users/jonathan/Projects/eval-cookbook/examples/data/paul_graham/paul_graham_essay.txt']"},
      {'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [],
     'processID': 'p1',
    {'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
     'spanID': '01b23423d2a3313d',
     'operationName': 'create-document-nodes',
     'references': [{'refType': 'CHILD_OF',
       'traceID': '2d1db49cbcbf707f8bea0cccbe43946a',
       'spanID': 'd4ad661565f03ae2'}],
     'startTime': 1715705744783201,
     'duration': 221595,
     'tags': [{'key': 'ch

In [10]:
# Now let's fetch the trace event data from our postgres table
# Notice that the `traceId` column matches with the raw trace data

from lastmile_eval.rag.debugger.tracing import list_ingestion_trace_events

ingestion_trace_events = list_ingestion_trace_events(take=1)
pd.DataFrame.from_records(ingestion_trace_events["ingestionTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragIngestionTraceEventId"}
)

2024-05-14 18:15:48,489 - Starting new HTTPS connection (1): lastmileai.dev:443
2024-05-14 18:15:48,655 - https://lastmileai.dev:443 "GET /api/rag_ingestion_traces/list?pageSize=1 HTTP/1.1" 200 596


NameError: name 'pd' is not defined

## Part 3: Run and Trace Query Pipeline

Now that we have an ingestion pipeline built, lets build a query pipeline. We will use an OpenAI model to generate responses to user queries. We will trace this pipeline with the @traced decorator. 
Note: the query pipeline and ingestion pipeline are separate, so we will need to link them together with the ingestion_trace_id.

In [None]:
import openai
from lastmile_eval.rag.debugger.api import (
    QueryReceived,
    ContextRetrieved,
    PromptResolved,
    LLMOutputReceived,
)

LLM_NAME = "gpt-3.5-turbo"


PROMPT_TEMPLATE = """
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:
"""

@traced(tracer, name="retrieve-context") #Decorate the function with the tracer
def retrieve_context(query_string: str, top_k: int = 5) -> list[str]:
    """
    Retrieve the top-k most relevant contexts based on the query string
    from the chroma db collection
    """
    tracer.register_param("similarity_top_k", top_k)
    chroma_retrival_results = collection.query(
        query_texts=[query_string], # Chroma will embed this for you
        n_results=top_k # how many results to return
        )
    documents_parsed_as_strings = [document for document in chroma_retrival_results.get("documents")[0]]
    tracer.mark_rag_query_trace_event(
            ContextRetrieved(context=documents_parsed_as_strings), get_latest_ingestion_trace_id()
        )
    return documents_parsed_as_strings

@traced(tracer, name="resolve-prompt")
def resolve_prompt(user_query: str, retrieved_contexts: list[str]):
    resolved_prompt = PROMPT_TEMPLATE.replace(
        "{context_str}", "\n\n\n".join(retrieved_contexts)
    ).replace("{query_str}", user_query)
    tracer.mark_rag_query_trace_event(
        PromptResolved(fully_resolved_prompt=resolved_prompt), get_latest_ingestion_trace_id()
    )
    return resolved_prompt


@traced(tracer=tracer, name="query-root-span") # You can also provide a custom name for the root span
def run_query_flow(user_query: str, ingestion_trace_id: str):
    tracer.mark_rag_query_trace_event(
        QueryReceived(query=user_query), ingestion_trace_id
    )

    retrieved_contexts = retrieve_context(user_query, top_k=3)

    resolved_prompt = resolve_prompt(user_query, retrieved_contexts)

    with tracer.start_as_current_span("call-llm") as _llm_span:
        openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
        response = openai_client.chat.completions.create(
            model=LLM_NAME,
            messages=[{"role": "user", "content": resolved_prompt}],
        )
        output: str = response.choices[0].message.content
        tracer.mark_rag_query_trace_event(
            LLMOutputReceived(llm_output=output), get_latest_ingestion_trace_id()
        )

    return output

In [None]:
# TODO: Right now the ingestion_trace_id within mark_rag_query_trace_event is
# no-op due to changes in assumptions, I'll fix later
response = run_query_flow("What did the author do growing up?", ingestion_trace_id)

print(f"Response: {response}")

In [None]:
# Just like what we did with the ingestion trace,
# let's print out what this looks like in the PostGres data, as well as the
# pure trace data again
from lastmile_eval.rag.debugger.tracing import (
    list_query_trace_events,
)

query_trace_events = list_query_trace_events(take=1)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceEventsId"}
)
query_trace_events_df



In [None]:
# This is what the trace data looks like
from lastmile_eval.rag.debugger.tracing import (
    get_trace_data,
)

query_trace_id = query_trace_events_df.iloc[0]["traceId"]
get_trace_data(query_trace_id)




## Part 4: Create Test Sets and Run Evaluators

In [None]:
# NOTE: Running this cell on all the nodes will take a while (probably 5-10mins), so please be patient

# Change this to a lower value if you want to run faster
# If we use None, we will not use this value and use total_queries_per_batch
# instead
total_queries_to_run_override = (
    5  # None
)


# Ok we're now going to artifically generate a bunch of query + context
# (ground truth) pairs. We will then run the `run_query_flow()` method on these
# generated queries later

# Define an LLM
llm = OpenAI(model=LLM_NAME)


# This method `generate_question_context_pairs()` essentially
# calls an LLM to generate questions for us. See this URL for more details:
# https://github.com/run-llama/llama_index/blob/8b373239396134a92c9277b36aa7023c633c018a/llama-index-finetuning/llama_index/finetuning/embeddings/common.py#L49-L64
num_questions_per_chunk = 1
qa_dataset = generate_question_context_pairs(
    nodes[0:total_queries_to_run_override or len(nodes)],
    llm=llm,
    num_questions_per_chunk=num_questions_per_chunk
)

In [None]:
# Run these queries through the `run_query_flow()` method

total_queries_per_batch = len(qa_dataset.queries)
total_queries_to_run = min(total_queries_to_run_override or total_queries_per_batch, total_queries_per_batch)

expected_node_ids: list[str] = []
for i, (query_id, query) in enumerate(qa_dataset.queries.items()):
    run_query_flow(query, ingestion_trace_id)
    associated_node_id_for_query = qa_dataset.relevant_docs[query_id]
    expected_node_ids.append(associated_node_id_for_query[0])

    print(f"Finished running {i+1}/{total_queries_to_run} queries...")
    if i + 1 == total_queries_to_run:
        break

# Have to reverse because the get_rag_query_trace_events() method
# returns the most recent trace events first
expected_node_ids.reverse()

In [None]:
from lastmile_eval.rag.debugger.tracing import list_query_trace_events

query_trace_events = list_query_trace_events(take=total_queries_to_run)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceId"}
)
query_trace_events_df

## Part 5 - Run Evaluators from Query Trace Events data

This directly creates evaluation sets using the method `evaluate_rag_outputs()` without the need to create intermediate test cases and test sets. All you need is to define your query trace event rows in a dataframe.

In [None]:
from lastmile_eval.rag.debugger.api.evaluation import evaluate_rag_outputs
from lastmile_eval.rag.debugger.tracing import (
    get_query_trace_event,
)

from lastmile_eval.text.metrics import calculate_rouge1_score
from llama_index.core.evaluation import HitRate, MRR


def wrap_rouge1(df: pd.DataFrame):
    return calculate_rouge1_score(df["output"].tolist(), df["groundTruth"].tolist())

def extract_data_to_evaluate(
    row: pd.Series,
) -> tuple[list[str], list[str]]:
    trace_query_id: str = row["ragQueryTraceId"]
    trace_query_data = get_query_trace_event(trace_query_id)
    retrieved_node_ids = trace_query_data["paramSet"]["retrieved_node_ids"]
    expected_node_ids: list[str] = [row["groundTruth"]]
    return (retrieved_node_ids, expected_node_ids)


def wrap_llama_index_evaluator(
    retrieved_and_expected_node_ids_tuple: tuple[list[str], list[str]],
    evaluator: HitRate | MRR,
) -> float:
    retrieved_node_ids, expected_node_ids = (
        retrieved_and_expected_node_ids_tuple
    )
    return evaluator.compute(
        retrieved_ids=retrieved_node_ids, expected_ids=expected_node_ids
    ).score

# Example using a row-level function on the dataframe
def compute_mrr(df: pd.DataFrame):
    """
    We are demonstrating methods that are applied across a row instead of
    entire dataframe, such as the MRR and Hit Rate metrics from the 
    llama_index.core.evaluation package. In order to do this, we define a
    method at the row level where we:
    
    1. Extract the data to evaluate from the row
    2. Run the evaluators on this extracted data
    
    After that's done, we pass this row-level method to df.apply()
    """
    def _evaluate_row(row: pd.Series) -> float:
        node_id_tuple = extract_data_to_evaluate(row)
        return wrap_llama_index_evaluator(node_id_tuple, MRR())
    
    return df.apply(_evaluate_row, axis=1)

def compute_hit_rate(df: pd.DataFrame):
    """
    Another row-function example with hit_rate
    """
    def _evaluate_row(row: pd.Series) -> float:
        node_id_tuple = extract_data_to_evaluate(row)
        return wrap_llama_index_evaluator(node_id_tuple, HitRate())
    
    return df.apply(_evaluate_row, axis=1)
    
trace_level_evaluators = {
    "rouge1": wrap_rouge1,
    "mrr": compute_mrr,
    "hit_rate": compute_hit_rate,
}

# We must add groundTruth to the dataframe
query_trace_events_df["groundTruth"] = expected_node_ids

eval_result = evaluate_rag_outputs(
    project_id="can be anything for now",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    df=query_trace_events_df,
    lastmile_api_token=LASTMILE_API_TOKEN,
    evaluation_set_name="Cool new evaluation set name"
)

#print out result
eval_result

## Part 6 - Run RAG query function and then evaluations

This is a convenience function that runs your query logic for you and then your specified evaluators. You can use it to more conveniently evaluate your RAG query input/output pairs. 

In [None]:
from functools import partial
from lastmile_eval.rag.debugger.api.evaluation import (
    run_and_evaluate_outputs,
    get_default_rag_trace_level_metrics
)

# Evaluate the relevance of the returned answer to the ground-truth answer.
trace_level_evaluators = get_default_rag_trace_level_metrics(    
    names={"relevance"},
    lastmile_api_token=LASTMILE_API_TOKEN
)

inputs = list(qa_dataset.queries.values())

ground_truth_answers = [
    "The author first interacted with programming on a mainframe computer, using punch cards to input Fortran code, which was a challenging and time-consuming process",
    "The transition from the IBM 1401 to microcomputers like the TRS-80 represented a significant step forward in terms of both programming capabilities and user interaction.",
    "A turning point came after reading Nick Bostrom's \"Superintelligence,\" which presented a persuasive argument on the potential of Artificial Intelligence (AI)",
    "Heinlein's \"The Moon is a Harsh Mistress\" and Terry Winograd's SHRDLU heavily influenced the author's decision to pursue AI",
    "The author considered the AI practices during his first year of grad school as a \"hoax\" because they didn't meet his expectations for understanding and interpreting natural language accurately.",
]

evaluate_result = run_and_evaluate_outputs(
    "my_project_id",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    rag_query_fn=partial(
        run_query_flow,
        ingestion_trace_id=ingestion_trace_id
    ),
    inputs=inputs,
    ground_truth=ground_truth_answers
)

In [None]:
evaluate_result.df_metrics_dataset

In [None]:
evaluate_result.df_metrics_trace

In [None]:
import requests
from requests import Response
from typing import Any, Optional

# TODO: Save this as it's own helper SDK from the lastmile-eval package
def list_evaluation_sets(
    take: int = 10,
    # TODO: Create macro for default timeout value
    timeout: int = 60,
) -> dict[str, Any]:  # TODO: Define eplicit typing for JSON response return
    """
    Get a list of evaluation sets from the LastMile API.

    Args:
        take: The number of evaluation sets to return. The default is 10.
        lastmile_api_token: The API token for the LastMile API. If not provided,
            will try to get the token from the LASTMILE_API_TOKEN
            environment variable.
            You can create a token from the "API Tokens" section from this website:
            https://lastmileai.dev/settings?page=tokens
        timeout: The maximum time in seconds to wait for the request to complete.
            The default is 60.

    Returns:
        A dictionary containing the evaluation sets.
    """
    lastmile_endpoint = f"https://lastmileai.dev/api/evaluation_sets/list?pageSize={str(take)}"

    response: Response = requests.get(
        lastmile_endpoint,
        headers={"Authorization": f"Bearer {LASTMILE_API_TOKEN}"},
        timeout=timeout,
    )
    # TODO: Handle response errors
    return response.json()

evaluation_sets = list_evaluation_sets(take=2)
evaluation_sets_df = pd.DataFrame.from_records(evaluation_sets["evaluationSets"]).rename(  # type: ignore[fixme]
    columns={"id": "evaluationSetId"}
)
pd.set_option('display.max_colwidth', 200)
evaluation_sets_df

# TODO: evaluationSetMetrics looks a bit weird, should probalby have helper
# method to display it better, but it's ok for now