## Step 1: Import modules and packages, download reference docs

This goes through entire LastMile AI Eval flow from

1. create ingestion trace
2. generate a query + ground truth context pair per each node context in a document
  - taking those queries and running rag query traces to get actual retrieved context
3. listing query traces I want to include in a test set (defaults to last N queries for now)
4. create test Set with given query_traces, as well as storing the ground truth for the associated context for each query
5. create evaluation metrics based on ones provided by Llama Index
   - note: this is mainly from Llama Index, so the evaluation metrics are only focused on retrieval, nothing on outputs (though I store those as output events too)
6. create evaluation set by feeding these metrics with test set we just created

Some notes:
- no manual id grepping needed --> all taken care of by helper functions
probably needs to be better designed in future, just was focused on getting unblocked
- need to refactor ingestion_trace_id to map to trace-level, not marking rag query event level (right now it doesn't work, I'll add that later)
- some other small API convenience functions need to be added to the API, such as a helper function for `list_evaluation_sets()`


In [None]:
!pip install chromadb 
!pip install llama-index
!pip install lastmile-eval --upgrade

## 1. Setup
- Install required Packages
- Download Sample data
- Launch the Rag Debugger

In [1]:
import os
import dotenv
# Set API Keys
# You can get your OPENAI_API_KEY from https://platform.openai.com/api-keys
os.environ["LASTMILE_API_TOKEN"] = "eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIn0..-qVMrf96rDw7RsRO.Baq59Zltreu0gmNEBOrAB85U0vq12aoqpqPbV8BDPaeOGx0QA_Mpf1VNRcAnlUz42XJSK-GEO0_Sp0ffnOd0SjPJWH2gB8YYUUVeOLsZUOM3v81D8hf7Se1otbQnUYTcmiVpXbd1UV0aDX3Cw4gj3-LsFbcVVeuc0trohbk16dyXGvCucsr8SnTUse0wNkh-hmFfvh_N8XgjKoT8yW5JWeCug4jEntygFAdxstOD.USrcXzLcwuE-2UZ8a2hukQ"
dotenv.load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LASTMILE_API_TOKEN = os.getenv("LASTMILE_API_TOKEN")
assert len(OPENAI_API_KEY) > 0
assert len(LASTMILE_API_TOKEN) > 0

In a seperate terminal, run the following command to launch the Rag Debugger:


!rag-debug launch

Open up your webbrowser and navigate to the url provided by the Rag Debugger. This will look like `http://localhost:8080/`
This notebook is an interactive tutorial that will show you how to use rag-debugger with the tracing API.


## Step 2: Run and Trace Ingestion Pipeline

Lets make a basic ingestion pipeline using chromadb. We will use a sample essay by paul graham.

- Run the following cell to download the sample data
- Run the following cell to create an ingestion pipeline, which is traced with the @traced decorator.

In [None]:
# Download sample data
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

In [None]:
from lastmile_eval.rag.debugger.tracing import (
    get_lastmile_tracer)
from lastmile_eval.rag.debugger.tracing.decorators import (
    traced,
)

import chromadb
import os
from lastmile_eval.rag.debugger.api import LastMileTracer
chroma_client = chromadb.Client()

# Instantiate a tracer object
tracer: LastMileTracer = get_lastmile_tracer("Paul-Graham")


@traced(tracer) #Decorate the function with the tracer
def chunk_document(file_path: str, chunk_size: int = 1000) -> list[str]:
    """
    Chunk a text file into a list of strings based on the specified chunk size.

    Args:
        file_path (str): The path to the text file.
        chunk_size (int): The desired number of characters in each chunk.

    Returns:
        list[str]: A list of strings, where each string represents a chunk of text.
    """
    with open(file_path, "r") as file:
        text = file.read()

    chunks: list[str] = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])

    return chunks

@traced(tracer)
def run_ingestion_flow() -> chromadb.Collection:
    collection = chroma_client.create_collection(name="my_collection")
    tracer.mark_rag_ingestion_trace_event("Ingesting Paul Graham's essay")
    
    document_chunks = chunk_document("data/paul_graham/paul_graham_essay.txt")
    document_ids = [f"chunk_{i}" for i in range(len(document_chunks))]

    collection.add(
        ids=document_ids,
        documents=document_chunks, # ex: ["What I Worked On", "February 2021", ...]
    )
    return collection

collection:chromadb.Collection = run_ingestion_flow()

In [None]:
# Let's print the trace data from Jaeger to
# show you what it looks like (search for "operationName" in the data)

from lastmile_eval.rag.debugger.tracing import (
    get_latest_ingestion_trace_id,
    get_trace_data,
)

ingestion_trace_id = get_latest_ingestion_trace_id()
get_trace_data(ingestion_trace_id)

In [None]:
# Now let's fetch the trace event data from our postgres table
# Notice that the `traceId` column matches with the raw trace data

from lastmile_eval.rag.debugger.tracing import list_ingestion_trace_events
import pandas as pd

ingestion_trace_events = list_ingestion_trace_events(take=1)
pd.DataFrame.from_records(ingestion_trace_events["ingestionTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragIngestionTraceEventId"}
)

## Part 3: Run and Trace Query Pipeline

Now that we have an ingestion pipeline built, lets build a query pipeline. We will use an OpenAI model to generate responses to user queries. We will trace this pipeline with the @traced decorator. 
Note: the query pipeline and ingestion pipeline are separate, so we will need to link them together with the ingestion_trace_id.

In [5]:
import openai
from lastmile_eval.rag.debugger.api import (
    QueryReceived,
    ContextRetrieved,
    PromptResolved,
    LLMOutputReceived,
)

LLM_NAME = "gpt-3.5-turbo"


PROMPT_TEMPLATE = """
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:
"""

@traced(tracer, name="retrieve-context") #Decorate the function with the tracer
def retrieve_context(query_string: str, top_k: int = 5) -> list[str]:
    """
    Retrieve the top-k most relevant contexts based on the query string
    from the chroma db collection
    """
    tracer.register_param("similarity_top_k", top_k)
    chroma_retrival_results = collection.query(query_texts=query_string, n_results=top_k)
    documents_parsed_as_strings = [document for document in chroma_retrival_results.get("documents")[0]]
    tracer.mark_rag_query_trace_event(
            ContextRetrieved(context=documents_parsed_as_strings), get_latest_ingestion_trace_id()
        )
    return documents_parsed_as_strings

@traced(tracer, name="resolve-prompt")
def resolve_prompt(user_query: str, retrieved_contexts: list[str]):
    resolved_prompt = PROMPT_TEMPLATE.replace(
        "{context_str}", "\n\n\n".join(retrieved_contexts)
    ).replace("{query_str}", user_query)
    tracer.mark_rag_query_trace_event(
        PromptResolved(fully_resolved_prompt=resolved_prompt), get_latest_ingestion_trace_id()
    )
    return resolved_prompt


@traced(tracer=tracer, name="query-root-span") # You can also provide a custom name for the root span
def run_query_flow(user_query: str, ingestion_trace_id: str):
    tracer.mark_rag_query_trace_event(
        QueryReceived(query=user_query), ingestion_trace_id
    )

    retrieved_contexts = retrieve_context(user_query, top_k=3)

    resolved_prompt = resolve_prompt(user_query, retrieved_contexts)

    with tracer.start_as_current_span("call-llm") as _llm_span:
        openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
        response = openai_client.chat.completions.create(
            model=LLM_NAME,
            messages=[{"role": "user", "content": resolved_prompt}],
        )
        output: str = response.choices[0].message.content
        tracer.mark_rag_query_trace_event(
            LLMOutputReceived(llm_output=output), get_latest_ingestion_trace_id()
        )

    return output

In [None]:
# TODO: Right now the ingestion_trace_id within mark_rag_query_trace_event is
# no-op due to changes in assumptions, I'll fix later
response = run_query_flow("What did the author do growing up?", ingestion_trace_id)

print(f"Response: {response}")

Checkpoint: At this point, we have a traced ingestion pipeline and a traced query pipeline. Open up the Rag Debugger and navigate to the traces tab. You should see the traces for the ingestion pipeline and the query pipeline.
|  View All Traces    | narrowed Ingestion Trace View |
| -------- | ------- |
| ![OpenAI Logo](https://github-production-user-asset-6210df.s3.amazonaws.com/141073967/330911464-10debfac-a32a-4349-bbd2-fd3e5da95fcb.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240515%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240515T180450Z&X-Amz-Expires=300&X-Amz-Signature=1f66773e767b26fb65d406b13284c349284e1a17112555a13cc9bea8eb195fa4&X-Amz-SignedHeaders=host&actor_id=141073967&key_id=0&repo_id=768880246)  | ![](https://github-production-user-asset-6210df.s3.amazonaws.com/141073967/330911533-68e1a305-5734-432a-bffe-896fae10bf1a.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240515%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240515T180514Z&X-Amz-Expires=300&X-Amz-Signature=c5c68b0ddc0b18f6326d3290086f4150c91b182e4605f23adbe4974a9d369212&X-Amz-SignedHeaders=host&actor_id=141073967&key_id=0&repo_id=768880246) |




In [None]:
# Just like what we did with the ingestion trace,
# let's print out what this looks like in the PostGres data, as well as the
# pure trace data again
from lastmile_eval.rag.debugger.tracing import (
    list_query_trace_events,
)

query_trace_events = list_query_trace_events(take=1)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceEventsId"}
)
query_trace_events_df



In [None]:
# This is what the trace data looks like
from lastmile_eval.rag.debugger.tracing import (
    get_trace_data,
)

# Fetch the first trace id from the query trace events
query_trace_id = query_trace_events_df.iloc[0]["traceId"]
get_trace_data(query_trace_id)



## Part 4: Create Test Sets and Run Evaluators

In [None]:
# Sample Query Test Set
from openai import OpenAI
from llama_index.core.evaluation import generate_question_context_pairs

queries = [ 
    "What two main things did Paul Graham work on before college, outside of school?", 
    "What was the key realization Paul Graham had about artificial intelligence during his first year of grad school at Harvard?", 
    "How did Paul Graham and his partner Robert Morris get their initial idea and start working on what became their startup Viaweb?", 
    "What were some of the novel approaches and advantages that Y Combinator introduced compared to traditional venture capital firms when it first started?", 
    "What ambitious programming language project did Paul Graham work on intensively for 4 years from 2015-2019, and what was unique about the goal and approach of this language called Bel?" 
             ]

In [None]:
# Run these queries through the `run_query_flow()` method

expected_node_ids: list[str] = []
for i, query in enumerate(queries):
    run_query_flow(query, ingestion_trace_id)

    print(f"Finished running {i+1}/5 queries...")


In [None]:
from lastmile_eval.rag.debugger.tracing import list_query_trace_events

# Fetch the trace data. Visualize with a pandas dataframe.
query_trace_events = list_query_trace_events(take=5)
query_trace_events_df = pd.DataFrame.from_records(query_trace_events["queryTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragQueryTraceId"}
)
query_trace_events_df

## Part 5  - Run Online evaluations with the RAG query function

This is a convenience function that runs your query logic for you and then your specified evaluators. You can use it to more conveniently evaluate your RAG query input/output pairs. 

In [None]:
from functools import partial
from lastmile_eval.rag.debugger.api.evaluation import (
    run_and_evaluate_outputs,
    get_default_rag_trace_level_metrics
)

# Evaluate the relevance of the returned answer to the ground-truth answer.
trace_level_evaluators = get_default_rag_trace_level_metrics(    
    names={"relevance"},
    lastmile_api_token=LASTMILE_API_TOKEN
)

inputs = queries # From the previous cell

ground_truth_answers = [
    "The author first interacted with programming on a mainframe computer, using punch cards to input Fortran code, which was a challenging and time-consuming process",
    "The transition from the IBM 1401 to microcomputers like the TRS-80 represented a significant step forward in terms of both programming capabilities and user interaction.",
    "A turning point came after reading Nick Bostrom's \"Superintelligence,\" which presented a persuasive argument on the potential of Artificial Intelligence (AI)",
    "Heinlein's \"The Moon is a Harsh Mistress\" and Terry Winograd's SHRDLU heavily influenced the author's decision to pursue AI",
    "The author considered the AI practices during his first year of grad school as a \"hoax\" because they didn't meet his expectations for understanding and interpreting natural language accurately.",
]

evaluate_result = run_and_evaluate_outputs(
    "my_project_id",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    rag_query_fn=partial(
        run_query_flow,
        ingestion_trace_id=ingestion_trace_id
    ),
    inputs=inputs,
    ground_truth=ground_truth_answers
)