# RAG-Debugger - Getting Started Tutorial

In this notebook, we showcase how to use the [RAG-Debugger](https://rag.lastmileai.dev/) to optimize your RAG pipelines. We will evaluate a demo RAG pipeline, which enables question-answering over [Paul Graham's essays](https://www.paulgraham.com/worked.html) using `gpt-3.5-turbo`. Check out our [Cookbook](https://github.com/lastmile-ai/eval-cookbook/tree/main) for more examples and tutorials.

<img width="500" alt="Screenshot 2024-05-16 at 12 27 45 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/497135cb-3fc0-452b-a7fd-c04819be2fab">

## Notebook Outline
* [Step 1: Install and Setup](#install)
* [Step 2: Build and Trace RAG System](#trace)
  * [Download Data](#download_data)
  * [Trace Ingestion Pipeline](#trace_ingestion)
  * [Trace Query Pipeline](#trace_query)
  * [Access Raw Traces](#access_data)
  * [View Traces in RAG Debugger UI](#view_ui)
* [Step 3: Debug and Optimize your RAG System](#debug)
  * [Measure and Evaluate Performance](#measure)
  * [Idenfity Issues](#identify)
  * [Iterate and Optimize System](#iterate)


<a name="install"></a>
## Step 1: Install and Setup
1. Install required packages and modules
2. Setup API Keys/Tokens

To begin, we need to install the required packages and modules.

In [1]:
# !pip install chromadb
# !pip install llama-index
# !pip install lastmile-eval --upgrade

In [2]:
import chromadb
import pandas as pd
import openai

from dataclasses import asdict
from lastmile_eval.rag.debugger.api import (
    LastMileTracer,
    Node,
    RetrievedNode,
)
from lastmile_eval.rag.debugger.tracing import (
    get_lastmile_tracer,
    list_ingestion_trace_events,
    get_latest_ingestion_trace_id,
    get_trace_data,
)
from functools import partial
from lastmile_eval.rag.debugger.api.evaluation import (
    run_and_evaluate,
)
from llama_index.core.evaluation import generate_question_context_pairs

  from .autonotebook import tqdm as notebook_tqdm


We also need the following tokens/keys:

* **LastMile AI API Token:** Go to the [LastMile Settings page](https://lastmileai.dev/settings?page=tokens). You will need to first create a LastMile AI account.
* **OpenAI API Key:** Go to [OpenAI API Keys page](https://platform.openai.com/account/api-keys) to create and access your OpenAI API Key.

For Jupyter notebook, save them in a `.env` file within this project directory.

In [3]:
import os

try:
    # First try this in case we're running on Google Colab
    from google.colab import userdata
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    os.environ['LASTMILE_API_TOKEN'] = userdata.get('LASTMILE_API_TOKEN')
except ModuleNotFoundError:
    import dotenv
    dotenv.load_dotenv()

<a name="trace"></a>

## Step 2: Build and Trace RAG System

1. Download Data (Paul Graham Essay)
2. Trace Document Ingestion Pipeline
3. (Optional) Access Raw Trace Data
4. Trace Query Pipeline
5. View Traces in RAG Debugger UI

**Note:** If you are using OpenAI, LangChain, or LlamaIndex, we offer auto-instrumention for tracing (no manual setup required). See our documentation to learn more.

<a name="download_data"></a>

#### Download Data

In [4]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.´.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: raw.´.com


<a name="trace_ingestion"></a>

#### Trace Document Ingestion Pipeline
In the cells below, we create chunks of our document (Paul Graham essay) and store it in a vector database (ChromaDB). ChromaDB converts these chunks of texts to vector embeddings which are indexed in the database and can easily be retrieved.

We also instatiated a **LastMile AI Tracer object** and traced the chunking step and the ingestion step (storing the embeddings in ChromaDB). We use the `@tracer.trace_function()` decorator to trace these steps.

First, instantiate a Tracer object. The project name ("Paul-Graham-Demo-Project") enables you to group traces in the UI.

In [5]:
tracer: LastMileTracer = get_lastmile_tracer(
    tracer_name="my-tracer-object",
    project_name="Paul-Graham-Demo-Project",
)

In [6]:
tracer

<lastmile_eval.rag.debugger.tracing.lastmile_tracer.LastMileTracer at 0x340fb4740>

Setup of your ingestion pipeline below with the necessary tracing added.

In [7]:
chroma_client = chromadb.Client()

@tracer.trace_function() #Decorate the function with the tracer
def chunk_document(file_path: str, chunk_size: int = 1000) -> list[Node]:
    """
    Chunk a text file into a list of strings based on the specified chunk size.

    Args:
        file_path (str): The path to the text file.
        chunk_size (int): The desired number of characters in each chunk.

    Returns:
        list[Node]: A list of Nodes, where each node contains info
            representing a chunk of text.
    """
    with open(file_path, "r") as file:
        text = file.read()

    nodes: list[Node] = []
    for i in range(0, len(text), chunk_size):
        nodes.append(
            Node(
                id=f"node{i}",
                text=text[i:i + chunk_size],
            )
        )
    tracer.add_chunking_event(
        output_nodes=nodes, 
        filepath=file_path, 
        metadata={"chunk_size": chunk_size},
    )

    return nodes

@tracer.trace_function()
def run_ingestion_flow() -> chromadb.Collection:
    filepath = "data/paul_graham/paul_graham_essay.txt"
    document_nodes: list[Node] = chunk_document(filepath)

    collection = chroma_client.create_collection(name="paul_graham_collection")
    collection.add(
        ids=[node.id for node in document_nodes],
        documents=[node.text for node in document_nodes] # ex: ["What I Worked On", "February 2021", ...]
    )
    
    tracer.add_synthesize_event(
        input=filepath,
        # use asdict because input and output need to be JSON-serializable!!!
        output=[asdict(node) for node in document_nodes],
    )
    return collection

In [8]:
# TODO(b7r6): do this properly with a real cache and stop being lazy...
# try:
#   print(f"collection: {collection}")
# except NameError:
collection = run_ingestion_flow()

**Important - Linking Ingestion Trace to Query Pipeline**
The trace data for the ingestion pipeline has an ID associated with it. We can use this ID to link the tracing for the ingestion step and the query step of the RAG system for a comprehensive overview of your RAG system.

Here is how you get the latest ingestion trace ID which you can use when setting up the tracing for the Query Pipeline.

In [9]:
# TODO(b7r6): we need a utility for this...
ingestion_trace_id = list_ingestion_trace_events(take=1)["ingestionTraces"][0]["id"]

print(ingestion_trace_id)

clx58wrf60098qp14kfpc4tjz


<a name="trace_query"></a>

#### Trace Query Pipeline
Now that we have the document ingestion pipeline built and traced, lets build a query pipeline. We will use an OpenAI model (`gpt-3.5-turbo`) to generate responses to user queries. Similar to the document ingestion pipeline, we will trace this pipeline with the `@traced` decorator.

**NOTE:** the document ingestion pipeline and query pipeline are separate so we will need to link them together with `ingestion_trace_id`.

In [12]:
from typing import Optional

LLM_NAME = "gpt-3.5-turbo"

PROMPT_TEMPLATE = """
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:
"""

@tracer.trace_function("retrieve-context") #Decorate the function with the tracer
def retrieve_context(
    query_string: str,
    ingestion_trace_id: str,
    top_k: int = 5,
) -> list[RetrievedNode]:
    """
    Retrieve the top-k most relevant contexts based on the query string
    from the chroma db collection
    """
    #Register parameters associated with your RAG pipeline setup
    tracer.register_param("similarity_top_k", top_k) 
    chroma_retrival_results = collection.query(query_texts=query_string, n_results=top_k)

    retrieved_nodes: list[RetrievedNode] = []
    for i in range(len(chroma_retrival_results.get("documents")[0])):
        retrieved_nodes.append(
            RetrievedNode(
                id=chroma_retrival_results.get("documents")[0][i],
                text=chroma_retrival_results.get("ids")[0][i],
                # shorter distance means more relevant so just take reciprocal
                score=1/chroma_retrival_results.get("distances")[0][i], 
            )
        )

    tracer.add_retrieval_event(
        query=query_string, 
        retrieved_nodes=retrieved_nodes, 
        metadata={"top_k": top_k},
        ingestion_trace_id=ingestion_trace_id,
    )

    return retrieved_nodes

@tracer.trace_function("resolve-prompt")
def resolve_prompt(
    user_query: str, 
    retrieved_nodes: list[RetrievedNode],
    ingestion_trace_id: str,
) -> str:
    retrieved_texts = [node.text for node in retrieved_nodes]
    resolved_prompt = PROMPT_TEMPLATE.replace(
        "{context_str}", "\n\n\n".join(retrieved_texts)
    ).replace("{query_str}", user_query)
    tracer.add_template_event(
        prompt_template=PROMPT_TEMPLATE,
        resolved_prompt=resolved_prompt,
        ingestion_trace_id=ingestion_trace_id,
    )
    return resolved_prompt

@tracer.trace_function("query-root-span") # You can provide a custom name for the root span
def run_query_flow(user_query: str, ingestion_trace_id: str) -> str:
    retrieved_nodes = retrieve_context(user_query, ingestion_trace_id, top_k=3)
    resolved_prompt = resolve_prompt(
        user_query, 
        retrieved_nodes, 
        ingestion_trace_id,
    )

    with tracer.start_as_current_span("call-llm") as _llm_span:
        openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
        response = openai_client.chat.completions.create(
            model=LLM_NAME,
            messages=[{"role": "user", "content": resolved_prompt}],
        )
        output: str = response.choices[0].message.content
    
        tracer.add_query_event(
            query=user_query, 
            llm_output=output,
            metadata={"llm_name": LLM_NAME},
            ingestion_trace_id=ingestion_trace_id,
        )

    return output


Let's try an example user query.

In [13]:
response = run_query_flow("What did the author do growing up?", ingestion_trace_id)

print(f"Response: {response}")

Response: Based on the context information provided, it is impossible to determine what the author did growing up.


<a name="access_data"></a>

#### Access Raw Trace Data
The trace data for the ingestion and query pipeline have IDs associated with them. The raw trace data is sent to both Jaegar and Postgres which we have shown below. This trace data is also viewable in a much more user-friendly view in the RAG-Debugger UI which will be shown soon.

In [14]:
# Print trace data from Jaeger
get_trace_data(get_latest_ingestion_trace_id())

{'data': [{'traceID': '564417d53e00be5ce837a5cdc9dc0953',
   'spans': [{'traceID': '564417d53e00be5ce837a5cdc9dc0953',
     'spanID': 'c7b5e277f46a16aa',
     'operationName': '1 - ingestion_function',
     'references': [],
     'startTime': 1717798559204194,
     'duration': 93634,
     'tags': [{'key': 'input', 'type': 'string', 'value': '{"chunk_size": 3}'},
      {'key': 'openinference.span.kind',
       'type': 'string',
       'value': 'EMBEDDING'},
      {'key': 'output', 'type': 'string', 'value': 'true'},
      {'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [],
     'processID': 'p1',
    {'traceID': '564417d53e00be5ce837a5cdc9dc0953',
     'spanID': '27ec83ba8d66e84e',
     'operationName': '2 - ingestion-child-span',
     'references': [{'refType': 'CHILD_OF',
       'traceID': '564417d53e00be5ce837a5cdc9dc0953',
       'spanID': 'c7b5e277f46a16aa'}],
     'startTime': 1717

In [15]:
# Print trace data from Postgres
ingestion_trace_events = list_ingestion_trace_events(take=1)
pd.DataFrame.from_records(ingestion_trace_events["ingestionTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragIngestionTraceEventId"}
)

Unnamed: 0,ragIngestionTraceEventId,createdAt,updatedAt,paramSet,eventName,eventData,input,output,metadata,traceId,creatorId,projectId,organizationId,visibility,active,annotations,feedback
0,clx58wrf60098qp14kfpc4tjz,2024-06-07T22:15:59.442Z,2024-06-07T22:15:59.442Z,"{'prognosis': 'My eyes are burning!', 'chunk_s...",,,,,,564417d53e00be5ce837a5cdc9dc0953,clp1m7n3l0062qpqnd4nyabbl,clwz7za15006aqj2xgabw04cv,,MEMBER,True,[],[]


<a name="view_ui"></a>

#### View Traces in RAG Debugger UI
At this point, we have a traced ingestion pipeline and a traced query pipeline. We can view the traces in the RAG Debugger UI. In a seperate terminal, run the following command to launch the RAG Debugger:

`rag-debug launch`

Open up your webbrowser and navigate to the url provided by the RAG Debugger. This will look like http://localhost:8080/

Navigate to the Traces tab. You should see the traces for the ingestion pipeline and the query pipeline. It will look something like this:

<img width="1792" alt="Screenshot 2024-05-16 at 12 00 55 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/adfc429e-7533-4d98-8bc7-acfc00a703f8">


<a name="debug"></a>

## Step 3: Debug your RAG System
1. Measure and Evaluate Performance
2. Identify Issues
3. Iterate and Optimize RAG System

Evaluation is a crucial part of LLM development. To improve and debug your RAG system, you must have a way to measure it. Evaluation metrics (aka evaluators) allow you to measure the quality of LLM-generated results. Evaluators can take in various inputs including the generated response, ground truth data, context, etc. and typically output a numeric score from 0 to 1.

Our first step is run evaluations on data we pass into our RAG system and gather metrics we can analyze in the RAG Debugger UI.

<a name="measure"></a>

#### Measure and Evaluate Performance


To evaluate our RAG system, we'll create a TestSet containing questions to ask the system. We'll compare the system's responses to ground truth answers using evaluation metrics to assess the quality and effectiveness of the pipeline in providing accurate and relevant answers based on the ingested document (Paul Graham essay).

In [16]:
user_questions = [
    "What two main things did Paul Graham work on before college, outside of school?",
    "What was the key realization Paul Graham had about artificial intelligence during his first year of grad school at Harvard?",
    "How did Paul Graham and his partner Robert Morris get their initial idea and start working on what became their startup Viaweb?",
    "What were some of the novel approaches and advantages that Y Combinator introduced compared to traditional venture capital firms when it first started?",
    "What ambitious programming language project did Paul Graham work on intensively for 4 years from 2015-2019, and what was unique about the goal and approach of this language called Bel?"
]

ground_truth_answers = [
    "The author first interacted with programming on a mainframe computer, using punch cards to input Fortran code, which was a challenging and time-consuming process",
    "The transition from the IBM 1401 to microcomputers like the TRS-80 represented a significant step forward in terms of both programming capabilities and user interaction.",
    "A turning point came after reading Nick Bostrom's \"Superintelligence,\" which presented a persuasive argument on the potential of Artificial Intelligence (AI)",
    "Heinlein's \"The Moon is a Harsh Mistress\" and Terry Winograd's SHRDLU heavily influenced the author's decision to pursue AI",
    "The author considered the AI practices during his first year of grad school as a \"hoax\" because they didn't meet his expectations for understanding and interpreting natural language accurately.",
]

For each question in the TestSet, we'll compute a **Relevance score** that measures how closely the system's response matches the corresponding ground truth answer. Since each question is associated with a specific trace, we'll obtain a Relevance score for each trace, allowing us to assess the performance of the RAG pipeline at a granular level.



In [17]:
# In this case, we are using just the default Relevance evaluation metric.
evaluator_names = {"relevance"}

evaluate_result = run_and_evaluate(
    project_id=None,
    evaluators=evaluator_names,
    run_query_fn=partial(
        run_query_flow,
        ingestion_trace_id=ingestion_trace_id
    ),
    inputs=user_questions,
    ground_truths=ground_truth_answers,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llm_classify |██████████| 5/5 (100.0%) | ⏳ 00:04<00:00 |  1.17it/s
llm_classify |██████████| 5/5 (100.0%) | ⏳ 00:03<00:00 |  1.51it/s
llm_classify |██████████| 5/5 (100.0%) | ⏳ 00:03<00:00 |  1.60it/s
llm_classify |██████████| 5/5 (100.0%) | ⏳ 00:03<00:00 |  1.50it/s


In [18]:
print(f"""
    {evaluate_result.success=}
    {evaluate_result.message=}

    {evaluate_result.evaluation_result_id=}
    {evaluate_result.example_set_id=}
""")

print("Example-level metrics:")
display(evaluate_result.df_metrics_example_level)
print("Aggregated metrics:")
display(evaluate_result.df_metrics_aggregated)


    evaluate_result.success=True
    evaluate_result.message='{"id":"clxakktvk00zpqjw9tywfofjk","createdAt":"2024-06-11T15:41:29.024Z","updatedAt":"2024-06-11T15:41:29.024Z","name":"Evaluation Result","paramSet":{"similarity_top_k":3},"testSetId":"clxakkii7000equ02ncva2umj","creatorId":"clp1m7n3l0062qpqnd4nyabbl","projectId":null,"organizationId":null,"visibility":"MEMBER","metadata":null,"active":true}'

    evaluate_result.evaluation_result_id='clxakktvk00zpqjw9tywfofjk'
    evaluate_result.example_set_id='clxakkii7000equ02ncva2umj'

Example-level metrics:


Unnamed: 0,exampleSetId,exampleId,metricName,value
0,clxakkii7000equ02ncva2umj,clxakkij0000fqu02rq6plt4f,relevance,1.0
1,clxakkii7000equ02ncva2umj,clxakkij0000gqu0263x7faxa,relevance,0.0
2,clxakkii7000equ02ncva2umj,clxakkij0000hqu02e1jevd43,relevance,1.0
3,clxakkii7000equ02ncva2umj,clxakkij0000iqu02t0bhdr8a,relevance,1.0
4,clxakkii7000equ02ncva2umj,clxakkij0000jqu02bkinjslo,relevance,1.0


Aggregated metrics:


Unnamed: 0,exampleSetId,metricName,value
0,clxakkii7000equ02ncva2umj,relevance_mean,0.8
0,clxakkii7000equ02ncva2umj,relevance_std,0.447214
0,clxakkii7000equ02ncva2umj,relevance_count,5.0


<a name="identify"></a>

### Identify Issues

<a name="iterate"></a>

### Iterate and Optimize RAG System

`TODO(b7r6): figure out what we want to do here...`