# RAG-Debugger - Getting Started Tutorial

In this notebook, we showcase how to use the [RAG-Debugger](https://rag.lastmileai.dev/) to optimize your RAG pipelines. We will evaluate a demo RAG pipeline, which enables question-answering over [Paul Graham's essays](https://www.paulgraham.com/worked.html) using `gpt-3.5-turbo`.

Check out our [Cookbook](https://github.com/lastmile-ai/eval-cookbook/tree/main) for more examples and tutorials.

## Notebook Outline
* [Step 1: Install and Setup](#install)
* [Step 2: Build and Trace RAG System](#trace)
  * [Download Data](#download_data)
  * [Trace Ingestion Pipeline](#trace_ingestion)
  * [Trace Query Pipeline](#trace_query)
  * [Access Raw Traces](#access_data)
  * [View Traces in RAG Debugger UI](#view_ui)
* [Step 3: Debug and Optimize your RAG System](#debug)
  * [Measure and Evaluate Performance](#measure)
  * [View Results in RAG Debugger UI](#identify)

<a name="install"></a>
## Step 1: Install and Setup

To begin, we need to install the required packages.

In [None]:
!pip install chromadb
!pip install lastmile-eval --upgrade
!pip install jsonref

Import all modules used in this tutorial.

In [1]:
import chromadb
import pandas as pd
import openai

from dataclasses import asdict
from lastmile_eval.rag.debugger.api import (
    LastMileTracer,
    Node,
    RetrievedNode,
)
from lastmile_eval.rag.debugger.tracing import (
    get_lastmile_tracer,
    list_ingestion_trace_events,
    get_latest_ingestion_trace_id,
    get_trace_data,
)
from lastmile_eval.rag.debugger.common.types import RagFlowType
from functools import partial
from lastmile_eval.rag.debugger.api.evaluation import (
    run_and_evaluate,
)

  from .autonotebook import tqdm as notebook_tqdm


You need the following API tokens/keys:
1. **LastMile AI API Token:** Get from the [Settings page](https://lastmileai.dev/settings?page=tokens) after creating a free LastMile AI account.
2. **OpenAI API Key:** Create from the [API Keys page](https://platform.openai.com/account/api-keys).

Setup your keys using one of the following methods:

* **Google Colab:** Add secrets `OPENAI_API_KEY` and `LASTMILE_API_TOKEN` in "Secrets Manager" (lock icon on left).
* **Local Notebook:** Create a `.env` file and add a line for each key (ex. `LASTMILE_API_TOKEN=your-api-token`).

Run the code cell below after setting the keys. Avoid inputting keys directly in the notebook.

In [2]:
import os

try:
    # If running on Google Colab, use userdata to securely input keys
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    LASTMILE_API_TOKEN = userdata.get('LASTMILE_API_TOKEN')
except ModuleNotFoundError:
    # If running locally, load keys from .env file
    from dotenv import load_dotenv
    load_dotenv()
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    LASTMILE_API_TOKEN = os.getenv('LASTMILE_API_TOKEN')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['LASTMILE_API_TOKEN'] = LASTMILE_API_TOKEN

<a name="trace"></a>

## Step 2: Build and Trace RAG System

1. Download Data (Paul Graham Essay)
2. Trace Document Ingestion Pipeline
3. (Optional) Access Raw Trace Data
4. Trace Query Pipeline
5. View Traces in RAG Debugger UI

**Note:** If you are using OpenAI, LangChain, or LlamaIndex, we offer auto-instrumention for tracing (no manual setup required).

<a name="download_data"></a>

#### Download Data

In [3]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   680k      0 --:--:-- --:--:-- --:--:--  684k


<a name="trace_ingestion"></a>

#### Trace Document Ingestion Pipeline
We will create chunks of our document (Paul Graham essay) and store it in a vector database (ChromaDB). ChromaDB converts these chunks of texts to vector embeddings which are indexed in the database and can easily be retrieved.

We will instatiate and use a **LastMile AI Tracer object** to trace the chunking step and the ingestion step.

First, instantiate a Tracer object. The project name ("Paul-Graham-Demo-Project") enables you to group traces in the UI.

In [4]:
tracer: LastMileTracer = get_lastmile_tracer(
    tracer_name="my-tracer",
    project_name="LM-Tutorial",
    rag_flow_type=RagFlowType.INGESTION,
)

Setup your ingestion pipeline below with LastMile Tracing.

In [5]:
chroma_client = chromadb.Client()

# Decorate chunking function with LastMile Tracer
@tracer.trace_function()
def chunk_document(file_path: str, chunk_size: int = 1000) -> list[Node]:
    """
    Chunk a text file into a list of strings based on the specified chunk size.

    Args:
        file_path (str): The path to the text file.
        chunk_size (int): The desired number of characters in each chunk.

    Returns:
        list[Node]: A list of Nodes, where each node contains info
            representing a chunk of text.
    """
    with open(file_path, "r") as file:
        text = file.read()

    nodes: list[Node] = []
    for i in range(0, len(text), chunk_size):
        nodes.append(
            Node(
                id=f"node{i}",
                text=text[i:i + chunk_size],
            )
        )

    # Add the chunking event to LastMile Tracer
    tracer.add_chunking_event(
        output_nodes=nodes,
        filepath=file_path,
        metadata={"chunk_size": chunk_size},
    )

    return nodes

# Decorate ingestion function with LastMile Tracer
@tracer.trace_function()
def run_ingestion_flow() -> chromadb.Collection:
    filepath = "data/paul_graham/paul_graham_essay.txt"
    document_nodes: list[Node] = chunk_document(filepath)

    collection = chroma_client.create_collection(name="paul_graham_collection")
    collection.add(
        ids=[node.id for node in document_nodes],
        documents=[node.text for node in document_nodes]
    )

    # Add the synthesize event to LastMile Tracer
    tracer.add_synthesize_event(
        input=filepath,
        output=[asdict(node) for node in document_nodes],
    )

    return collection

In [6]:
collection = run_ingestion_flow()

**Important - Linking Ingestion Trace to Query Pipeline**
The trace data for the ingestion pipeline has an ID associated with it. We can use this ID to link the tracing for the ingestion step and the query step of the RAG system for a comprehensive overview of your RAG system.

Here is how you get the latest ingestion trace ID which you can use when setting up the tracing for the Query Pipeline.

In [7]:
ingestion_trace_id = list_ingestion_trace_events(take=1)["ingestionTraces"][0]["id"]

print(ingestion_trace_id)

clxkny9ap00h8qjt6p9u28jg2


<a name="trace_query"></a>

#### Trace Query Pipeline
Now that we have the document ingestion pipeline built and traced, let's build a query pipeline. We will use an OpenAI model (`gpt-3.5-turbo`) to generate responses to user queries. Similar to the document ingestion pipeline, we will trace this pipeline with the `@traced` decorator.

**NOTE:** the document ingestion pipeline and query pipeline are separate so we will need to link them together with `ingestion_trace_id`.

In [8]:
from typing import Optional

LLM_NAME = "gpt-3.5-turbo"

PROMPT_TEMPLATE = """
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:
"""

# Instantiate Tracer for the Query Pipeline
tracer: LastMileTracer = get_lastmile_tracer(
    tracer_name="my-tracer",
    project_name="LM-Tutorial",
    rag_flow_type=RagFlowType.QUERY,
)

# Decorate retrieval function with LastMile Tracer
@tracer.trace_function("retrieve-context")
def retrieve_context(
    query_string: str,
    ingestion_trace_id: str,
    top_k: int = 5,
) -> list[RetrievedNode]:
    """
    Retrieve the top-k most relevant contexts based on the query string
    from the chroma db collection
    """
    # Register key parameters to your trace
    tracer.register_param("similarity_top_k", top_k)

    chroma_retrival_results = collection.query(query_texts=query_string, n_results=top_k)

    retrieved_nodes: list[RetrievedNode] = []
    for i in range(len(chroma_retrival_results.get("documents")[0])):
        retrieved_nodes.append(
            RetrievedNode(
                text=chroma_retrival_results.get("documents")[0][i],
                id=chroma_retrival_results.get("ids")[0][i],
                score=1/chroma_retrival_results.get("distances")[0][i],
            )
        )

    # Add the retrieval event to LastMile Tracer
    tracer.add_retrieval_event(
        query=query_string,
        retrieved_nodes=retrieved_nodes,
        metadata={"top_k": top_k},
        ingestion_trace_id=ingestion_trace_id,
    )

    return retrieved_nodes

# Decorate prompt resolution function with LastMile Tracer
@tracer.trace_function("resolve-prompt")
def resolve_prompt(
    user_query: str,
    retrieved_nodes: list[RetrievedNode],
    ingestion_trace_id: str,
) -> str:
    retrieved_texts = [node.text for node in retrieved_nodes]
    resolved_prompt = PROMPT_TEMPLATE.replace(
        "{context_str}", "\n\n\n".join(retrieved_texts)
    ).replace("{query_str}", user_query)

    # Add the prompt template event to LastMile Tracer
    tracer.add_template_event(
        prompt_template=PROMPT_TEMPLATE,
        resolved_prompt=resolved_prompt,
        ingestion_trace_id=ingestion_trace_id,
    )
    return resolved_prompt

# Decorate LLM call function with LastMile Tracer
@tracer.trace_function("query-root-span")
def run_query_flow(user_query: str, ingestion_trace_id: str) -> str:
    retrieved_nodes = retrieve_context(user_query, ingestion_trace_id, top_k=3)
    resolved_prompt = resolve_prompt(
        user_query,
        retrieved_nodes,
        ingestion_trace_id,
    )

    # Start span on LastMile Tracer
    with tracer.start_as_current_span("call-llm") as _llm_span:
        openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
        response = openai_client.chat.completions.create(
            model=LLM_NAME,
            messages=[{"role": "user", "content": resolved_prompt}],
        )
        output: str = response.choices[0].message.content

        # Add the LLM query event to LastMile Tracer
        tracer.add_query_event(
            query=resolved_prompt,
            llm_output=output,
            metadata={"llm_name": LLM_NAME},
            ingestion_trace_id=ingestion_trace_id,
        )

    return output


Let's try an example user query.

In [9]:
response = run_query_flow("What did the author do growing up?", ingestion_trace_id)

print(f"Response: {response}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Response: Without more specific information about the author or their background, it is not possible to determine what the author did growing up.


<a name="access_data"></a>

#### Access Raw Trace Data
The trace data for the ingestion and query pipeline have IDs associated with them. The raw trace data is sent to both Jaegar and Postgres which we have shown below. This trace data is also viewable in a much more user-friendly view in the RAG-Debugger UI which will be shown soon.

In [10]:
# Print trace data from Jaeger
get_trace_data(get_latest_ingestion_trace_id())

{'data': [{'traceID': '0ea9ec02e3182b69a90e0a6669b216ad',
   'spans': [{'traceID': '0ea9ec02e3182b69a90e0a6669b216ad',
     'spanID': '04eacba5705eec9a',
     'operationName': '1 - run_ingestion_flow',
     'references': [],
     'startTime': 1718730809223550,
     'duration': 6762927,
     'tags': [{'key': 'input', 'type': 'string', 'value': '{}'},
      {'key': 'lastmile.span.kind', 'type': 'string', 'value': 'synthesize'},
      {'key': 'output',
       'type': 'string',
       'value': '"<output not json-serializable>: All logged values must be JSON-serializable: <chromadb.api.models.Collection.Collection object at 0x109db6650>"'},
      {'key': 'span.kind', 'type': 'string', 'value': 'internal'},
      {'key': 'internal.span.format', 'type': 'string', 'value': 'otlp'}],
     'logs': [{'timestamp': 1718730815986452,
       'fields': [{'key': 'event', 'type': 'string', 'value': 'synthesize'},
        {'key': 'event_data', 'type': 'string', 'value': '{}'},
        {'key': 'ingestion_

In [11]:
# Print trace data from Postgres
ingestion_trace_events = list_ingestion_trace_events(take=1)
pd.DataFrame.from_records(ingestion_trace_events["ingestionTraces"]).rename(  # type: ignore[fixme]
    columns={"id": "ragIngestionTraceEventId"}
)

Unnamed: 0,ragIngestionTraceEventId,createdAt,updatedAt,paramSet,eventName,eventData,input,output,metadata,traceId,creatorId,projectId,organizationId,visibility,active,annotations,tags,feedback
0,clxkny9ap00h8qjt6p9u28jg2,2024-06-18T17:13:36.145Z,2024-06-18T17:13:36.145Z,{},,,data/paul_graham/paul_graham_essay.txt,"[{'id': 'node0', 'text': ' What I Worked On ...",,0ea9ec02e3182b69a90e0a6669b216ad,clkrgxm850004phi6ee5mvhd1,clxkm800i0000qp7ghkmhyckg,,MEMBER,True,[],[],[]


<a name="view_ui"></a>

#### View Traces in RAG Debugger UI
At this point, we have a traced ingestion pipeline and a traced query pipeline. We can view the traces in the RAG Debugger UI.

Run the following command in a separate terminal to launch the UI:

`rag-debug launch`

Open up your webbrowser and navigate to the url provided by the RAG Debugger. This will look like http://localhost:8080/

1. Navigate to the **Traces Tab**.
2. Select Project 'LM-Getting-Started'.
3. You should see a single trace for the test run we executed from this notebook.

<img width="973" alt="Screenshot 2024-06-17 at 1 56 46 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/3e96e7fe-ad7e-427e-955f-532409044664">



<a name="debug"></a>

## Step 3: Debug your RAG System
1. Measure and Evaluate Performance
2. View Results in RAG Debugger UI

Evaluation is a crucial part of LLM development. To improve and debug your RAG system, you must have a way to measure it. Evaluation metrics (aka evaluators) allow you to measure the quality of LLM-generated results. Evaluators can take in various inputs including the generated response, ground truth data, context, etc. and typically output a numeric score from 0 to 1.

Our first step is run evaluations on data we pass into our RAG system and gather metrics we can analyze in the RAG Debugger UI.

<a name="measure"></a>

#### Measure and Evaluate Performance


We will test a few questions with known ground truth answers and judge the quality of the RAG system's LLM responses to those ground truth answers using evaluator metrics.

In [12]:
user_questions = [
    "What two main things did Paul Graham work on before college, outside of school?",
    "What was the key realization Paul Graham had about artificial intelligence during his first year of grad school at Harvard?",
    "How did Paul Graham and his partner Robert Morris get their initial idea and start working on what became their startup Viaweb?",
    "What were some of the novel approaches and advantages that Y Combinator introduced compared to traditional venture capital firms when it first started?",
    "What ambitious programming language project did Paul Graham work on intensively for 4 years from 2015-2019, and what was unique about the goal and approach of this language called Bel?"
]

ground_truth_answers = [
    "The author first interacted with programming on a mainframe computer, using punch cards to input Fortran code, which was a challenging and time-consuming process",
    "The transition from the IBM 1401 to microcomputers like the TRS-80 represented a significant step forward in terms of both programming capabilities and user interaction.",
    "A turning point came after reading Nick Bostrom's \"Superintelligence,\" which presented a persuasive argument on the potential of Artificial Intelligence (AI)",
    "Heinlein's \"The Moon is a Harsh Mistress\" and Terry Winograd's SHRDLU heavily influenced the author's decision to pursue AI",
    "The author considered the AI practices during his first year of grad school as a \"hoax\" because they didn't meet his expectations for understanding and interpreting natural language accurately.",
]

For each question, we'll compute a **Relevance score** that measures how closely the system's response matches the corresponding ground truth answer. Since each question is associated with a specific trace, we'll obtain a Relevance score for each trace, allowing us to assess the performance of the RAG pipeline at a granular level.

In [13]:
# Let's use the default Relevance evaluation metric.
evaluator_names = {"relevance", "qa", "similarity"}

# Runs the RAG query pipeline (with tracing) on test set and evaluates responses
evaluate_result = run_and_evaluate(
    project_id=tracer.project_id,
    evaluators=evaluator_names,
    run_query_fn=partial(
        run_query_flow,
        ingestion_trace_id=ingestion_trace_id
    ),
    inputs=user_questions,
    ground_truths=ground_truth_answers,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [14]:
# Print Resuts
print("Relevance Score for Each Question:")
display(evaluate_result.df_metrics_example_level)
print("Aggregated Relevance Score for All Questions:")
display(evaluate_result.df_metrics_aggregated)

Relevance Score for Each Question:


Unnamed: 0,exampleSetId,exampleId,metricName,value
0,clxknylfe009spe73x2pavnwh,clxknylg6009upe73thslyaa1,similarity,0.2
1,clxknylfe009spe73x2pavnwh,clxknylg6009vpe73ja2d6eod,similarity,0.2
2,clxknylfe009spe73x2pavnwh,clxknylg6009wpe732bfu1pew,similarity,0.5
3,clxknylfe009spe73x2pavnwh,clxknylg6009xpe739y9bfvt3,similarity,0.5
4,clxknylfe009spe73x2pavnwh,clxknylg6009ype735e19fwul,similarity,0.2
0,clxknylfe009spe73x2pavnwh,clxknylg6009upe73thslyaa1,qa,0.0
1,clxknylfe009spe73x2pavnwh,clxknylg6009vpe73ja2d6eod,qa,0.0
2,clxknylfe009spe73x2pavnwh,clxknylg6009wpe732bfu1pew,qa,0.0
3,clxknylfe009spe73x2pavnwh,clxknylg6009xpe739y9bfvt3,qa,1.0
4,clxknylfe009spe73x2pavnwh,clxknylg6009ype735e19fwul,qa,1.0


Aggregated Relevance Score for All Questions:


Unnamed: 0,exampleSetId,metricName,value
0,clxknylfe009spe73x2pavnwh,similarity_mean,0.44
0,clxknylfe009spe73x2pavnwh,similarity_std,0.216795
0,clxknylfe009spe73x2pavnwh,similarity_count,5.0
0,clxknylfe009spe73x2pavnwh,qa_mean,0.8
0,clxknylfe009spe73x2pavnwh,qa_std,0.447214
0,clxknylfe009spe73x2pavnwh,qa_count,5.0
0,clxknylfe009spe73x2pavnwh,relevance_mean,1.0
0,clxknylfe009spe73x2pavnwh,relevance_std,0.0
0,clxknylfe009spe73x2pavnwh,relevance_count,5.0


<a name="identify"></a>

### View Results in RAG Debugger UI

We can view the evaluation results in the RAG Debugger UI.

Run the following command in a separate terminal to launch the UI:

`rag-debug launch`

Open up your webbrowser and navigate to the url provided by the RAG Debugger. This will look like http://localhost:8080/

1. Navigate to the **Evaluation Console**.
2. Select Project 'LM-Getting-Started'.
3. You should see your latest evaluation runs here for this project.
4. Click on the latest run.

<img width="973" alt="Screenshot 2024-06-17 at 4 22 07 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/bb73d1b2-de1f-43b3-ade7-d14e0368230e">


5. Now we can see all the inputs, outputs, and metrics for our latest evaluation run.

<img width="973" alt="Screenshot 2024-06-17 at 4 26 05 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/c22bc5f8-370c-4ba5-a0f0-2b088fa07fbc">

6. Let's debug the response to a specific question. Click the debug icon.

<img width="973" alt="Screenshot 2024-06-17 at 4 26 05 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/6eba1fb8-e1b9-40f5-bf09-5c87508485cd">

7. Now we can see all the steps (via our LastMile Tracer) used to generate the output for this specific question.

<img width="973" alt="Screenshot 2024-06-17 at 4 26 24 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/425c0bf9-0579-42ff-b545-5a332087c67f">


8. For the LLM call, we also can use the Prompt Debugger to switch out different models, edit the system prompt, etc. in real time!

<img width="973" alt="Screenshot 2024-06-17 at 4 26 59 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/c0c96946-18ee-415a-a876-0a1df6aa6f27">


<img width="973" alt="Screenshot 2024-06-17 at 4 27 10 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/7ee929c1-5352-4ac0-9f2e-c590e0895392">

