# Retrieval-First RAG Experimentation on NFCorpus

This notebook presents a controlled Retrieval-Augmented Generation (RAG)
experiment using the public **NFCorpus** biomedical QA dataset.

The goal is to understand how retrieval and chunking configurations
impact ranking quality, system stability, and latency under realistic
LLM context constraints.

This notebook is designed to be:
- Fully reproducible on free-tier Google Colab (T4 GPU)
- Retrieval-first (no generation quality judging)
- Focused on experimentation and tradeoffs


## Dataset: NFCorpus

NFCorpus is a biomedical question-answering dataset containing:
- Natural language medical questions
- Relevant biomedical abstracts
- Human-annotated relevance judgments (qrels)

Why NFCorpus:
- Dense, technical text (stress-tests chunking strategies)
- Realistic QA retrieval workload
- Public, well-known IR benchmark


## Experiment Goal

Evaluate how RAG retrieval configurations affect:
- Ranking quality (Precision, Recall, F1, NDCG@5, MRR)
- System stability under LLM context limits
- End-to-end processing time

We focus on **retrieval-first optimization**, not generation quality.



## Hypothesis

1. Smaller chunk sizes will improve ranking metrics by reducing
   semantic dilution in dense biomedical text.
2. Increasing retrieval depth improves recall but risks
   exceeding the LLM context window.
3. A stable RAG system must balance ranking quality with
   prompt length constraints.


### Install and Initialize RapidFire AI

In [None]:
try:
    import rapidfireai
    print("‚úÖ rapidfireai already installed")
except ImportError:
    !pip install rapidfireai  # Takes 1 min
    !rapidfireai init --evals # Takes 1 min

### Import Rapidfire Components

In [2]:
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

from rapidfireai import Experiment
from rapidfireai.evals.automl import List, RFLangChainRagSpec, RFvLLMModelConfig, RFPromptManager, RFGridSearch
import re, json
from typing import List as listtype, Dict, Any

# NB: If you get "AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'" from Colab, just rerun this cell

### Download Dataset

In [None]:


!pip -q install beir

from beir import util
from beir.datasets.data_loader import GenericDataLoader
from pathlib import Path

ROOT = Path.cwd()
OUT_DIR = ROOT / "datasets"
OUT_DIR.mkdir(parents=True, exist_ok=True)

url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip"

# Download + unzip (returns the extracted dataset folder path)
extracted_path = Path(util.download_and_unzip(url, str(OUT_DIR)))
print("Extracted to:", extracted_path)

# Find the folder that actually contains corpus.jsonl (handles nested nfcorpus/nfcorpus)
corpus_jsonl = next(extracted_path.rglob("corpus.jsonl"))
DATASET_DIR = corpus_jsonl.parent
print("Using data_folder:", DATASET_DIR)

loader = GenericDataLoader(data_folder=str(DATASET_DIR))
corpus, queries, qrels = loader.load()

print("\nDataset stats:")
print(f"Corpus documents: {len(corpus)}")
print(f"Queries: {len(queries)}")
print(f"Qrels: {sum(len(v) for v in qrels.values())}")

### Dataset Normalization

RAG evaluation pipelines rely on **strict ID consistency** across all inputs.

In particular, the following fields must refer to the **same identifiers**:
- `queries.jsonl` ‚Üí `query_id`
- `corpus.jsonl` ‚Üí `_id`
- `qrels.tsv` ‚Üí `query_id`, `corpus_id`

If any of these IDs are stored as strings while others are integers, joins and lookups can fail silently or cause downstream errors during indexing, retrieval, or evaluation.

RapidFire AI expects **integer IDs** for queries and documents, since IDs are used internally for hashing, indexing, and metric computation.  
This notebook includes a normalization step to ensure all dataset files use consistent integer IDs before running experiments.


### Dataset normalization contract
This notebook produces a canonical set of dataset files (*.final.*) that are guaranteed to use integer identifiers.

If the source dataset already satisfies this constraint, files are copied unchanged. Otherwise, deterministic ID normalization is applied.

In [None]:
from pathlib import Path
import json
import pandas as pd
import shutil

def prepare_final_dataset(
    dataset_dir: Path,
    corpus_file="corpus.jsonl",
    queries_file="queries.jsonl",
    qrels_file="qrels.tsv",
):
    """
    Produces canonical files:
      - corpus.final.jsonl
      - queries.final.jsonl
      - qrels.final.tsv

    Handles datasets where query IDs may be under `_id` or `query_id`.
    """

    corpus_path = dataset_dir / corpus_file
    queries_path = dataset_dir / queries_file
    qrels_path = dataset_dir / qrels_file

    final_corpus = dataset_dir / "corpus.final.jsonl"
    final_queries = dataset_dir / "queries.final.jsonl"
    final_qrels  = dataset_dir / "qrels.final.tsv"

    # --------------------
    # Load inputs
    # --------------------
    with open(corpus_path) as f:
        corpus = [json.loads(l) for l in f]

    with open(queries_path) as f:
        queries = [json.loads(l) for l in f]

    qrels = pd.read_csv(
        qrels_path,
        sep="\t",
        header=None,
        names=["query_id", "corpus_id", "relevance"]
    )

    # --------------------
    # Detect query ID field
    # --------------------
    if "query_id" in queries[0]:
        query_id_key = "query_id"
    elif "_id" in queries[0]:
        query_id_key = "_id"
    else:
        raise ValueError("‚ùå Could not find query ID field in queries.jsonl")

    print("üîç ID type check")
    print(f"  corpus _id: {type(corpus[0].get('_id'))}")
    print(f"  query {query_id_key}: {type(queries[0].get(query_id_key))}")

    # --------------------
    # Check if already normalized
    # --------------------
    if isinstance(corpus[0]["_id"], int) and isinstance(queries[0][query_id_key], int):
        shutil.copy(corpus_path, final_corpus)
        shutil.copy(queries_path, final_queries)
        shutil.copy(qrels_path, final_qrels)
        print("\n‚úÖ IDs already integers ‚Äî copied originals to *.final.*")
        return

    # --------------------
    # Normalize IDs
    # --------------------
    print("\n‚ö†Ô∏è String IDs detected. Normalizing deterministically...")

    corpus_id_map = {doc["_id"]: i for i, doc in enumerate(corpus)}
    query_id_map  = {q[query_id_key]: i for i, q in enumerate(queries)}

    # Rewrite corpus
    for doc in corpus:
        doc["_id"] = corpus_id_map[doc["_id"]]

    # Rewrite queries (canonicalize to query_id)
    for q in queries:
        q["query_id"] = query_id_map[q[query_id_key]]
        if query_id_key != "query_id":
            del q[query_id_key]


    # Rewrite qrels
    qrels["query_id"]  = qrels["query_id"].map(query_id_map)
    qrels["corpus_id"] = qrels["corpus_id"].map(corpus_id_map)

    missing = qrels[qrels.isnull().any(axis=1)]
    if len(missing) > 0:
        print(f"‚ö†Ô∏è Dropping {len(missing)} qrels with missing corpus/query IDs")
        qrels = qrels.dropna().reset_index(drop=True)

    # --------------------
    # Write canonical finals
    # --------------------
    with open(final_corpus, "w") as f:
        for doc in corpus:
            f.write(json.dumps(doc) + "\n")

    with open(final_queries, "w") as f:
        for q in queries:
            f.write(json.dumps(q) + "\n")

    qrels.to_csv(final_qrels, sep="\t", index=False, header=False)

    print("\n‚úÖ Normalization complete")
    print("‚û° Written canonical files:")
    print("   - corpus.final.jsonl")
    print("   - queries.final.jsonl")
    print("   - qrels.final.tsv")

from pathlib import Path

NF_DATASET_DIR = Path("/content/datasets/nfcorpus")

prepare_final_dataset(
    dataset_dir=NF_DATASET_DIR,
    corpus_file="corpus.jsonl",
    queries_file="queries.jsonl",
    qrels_file="qrels/test.tsv",
)



### Load Dataset, Rename Columns, and Downsample Data

In [None]:
from datasets import load_dataset
import pandas as pd
import json, random
from pathlib import Path

# ----------------
# Project + dataset root
# ----------------
PROJECT_ROOT = Path.cwd()
dataset_dir = PROJECT_ROOT / "datasets" / "nfcorpus"

# ----------------
# Load queries (canonical final)
# ----------------
nfcorpus_dataset = load_dataset(
    "json",
    data_files=str(dataset_dir / "queries.final.jsonl"),
    split="train"
)

# Rename only if needed
if "text" in nfcorpus_dataset.column_names:
    nfcorpus_dataset = nfcorpus_dataset.rename_columns({"text": "query"})

# ----------------
# Load qrels (canonical final)
# ----------------
qrels = pd.read_csv(
    dataset_dir / "qrels.final.tsv",
    sep="\t",
    header=None,
    names=["query_id", "corpus_id", "relevance"]
)


# ----------------
# Downsample queries + corpus jointly (NFCorpus-safe)
# ----------------
NUM_QUERIES = 10
rseed = 1

# Keep only queries that have qrels
valid_query_ids = set(qrels["query_id"].unique())

nfcorpus_dataset = nfcorpus_dataset.filter(
    lambda x: x["query_id"] in valid_query_ids
)

print(f"Queries with qrels: {len(nfcorpus_dataset)}")

nfcorpus_dataset = (
    nfcorpus_dataset
    .shuffle(seed=rseed)
    .select(range(min(NUM_QUERIES, len(nfcorpus_dataset))))
)

print(f"Using {len(nfcorpus_dataset)} queries")

# IDs are guaranteed to be integers
query_ids = set(nfcorpus_dataset["query_id"])

# Step 2: Filter qrels to sampled queries
qrels_filtered = qrels[qrels["query_id"].isin(query_ids)]
relevant_corpus_ids = set(qrels_filtered["corpus_id"].tolist())

print(f"Using {len(nfcorpus_dataset)} queries")
print(f"Found {len(relevant_corpus_ids)} relevant documents for these queries")

# ----------------
# Step 3: Load corpus (canonical final) and filter
# ----------------
input_file = dataset_dir / "corpus.final.jsonl"
output_file = dataset_dir / "corpus_sampled.jsonl"

with open(input_file, "r") as f:
    all_corpus = [json.loads(line) for line in f]

sampled_corpus = [
    doc for doc in all_corpus
    if doc["_id"] in relevant_corpus_ids
]

with open(output_file, "w") as f:
    for doc in sampled_corpus:
        f.write(json.dumps(doc) + "\n")

print(f"Sampled {len(sampled_corpus)} documents from {len(all_corpus)} total")
print(f"Saved to: {output_file}")
print(f"Filtered qrels to {len(qrels_filtered)} relevance judgments")

# Update qrels to match sampled dataset
qrels = qrels_filtered.reset_index(drop=True)


## Experiment Design

### Fixed Parameters
- Dataset: NFCorpus
- Embedding model: sentence-transformers/all-MiniLM-L6-v2
- Vector index: FAISS (default)
- Retrieval method: similarity search
- Reranker: cross-encoder/ms-marco-MiniLM-L6-v2
- LLM: Qwen/Qwen2.5-0.5B-Instruct (3000 token context)

### Varied Parameters
- Chunk size ‚àà {128, 256}
- Retrieval depth (k) ‚àà {8, 16}
- Reranker top_n ‚àà {2}

### Metrics
- Precision
- Recall
- F1 Score
- NDCG@5
- MRR
- Processing time




### Create Experiment

In [None]:
experiment = Experiment(experiment_name="exp1-nfcorpus-rag-colab", mode="evals")

### Define Partial Multi-Config Knobs for LangChain part of RAG Pipeline using RapidFire AI Wrapper APIs

In [7]:
from langchain_community.document_loaders import DirectoryLoader, JSONLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Per-Actor batch size for hardware efficiency
batch_size = 50

# 2 chunk sizes x 2 reranking top-n = 4 combinations in total
rag_gpu = RFLangChainRagSpec(
    document_loader=DirectoryLoader(
        path=str(NF_DATASET_DIR),

        glob="corpus.final.jsonl",
        loader_cls=JSONLoader,
        loader_kwargs={
            "jq_schema": ".",
            "content_key": "text",
            "metadata_func": lambda record, metadata: {
                "corpus_id": int(record.get("_id"))
            },  # store the document id
            "json_lines": True,
            "text_content": False,
        },
        sample_seed=42,
    ),
    # 2 chunking strategies with different chunk sizes
    text_splitter=List([
            RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                encoding_name="gpt2", chunk_size=256, chunk_overlap=32
            ),
            RecursiveCharacterTextSplitter.from_tiktoken_encoder(
                encoding_name="gpt2", chunk_size=128, chunk_overlap=32
            ),
            # RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            #     encoding_name="gpt2", chunk_size=64, chunk_overlap=32
            # ),
        ],
    ),
    embedding_cls=HuggingFaceEmbeddings,
    embedding_kwargs={
        "model_name": "sentence-transformers/all-MiniLM-L6-v2",
        "model_kwargs": {"device": "cuda:0"},
        "encode_kwargs": {"normalize_embeddings": True, "batch_size": batch_size},
    },
    vector_store=None,  # uses FAISS by default
    search_type="similarity",
    #search_kwargs={"k": 8},
    search_kwargs={"k": List([8, 16])},
    # 2 reranking strategies with different top-n values
    reranker_cls=CrossEncoderReranker,
    reranker_kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L6-v2",
        "model_kwargs": {"device": "cpu"},
        "top_n": List([2]),
        #"top_n": List([2, 5]),
    },
    enable_gpu_search=True,
)

## Context Length Constraints

We observed that increasing chunk_size to 256 caused prompt overflow,
even when limiting reranked context to top_n=2.

This occurs because:
- Biomedical abstracts are dense
- Larger chunks increase prompt length rapidly
- The Qwen2.5-0.5B model has a 3000 token limit

As a result, **chunk_size=256 combined with deep retrieval (rag_k=16) fails deterministically**.  
However, `chunk_size=256` with moderate retrieval depth (`rag_k=8`) remains stable and achieves strong ranking quality.


### Define Data Processing and Postprocessing Functions

In [None]:
def sample_preprocess_fn(
    batch: Dict[str, listtype], rag: RFLangChainRagSpec, prompt_manager: RFPromptManager
) -> Dict[str, listtype]:
    """Function to prepare the final inputs given to the generator model"""

    INSTRUCTIONS = "Use the provided biomedical context to answer the question accurately."


    # Perform batched retrieval over all queries; returns a list of lists of k documents per query
    all_context = rag.get_context(batch_queries=batch["query"], serialize=False)

    # Extract the retrieved document ids from the context
    retrieved_documents = [
        [doc.metadata["corpus_id"] for doc in docs] for docs in all_context
    ]

    # Serialize the retrieved documents into a single string per query using the default template
    serialized_context = rag.serialize_documents(all_context)
    batch["query_id"] = [int(query_id) for query_id in batch["query_id"]]

    # Each batch to contain conversational prompt, retrieved documents, and original 'query_id', 'query', 'metadata'
    return {
        "prompts": [
            [
                {"role": "system", "content": INSTRUCTIONS},
                {
                    "role": "user",
                    "content": f"Here is some relevant context:\n{context}. \nNow answer the following question using the context provided earlier:\n{question}",
                },
            ]
            for question, context in zip(batch["query"], serialized_context)
        ],
        "retrieved_documents": retrieved_documents,
        **batch,
    }


def sample_postprocess_fn(batch: Dict[str, listtype]) -> Dict[str, listtype]:
    """Function to postprocess outputs produced by generator model"""
    # Get ground truth documents for each query; can be done in preprocess_fn too but done here for clarity
    batch["ground_truth_documents"] = [
        qrels[qrels["query_id"] == query_id]["corpus_id"].tolist()
        for query_id in batch["query_id"]
    ]
    return batch

### Define Custom Eval Metrics Functions

In [9]:
import math


def compute_ndcg_at_k(retrieved_docs: set, expected_docs: set, k=5):
    """Utility function to compute NDCG@k"""
    relevance = [1 if doc in expected_docs else 0 for doc in list(retrieved_docs)[:k]]
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

    # IDCG: perfect ranking limited by min(k, len(expected_docs))
    ideal_length = min(k, len(expected_docs))
    ideal_relevance = [3] * ideal_length + [0] * (k - ideal_length)
    idcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(ideal_relevance))

    return dcg / idcg if idcg > 0 else 0.0


def compute_rr(retrieved_docs: set, expected_docs: set):
    """Utility function to compute Reciprocal Rank (RR) for a single query"""
    rr = 0
    for i, retrieved_doc in enumerate(retrieved_docs):
        if retrieved_doc in expected_docs:
            rr = 1 / (i + 1)
            break
    return rr


def sample_compute_metrics_fn(batch: Dict[str, listtype]) -> Dict[str, Dict[str, Any]]:
    """Function to compute all eval metrics based on retrievals and/or generations"""

    true_positives, precisions, recalls, f1_scores, ndcgs, rrs = 0, [], [], [], [], []
    total_queries = len(batch["query"])

    for pred, gt in zip(batch["retrieved_documents"], batch["ground_truth_documents"]):
        expected_set = set(gt)
        retrieved_set = set(pred)

        true_positives = len(expected_set.intersection(retrieved_set))
        precision = true_positives / len(retrieved_set) if len(retrieved_set) > 0 else 0
        recall = true_positives / len(expected_set) if len(expected_set) > 0 else 0
        f1 = (
            2 * precision * recall / (precision + recall)
            if (precision + recall) > 0
            else 0
        )

        precisions.append(precision)
        recalls.append(recall)
        f1_scores.append(f1)
        ndcgs.append(compute_ndcg_at_k(retrieved_set, expected_set, k=5))
        rrs.append(compute_rr(retrieved_set, expected_set))

    return {
        "Total": {"value": total_queries},
        "Precision": {"value": sum(precisions) / total_queries},
        "Recall": {"value": sum(recalls) / total_queries},
        "F1 Score": {"value": sum(f1_scores) / total_queries},
        "NDCG@5": {"value": sum(ndcgs) / total_queries},
        "MRR": {"value": sum(rrs) / total_queries},
    }


def sample_accumulate_metrics_fn(
    aggregated_metrics: Dict[str, listtype],
) -> Dict[str, Dict[str, Any]]:
    """Function to accumulate eval metrics across all batches"""

    num_queries_per_batch = [m["value"] for m in aggregated_metrics["Total"]]
    total_queries = sum(num_queries_per_batch)
    algebraic_metrics = ["Precision", "Recall", "F1 Score", "NDCG@5", "MRR"]

    return {
        "Total": {"value": total_queries},
        **{
            metric: {
                "value": sum(
                    m["value"] * queries
                    for m, queries in zip(
                        aggregated_metrics[metric], num_queries_per_batch
                    )
                )
                / total_queries,
                "is_algebraic": True,
                "value_range": (0, 1),
            }
            for metric in algebraic_metrics
        },
    }

### Define Partial Multi-Config Knobs for vLLM Generator part of RAG Pipeline using RapidFire AI Wrapper APIs

This tutorial showcases Qwen2.5-0.5B-Instruct (0.5B parameters), which is perfect for Colab's memory constraints

In [10]:
vllm_config1 = RFvLLMModelConfig(
    model_config={
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "dtype": "half",
        "gpu_memory_utilization": 0.25,
        "tensor_parallel_size": 1,
        "distributed_executor_backend": "mp",
        "enable_chunked_prefill": False,
        "enable_prefix_caching": False,
        "max_model_len": 3000,
        "disable_log_stats": True,  # Disable vLLM progress logging
        "enforce_eager": True,
        "disable_custom_all_reduce": True,
    },
    sampling_params={
        "temperature": 0.8,
        "top_p": 0.95,
        "max_tokens": 128,
    },
    rag=rag_gpu,
    prompt_manager=None,
)

batch_size = 3 # Smaller batch size for generation
config_set = {
    "vllm_config": vllm_config1,  # Only 1 generator, but it represents 4 full configs
    "batch_size": batch_size,
    "preprocess_fn": sample_preprocess_fn,
    "postprocess_fn": sample_postprocess_fn,
    "compute_metrics_fn": sample_compute_metrics_fn,
    "accumulate_metrics_fn": sample_accumulate_metrics_fn,
    "online_strategy_kwargs": {
        "strategy_name": "normal",
        "confidence_level": 0.95,
        "use_fpc": True,
    },
}

### Create Config Group

In [11]:
# Simple grid search across all config combinations: 4 total (2 chunkers √ó 2 rerankers)
config_group = RFGridSearch(config_set)

### Run Multi-Config Evals + Launch Interactive Run Controller

Now we get to the main function for running multi-config evals. Two tables will appear below the run_evals cell:
- The first table will appear immediately. It lists all preprocessing/RAG sources.
- After a short while, the second table will appear. It lists all individual runs with their knobs and metrics that are updated in real-time via online aggregation showing both estimates and confidence intervals.

RapidFire AI also provides an Interactive Controller panel UI for Colab that lets you manage executing runs dynamically in real-time from the notebook:

- ‚èπÔ∏è **Stop**: Gracefully stop a running config
- ‚ñ∂Ô∏è **Resume**: Resume a stopped run
- üóëÔ∏è **Delete**: Remove a run from this experiment
- üìã **Clone**: Create a new run by editing the config dictionary of a parent run to try new knob values; optional warm start of parameters
- üîÑ **Refresh**: Update run status and metrics

In [None]:
# Launch evals of all RAG configs in the config_group with swap granularity of 4 chunks
# NB: If your machine has more than 1 GPU, set num_actors to that number
results = experiment.run_evals(
    config_group=config_group,
    dataset=nfcorpus_dataset,
    num_actors=1,
    num_shards=4,
    seed=42,
)

### View Results

In [None]:
# Convert results dict to DataFrame
results_df = pd.DataFrame([
    {k: v['value'] if isinstance(v, dict) and 'value' in v else v for k, v in {**metrics_dict, 'run_id': run_id}.items()}
    for run_id, (_, metrics_dict) in results.items()
])

results_df

## Result Interpretation

Key observations:
- chunk_size=128 provides stable execution across all runs
- Larger chunks improve recall marginally but fail due to context overflow
- Increasing retrieval depth (rag_k=16) improves recall but introduces mild precision and ranking tradeoffs, 
  especially when combined with larger chunks.

- Reranking with top_n=2 provides the best balance of quality and stability

This confirms the hypothesis that chunk size dominates RAG stability
for dense biomedical datasets.


### End Experiment

In [None]:
from google.colab import output
from IPython.display import display, HTML

display(HTML('''
<button id="continue-btn" style="padding: 10px 20px; font-size: 16px;">Click to End Experiment</button>
'''))

# eval_js blocks until the Promise resolves
output.eval_js('''
new Promise((resolve) => {
    document.getElementById("continue-btn").onclick = () => {
        document.getElementById("continue-btn").disabled = true;
        document.getElementById("continue-btn").innerText = "Continuing...";
        resolve("clicked");
    };
})
''')

# Actually end the experiment after the button is clicked
experiment.end()
print("Done!")

### View RapidFire AI Log Files

In [None]:
# Get the experiment-specific log file
log_file = experiment.get_log_file_path()

print(f"üìÑ Log File: {log_file}")
print()

if log_file.exists():
    print("=" * 80)
    print(f"Last 30 lines of {log_file.name}:")
    print("=" * 80)
    with open(log_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines[-30:]:
            print(line.rstrip())
else:
    print(f"‚ùå Log file not found: {log_file}")

### Plot the Metrics

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# ----------------------------------
# Normalize Processing Time safely
# ----------------------------------
results_df = results_df.copy()

results_df["Processing Time"] = (
    results_df["Processing Time"]
    .astype(str)
    .str.replace(" seconds", "", regex=False)
    .astype(float)
)

# ----------------------------------
# Prepare plotting dataframe
# ----------------------------------
plot_df = results_df[
    [
        "chunk_size",
        "rag_k",
        "Precision",
        "Recall",
        "F1 Score",
        "NDCG@5",
        "MRR",
        "Processing Time",
    ]
].copy()

# Force numeric where possible (safe, no hardcoding)
for col in ["Precision", "Recall", "F1 Score", "NDCG@5", "MRR", "Processing Time"]:
    plot_df[col] = pd.to_numeric(plot_df[col], errors="coerce")

plot_df = plot_df.dropna()
plot_df = plot_df.sort_values(["rag_k", "chunk_size"])

# ----------------------------------
# Annotation helper (robust)
# ----------------------------------
def annotate_from_df(ax, df, x_col, y_col, fmt, y_offset=6):
    for _, row in df.iterrows():
        if pd.notna(row[y_col]):
            ax.annotate(
                fmt.format(row[y_col]),
                (row[x_col], row[y_col]),
                textcoords="offset points",
                xytext=(0, y_offset),
                ha="center",
                fontsize=9
            )

# ----------------------------------
# Metrics to plot
# ----------------------------------
metrics = [
    ("Precision", "{:.3f}"),
    ("Recall", "{:.3f}"),
    ("F1 Score", "{:.3f}"),
    ("NDCG@5", "{:.3f}"),
    ("MRR", "{:.3f}"),
    ("Processing Time", "{:.2f}s"),
]

# ----------------------------------
# Generate plots
# ----------------------------------
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for ax, (metric, fmt) in zip(axes, metrics):
    for k, g in plot_df.groupby("rag_k"):
        ax.plot(
            g["chunk_size"],
            g[metric],
            marker="o",
            label=f"k={int(k)}"
        )
        annotate_from_df(ax, g, "chunk_size", metric, fmt)

    ax.set_xlabel("Chunk Size (tokens)")
    ax.set_ylabel(metric)
    ax.set_title(f"{metric} vs Chunk Size")
    ax.set_xticks(sorted(plot_df["chunk_size"].unique()))
    ax.grid(True)
    ax.legend()

plt.tight_layout()
plt.show()


## Best-Performing Configurations

Our experiments reveal **two distinct optima**, depending on whether the goal is maximizing ranking quality or prioritizing latency and robustness.

###  Best Ranking Quality (Primary Winner)

- **Chunk size:** 256  
- **Retrieval depth (rag_k):** 8  
- **Reranker top_n:** 2  

This configuration achieves the **highest NDCG@5**, indicating the strongest ranking quality among all successful runs.  
Despite using larger chunks, it remains **stable under context limits** when retrieval depth is kept moderate.

---

###  Best Latency‚ÄìStability Tradeoff

- **Chunk size:** 128  
- **Retrieval depth (rag_k):** 16  
- **Reranker top_n:** 2  

This configuration minimizes **end-to-end processing time** while avoiding prompt overflows.  
Although ranking metrics are slightly lower than the primary winner, it provides a strong balance between **recall, stability, and runtime efficiency**.

---

### Key Insight

There is **no single universally optimal RAG configuration**.  
Instead, effective retrieval-first RAG systems must balance:

- Ranking quality (NDCG@5, MRR)
- Retrieval depth and semantic coverage
- Context-length constraints
- Runtime stability

These results demonstrate why **controlled experimentation is essential** when deploying RAG systems under real-world constraints.


## Role of RapidFire AI

RapidFire AI enabled:
- Parallel evaluation of RAG configurations
- Clean separation of retrieval and evaluation logic
- Interactive control over long-running experiments
- Reproducible, customer-ready experimentation artifacts

This mirrors how real AI teams test RAG pipelines before deployment.


## Conclusion

This notebook demonstrates that effective RAG optimization requires:
- Controlled experimentation
- Awareness of context length constraints
- Dataset-specific tuning

Rather than maximizing retrieval depth blindly, we show that
balanced configurations produce more reliable real-world systems.
