# UCSB Course Catalog RAG System
## RapidFire AI Winter Competition - RAG Track

**Dataset:** UCSB 2009-2010 Course Catalog
**Research Question:** Which chunking and reranking strategies optimize retrieval for course catalog queries?
**Configurations Tested:** 4 (chunk_size: [128, 256] √ó reranking top_n: [2, 5])

## Experiment Hypothesis

Larger chunks (256 tokens) will better preserve course description context compared to smaller chunks (128 tokens), leading to improved Precision and NDCG@5 scores.
Keeping more reranked documents (top_n=5) will improve Recall at the cost of some Precision.

In [None]:
try:
    import rapidfireai
    print("‚úÖ rapidfireai already installed")
except ImportError:
    %pip install rapidfireai
    !rapidfireai init --evals

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

from rapidfireai import Experiment
from rapidfireai.automl import List, RFLangChainRagSpec, RFvLLMModelConfig, RFPromptManager, RFGridSearch
import re, json
from typing import List as listtype, Dict, Any

## üìä Dataset Preparation

The UCSB 2009-2010 catalog was processed into three files:
- **corpus.jsonl**: 1,304 document chunks from the 490-page PDF
- **queries.jsonl**: 30 typical student questions about courses and requirements
- **qrels.tsv**: 150 relevance judgments (5 per query) generated via embedding similarity

For dataset preparation code, see the separate data preparation scripts.

In [None]:
from datasets import load_dataset
import pandas as pd
import random
from pathlib import Path

dataset_dir = Path("/content/tutorial_notebooks/rag-contexteng/datasets")

# Load UCSB catalog queries
ucsb_dataset = load_dataset(
    "json",
    data_files=str(dataset_dir / "ucsb_catalog" / "queries.jsonl"),
    split="train"
)

# Load relevance labels
qrels = pd.read_csv(str(dataset_dir / "ucsb_catalog" / "qrels.tsv"), sep="\t")

# Sample 50% of queries for this experiment
sample_fraction = 0.5
rseed = 1
random.seed(rseed)

sample_size = int(len(ucsb_dataset) * sample_fraction)
ucsb_dataset = ucsb_dataset.shuffle(seed=rseed).select(range(sample_size))

query_ids = set([int(qid) for qid in ucsb_dataset["query_id"]])
qrels = qrels[qrels["query_id"].isin(query_ids)]

print(f"Using {len(ucsb_dataset)} queries ({sample_fraction*100}% of dataset)")
print(f"Filtered qrels to {len(qrels)} relevance judgments")

## üîß RAG Pipeline Configuration

Testing 4 configurations:

| Config | Chunk Size | Rerank Top-N | Hypothesis |
|--------|-----------|--------------|------------|
| 1 | 256 | 2 | High precision, focused context |
| 2 | 256 | 5 | Balanced precision & recall |
| 3 | 128 | 2 | Granular chunks, focused results |
| 4 | 128 | 5 | Granular chunks, broad coverage |

**Fixed Parameters:**
- Embedding: sentence-transformers/all-MiniLM-L6-v2
- Initial retrieval: k=8 (similarity search)
- Reranker: cross-encoder/ms-marco-MiniLM-L6-v2
- Generator: Qwen/Qwen2.5-0.5B-Instruct

In [None]:
from langchain_community.document_loaders import DirectoryLoader, JSONLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

batch_size = 50

rag_gpu = RFLangChainRagSpec(
    document_loader=DirectoryLoader(
        path=str(dataset_dir / "ucsb_catalog"),
        glob="corpus.jsonl",
        loader_cls=JSONLoader,
        loader_kwargs={
            "jq_schema": ".",
            "content_key": "text",
            "metadata_func": lambda record, metadata: {
                "corpus_id": int(record.get("_id"))
            },
            "json_lines": True,
            "text_content": False,
        },
        sample_seed=42,
    ),
    # EXPERIMENT VARIABLE 1: Chunk size
    text_splitter=List([
        RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            encoding_name="gpt2", chunk_size=256, chunk_overlap=32
        ),
        RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            encoding_name="gpt2", chunk_size=128, chunk_overlap=32
        ),
    ]),
    embedding_cls=HuggingFaceEmbeddings,
    embedding_kwargs={
        "model_name": "sentence-transformers/all-MiniLM-L6-v2",
        "model_kwargs": {"device": "cuda:0"},
        "encode_kwargs": {"normalize_embeddings": True, "batch_size": batch_size},
    },
    vector_store=None,
    search_type="similarity",
    search_kwargs={"k": 8},
    # EXPERIMENT VARIABLE 2: Reranking strategy
    reranker_cls=CrossEncoderReranker,
    reranker_kwargs={
        "model_name": "cross-encoder/ms-marco-MiniLM-L6-v2",
        "model_kwargs": {"device": "cpu"},
        "top_n": List([2, 5]),
    },
    enable_gpu_search=True,
)

In [None]:
def sample_preprocess_fn(
    batch: Dict[str, listtype], rag: RFLangChainRagSpec, prompt_manager: RFPromptManager
) -> Dict[str, listtype]:
    """Retrieve context and format prompts for LLM"""

    INSTRUCTIONS = """You are a helpful academic advisor for UCSB students.
Use the provided course catalog information to answer questions about courses,
requirements, policies, and academic programs. Be specific and cite relevant
catalog sections when possible."""

    all_context = rag.get_context(batch_queries=batch["query"], serialize=False)

    retrieved_documents = [
        [doc.metadata["corpus_id"] for doc in docs] for docs in all_context
    ]

    serialized_context = rag.serialize_documents(all_context)
    batch["query_id"] = [int(query_id) for query_id in batch["query_id"]]

    return {
        "prompts": [
            [
                {"role": "system", "content": INSTRUCTIONS},
                {
                    "role": "user",
                    "content": f"""Here is relevant information from the UCSB course catalog:

{context}

Based on this catalog information, please answer the following question:
{question}""",
                },
            ]
            for question, context in zip(batch["query"], serialized_context)
        ],
        "retrieved_documents": retrieved_documents,
        **batch,
    }

In [None]:
def sample_postprocess_fn(batch: Dict[str, listtype]) -> Dict[str, listtype]:
    """Attach ground truth documents for evaluation"""
    batch["ground_truth_documents"] = [
        qrels[qrels["query_id"] == query_id]["corpus_id"].tolist()
        for query_id in batch["query_id"]
    ]
    return batch

In [None]:
import math

def compute_ndcg_at_k(retrieved_docs: set, expected_docs: set, k=5):
    """Compute NDCG@k metric"""
    relevance = [1 if doc in expected_docs else 0 for doc in list(retrieved_docs)[:k]]
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevance))

    ideal_length = min(k, len(expected_docs))
    ideal_relevance = [3] * ideal_length + [0] * (k - ideal_length)
    idcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(ideal_relevance))

    return dcg / idcg if idcg > 0 else 0.0

def compute_rr(retrieved_docs: set, expected_docs: set):
    """Compute Reciprocal Rank for a single query"""
    rr = 0
    for i, retrieved_doc in enumerate(retrieved_docs):
        if retrieved_doc in expected_docs:
            rr = 1 / (i + 1)
            break
    return rr

def sample_compute_metrics_fn(batch: Dict[str, listtype]) -> Dict[str, Dict[str, Any]]:
    """Compute retrieval metrics per batch"""
    precisions, recalls, f1_scores, ndcgs, rrs = [], [], [], [], []
    total_queries = len(batch["query"])

    for pred, gt in zip(batch["retrieved_documents"], batch["ground_truth_documents"]):
        expected_set = set(gt)
        retrieved_set = set(pred)

        true_positives = len(expected_set.intersection(retrieved_set))
        precision = true_positives / len(retrieved_set) if len(retrieved_set) > 0 else 0
        recall = true_positives / len(expected_set) if len(expected_set) > 0 else 0
        f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0

        precisions.append(precision)
        recalls.append(recall)
        f1_scores.append(f1)
        ndcgs.append(compute_ndcg_at_k(retrieved_set, expected_set, k=5))
        rrs.append(compute_rr(retrieved_set, expected_set))

    return {
        "Total": {"value": total_queries},
        "Precision": {"value": sum(precisions) / total_queries},
        "Recall": {"value": sum(recalls) / total_queries},
        "F1 Score": {"value": sum(f1_scores) / total_queries},
        "NDCG@5": {"value": sum(ndcgs) / total_queries},
        "MRR": {"value": sum(rrs) / total_queries},
    }

def sample_accumulate_metrics_fn(
    aggregated_metrics: Dict[str, listtype],
) -> Dict[str, Dict[str, Any]]:
    """Accumulate metrics across all batches"""
    num_queries_per_batch = [m["value"] for m in aggregated_metrics["Total"]]
    total_queries = sum(num_queries_per_batch)
    algebraic_metrics = ["Precision", "Recall", "F1 Score", "NDCG@5", "MRR"]

    return {
        "Total": {"value": total_queries},
        **{
            metric: {
                "value": sum(
                    m["value"] * queries
                    for m, queries in zip(
                        aggregated_metrics[metric], num_queries_per_batch
                    )
                ) / total_queries,
                "is_algebraic": True,
                "value_range": (0, 1),
            }
            for metric in algebraic_metrics
        },
    }

In [None]:
vllm_config1 = RFvLLMModelConfig(
    model_config={
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "dtype": "half",
        "gpu_memory_utilization": 0.25,
        "tensor_parallel_size": 1,
        "distributed_executor_backend": "mp",
        "enable_chunked_prefill": False,
        "enable_prefix_caching": False,
        "max_model_len": 3000,
        "disable_log_stats": True,
        "enforce_eager": True,
        "disable_custom_all_reduce": True,
    },
    sampling_params={
        "temperature": 0.8,
        "top_p": 0.95,
        "max_tokens": 128,
    },
    rag=rag_gpu,
    prompt_manager=None,
)

batch_size = 3

config_set = {
    "vllm_config": vllm_config1,
    "batch_size": batch_size,
    "preprocess_fn": sample_preprocess_fn,
    "postprocess_fn": sample_postprocess_fn,
    "compute_metrics_fn": sample_compute_metrics_fn,
    "accumulate_metrics_fn": sample_accumulate_metrics_fn,
    "online_strategy_kwargs": {
        "strategy_name": "normal",
        "confidence_level": 0.95,
        "use_fpc": True,
    },
}

In [None]:
config_group = RFGridSearch(config_set)

## üöÄ Running Experiments

This cell executes all 4 configurations in parallel using RapidFire AI's experiment orchestration system.

**Expected Runtime:** ~20-30 minutes on Colab T4 GPU

In [None]:
experiment = Experiment(experiment_name="ucsb-catalog-rag-exp1", mode="evals")

In [None]:
from google.colab import output
output.serve_kernel_port_as_iframe(8855)

In [None]:
results = experiment.run_evals(
    config_group=config_group,
    dataset=ucsb_dataset,
    num_actors=1,
    num_shards=4,
    seed=42,
)

## üìä Results Analysis

The table below shows the final metrics for all 4 configurations.

**Key Metrics:**
- **Precision**: Of retrieved documents, what % were relevant?
- **Recall**: Of all relevant documents, what % were retrieved?
- **F1 Score**: Harmonic mean of Precision and Recall
- **NDCG@5**: How well were relevant documents ranked? (higher = better ranking)
- **MRR**: How quickly was the first relevant document found?

In [None]:
results_df = pd.DataFrame([
    {k: v['value'] if isinstance(v, dict) and 'value' in v else v
     for k, v in {**metrics_dict, 'run_id': run_id}.items()}
    for run_id, (_, metrics_dict) in results.items()
])

# Display key columns only for clarity
display_cols = ['run_id', 'chunk_size', 'top_n', 'Precision', 'Recall', 'F1 Score', 'NDCG@5', 'MRR']
results_df[display_cols]

## üîç Key Findings

### Best Configuration: [Fill this in based on your results]

**Observations:**
- Chunk size impact: [Compare 256 vs 128]
- Reranking strategy impact: [Compare top_n=2 vs top_n=5]
- Trade-offs observed: [Precision vs Recall, etc.]

### RapidFire AI Value

RapidFire AI enabled:
1. **Parallel execution**: All 4 configs ran simultaneously, saving ~XX minutes
2. **Live metrics**: Real-time confidence intervals showed convergence
3. **Interactive control**: Could stop/clone configs dynamically
4. **Easy reproducibility**: Complete config tracking in logs

In [None]:
experiment.end()
print("Experiment completed successfully!")

In [None]:
# Save to Google Drive for easy access
results_df.to_csv('/content/drive/MyDrive/ucsb_catalog_results.csv', index=False)
print("‚úÖ Results saved to Google Drive")