# Retrieval QA Benchmark

Retreival QA Benchmark (RQABench in short) is an open-sourced, end-to-end test workbench for Retrieval Augmented Generation (RAG) systems. We intend to build an open benchmark for all developers and researchers to reproduce and design new RAG systems. We also want to create a platform for everyone to share their lego blocks to help others to build up their own retrieval + LLM system.

Here are some major feature of this benchmark:

- **Flexibility**: We maximize the flexibility when design your retrieval system, as long as your transform accept `QARecord` as input and `QARecord` as output.
- **Reproducibility**: We gather all settings in the evaluation process into a single YAML configuration. It helps you to track and reproduce experiements.
- **Traceability**: We collect more than the accuracy and scores. We also focus on running times on any function you want to watch and the tokens used in the whole RAG system.

In [None]:
from retrieval_qa_benchmark.models import *
from retrieval_qa_benchmark.datasets import *
from retrieval_qa_benchmark.transforms import *
from retrieval_qa_benchmark.evaluators import *
from retrieval_qa_benchmark.utils.registry import REGISTRY
from retrieval_qa_benchmark.utils.profiler import PROFILER

print(str(REGISTRY))

In [None]:
import yaml
from retrieval_qa_benchmark.utils.factory import EvaluatorFactory
from retrieval_qa_benchmark.utils.config import load

config = load(open("../config/mmlu.yaml"))
evaluator = EvaluatorFactory.from_config(config).build()

In [None]:
import json
from tqdm import tqdm
from multiprocess.pool import ThreadPool
from retrieval_qa_benchmark.utils.profiler import PROFILER

PROFILER.clear()

# shrink the size of the dataset
evaluator.dataset.eval_set = evaluator.dataset.eval_set[:5]

data = []
for r in tqdm(
    map(evaluator.transform, evaluator.dataset.iterator()), total=len(evaluator.dataset)
):
    data.append(r)
with open("new_aligned.jsonl", "w") as f:
    f.write("\n".join([json.dumps(d.model_dump()) for d in data]))

In [None]:
# This is how it is formed as plain prompt
print(evaluator.llm.convert_record(data[0]))

In [None]:
# print profile result
print(str(PROFILER))
PROFILER.clear()

In [None]:
acc, result = evaluator()

with open("results.with-retrieval.test.jsonl", "w") as f:
    f.write("\n".join([r.model_dump_json() for r in result]))

## Save all mismatched

In [None]:
from retrieval_qa_benchmark.schema import QAPrediction

mismatched = [pred for pred in matched[1] if pred.matched]

In [None]:
with open("results.test.jsonl", "w") as f:
    f.write("\n".join([r.model_dump_json() for r in mismatched]))

## Build all datasets

In [None]:
for dname, dset in REGISTRY.Datasets.items():
    print(f"Loading {dname}...")
    dset.build()

## Check intersection of two results (retreival system recall)

In [None]:
import json

with open("old_aligned.jsonl") as f1:
    old_result = [json.loads(l) for l in f1.readlines()]

with open("new.jsonl") as f2:
    new_result = [json.loads(l) for l in f2.readlines()]

In [None]:
from hashlib import sha256

cnt = 0
for r1, r2 in zip(old_result, new_result):
    c1 = set(map(lambda x: sha256(x.encode("utf-8")).hexdigest(), r1["context"]))
    c2 = set(map(lambda x: sha256(x.encode("utf-8")).hexdigest(), r2["context"]))
    cnt += len(c2.intersection(c1))
print(cnt / len(new_result) / 5)