A benchmark framework for evaluating multiple Retrieval-Augmented Generation (RAG) and GraphRAG methods across QA and summarization tasks.
| Method | Scripts |
|---|---|
| RAG (standard vector-based) | retrieval.py, retrieval_multiple.py |
| RaptorRAG | raptor_retrieval.py, raptor_retrieval_multiple.py |
| KG-GraphRAG (Triplets / Triplets+Text) | graph_retrieval.py, graph_retrieval_multiple.py |
| Community-GraphRAG (Local) | graphrag_local.py, graphrag_local_multiple.py |
| Community-GraphRAG (Global) | graphrag_global.py, graphrag_global_multiple.py |
| HippoRAG2 | hippo_retrieval.py, hippo_retrieval_multiple.py |
| Type | Dataset |
|---|---|
| Full-document (single index) | news (MultiHop-RAG), hotpot, Story, Meeting |
| Sub-document (per-document index) | NovelQA, NQ, SQU, QM |
| QA task | news (MultiHop-RAG), hotpot, NovelQA, NQ |
| Summarization task | Story, Meeting, SQU, QM |
1. Indexing → 2. Retrieval → 3. QA / Summarization → 4. Evaluation
Indexing is handled automatically inside each retrieval script via dataset.py. The index is cached on disk and reused on subsequent runs.
Use retrieval.py for full-document datasets and retrieval_multiple.py for sub-document datasets.
Basic retrieval:
python retrieval.py --dataset news --topk 10
python retrieval_multiple.py --dataset NovelQA --topk 10 --subdocsWith IRCoT (iterative retrieval):
python retrieval.py --dataset news --topk 10 --ircot
python retrieval_multiple.py --dataset NovelQA --topk 10 --subdocs --ircotWith reranking:
python retrieval.py --dataset news --topk 20 --rerank
python retrieval_multiple.py --dataset NovelQA --topk 20 --subdocs --rerankOther method-specific retrieval scripts follow the same argument conventions. For KG-GraphRAG, pass --withtext to include source text alongside triplets.
QA datasets (news, hotpot, NovelQA, NQ) — use qa/qa_batch.py:
cd qa
python qa_batch.py --dataset news --graphrag_local
python qa_batch.py --dataset hotpot --ragSummarization datasets (SQU, QM, Story, Meeting) — use qa/summarization.py:
cd qa
python summarization.py --dataset Story --graphrag_local
python summarization.py --dataset Meeting --ragMethod flags: --rag, --raptor_rag, --hippo_rag, --graphrag_local, --graphrag (global).
| Dataset | Script |
|---|---|
| news (MultiHop-RAG) | evaluation/qa_evaluation.py |
| hotpot, NQ | evaluation/hotpot_evaluation.py |
| NovelQA | evaluation/NovelQA_evaluation.py |
| SQU, QM, Story, Meeting | evaluation/summarization_evaluation.py |
Examples:
cd evaluation
# news
python qa_evaluation.py --dataset news --graphrag_local
# hotpot / NQ
python hotpot_evaluation.py --dataset NQ --subdocs --graphrag_local
# NovelQA
python NovelQA_evaluation.py --dataset NovelQA --graphrag_local
# Summarization
python summarization_evaluation.py --dataset Meeting --graphrag_local| Argument | Description |
|---|---|
--dataset |
Dataset name (e.g., news, hotpot, NovelQA, NQ, Story, Meeting, SQU, QM) |
--topk |
Number of retrieved documents |
--subdocs |
Use per-document (sub-document) indices |
--rerank |
Enable reranking with BAAI/bge-reranker-large |
--ircot |
Enable IRCoT iterative retrieval |
--model_size |
LLM size for IRCoT (default: 8B) |
--embed_model |
Embedding model (default: text-embedding-ada-002) |
- LlamaIndex
- vLLM
- HippoRAG
- RAPTOR
- Microsoft GraphRAG
- OpenAI API (for embeddings and LLM calls)