This repository contains the code accompanying the paper
“Core-based Hierarchies for Efficient GraphRAG.”
Our implementation builds directly on the official GraphRAG benchmarking framework released by Microsoft. We introduce k-core–based hierarchical community construction algorithms as drop-in replacements for Leiden-based community detection. The overall pipeline, datasets, and evaluation procedures remain unchanged.
The code follows the same setup and execution procedure as the original GraphRAG benchmarking repository:
https://github.com/microsoft/graphrag-benchmarking-datasets
We retain:
- The same dataset format
- The same indexing and query-time execution flow
- The same evaluation and head-to-head comparison framework
Our contributions are limited to modifications in the community construction and hierarchy generation components.
The code is executed using the standard GraphRAG commands. No additional steps are required beyond those described in the original GraphRAG repository.
Run the following from the root directory:
pip install -e ./graphragCreate a new directory to hold your input files (e.g., for the Kevin Scott podcast benchmark):
mkdir -p ./kevin_scott_podcasts/inputPlace the unzipped input files into the input folder.
python graphrag/cli/main.py init --root kevin_scott_podcasts/This command creates two files inside ./kevin_scott_podcasts/:
.env: contains theGRAPHRAG_API_KEYenvironment variable. Set it to your OpenAI or Azure OpenAI API key.settings.yaml: configures the GraphRAG pipeline. You may edit this file to customize pipeline behavior.
Indexing constructs the knowledge graph and hierarchical communities:
python graphrag/cli/main.py index --root ./kevin_scott_podcasts --community RkHHere, RkH specifies the community construction algorithm. The default is the original GraphRAG Leiden algorithm. You may alternatively select one of our proposed heuristics:
RkHM2hCMRC
Indexing time varies depending on dataset size and may take several minutes to several hours.
You can now run queries against your indexed dataset. For example, to perform global search on leaf-level (LF) communities:
python graphrag/cli/main.py query \
--root ./kevin_scott_podcasts \
--method global \
--community-level LF \
--query "What recurring topics do tech leaders emphasize in their discussions?"