KMA-Bench

A benchmark for knowledge-management tasks in AI agent systems.

Existing benchmarks test retrieval (BEIR), QA (MuSiQue), or embedding quality (MTEB). None test whether a system can manage a knowledge base: detect when stored knowledge is stale, evaluate whether a change is legitimate or corrupt, route a query to the right files, and decide whether to serve, flag, or block content.

KMA-Bench fills that gap with 166 cases across three tasks and two knowledge bases (14 of them drawn from real git commits). Bring any system, implement three methods, and get scored.

Tasks

Task	What it tests	Cases	Metrics
Diff evaluation	Classify a change as accept / reject / partial	80	Accuracy, per-category
Routing	Select the relevant files for a query	60	Recall, precision, rank-1
Gate decision	Serve, annotate, or block based on a change signal	26	Accuracy, severity-weighted
Staleness detection	Is a query affected by recent changes?	planned	Detection rate, false alarm
Full pipeline	Route → detect → gate → eval → answer	planned	End-to-end accuracy, tokens

Knowledge bases

French wiki: 16 markdown files of grammar, tenses, vocabulary, and lesson notes for B2 exam prep (the author's own notes).
PostHog handbook: a subset of the open-source PostHog company handbook (MIT).

Note

ClickHouse docs were used to evaluate models during development, but those cases aren't redistributed here; they're CC BY-NC-SA. The public benchmark is 166 cases across French + PostHog. See NOTICE.

Quick start

uv pip install git+https://github.com/malgamves/kma-bench

Run the built-in heuristic baseline, a deliberately dumb floor (always "accept" on diff eval, keyword matching for routing, a severity lookup for gating):

uv run kma-bench evaluate --system heuristic

Three ready-made systems (a local fine-tune and base model via LM Studio, and Claude) live in examples/. Run one, or compare several:

uv run kma-bench evaluate --system examples.kma_system.KMAFineTuned
uv run kma-bench compare  --systems heuristic,examples.kma_system.KMAFineTuned,examples.kma_claude.KMAClaude

Filter by task or knowledge base with --task {diff_eval,routing,gate} and --kb {french_wiki,posthog_handbook}. See the instructions for the full walkthrough.

Benchmark your own system

Subclass KMASystem and implement three methods:

from kma_bench import KMASystem, DiffEvalResult, RoutingResult, GateResult

class MySystem(KMASystem):
    @property
    def name(self):
        return "My KMA system"

    def evaluate_diff(self, old_content, diff, query, filename):
        return DiffEvalResult(verdict="accept", reasoning="...", confidence=0.9)

    def route_query(self, query, file_index):
        return RoutingResult(selected_files=["file.md"], reasoning="...", confidence=0.9)

    def gate_decision(self, signal_description, signal_severity, query, filename):
        return GateResult(decision="serve", reasoning="...", risk_level="low", confidence=0.9)

uv run kma-bench evaluate --system mypackage.MySystem

Tip

load_system adds the working directory to sys.path, so a module in your current directory works as --system module.MyClass.

Scoring

Diff evaluation: exact match 1.0; adjacent confusion (e.g. reject↔partial) 0.5; opposite (reject↔accept) 0.0.

Routing: recall (expected files found), precision (returned files that were expected), rank-1 (top file is correct), and out-of-scope (correctly empty on irrelevant queries).

Gate decision: exact match 1.0, with severity-weighted errors: serving when it should block (0.0) is penalised harder than annotating when it should block (0.3).

Test-case categories

Task	Categories
Diff eval	`corruption_conjugation`, `corruption_chars`, `corruption_values`, `rule_swap`, `fake_entry`, `legitimate_addition`, `legitimate_update`, `correction`, `no_change`, `formatting`, `restructure`, `partial_error`, `deletion`
Routing	`single`, `multi`, `negative` (out of scope)
Gate	critical / medium / low severity, across corruption, legitimate updates, partial errors, and structural changes

Related work

BEIR: information retrieval (tests routing, not management).
MuSiQue: multi-hop QA (tests reasoning, not staleness).
MeMo: Memory as a Model (validates the small-model-as-knowledge-server idea).
MTEB: embeddings (not generative knowledge management).

Citation

@software{kma_bench_2026,
  author = {Phiri, Daniel},
  title  = {KMA-Bench: Knowledge Management Agent Benchmark},
  year   = {2026},
  url    = {https://github.com/malgamves/kma-bench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
kma_bench		kma_bench
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
kma_bench_instructions.md		kma_bench_instructions.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KMA-Bench

Tasks

Knowledge bases

Quick start

Benchmark your own system

Scoring

Test-case categories

Related work

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KMA-Bench

Tasks

Knowledge bases

Quick start

Benchmark your own system

Scoring

Test-case categories

Related work

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages