A benchmark for knowledge-management tasks in AI agent systems.
Model · Peripheral · Write-up
Existing benchmarks test retrieval (BEIR), QA (MuSiQue), or embedding quality (MTEB). None test whether a system can manage a knowledge base: detect when stored knowledge is stale, evaluate whether a change is legitimate or corrupt, route a query to the right files, and decide whether to serve, flag, or block content.
KMA-Bench fills that gap with 166 cases across three tasks and two knowledge bases (14 of them drawn from real git commits). Bring any system, implement three methods, and get scored.
| Task | What it tests | Cases | Metrics |
|---|---|---|---|
| Diff evaluation | Classify a change as accept / reject / partial | 80 | Accuracy, per-category |
| Routing | Select the relevant files for a query | 60 | Recall, precision, rank-1 |
| Gate decision | Serve, annotate, or block based on a change signal | 26 | Accuracy, severity-weighted |
| Staleness detection | Is a query affected by recent changes? | planned | Detection rate, false alarm |
| Full pipeline | Route → detect → gate → eval → answer | planned | End-to-end accuracy, tokens |
- French wiki: 16 markdown files of grammar, tenses, vocabulary, and lesson notes for B2 exam prep (the author's own notes).
- PostHog handbook: a subset of the open-source PostHog company handbook (MIT).
Note
ClickHouse docs were used to evaluate models during development, but those cases aren't redistributed here; they're CC BY-NC-SA. The public benchmark is 166 cases across French + PostHog. See NOTICE.
uv pip install git+https://github.com/malgamves/kma-benchRun the built-in heuristic baseline, a deliberately dumb floor (always "accept" on diff eval, keyword matching for routing, a severity lookup for gating):
uv run kma-bench evaluate --system heuristicThree ready-made systems (a local fine-tune and base model via LM Studio, and Claude) live in examples/. Run one, or compare several:
uv run kma-bench evaluate --system examples.kma_system.KMAFineTuned
uv run kma-bench compare --systems heuristic,examples.kma_system.KMAFineTuned,examples.kma_claude.KMAClaudeFilter by task or knowledge base with --task {diff_eval,routing,gate} and --kb {french_wiki,posthog_handbook}. See the instructions for the full walkthrough.
Subclass KMASystem and implement three methods:
from kma_bench import KMASystem, DiffEvalResult, RoutingResult, GateResult
class MySystem(KMASystem):
@property
def name(self):
return "My KMA system"
def evaluate_diff(self, old_content, diff, query, filename):
return DiffEvalResult(verdict="accept", reasoning="...", confidence=0.9)
def route_query(self, query, file_index):
return RoutingResult(selected_files=["file.md"], reasoning="...", confidence=0.9)
def gate_decision(self, signal_description, signal_severity, query, filename):
return GateResult(decision="serve", reasoning="...", risk_level="low", confidence=0.9)uv run kma-bench evaluate --system mypackage.MySystemTip
load_system adds the working directory to sys.path, so a module in your current directory works as --system module.MyClass.
Diff evaluation: exact match 1.0; adjacent confusion (e.g. reject↔partial) 0.5; opposite (reject↔accept) 0.0.
Routing: recall (expected files found), precision (returned files that were expected), rank-1 (top file is correct), and out-of-scope (correctly empty on irrelevant queries).
Gate decision: exact match 1.0, with severity-weighted errors: serving when it should block (0.0) is penalised harder than annotating when it should block (0.3).
| Task | Categories |
|---|---|
| Diff eval | corruption_conjugation, corruption_chars, corruption_values, rule_swap, fake_entry, legitimate_addition, legitimate_update, correction, no_change, formatting, restructure, partial_error, deletion |
| Routing | single, multi, negative (out of scope) |
| Gate | critical / medium / low severity, across corruption, legitimate updates, partial errors, and structural changes |
- BEIR: information retrieval (tests routing, not management).
- MuSiQue: multi-hop QA (tests reasoning, not staleness).
- MeMo: Memory as a Model (validates the small-model-as-knowledge-server idea).
- MTEB: embeddings (not generative knowledge management).
@software{kma_bench_2026,
author = {Phiri, Daniel},
title = {KMA-Bench: Knowledge Management Agent Benchmark},
year = {2026},
url = {https://github.com/malgamves/kma-bench}
}