Skip to content

malgamves/kma-bench

Repository files navigation

KMA-Bench

A benchmark for knowledge-management tasks in AI agent systems.

License Cases Tasks Python

Model · Peripheral · Write-up

Existing benchmarks test retrieval (BEIR), QA (MuSiQue), or embedding quality (MTEB). None test whether a system can manage a knowledge base: detect when stored knowledge is stale, evaluate whether a change is legitimate or corrupt, route a query to the right files, and decide whether to serve, flag, or block content.

KMA-Bench fills that gap with 166 cases across three tasks and two knowledge bases (14 of them drawn from real git commits). Bring any system, implement three methods, and get scored.

Tasks

Task What it tests Cases Metrics
Diff evaluation Classify a change as accept / reject / partial 80 Accuracy, per-category
Routing Select the relevant files for a query 60 Recall, precision, rank-1
Gate decision Serve, annotate, or block based on a change signal 26 Accuracy, severity-weighted
Staleness detection Is a query affected by recent changes? planned Detection rate, false alarm
Full pipeline Route → detect → gate → eval → answer planned End-to-end accuracy, tokens

Knowledge bases

  • French wiki: 16 markdown files of grammar, tenses, vocabulary, and lesson notes for B2 exam prep (the author's own notes).
  • PostHog handbook: a subset of the open-source PostHog company handbook (MIT).

Note

ClickHouse docs were used to evaluate models during development, but those cases aren't redistributed here; they're CC BY-NC-SA. The public benchmark is 166 cases across French + PostHog. See NOTICE.

Quick start

uv pip install git+https://github.com/malgamves/kma-bench

Run the built-in heuristic baseline, a deliberately dumb floor (always "accept" on diff eval, keyword matching for routing, a severity lookup for gating):

uv run kma-bench evaluate --system heuristic

Three ready-made systems (a local fine-tune and base model via LM Studio, and Claude) live in examples/. Run one, or compare several:

uv run kma-bench evaluate --system examples.kma_system.KMAFineTuned
uv run kma-bench compare  --systems heuristic,examples.kma_system.KMAFineTuned,examples.kma_claude.KMAClaude

Filter by task or knowledge base with --task {diff_eval,routing,gate} and --kb {french_wiki,posthog_handbook}. See the instructions for the full walkthrough.

Benchmark your own system

Subclass KMASystem and implement three methods:

from kma_bench import KMASystem, DiffEvalResult, RoutingResult, GateResult

class MySystem(KMASystem):
    @property
    def name(self):
        return "My KMA system"

    def evaluate_diff(self, old_content, diff, query, filename):
        return DiffEvalResult(verdict="accept", reasoning="...", confidence=0.9)

    def route_query(self, query, file_index):
        return RoutingResult(selected_files=["file.md"], reasoning="...", confidence=0.9)

    def gate_decision(self, signal_description, signal_severity, query, filename):
        return GateResult(decision="serve", reasoning="...", risk_level="low", confidence=0.9)
uv run kma-bench evaluate --system mypackage.MySystem

Tip

load_system adds the working directory to sys.path, so a module in your current directory works as --system module.MyClass.

Scoring

Diff evaluation: exact match 1.0; adjacent confusion (e.g. reject↔partial) 0.5; opposite (reject↔accept) 0.0.

Routing: recall (expected files found), precision (returned files that were expected), rank-1 (top file is correct), and out-of-scope (correctly empty on irrelevant queries).

Gate decision: exact match 1.0, with severity-weighted errors: serving when it should block (0.0) is penalised harder than annotating when it should block (0.3).

Test-case categories

Task Categories
Diff eval corruption_conjugation, corruption_chars, corruption_values, rule_swap, fake_entry, legitimate_addition, legitimate_update, correction, no_change, formatting, restructure, partial_error, deletion
Routing single, multi, negative (out of scope)
Gate critical / medium / low severity, across corruption, legitimate updates, partial errors, and structural changes

Related work

  • BEIR: information retrieval (tests routing, not management).
  • MuSiQue: multi-hop QA (tests reasoning, not staleness).
  • MeMo: Memory as a Model (validates the small-model-as-knowledge-server idea).
  • MTEB: embeddings (not generative knowledge management).

Citation

@software{kma_bench_2026,
  author = {Phiri, Daniel},
  title  = {KMA-Bench: Knowledge Management Agent Benchmark},
  year   = {2026},
  url    = {https://github.com/malgamves/kma-bench}
}

About

a benchmark for knowledge-management tasks in AI agents: diff evaluation, routing, and gate decisions.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages