A small, local model that manages a knowledge base, so a frontier model doesn't have to.
Peripheral is a fine-tuned 8B model that does the unglamorous admin around a filesystem knowledge base: it routes a query to the right files, notices when stored knowledge has gone stale, and decides whether a recent change is safe to serve. It runs locally, costs nothing per call, and on these knowledge-management tasks it beats a frontier API at a fraction of the latency and cost.
It's a knowledge management agent (KMA): a specialist, not a better frontier model. On KMA-Bench it scores 81% to Claude Sonnet 4.6's 76%, at ~760ms per call (roughly 5x faster and 5x cheaper).
Note
Peripheral was trained on ~7,000 task examples and beats a zero-shot generalist, but it transfers only partially to unseen domains. See the model card for the honest limitations.
- Local and cheap: an 8B
Q4_K_MGGUF (~4.8 GB VRAM) you run in LM Studio, at zero API cost. - Three jobs, one model: routing, change evaluation, and serve/annotate/block gating, each a tagged prompt that returns a single JSON object.
- Cheap heuristics do the easy work: a topic-keyword index for routing and git diffs for staleness; the model is reserved for the hard call: is this change legitimate, corrupt, or somewhere in between?
- Reproducible: generate the training data and fine-tune on your own knowledge base, no frontier API required.
A query passes through a pipeline of cheap heuristics before the model is asked anything, so the model only does the work that genuinely needs judgement.
User query
│
▼
┌──────────────────────────────────────────────────────────┐
│ Peripheral pipeline │
│ ┌────────┐ ┌────────┐ ┌──────────────────┐ │
│ │ Router │──▶│ PVL │──▶│ Block Gate │ │
│ │ topic │ │ git │ │ serve / annotate │ │
│ │ index │ │ diff │ │ / block │ │
│ └────────┘ └────────┘ └────────┬─────────┘ │
│ │ │ │
│ relevant files ┌──────┴──────┐ │
│ (5-6x fewer tokens) │ Eval (diff │ │
│ │ classifier) │ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────┘
│
▼
Response from verified, routed context
- Router: maps a query to the relevant wiki files via a hand-built topic-keyword index (100% hit rate vs ~30% for embedding search on a non-English KB).
- PVL (Peripheral Vision Layer): an append-only change log from git diffs that flags recently modified files, at ~226 tokens per query.
- Block Gate: instead of injecting a "this might be stale" warning (which the model ignores), it withholds the content and serves the diff plus the previous version, reframing the task as evaluate this change.
- Peripheral (the model): Qwen3-8B fine-tuned with QLoRA, the judgement layer that classifies a change as
accept/partial/reject.
See docs/architecture.md for the full design.
On KMA-Bench, averaged over multiple runs across three knowledge bases (% correct):
| Model | Diff Eval | Routing | Gate | Overall | Latency |
|---|---|---|---|---|---|
| Heuristic | 44% | 75% | 100% | 64% | 0 ms |
| Base Qwen3-8B | 46% | 13% | 16% | 30% | ~3,500 ms |
| Peripheral (8B) | 71% | 88% | 96% | 81% | ~760 ms |
| Claude Sonnet 4.6 (zero-shot) | 60% | 96% | 83% | 76% | ~4,100 ms |
Fine-tuning carried the base model from 30% (it couldn't reliably emit valid JSON) to 81%. The public benchmark ships 166 cases (French + PostHog); ClickHouse was evaluated too but isn't redistributed.
Prerequisites: Python 3.11+, uv, and LM Studio (for the local model) or an Anthropic API key.
uv sync
cp .env.example .env # add your Anthropic key only if benchmarking ClaudeRun the benchmark against the local model (loaded in LM Studio) or Claude:
uv run python scripts/benchmark/benchmark_core.py --backend lmstudio --model-name peripheral-8b
uv run python scripts/benchmark/benchmark_core.py --backend anthropicReproduce the training data and fine-tune split on your own knowledge base, generated locally, zero API cost:
uv run python scripts/datagen/gen_eval.py --wiki-path ./french-kb/wiki --output ./training-data/eval
uv run python scripts/datagen/gen_routing.py --wiki-path ./french-kb/wiki --output ./training-data/routing
uv run python scripts/datagen/gen_gate.py --wiki-path ./french-kb/wiki --output ./training-data/gate
uv run python scripts/training/prep_finetune.py --data-dir ./training-data --output ./finetune-dataFine-tune the prepared split with Unsloth (QLoRA on a single GPU) and export a Q4_K_M GGUF.
The fastest path is LM Studio: search malgamves/peripheral-8b, download the GGUF, and give the model a task instruction ([EVAL], [ROUTE], or [GATE]) plus your input. It replies with a single JSON object.
Tip
Keep the temperature low (~0.1). Peripheral is trained to emit short, deterministic JSON. Full prompt formats are in the model card.
peripheral/
├── french-kb/ French language-learning wiki (the author's notes)
├── scripts/
│ ├── benchmark/ KMA-Bench runners
│ ├── datagen/ Training-data generation
│ ├── measurement/ Token-measurement experiments
│ ├── pvl/ Peripheral Vision Layer + block gate
│ ├── training/ Fine-tuning prep (train/val split)
│ └── validation/ Index-validation experiments
└── docs/ Architecture notes + model card
The benchmark lives in its own repo: KMA-Bench.
Note
An MCP Server is coming!!
- Architecture: the pipeline, the four tasks, and the data flow.
- Model card: prompts, results, training details, and limitations.
- Write-up: the full story behind the project.