The eval, wired into the loop: a 3-layer quality gate — deterministic fail-fast → trusted verdict-rules (short-circuit the LLM) → cross-family LLM jury on a 6-dim rubric — routing ship / retry / escalate with trust-tiered provenance.
Status: M1 · part of the Vollko cube platform.
Language: TypeScript (Bun/Node) · zero runtime dependencies; depends on
cube-runtime for the AsyncModelBackend contract type (the real async paths).
import { gate, Jury, FakeEvaluator, PredicateSensor, RUBRIC } from "eval-gate";
const jury = new Jury([
new FakeEvaluator(Object.fromEntries(RUBRIC.map((d) => [d.name, 4.4])) as Record<string, number>, "judgeA", "anthropic"),
new FakeEvaluator(Object.fromEntries(RUBRIC.map((d) => [d.name, 4.1])) as Record<string, number>, "judgeB", "openai"),
]);
const result = gate("the work to judge.", { profile: "default", jury });
// result.decision -> "ship" | "retry" | "escalate"- Deterministic sensors run first and can veto a perfect jury (cheap, certain).
- Trusted
VerdictRules short-circuit the LLM (hard-fail → escalate, etc.). - The jury aggregates evaluators by per-dimension median; a cross-family jury scores higher trust.
- The entity doing the work never judges it — you pass in an external
Evaluator/Jury. FakeEvaluatormakes the whole gate deterministic with no API keys; a real model judge (orharness) drops in behind theEvaluatorinterface.- The async real path (
AsyncEvaluator/AsyncCompressionJudge) wires an injectedAsyncModelBackendbehind the same fakes-first philosophy — no model SDK ever becomes a compile-time dependency of this package.
The jury scores work on six dimensions (weights sum to 100 %):
| Dimension | Weight | What it measures |
|---|---|---|
complete |
25 % | All parts of what was asked are addressed |
specific |
20 % | Claims are concrete rather than vague |
correct |
20 % | Content is factually and logically sound |
actionable |
15 % | Someone can act directly on this output |
coherent |
10 % | Structure and flow are easy to follow |
format |
10 % | Presentation matches expectations (length, structure) |
weightedAverage(scores) collapses a DimensionScores map to a single number.
PROFILE_THRESHOLDS defines three ship thresholds: lenient (3.0), default (3.5),
strict (4.0); thresholdOf(profile) retrieves the value.
The package also ships standalone scorers that measure the evaluators themselves — built on the same deterministic-first philosophy, with judges injected so the kernels stay pure.
-
Validation contracts (
contract.ts) — author named pass/fail assertions at plan time, before any code.manifestOfserialises the contract's intent (ids + descriptions) to a JSON-safeContractManifestfor the transcript;contractSensorsfeeds the assertions straight intogate({ sensors });enforceContractruns them standalone and reports the failed assertion ids.import { enforceContract, type ValidationContract } from "eval-gate"; import { PredicateSensor } from "eval-gate"; const contract: ValidationContract = { id: "c1", task: "build login", assertions: [ { id: "a.has-token", description: "mentions a token", sensor: new PredicateSensor("has-token", (w) => w.includes("token"), "no token") }, ], }; enforceContract(contract, "here is a token").passed; // true
-
Reviewer benchmark (
benchmark.ts) — score a bug-finder against a golden set.matchFindingsgreedily pairsFindings toGoldenBugs (same file, same type, line withinlineTolerance, default 2),prf1turns counts into precision/recall/F1,benchmarkdoes both in one call, andconsistencysummarises F1 spread across repeated runs (lower spread = more deterministic reviewer). -
Compression probes (
compressionProbe.ts) — does a summary still answer the questions that matter once the source is gone?scoreCompressiongrades aProbeSuitewith an injectedCompressionJudge, aggregating perProbeClass(recall/artifact/continuation/decision) and overall, so dropping one class stays visible even when the headline pass-rate looks healthy.fakeJudgeis a deterministic case-insensitive substring judge for tests and exact-match probes.scoreCompressionAsyncis the async mirror for real LLM judges.modelCompressionJudge(backend)builds a production judge from an injectedAsyncModelBackend;parseCompressionVerdictextracts a boolean from the model's yes/no response and throws if the text is ambiguous or contains both signals. -
Comparable scores (
scorers.ts) —passRateis the fraction of boolean trials that passed;eloFromPairwisefits deterministic Bradley-Terry strength scores from head-to-head votes (MM iteration; identical votes always yield identical scores). -
Golden set (
goldenSet.ts) —GoldenSetcurates high-confidence reference cases;promoteCandidatesgraduates production traffic into the golden set;tailSamplepulls low-scoring items for review.
Both async seams follow the same contract: the caller injects an AsyncModelBackend;
this package only depends on the contract type, never on a model SDK.
import { modelEvaluator, FakeAsyncEvaluator } from "eval-gate";
import type { AsyncModelBackend } from "cube-runtime";
// In tests — no API keys, deterministic:
const fake = new FakeAsyncEvaluator({ complete: 4, specific: 4, correct: 5,
actionable: 4, coherent: 4, format: 4 });
const scores = await fake.score("some output");
// In production — wire any AsyncModelBackend:
declare const backend: AsyncModelBackend;
const evaluator = modelEvaluator(backend, { name: "claude-judge", family: "anthropic" });
const scores2 = await evaluator.score("some output");
// scores2 -> DimensionScores, one number per rubric dimensionbuildEvalPrompt(work) constructs the scoring prompt (JSON-only, 1–5 per dimension).
parseEvalScores(text) parses the model response into DimensionScores and throws if:
- no JSON object is found in the response,
- the JSON is malformed,
- any rubric dimension key is missing or non-numeric.
A model that ignored the scoring contract is a real failure, not a silent pass.
modelEvaluator(backend, opts?) returns an AsyncEvaluator with optional name,
family, and context overrides.
import { modelCompressionJudge, scoreCompressionAsync, parseCompressionVerdict } from "eval-gate";
import type { AsyncModelBackend } from "cube-runtime";
declare const backend: AsyncModelBackend;
const judge = modelCompressionJudge(backend);
const result = await scoreCompressionAsync(summary, suite, judge);
// result.passRate -> overall fraction of probes that passed
// result.byClass -> { recall, artifact, continuation, decision } ClassTally eachparseCompressionVerdict(text) accepts yes/true/no/false (case-insensitive,
anywhere in the text) and throws if the response is ambiguous (both signals present)
or contains neither.
modelCompressionJudge(backend) returns an AsyncCompressionJudge that builds its
prompt with buildCompressionJudgePrompt (internal) and parses the reply with
parseCompressionVerdict.
Everything is re-exported from eval-gate (src/index.ts):
- Gate:
gate,GateArgs,GateResult,ShadowVerdict,Decision - Rubric:
RUBRIC,PROFILE_THRESHOLDS,thresholdOf,weightedAverage,Dimension,DimensionScores,Profile - Evaluators (sync):
Jury,FakeEvaluator,Evaluator - Evaluators (async):
AsyncEvaluator,FakeAsyncEvaluator,modelEvaluator,buildEvalPrompt,parseEvalScores - Sensors:
PredicateSensor,runSensors,allPassed,DeterministicSensor,SensorResult - Trust:
TRUST_TIER,VerdictSource - Verdict rules:
HardFailRule,SourcesRequiredRule,firstVerdict,Verdict,VerdictRule - Slop:
detectSlop,SlopResult - Contradiction:
detectContradictions,sameKeyConflict,Assertion,Contradiction,ContradictionRule - Golden set:
GoldenSet,promoteCandidates,tailSample,GoldenCase,ProdCase - Validation contracts:
enforceContract,contractSensors,manifestOf,ValidationContract,ContractAssertion,ContractManifest,ContractResult - Benchmark:
benchmark,consistency,matchFindings,prf1,Finding,GoldenBug,MatchOptions - Compression probes:
scoreCompression,scoreCompressionAsync,modelCompressionJudge,parseCompressionVerdict,fakeJudge,PROBE_CLASSES,Probe,ProbeSuite,ProbeClass,ClassTally,CompressionJudge,AsyncCompressionJudge - Scorers:
eloFromPairwise,passRate,Pairwise,EloOptions
Dev: bun install, bun test, bun run typecheck.
Part of the Vollko cube platform — a polyrepo of composable, zero-dependency building blocks for AI-native organizations. See reference-constellation for the whole stack running end-to-end; packages depend on each other via file:../, so clone the siblings alongside this one.
Apache-2.0 © 2026 Vlad Bordei bordeivlad@gmail.com · https://github.com/p-vbordei