eval-gate

The eval, wired into the loop: a 3-layer quality gate — deterministic fail-fast → trusted verdict-rules (short-circuit the LLM) → cross-family LLM jury on a 6-dim rubric — routing ship / retry / escalate with trust-tiered provenance.

Status: M1 · part of the Vollko cube platform. Language: TypeScript (Bun/Node) · zero runtime dependencies; depends on cube-runtime for the AsyncModelBackend contract type (the real async paths).

import { gate, Jury, FakeEvaluator, PredicateSensor, RUBRIC } from "eval-gate";

const jury = new Jury([
  new FakeEvaluator(Object.fromEntries(RUBRIC.map((d) => [d.name, 4.4])) as Record<string, number>, "judgeA", "anthropic"),
  new FakeEvaluator(Object.fromEntries(RUBRIC.map((d) => [d.name, 4.1])) as Record<string, number>, "judgeB", "openai"),
]);
const result = gate("the work to judge.", { profile: "default", jury });
// result.decision -> "ship" | "retry" | "escalate"

Deterministic sensors run first and can veto a perfect jury (cheap, certain).
Trusted VerdictRules short-circuit the LLM (hard-fail → escalate, etc.).
The jury aggregates evaluators by per-dimension median; a cross-family jury scores higher trust.
The entity doing the work never judges it — you pass in an external Evaluator/Jury.
FakeEvaluator makes the whole gate deterministic with no API keys; a real model judge (or harness) drops in behind the Evaluator interface.
The async real path (AsyncEvaluator / AsyncCompressionJudge) wires an injected AsyncModelBackend behind the same fakes-first philosophy — no model SDK ever becomes a compile-time dependency of this package.

Rubric

The jury scores work on six dimensions (weights sum to 100 %):

Dimension	Weight	What it measures
`complete`	25 %	All parts of what was asked are addressed
`specific`	20 %	Claims are concrete rather than vague
`correct`	20 %	Content is factually and logically sound
`actionable`	15 %	Someone can act directly on this output
`coherent`	10 %	Structure and flow are easy to follow
`format`	10 %	Presentation matches expectations (length, structure)

weightedAverage(scores) collapses a DimensionScores map to a single number. PROFILE_THRESHOLDS defines three ship thresholds: lenient (3.0), default (3.5), strict (4.0); thresholdOf(profile) retrieves the value.

Beyond the gate

The package also ships standalone scorers that measure the evaluators themselves — built on the same deterministic-first philosophy, with judges injected so the kernels stay pure.

Validation contracts (contract.ts) — author named pass/fail assertions at plan time, before any code. manifestOf serialises the contract's intent (ids + descriptions) to a JSON-safe ContractManifest for the transcript; contractSensors feeds the assertions straight into gate({ sensors }); enforceContract runs them standalone and reports the failed assertion ids.

import { enforceContract, type ValidationContract } from "eval-gate";
import { PredicateSensor } from "eval-gate";

const contract: ValidationContract = {
  id: "c1",
  task: "build login",
  assertions: [
    { id: "a.has-token", description: "mentions a token",
      sensor: new PredicateSensor("has-token", (w) => w.includes("token"), "no token") },
  ],
};
enforceContract(contract, "here is a token").passed; // true

Reviewer benchmark (benchmark.ts) — score a bug-finder against a golden set. matchFindings greedily pairs Findings to GoldenBugs (same file, same type, line within lineTolerance, default 2), prf1 turns counts into precision/recall/F1, benchmark does both in one call, and consistency summarises F1 spread across repeated runs (lower spread = more deterministic reviewer).
Compression probes (compressionProbe.ts) — does a summary still answer the questions that matter once the source is gone? scoreCompression grades a ProbeSuite with an injected CompressionJudge, aggregating per ProbeClass (recall / artifact / continuation / decision) and overall, so dropping one class stays visible even when the headline pass-rate looks healthy. fakeJudge is a deterministic case-insensitive substring judge for tests and exact-match probes. scoreCompressionAsync is the async mirror for real LLM judges. modelCompressionJudge(backend) builds a production judge from an injected AsyncModelBackend; parseCompressionVerdict extracts a boolean from the model's yes/no response and throws if the text is ambiguous or contains both signals.
Comparable scores (scorers.ts) — passRate is the fraction of boolean trials that passed; eloFromPairwise fits deterministic Bradley-Terry strength scores from head-to-head votes (MM iteration; identical votes always yield identical scores).
Golden set (goldenSet.ts) — GoldenSet curates high-confidence reference cases; promoteCandidates graduates production traffic into the golden set; tailSample pulls low-scoring items for review.

Async real paths

Both async seams follow the same contract: the caller injects an AsyncModelBackend; this package only depends on the contract type, never on a model SDK.

AsyncEvaluator — LLM jury over HTTP

import { modelEvaluator, FakeAsyncEvaluator } from "eval-gate";
import type { AsyncModelBackend } from "cube-runtime";

// In tests — no API keys, deterministic:
const fake = new FakeAsyncEvaluator({ complete: 4, specific: 4, correct: 5,
                                      actionable: 4, coherent: 4, format: 4 });
const scores = await fake.score("some output");

// In production — wire any AsyncModelBackend:
declare const backend: AsyncModelBackend;
const evaluator = modelEvaluator(backend, { name: "claude-judge", family: "anthropic" });
const scores2 = await evaluator.score("some output");
// scores2 -> DimensionScores, one number per rubric dimension

buildEvalPrompt(work) constructs the scoring prompt (JSON-only, 1–5 per dimension). parseEvalScores(text) parses the model response into DimensionScores and throws if:

no JSON object is found in the response,
the JSON is malformed,
any rubric dimension key is missing or non-numeric.

A model that ignored the scoring contract is a real failure, not a silent pass.

modelEvaluator(backend, opts?) returns an AsyncEvaluator with optional name, family, and context overrides.

AsyncCompressionJudge — yes/no probe grading over HTTP

import { modelCompressionJudge, scoreCompressionAsync, parseCompressionVerdict } from "eval-gate";
import type { AsyncModelBackend } from "cube-runtime";

declare const backend: AsyncModelBackend;
const judge = modelCompressionJudge(backend);

const result = await scoreCompressionAsync(summary, suite, judge);
// result.passRate    -> overall fraction of probes that passed
// result.byClass     -> { recall, artifact, continuation, decision } ClassTally each

parseCompressionVerdict(text) accepts yes/true/no/false (case-insensitive, anywhere in the text) and throws if the response is ambiguous (both signals present) or contains neither.

modelCompressionJudge(backend) returns an AsyncCompressionJudge that builds its prompt with buildCompressionJudgePrompt (internal) and parses the reply with parseCompressionVerdict.

Public API

Everything is re-exported from eval-gate (src/index.ts):

Gate: gate, GateArgs, GateResult, ShadowVerdict, Decision
Rubric: RUBRIC, PROFILE_THRESHOLDS, thresholdOf, weightedAverage, Dimension, DimensionScores, Profile
Evaluators (sync): Jury, FakeEvaluator, Evaluator
Evaluators (async): AsyncEvaluator, FakeAsyncEvaluator, modelEvaluator, buildEvalPrompt, parseEvalScores
Sensors: PredicateSensor, runSensors, allPassed, DeterministicSensor, SensorResult
Trust: TRUST_TIER, VerdictSource
Verdict rules: HardFailRule, SourcesRequiredRule, firstVerdict, Verdict, VerdictRule
Slop: detectSlop, SlopResult
Contradiction: detectContradictions, sameKeyConflict, Assertion, Contradiction, ContradictionRule
Golden set: GoldenSet, promoteCandidates, tailSample, GoldenCase, ProdCase
Validation contracts: enforceContract, contractSensors, manifestOf, ValidationContract, ContractAssertion, ContractManifest, ContractResult
Benchmark: benchmark, consistency, matchFindings, prf1, Finding, GoldenBug, MatchOptions
Compression probes: scoreCompression, scoreCompressionAsync, modelCompressionJudge, parseCompressionVerdict, fakeJudge, PROBE_CLASSES, Probe, ProbeSuite, ProbeClass, ClassTally, CompressionJudge, AsyncCompressionJudge
Scorers: eloFromPairwise, passRate, Pairwise, EloOptions

Dev: bun install, bun test, bun run typecheck.

Part of the Vollko cube platform — a polyrepo of composable, zero-dependency building blocks for AI-native organizations. See reference-constellation for the whole stack running end-to-end; packages depend on each other via file:../, so clone the siblings alongside this one.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-gate

Rubric

Beyond the gate

Async real paths

AsyncEvaluator — LLM jury over HTTP

AsyncCompressionJudge — yes/no probe grading over HTTP

Public API

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-gate

Rubric

Beyond the gate

Async real paths

AsyncEvaluator — LLM jury over HTTP

AsyncCompressionJudge — yes/no probe grading over HTTP

Public API

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages