Skip to content

p-vbordei/eval-gate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval-gate

The eval, wired into the loop: a 3-layer quality gate — deterministic fail-fast → trusted verdict-rules (short-circuit the LLM) → cross-family LLM jury on a 6-dim rubric — routing ship / retry / escalate with trust-tiered provenance.

Status: M1 · part of the Vollko cube platform. Language: TypeScript (Bun/Node) · zero runtime dependencies; depends on cube-runtime for the AsyncModelBackend contract type (the real async paths).

import { gate, Jury, FakeEvaluator, PredicateSensor, RUBRIC } from "eval-gate";

const jury = new Jury([
  new FakeEvaluator(Object.fromEntries(RUBRIC.map((d) => [d.name, 4.4])) as Record<string, number>, "judgeA", "anthropic"),
  new FakeEvaluator(Object.fromEntries(RUBRIC.map((d) => [d.name, 4.1])) as Record<string, number>, "judgeB", "openai"),
]);
const result = gate("the work to judge.", { profile: "default", jury });
// result.decision -> "ship" | "retry" | "escalate"
  • Deterministic sensors run first and can veto a perfect jury (cheap, certain).
  • Trusted VerdictRules short-circuit the LLM (hard-fail → escalate, etc.).
  • The jury aggregates evaluators by per-dimension median; a cross-family jury scores higher trust.
  • The entity doing the work never judges it — you pass in an external Evaluator/Jury.
  • FakeEvaluator makes the whole gate deterministic with no API keys; a real model judge (or harness) drops in behind the Evaluator interface.
  • The async real path (AsyncEvaluator / AsyncCompressionJudge) wires an injected AsyncModelBackend behind the same fakes-first philosophy — no model SDK ever becomes a compile-time dependency of this package.

Rubric

The jury scores work on six dimensions (weights sum to 100 %):

Dimension Weight What it measures
complete 25 % All parts of what was asked are addressed
specific 20 % Claims are concrete rather than vague
correct 20 % Content is factually and logically sound
actionable 15 % Someone can act directly on this output
coherent 10 % Structure and flow are easy to follow
format 10 % Presentation matches expectations (length, structure)

weightedAverage(scores) collapses a DimensionScores map to a single number. PROFILE_THRESHOLDS defines three ship thresholds: lenient (3.0), default (3.5), strict (4.0); thresholdOf(profile) retrieves the value.

Beyond the gate

The package also ships standalone scorers that measure the evaluators themselves — built on the same deterministic-first philosophy, with judges injected so the kernels stay pure.

  • Validation contracts (contract.ts) — author named pass/fail assertions at plan time, before any code. manifestOf serialises the contract's intent (ids + descriptions) to a JSON-safe ContractManifest for the transcript; contractSensors feeds the assertions straight into gate({ sensors }); enforceContract runs them standalone and reports the failed assertion ids.

    import { enforceContract, type ValidationContract } from "eval-gate";
    import { PredicateSensor } from "eval-gate";
    
    const contract: ValidationContract = {
      id: "c1",
      task: "build login",
      assertions: [
        { id: "a.has-token", description: "mentions a token",
          sensor: new PredicateSensor("has-token", (w) => w.includes("token"), "no token") },
      ],
    };
    enforceContract(contract, "here is a token").passed; // true
  • Reviewer benchmark (benchmark.ts) — score a bug-finder against a golden set. matchFindings greedily pairs Findings to GoldenBugs (same file, same type, line within lineTolerance, default 2), prf1 turns counts into precision/recall/F1, benchmark does both in one call, and consistency summarises F1 spread across repeated runs (lower spread = more deterministic reviewer).

  • Compression probes (compressionProbe.ts) — does a summary still answer the questions that matter once the source is gone? scoreCompression grades a ProbeSuite with an injected CompressionJudge, aggregating per ProbeClass (recall / artifact / continuation / decision) and overall, so dropping one class stays visible even when the headline pass-rate looks healthy. fakeJudge is a deterministic case-insensitive substring judge for tests and exact-match probes. scoreCompressionAsync is the async mirror for real LLM judges. modelCompressionJudge(backend) builds a production judge from an injected AsyncModelBackend; parseCompressionVerdict extracts a boolean from the model's yes/no response and throws if the text is ambiguous or contains both signals.

  • Comparable scores (scorers.ts) — passRate is the fraction of boolean trials that passed; eloFromPairwise fits deterministic Bradley-Terry strength scores from head-to-head votes (MM iteration; identical votes always yield identical scores).

  • Golden set (goldenSet.ts) — GoldenSet curates high-confidence reference cases; promoteCandidates graduates production traffic into the golden set; tailSample pulls low-scoring items for review.

Async real paths

Both async seams follow the same contract: the caller injects an AsyncModelBackend; this package only depends on the contract type, never on a model SDK.

AsyncEvaluator — LLM jury over HTTP

import { modelEvaluator, FakeAsyncEvaluator } from "eval-gate";
import type { AsyncModelBackend } from "cube-runtime";

// In tests — no API keys, deterministic:
const fake = new FakeAsyncEvaluator({ complete: 4, specific: 4, correct: 5,
                                      actionable: 4, coherent: 4, format: 4 });
const scores = await fake.score("some output");

// In production — wire any AsyncModelBackend:
declare const backend: AsyncModelBackend;
const evaluator = modelEvaluator(backend, { name: "claude-judge", family: "anthropic" });
const scores2 = await evaluator.score("some output");
// scores2 -> DimensionScores, one number per rubric dimension

buildEvalPrompt(work) constructs the scoring prompt (JSON-only, 1–5 per dimension). parseEvalScores(text) parses the model response into DimensionScores and throws if:

  • no JSON object is found in the response,
  • the JSON is malformed,
  • any rubric dimension key is missing or non-numeric.

A model that ignored the scoring contract is a real failure, not a silent pass.

modelEvaluator(backend, opts?) returns an AsyncEvaluator with optional name, family, and context overrides.

AsyncCompressionJudge — yes/no probe grading over HTTP

import { modelCompressionJudge, scoreCompressionAsync, parseCompressionVerdict } from "eval-gate";
import type { AsyncModelBackend } from "cube-runtime";

declare const backend: AsyncModelBackend;
const judge = modelCompressionJudge(backend);

const result = await scoreCompressionAsync(summary, suite, judge);
// result.passRate    -> overall fraction of probes that passed
// result.byClass     -> { recall, artifact, continuation, decision } ClassTally each

parseCompressionVerdict(text) accepts yes/true/no/false (case-insensitive, anywhere in the text) and throws if the response is ambiguous (both signals present) or contains neither.

modelCompressionJudge(backend) returns an AsyncCompressionJudge that builds its prompt with buildCompressionJudgePrompt (internal) and parses the reply with parseCompressionVerdict.

Public API

Everything is re-exported from eval-gate (src/index.ts):

  • Gate: gate, GateArgs, GateResult, ShadowVerdict, Decision
  • Rubric: RUBRIC, PROFILE_THRESHOLDS, thresholdOf, weightedAverage, Dimension, DimensionScores, Profile
  • Evaluators (sync): Jury, FakeEvaluator, Evaluator
  • Evaluators (async): AsyncEvaluator, FakeAsyncEvaluator, modelEvaluator, buildEvalPrompt, parseEvalScores
  • Sensors: PredicateSensor, runSensors, allPassed, DeterministicSensor, SensorResult
  • Trust: TRUST_TIER, VerdictSource
  • Verdict rules: HardFailRule, SourcesRequiredRule, firstVerdict, Verdict, VerdictRule
  • Slop: detectSlop, SlopResult
  • Contradiction: detectContradictions, sameKeyConflict, Assertion, Contradiction, ContradictionRule
  • Golden set: GoldenSet, promoteCandidates, tailSample, GoldenCase, ProdCase
  • Validation contracts: enforceContract, contractSensors, manifestOf, ValidationContract, ContractAssertion, ContractManifest, ContractResult
  • Benchmark: benchmark, consistency, matchFindings, prf1, Finding, GoldenBug, MatchOptions
  • Compression probes: scoreCompression, scoreCompressionAsync, modelCompressionJudge, parseCompressionVerdict, fakeJudge, PROBE_CLASSES, Probe, ProbeSuite, ProbeClass, ClassTally, CompressionJudge, AsyncCompressionJudge
  • Scorers: eloFromPairwise, passRate, Pairwise, EloOptions

Dev: bun install, bun test, bun run typecheck.

Part of the Vollko cube platform — a polyrepo of composable, zero-dependency building blocks for AI-native organizations. See reference-constellation for the whole stack running end-to-end; packages depend on each other via file:../, so clone the siblings alongside this one.


Apache-2.0 © 2026 Vlad Bordei bordeivlad@gmail.com · https://github.com/p-vbordei

About

A 3-layer quality gate (deterministic veto, trusted rules, cross-family LLM jury) routing ship/retry/escalate. Part of the Vollko cube platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors