multivon-ai/eval-action

Engine: multivon-eval · Docs · Apache 2.0

Run a multivon-eval suite on every pull request. Posts a PR comment with Wilson confidence intervals, McNemar p-values, cost in dollars, and an opinionated gate verdict.

The comment looks like this:

## multivon-eval — ⚠️ FIX_THEN_MERGE
_toxicity regression (−18.0pp, p=0.012); cost 1.7× baseline_

Pass rate 85.4% · Cost $0.0345 · Cases 50 · Runs/case 3 · Δ vs baseline −4.6pp · Cost Δ 1.7×

### Per-evaluator
| Evaluator      | Pass rate (95% CI)   | Baseline | Δ        | p (McNemar) | Verdict        |
|---             |---                   |---       |---       |---          |---             |
| `faithfulness` | 90.0% [0.78–0.96]    | 95.0%    | −5.0pp   | 0.31        | noise          |
| `toxicity`     | 50.0% [0.32–0.68]    | 100.0%   | −50.0pp  | 0.001       | 🔻 regression  |
| `pii_detection`| 100.0% [0.93–1.00]   | 100.0%   | 0.0pp    | —           | ≈ unchanged    |

🔒 **Lock:** lock OK

Why use this

PR comment is where engineers actually read CI eval results. Replacing "the suite passed" with "the suite passed with these intervals" turns eval from a vibes check into a statistical gate. Specifically:

Wilson 95% CI on every per-evaluator pass rate so you can tell noise from signal at small n.
McNemar paired test on the verdict deltas — flags only the evaluators whose change is statistically real.
Cost report. Every comment shows the dollar cost of the run. Procurement won't approve a CI tool whose spend it can't predict.
Lockfile check. Verifies suite.lock hasn't drifted (catches silent prompt changes).
Default gate ladder. PASS / FIX_THEN_MERGE / NEEDS_REWORK with safety-class regressions escalating automatically.

Quick start

# .github/workflows/eval.yml
on:
  pull_request:
    paths: [src/**, evals/**]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: multivon-ai/eval-action@v1
        with:
          suite: evals/production.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

evals/production.py:

from multivon_eval import EvalSuite, EvalCase
from multivon_eval.evaluators.llm_judge import Faithfulness, Hallucination

def build_suite() -> EvalSuite:
    suite = EvalSuite.eu_ai_act_high_risk(jurisdiction="gdpr")
    suite.add_cases([
        EvalCase(input="Summarize this contract.", context=open("evals/contract.txt").read()),
        # …
    ])
    return suite

Inputs

Input	Default	Description
`suite`	(required)	Path to a Python file that exposes `suite` or `build_suite()`.
`baseline`	base branch of the PR	Git ref to diff against.
`fail-on`	`PR_NEEDS_REWORK`	When to exit non-zero. One of `NEVER`, `PR_NEEDS_REWORK`, `PR_FIX_THEN_MERGE`, `ANY_REGRESSION`.
`runs-per-case`	`3`	Multi-run flakiness detection. Higher = more confidence, more cost.
`workers`	`4`	Concurrent cases.
`evaluator-concurrency`	unbounded	Concurrent evaluators per case.
`comment-mode`	`replace`	`replace` (rewrite our previous comment), `append`, or `off`.
`gate-policy`	(none)	Path to a YAML policy file overriding the default gate rules.
`lockfile`	(none)	Path to a saved `suite.lock`. If set, drift causes a warning in the comment.
`github-token`	`${{ github.token }}`	Token with PR `comments: write` permission.

Outputs

Output	Example
`gate`	`PASS`, `FIX_THEN_MERGE`, `NEEDS_REWORK`
`pass_rate`	`0.854`
`cost_usd`	`0.0345`
`comment_url`	`https://github.com/…/issues/1284#issuecomment-...`

Gate policy

Defaults handle 90% of cases. Override per repo with:

# .multivon/gate-policy.yaml
gates:
  - rule: "regression"
    on_fail: FIX_THEN_MERGE
  - rule: "cost_delta_x > 2.0"
    on_fail: FIX_THEN_MERGE

Pass gate-policy: .multivon/gate-policy.yaml to the Action.

How the verdict is computed

Default rules, applied top-down:

Condition	Verdict
Any evaluator with `safety`/`toxicity`/`bias`/`pii`/`hallucination` in its name regresses with p<0.05	`NEEDS_REWORK`
Any evaluator regresses with p<0.05 (CIs don't overlap)	`FIX_THEN_MERGE`
Overall pass rate dropped >5pp	`FIX_THEN_MERGE`
Cost > 2× baseline	`FIX_THEN_MERGE`
Otherwise	`PASS`

Verdict → exit code is driven by the fail-on input.

What it doesn't do

Replaces neither pytest nor your existing CI. The Action is a layer on top — pair it with pytest -q in another job.
Does not check out arbitrary refs. Uses git worktree against whatever refs your checkout step staged.
Does not handle PRs from forks that lack secrets.OPENAI_API_KEY. Standard GitHub limitation; document a workflow_run pattern for fork support.

Cost expectations

For a 50-case suite × 3 runs × 5 sub-calls/case on gpt-4o-mini: roughly $0.03 per PR at default settings, and the comment shows the actual number every time.

Pair with multivon-eval's built-in judge cache (JudgeConfig(cache=True)) to amortize across repeated PRs on the same baseline.

The Multivon ecosystem

Five public + one early-access package, all built on a shared evaluation engine:

Repo	What it is
multivon-eval	Python SDK — the engine eval-action runs on every PR
pdfhell	Adversarial PDFs — also emits JUnit output, also gates merges
multivon-mcp	MCP server — call the same evals from inside Claude / Cursor / Cline
eval-action (you are here)	GitHub Action wrapper
eval-framework-benchmark	Reproducible head-to-head benchmark vs DeepEval + RAGAS
multivon-guard (early access)	Local proxy that catches LLM coding agents leaking secrets / PII

License

Apache 2.0.

Maintained by Multivon. Issues + PRs welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
examples		examples
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multivon-ai/eval-action

Why use this

Quick start

Inputs

Outputs

Gate policy

How the verdict is computed

What it doesn't do

Cost expectations

The Multivon ecosystem

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

multivon-ai/eval-action

Why use this

Quick start

Inputs

Outputs

Gate policy

How the verdict is computed

What it doesn't do

Cost expectations

The Multivon ecosystem

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages