The only eval package native to the official Laravel AI stack.
Status: Early development — pre-1.0, API unstable.
Modern Laravel apps increasingly ship AI agents, prompts, and MCP tools straight
to production. The official laravel/ai SDK makes building them easy, but the
feedback loop for evaluating them has lived outside the framework in
language-agnostic tools like Promptfoo. Proofread brings that loop home: a
Laravel-native way to measure whether your agents actually do what you think
they do, from Pest, from CI, and from production traffic.
Three things make Proofread different:
- Agent classes are first-class eval subjects. Point an eval at an FQCN,
an instance, or a callable —
SubjectResolvernormalizes all of them, andlaravel/aifakes plug in automatically. - Pest-native expectations. Write evals as
expect($agent)->toPassEval(...)instead of learning a new YAML DSL. Rubrics, JSON schemas, cost ceilings, and golden snapshots are allexpect()extensions. - Shadow evals in production. Capture real traffic via middleware, evaluate it asynchronously against registered assertions, alert on pass-rate drops, and promote failing captures into regression datasets.
composer require mosaiqo/proofreadRequires PHP 8.4 and Laravel 13.x.
Optional MCP integration (expose eval tools to MCP-compatible editors and assistants):
composer require laravel/mcpPublish the config and migrations:
php artisan vendor:publish --tag=proofread-config
php artisan vendor:publish --tag=proofread-migrations
php artisan migratenamespace App\Agents;
use Laravel\Ai\Contracts\Agent;
use Laravel\Ai\Promptable;
use Stringable;
class SentimentAgent implements Agent
{
use Promptable;
public function instructions(): Stringable|string
{
return <<<'PROMPT'
You are a sentiment classifier.
Classify the user message as exactly one of: positive, negative, neutral.
Respond with a single lowercase word. No punctuation, no explanation.
PROMPT;
}
}use App\Agents\SentimentAgent;
use Mosaiqo\Proofread\Assertions\CostLimit;
use Mosaiqo\Proofread\Assertions\RegexAssertion;
use Mosaiqo\Proofread\Assertions\Rubric;
use Mosaiqo\Proofread\Support\Dataset;
it('classifies sentiment reliably and cheaply', function (): void {
$dataset = Dataset::make('sentiment', [
['input' => 'I love this product!', 'expected' => 'positive'],
['input' => 'This is terrible.', 'expected' => 'negative'],
['input' => 'It works as described.', 'expected' => 'neutral'],
]);
expect(SentimentAgent::class)->toPassEval($dataset, [
RegexAssertion::make('/^(positive|negative|neutral)$/'),
Rubric::make('response is a single lowercase sentiment label'),
CostLimit::under(0.01),
]);
});namespace App\Evals;
use App\Agents\SentimentAgent;
use Mosaiqo\Proofread\Assertions\RegexAssertion;
use Mosaiqo\Proofread\Suite\EvalSuite;
use Mosaiqo\Proofread\Support\Dataset;
final class SentimentSuite extends EvalSuite
{
public function dataset(): Dataset
{
return Dataset::make('sentiment', [
['input' => 'I love this product!', 'expected' => 'positive'],
['input' => 'This is terrible.', 'expected' => 'negative'],
]);
}
public function subject(): mixed
{
return SentimentAgent::class;
}
public function assertions(): array
{
return [
RegexAssertion::make('/^(positive|negative|neutral)$/'),
];
}
}php artisan evals:run "App\\Evals\\SentimentSuite"A suite bundles a dataset, a subject, and a list of assertions. Extend
Mosaiqo\Proofread\Suite\EvalSuite and implement three methods:
abstract public function dataset(): Dataset;
abstract public function subject(): mixed;
abstract public function assertions(): array;See src/Suite/EvalSuite.php for the full contract.
Suites can optionally override setUp() and tearDown() for
database-dependent setup, and assertionsFor(array $case) to vary
assertions per case based on metadata. Both toPassSuite() in Pest
and the evals:run Artisan command drive the full lifecycle
automatically.
Three shapes are accepted:
- Callable — a closure or any
callable, receives(mixed $input, array $case). - Agent FQCN — a class-string implementing
Laravel\Ai\Contracts\Agent. Resolved from the container on each case. - Agent instance — useful when you want to pre-configure the agent.
Mosaiqo\Proofread\Runner\SubjectResolver normalizes all three into a uniform
closure and captures token usage, model, provider, latency, and derived cost
into assertion metadata.
| Assertion | Kind | Example |
|---|---|---|
ContainsAssertion |
deterministic | ContainsAssertion::make('positive') |
RegexAssertion |
deterministic | RegexAssertion::make('/^\d+$/') |
LengthAssertion |
deterministic | LengthAssertion::between(5, 200) |
CountAssertion |
deterministic | CountAssertion::between(1, 10) |
JsonSchemaAssertion |
deterministic | JsonSchemaAssertion::fromAgent(MyAgent::class) |
TokenBudget |
operational | TokenBudget::maxTotal(1000) |
CostLimit |
operational | CostLimit::under(0.01) |
LatencyLimit |
operational | LatencyLimit::under(3000) |
Rubric |
semantic (LLM-as-judge) | Rubric::make('polite and concise') |
Similar |
semantic (embeddings) | Similar::to('reference text')->minScore(0.8) |
HallucinationAssertion |
semantic (LLM-as-judge) | HallucinationAssertion::against($groundTruth) |
LanguageAssertion |
semantic (LLM-as-judge) | LanguageAssertion::matches('en') |
Trajectory |
trajectory | Trajectory::callsTool('search') |
GoldenSnapshot |
snapshot | GoldenSnapshot::fromContext() |
StructuredOutputAssertion |
structured | StructuredOutputAssertion::conformsTo(MyAgent::class) |
PiiLeakageAssertion |
safety | PiiLeakageAssertion::make() |
All assertions live under Mosaiqo\Proofread\Assertions\. Each one returns an
AssertionResult with a passed bool, a human-readable reason, an optional
numeric score, and arbitrary metadata.
| Expectation | Subject | Usage |
|---|---|---|
toPassAssertion |
any output | expect($output)->toPassAssertion(ContainsAssertion::make('x')) |
toPassEval |
callable / Agent | expect($agent)->toPassEval($dataset, $assertions) |
toPassSuite |
EvalSuite | expect($suite)->toPassSuite() |
toPassRubric |
string output | expect($output)->toPassRubric('polite tone') |
toMatchSchema |
JSON output | expect($json)->toMatchSchema($schemaArray) |
toCostUnder |
EvalRun |
expect($run)->toCostUnder(0.05) |
toMatchGoldenSnapshot |
any output | expect($output)->toMatchGoldenSnapshot() |
Both toPassEval and toPassSuite assign the resulting EvalRun
to the expectation's ->value after the assertion passes, so
callers can chain post-run inspection:
$run = expect($agent)->toPassEval($dataset, $assertions)->value;
expect($run->total_cost_usd)->toBeLessThan(0.05);
expect($run->failures())->toBeEmpty();Expectations are loaded via Mosaiqo\Proofread\Testing\expectations.php; wire
it up from your tests/Pest.php:
require_once __DIR__.'/../vendor/mosaiqo/proofread/src/Testing/expectations.php';A bundled PHPStan extension teaches static analysis about these dynamic expectations — no stub files to maintain.
| Command | Purpose |
|---|---|
evals:run {suites*} |
Run one or more EvalSuite classes. Flags: --persist, --fail-fast, --filter, --junit, --queue, --commit-sha, --fake-judge, --concurrency, --gate-pass-rate, --gate-cost-max. |
evals:benchmark {suite} |
Run a suite N times and report pass-rate variance, duration percentiles, cost, and per-case flakiness. Flags: --iterations, --concurrency, --fake-judge, --flakiness-threshold, --format. |
evals:compare {base} {head} |
Structured diff between two persisted runs. Flags: --format=table|json|markdown, --only-regressions, --max-cases, --output. |
evals:dataset:diff {dataset} |
Compare two versions of a dataset. Accepts --base, --head, --format. |
evals:providers {suite} |
Run a MultiSubjectEvalSuite and render a matrix of cases × subjects. Flags: --persist, --commit-sha, --concurrency, --provider-concurrency, --fake-judge, --format. |
evals:export {id} |
Export a persisted run or comparison as self-contained Markdown or HTML. id accepts a ULID, a commit SHA prefix, or latest (resolves to the most recent run). Use --type=comparison to target comparisons. Flags: --format, --output, --type=run|comparison. |
evals:cluster |
Cluster failures by embedding similarity |
evals:cost-simulate {agent} |
Project cost against alternative models using shadow capture data. Flags: --days, --model, --format. |
evals:coverage {agent} {dataset} |
Analyze dataset coverage against shadow captures using embeddings. Flags: --days, --threshold, --max-captures, --embedding-model, --format. |
proofread:lint {agents*} |
Static analysis of Agent instructions(). Flags: --format, --severity, --with-judge. |
shadow:evaluate |
Evaluate captured shadow traffic against registered assertions |
shadow:alert |
Check pass-rate alerts against thresholds |
dataset:generate |
Generate synthetic cases from a schema via an LLM |
dataset:import {file} |
Import a CSV or JSON file into a PHP dataset file. Flags: --name, --output, --force. |
dataset:export {dataset} |
Export a persisted dataset version as CSV or JSON. Flags: --format, --output, --dataset-version. |
proofread:make-suite {name} |
Scaffold a new EvalSuite. Flag: --multi. |
proofread:make-assertion {name} |
Scaffold a new Assertion class. |
proofread:make-dataset {name} |
Scaffold a dataset PHP file. Flag: --path. |
Each command supports --help for the full flag list.
EvalRunner::runSuite($suite, concurrency: 5) executes up to 5 cases
in parallel via Laravel's process-based concurrency driver. The
evals:run --concurrency=N flag exposes the same capability from the
CLI.
Concurrency is beneficial for I/O-bound subjects (LLM and HTTP calls)
where parallel wait dominates. For deterministic in-memory subjects
the per-task overhead of child-process spawning and closure
serialization makes sequential execution faster — leave concurrency
at its default of 1 in that case.
Subjects must be serializable. Agent class-string FQCNs and static
closures serialize cleanly. Ad-hoc closures that capture test-local
state by reference (use (&$x)) cannot cross the process boundary
and should only be used with concurrency: 1.
Note: concurrency > 1 is unsafe for subjects that write to SQLite. SQLite serializes writers, so parallel SQLite writes will hit "database is locked" errors. Use concurrency for LLM and HTTP subjects, or subjects that only read from the database.
Shadow evals let you measure agent quality on real production traffic without affecting the user-facing response path.
The flow is:
- Apply
Mosaiqo\Proofread\Shadow\EvalShadowMiddlewareto routes that call your agent. It samples a configurable fraction of requests, sanitizes PII, and persists aShadowCapturerow. - A queued job (or
shadow:evaluateon a schedule) runs the registered assertions for that agent over new captures and records aShadowEval. shadow:alertchecks the pass rate over a sliding window and dispatches a notification (mail, Slack webhook, etc.) when it drops below the configured threshold.- From the dashboard you can promote any failing capture straight into a regression dataset.
Register the assertions you want to evaluate per agent in a service provider:
use App\Agents\SentimentAgent;
use Mosaiqo\Proofread\Assertions\RegexAssertion;
use Mosaiqo\Proofread\Shadow\ShadowAssertionsRegistry;
public function boot(ShadowAssertionsRegistry $registry): void
{
$registry->register(SentimentAgent::class, [
RegexAssertion::make('/^(positive|negative|neutral)$/'),
]);
}And enable shadow capture in config/proofread.php:
'shadow' => [
'enabled' => true,
'sample_rate' => 0.1,
'queue' => 'default',
'alerts' => [
'enabled' => true,
'pass_rate_threshold' => 0.85,
'window' => '1h',
'channels' => ['mail', 'slack'],
],
],See src/Shadow/ for the full surface.
Compare the same dataset against multiple subjects — typically
different models, providers, or prompt variations — in a single
invocation. Extend MultiSubjectEvalSuite and declare
subjects(): array<string, mixed>:
final class SentimentMatrixSuite extends MultiSubjectEvalSuite
{
public function name(): string
{
return 'sentiment-matrix';
}
public function dataset(): Dataset
{
return Dataset::make('sentiment', [...]);
}
public function subjects(): array
{
return [
'haiku' => SentimentAgentHaiku::class,
'sonnet' => SentimentAgentSonnet::class,
'opus' => SentimentAgentOpus::class,
];
}
public function assertions(): array
{
return [Rubric::make('Must classify correctly.')->minScore(0.8)];
}
}Run it:
php artisan evals:providers App\\Evals\\SentimentMatrixSuite --persist --provider-concurrency=3Output is a matrix of cases × subjects with pass/fail status and
per-subject aggregate stats. The persisted comparison is browsable
at /evals/comparisons/{id} in the dashboard, exportable via
evals:export {id}, and queryable via the EvalComparison model.
Helper methods on EvalComparison: bestByPassRate(), cheapest(),
fastest(). No opinionated overall winner — the three axes are
surfaced separately.
A Livewire-powered dashboard ships with the package at /evals (configurable).
Routes:
/evals/overview— home with trend chart and recent regressions/evals/runs— run history with filters and stats/evals/runs/{ulid}— per-case drill-down/evals/runs/{ulid}/export?format=md|html— download the run as a self-contained Markdown or HTML document/evals/datasets— dataset explorer with sparklines/evals/compare— side-by-side diff between two runs/evals/comparisons— multi-provider comparison history with filters and stats/evals/comparisons/{id}— matrix detail with winner cards and drill-down drawer/evals/comparisons/{id}/export?format=md|html— download the comparison as a self-contained Markdown or HTML document/evals/costs— cost breakdown by model and dataset/evals/shadow— captures, evals, and promote-to-dataset workflow
Access is controlled by the viewEvals Gate. The default definition allows
the local environment only; override it in your AppServiceProvider:
use Illuminate\Support\Facades\Gate;
public function boot(): void
{
Gate::define('viewEvals', fn ($user) => $user?->isAdmin() === true);
}Every persisted run is linked to an EvalDatasetVersion snapshot
capturing the exact cases that were evaluated. When a dataset's
checksum changes between runs, Proofread automatically records a
new version without losing the old one. Use evals:dataset:diff to
see what changed between any two versions.
Proofread ships approximate pricing for common models (Claude 4.x, GPT-4o,
o1 series, Gemini 1.5, OpenAI embeddings). Pricing covers input, output,
cache reads, cache writes, and reasoning tokens. CostLimit, the dashboard
cost view, and shadow capture cost tracking all work out of the box for
supported models.
Override or extend the table in config/proofread.php:
'pricing' => [
'models' => [
'claude-sonnet-4-6' => [
'input_per_1m' => 3.00,
'output_per_1m' => 15.00,
'cache_read_per_1m' => 0.30,
'cache_write_per_1m' => 3.75,
],
// ...
],
],Any model absent from the table reports a null cost — CostLimit fails
closed on missing data rather than silently passing.
When laravel/mcp is installed, Proofread exposes four tools:
list_eval_suites— list registeredEvalSuiteclassesrun_eval_suite— run a suite and return the structured resultget_eval_run_diff— diff two persisted runs by ULIDrun_provider_comparison— run aMultiSubjectEvalSuiteand return per-subject aggregate stats
Register them from your own Server subclass:
use Laravel\Mcp\Server;
use Mosaiqo\Proofread\Mcp\McpIntegration;
class ProofreadMcpServer extends Server
{
protected array $tools = McpIntegration::tools();
}And declare which suites are discoverable via the tool in
config/proofread.php:
'mcp' => [
'suites' => [
\App\Evals\SentimentSuite::class,
],
],Proofread ships a ready-to-use workflow template. Publish it to your project and customize the suite FQCN:
php artisan vendor:publish --tag=proofread-workflowsThe workflow lands at .github/workflows/proofread.yml and runs your
suites on every PR and push to main. It uploads JUnit XML as an
artifact and renders a per-case report directly in the PR via
mikepenz/action-junit-report.
Required repository secrets if your suites use LLM-backed assertions
(Rubric, Hallucination, Similar, Language):
ANTHROPIC_API_KEY— if your judge or agents use Anthropic.OPENAI_API_KEY— if they use OpenAI.
For deterministic CI runs without real LLM calls, add
--fake-judge=pass to the evals:run command in the workflow.
evals:compare --format=markdown renders the diff as a
PR-friendly Markdown document. Regressions lead, improvements
follow, stable cases collapse into a <details> block for
readability. Pair it with an --output path and post the result
via peter-evans/create-or-update-comment or similar:
php artisan evals:compare "$BASE_RUN_ID" "$HEAD_RUN_ID" \
--format=markdown \
--output=storage/evals/pr-comment.mdThe published workflow template includes commented scaffolding for this step. Activate it by implementing your own baseline strategy (artifact, shared DB, or branch comparison) to resolve the two run IDs.
If your project has laravel/telescope installed, Proofread
automatically records persisted eval runs as Telescope entries.
They appear under Events tagged with proofread_eval alongside
your queries, jobs, and requests. Filter by the proofread_eval
tag (or by dataset:..., suite:..., commit:...) to inspect
recent runs without leaving your debugging workflow.
Registration is conditional — Proofread checks for Telescope at boot time and wires up the listener only when it is available. No configuration required.
If your project uses laravel/pulse, Proofread automatically
records persisted eval runs as Pulse metrics. Each run emits a
proofread_eval counter keyed by dataset::passed|failed, a
proofread_eval_duration gauge (avg + max, in ms), and a
proofread_eval_cost sum (in micro-dollars) when cost data is
available. They surface alongside your queries, jobs, and
requests in the Pulse dashboard.
Publish the optional dashboard card:
php artisan vendor:publish --tag=proofread-pulseThen add it to your Pulse dashboard view
resources/views/vendor/pulse/dashboard.blade.php:
<x-pulse>
@include('vendor.pulse.cards.proofread', ['cols' => 4, 'rows' => 2])
</x-pulse>Registration is conditional on Pulse being installed and bound in the container — no action needed if you don't use Pulse.
If your project has open-telemetry/api installed (with a
configured TracerProvider), Proofread emits a span tree for
every persisted eval run:
- Root span
proofread.eval.runcovers the full run duration. - Child span
proofread.eval.caseper case, parented to the root. - Each case span carries its assertion outcomes as OTel events
named
assertion.
Attributes are prefixed with proofread.* (run.id,
dataset.name, suite.class, run.passed, case.duration_ms,
etc.). Spans propagate to whatever exporter the host app has
configured — Jaeger, Grafana Tempo, Honeycomb, an OTLP collector,
and so on.
Registration is conditional on the OpenTelemetry API being
available. Proofread does not require it as a hard dependency;
install open-telemetry/api (or the full SDK) in your project and
wire a TracerProvider via OpenTelemetry\API\Globals to
activate tracing.
If your project uses laravel/boost, publish Proofread's AI
guidelines so Boost-powered editors can generate idiomatic
suites, assertions, and tests:
php artisan vendor:publish --tag=proofread-boost-guidelinesThe guidelines land at .ai/guidelines/proofread.md. They cover
suite structure, assertion selection, testing patterns, and CLI
workflow. If your Boost setup expects a different path, move the
file after publishing.
Most LLM providers bill per-token via their APIs. If you already pay a flat-rate subscription (Max, Pro, etc.), you may not want to rack up API charges on top of what you are already paying. Proofread lets you evaluate against headless CLI tools that run under your subscription.
use Mosaiqo\Proofread\Cli\Subjects\ClaudeCodeCliSubject;
use Mosaiqo\Proofread\Runner\EvalRunner;
$subject = ClaudeCodeCliSubject::make()
->withModel('claude-sonnet-4-6')
->withTimeout(60);
$run = (new EvalRunner)->run($subject, $dataset, $assertions);The CLI must be installed and authenticated in the environment
where evals run. claude --output-format json is parsed to extract
the response text plus token usage, cost, and model metadata.
LatencyLimit, TokenBudget, and CostLimit assertions work
unchanged — tokens come from the CLI's reported usage, and cost is
whatever the CLI reports (0 for subscription mode).
Subclass CliSubject and implement three methods:
use Mosaiqo\Proofread\Cli\CliResponse;
use Mosaiqo\Proofread\Cli\CliSubject;
final class MyCliSubject extends CliSubject
{
public function binary(): string
{
return 'mycli';
}
public function args(string $prompt): array
{
return ['--prompt', $prompt, '--json'];
}
public function parseOutput(string $stdout, string $stderr): CliResponse
{
$decoded = json_decode($stdout, true);
return new CliResponse(
output: $decoded['response'] ?? '',
metadata: [
'tokens_in' => $decoded['tokens']['in'] ?? null,
'tokens_out' => $decoded['tokens']['out'] ?? null,
'model' => $decoded['model'] ?? null,
],
);
}
}Optional overrides: timeout(), workingDirectory(), env(),
usesStdin(), estimateTokens().
- Subprocess spawning is heavier than HTTP. Avoid
--concurrency > 1with CLI subjects. - CI environments need the CLI pre-installed and authenticated.
StructuredOutputAssertionandTrajectoryassertions require API-level data that headless CLIs may not expose.- Maintenance burden: CLI output formats change between versions.
Proofread maintains
ClaudeCodeCliSubject; community implementations for other CLIs are welcome.
Publish the config file:
php artisan vendor:publish --tag=proofread-configKey sections (see config/proofread.php for the annotated full file):
judge— default model and retries for LLM-as-judge (Rubric)similarity— default embedding model forSimilarand clusteringpricing.models— per-model token rates (input, output, cache, reasoning)snapshots—GoldenSnapshotstorage path and update modedashboard— enable/disable, path, middleware stackshadow— capture sampling, PII sanitization, alerts, queue connectionqueue— async eval execution connection and queuewebhooks— regression alerts (Slack, Discord, generic JSON)mcp— suites exposed through MCP tools
composer test # Pest v4 against Testbench
composer analyse # PHPStan
composer format # PintAll three must pass on every commit to main.
Contributions are welcome. The development workflow is:
- TDD always. Red, green, refactor.
- Work directly on
mainwhile the package is pre-v1. Every commit must be green (composer test,composer format,composer analyse). - Commits in English, imperative mood (
Add X, notAdded X).
See CLAUDE.md for the full set of conventions.
The MIT License (MIT). See LICENSE for details.