Skip to content

Success Criteria

nick3 edited this page May 28, 2026 · 1 revision

Success Criteria

How the goal runner decides a goal is actually done. Pick one of four types when Starting-a-Goal; the AI calls claim_complete when it thinks the goal is achieved, and the runner runs the criterion to verify.

If verification passes: goal ends as completed. If verification fails: the failure detail is injected as a system message ("Verification failed: …") and the loop resumes. The AI sees why it failed and can try again, or call abort_with_report if it's truly stuck.


Shell (most reliable)

{ type: 'shell', command: string, exitCode?: number /* default 0 */ }

The runner spawns command via child_process.spawn with shell=true, captures stdout/stderr, and checks the exit code against exitCode (default 0).

Pros Cons
Deterministic — exit codes don't lie Requires the verification to be expressible as a shell command
Output tail is fed back to the AI on failure so it can self-correct Local-only — runs on the machine ClusterSpace is on
Fast

Best for: web servers up (curl), files exist (test -f), tests pass (npm test), commits land (git log -1 --oneline | grep), packages installed (which nginx), services running (systemctl is-active).

Examples:

curl -sf http://localhost:8080/health
npm test
test -f /tmp/report.csv && test -s /tmp/report.csv
docker ps --format '{{.Names}}' | grep -q my-container
git rev-parse HEAD | grep -q '^[0-9a-f]\{40\}$'

When the shell exits non-zero, the runner surfaces the last 5 lines of stderr (or stdout) in the failure message — so the AI sees 404 Not Found or Test failed: line 23 expected 3 got 2 and can act on it.


Model question (subjective / visual)

{ type: 'model_question', question: string, threshold?: 'yes' | 'high_confidence' }

The runner makes a one-shot call to the active provider with a strict YES/NO judge prompt:

System: You are a strict yes/no judge. Reply with exactly one token: YES or NO — nothing else.
User:   Question: <question>
        Agent rationale: <claim_complete rationale>
        Answer:

The first token determines verification. Anything other than YES → not verified; the full reply text becomes the failure detail.

Pros Cons
Handles subjective outcomes shell can't ("does the dashboard look right?") Judge can be wrong — bias toward NO if rationale is thin
No external tooling needed Costs a model call per claim
Provider-agnostic Quality scales with the judge model

Best for: visual outcomes (combined with Vision-Verification tools the AI calls before claiming), policy-style judgements ("did the agent follow the deploy checklist?"), interpretation-required outcomes.

Examples:

"Is the user dashboard showing the username 'shop@cardboardlegacy.com' in the top-right corner?"
"Does the README now have a 'Configuration' section?"
"Was the deploy followed by a health check that returned 200?"

For visual outcomes, the AI should call browser_verify_visual_state before claim_complete — that grounds its rationale in pixels the judge can implicitly trust.


JSON predicate (deferred)

{ type: 'json_predicate', expr: string }

A JavaScript-style expression that the runner would evaluate against the AI's claim rationale parsed as JSON. Not fully implemented yet — the runner accepts the rationale with a note that the predicate was unverified, and surfaces the predicate text in the report so you know what was asserted.

For now, use model_question instead: phrase the predicate as a yes/no question.

When the evaluator ships, it'll support expressions like response.status === 200 && response.body.ok === true against the claimed result.


Manual (trust)

{ type: 'manual' }

The runner trusts whatever rationale the AI provides with claim_complete. Always verifies.

Pros Cons
Useful for exploratory goals where success isn't well-defined upfront No safety net — the AI can claim done falsely
Lets you review the rationale in the dashboard and re-open if it lied You have to review

Best for: research goals, "look around and tell me what you find" tasks, anything where you'll inspect the final report yourself anyway.


How the AI uses these

The model is taught (via the goal prompt the runner builds) about its termination contract:

RULES:

  • You cannot stop on your own. The loop runs until you call claim_complete (which the runner verifies) or abort_with_report (graceful give-up).
  • When you believe the goal is achieved, call claim_complete with a brief rationale. If verification fails, you'll be told why and the loop resumes.
  • If you genuinely cannot make progress, call abort_with_report with a reason and what you learned.
  • Use the step protocol (declare_step → action → verify_step) for non-trivial actions.
  • You are running under policy: risk=, sandbox=. Tools exceeding this scope will prompt the user.

The runner enforces these. The model can't "just stop."


Failure-loop dynamics

When verification fails, the loop continues with the failure reason injected. So:

[step 12] AI calls claim_complete("nginx is running on port 80")
[runner]  spawns: curl -sf http://localhost
[runner]  exit 7 — connection refused
[runner]  injects: "Verification failed: Shell exited 7, wanted 0. Output tail:
          curl: (7) Failed to connect to localhost port 80: Connection refused"
[step 13] AI sees failure → reasons "I should check the service status"
[step 14] AI runs: systemctl status nginx
[step 15] AI sees nginx isn't enabled → fixes → re-claims complete
[runner]  spawns: curl -sf http://localhost
[runner]  exit 0
[runner]  goal completed

This is the heart of why "set it and walk away" works — the verification gap doesn't end the run, it informs the next attempt.


When the loop ends

End state Triggers
completed claim_complete + verification passed
aborted User clicked Abort, or AI called abort_with_report
failed Wall clock exceeded, model returned no message, loop crashed (rare)

The terminal state lives in the goal log along with the full step history and finalReport text — visible in the dashboard.


See also

Clone this wiki locally