-
Notifications
You must be signed in to change notification settings - Fork 0
Success Criteria
How the goal runner decides a goal is actually done. Pick one of four types when Starting-a-Goal; the AI calls claim_complete when it thinks the goal is achieved, and the runner runs the criterion to verify.
If verification passes: goal ends as completed.
If verification fails: the failure detail is injected as a system message ("Verification failed: …") and the loop resumes. The AI sees why it failed and can try again, or call abort_with_report if it's truly stuck.
{ type: 'shell', command: string, exitCode?: number /* default 0 */ }The runner spawns command via child_process.spawn with shell=true, captures stdout/stderr, and checks the exit code against exitCode (default 0).
| Pros | Cons |
|---|---|
| Deterministic — exit codes don't lie | Requires the verification to be expressible as a shell command |
| Output tail is fed back to the AI on failure so it can self-correct | Local-only — runs on the machine ClusterSpace is on |
| Fast |
Best for: web servers up (curl), files exist (test -f), tests pass (npm test), commits land (git log -1 --oneline | grep), packages installed (which nginx), services running (systemctl is-active).
Examples:
curl -sf http://localhost:8080/health
npm test
test -f /tmp/report.csv && test -s /tmp/report.csv
docker ps --format '{{.Names}}' | grep -q my-container
git rev-parse HEAD | grep -q '^[0-9a-f]\{40\}$'
When the shell exits non-zero, the runner surfaces the last 5 lines of stderr (or stdout) in the failure message — so the AI sees 404 Not Found or Test failed: line 23 expected 3 got 2 and can act on it.
{ type: 'model_question', question: string, threshold?: 'yes' | 'high_confidence' }The runner makes a one-shot call to the active provider with a strict YES/NO judge prompt:
System: You are a strict yes/no judge. Reply with exactly one token: YES or NO — nothing else.
User: Question: <question>
Agent rationale: <claim_complete rationale>
Answer:
The first token determines verification. Anything other than YES → not verified; the full reply text becomes the failure detail.
| Pros | Cons |
|---|---|
| Handles subjective outcomes shell can't ("does the dashboard look right?") | Judge can be wrong — bias toward NO if rationale is thin |
| No external tooling needed | Costs a model call per claim |
| Provider-agnostic | Quality scales with the judge model |
Best for: visual outcomes (combined with Vision-Verification tools the AI calls before claiming), policy-style judgements ("did the agent follow the deploy checklist?"), interpretation-required outcomes.
Examples:
"Is the user dashboard showing the username 'shop@cardboardlegacy.com' in the top-right corner?"
"Does the README now have a 'Configuration' section?"
"Was the deploy followed by a health check that returned 200?"
For visual outcomes, the AI should call browser_verify_visual_state before claim_complete — that grounds its rationale in pixels the judge can implicitly trust.
{ type: 'json_predicate', expr: string }A JavaScript-style expression that the runner would evaluate against the AI's claim rationale parsed as JSON. Not fully implemented yet — the runner accepts the rationale with a note that the predicate was unverified, and surfaces the predicate text in the report so you know what was asserted.
For now, use model_question instead: phrase the predicate as a yes/no question.
When the evaluator ships, it'll support expressions like response.status === 200 && response.body.ok === true against the claimed result.
{ type: 'manual' }The runner trusts whatever rationale the AI provides with claim_complete. Always verifies.
| Pros | Cons |
|---|---|
| Useful for exploratory goals where success isn't well-defined upfront | No safety net — the AI can claim done falsely |
| Lets you review the rationale in the dashboard and re-open if it lied | You have to review |
Best for: research goals, "look around and tell me what you find" tasks, anything where you'll inspect the final report yourself anyway.
The model is taught (via the goal prompt the runner builds) about its termination contract:
RULES:
- You cannot stop on your own. The loop runs until you call
claim_complete(which the runner verifies) orabort_with_report(graceful give-up).- When you believe the goal is achieved, call
claim_completewith a brief rationale. If verification fails, you'll be told why and the loop resumes.- If you genuinely cannot make progress, call
abort_with_reportwith a reason and what you learned.- Use the step protocol (declare_step → action → verify_step) for non-trivial actions.
- You are running under policy: risk=, sandbox=. Tools exceeding this scope will prompt the user.
The runner enforces these. The model can't "just stop."
When verification fails, the loop continues with the failure reason injected. So:
[step 12] AI calls claim_complete("nginx is running on port 80")
[runner] spawns: curl -sf http://localhost
[runner] exit 7 — connection refused
[runner] injects: "Verification failed: Shell exited 7, wanted 0. Output tail:
curl: (7) Failed to connect to localhost port 80: Connection refused"
[step 13] AI sees failure → reasons "I should check the service status"
[step 14] AI runs: systemctl status nginx
[step 15] AI sees nginx isn't enabled → fixes → re-claims complete
[runner] spawns: curl -sf http://localhost
[runner] exit 0
[runner] goal completed
This is the heart of why "set it and walk away" works — the verification gap doesn't end the run, it informs the next attempt.
| End state | Triggers |
|---|---|
completed |
claim_complete + verification passed |
aborted |
User clicked Abort, or AI called abort_with_report
|
failed |
Wall clock exceeded, model returned no message, loop crashed (rare) |
The terminal state lives in the goal log along with the full step history and finalReport text — visible in the dashboard.
- Starting-a-Goal — picking a criterion when creating a goal
- Goal-Runner-Overview — the loop contract
- Critic-and-Replan — what runs between claims to keep progress on track
- Vision-Verification — for visual outcomes the AI verifies mid-loop
ClusterSpace · Issues · Releases · MIT License · Edit any page via the Edit button (top right of the wiki).
- Workspaces-and-Layout
- Terminal-Panes
- Per-Pane-Tabs
- SSH-and-tmux
- Browser-Panes
- Saved-Logins
- Command-Palette
- Broadcast-Mode
- Settings-and-Configuration
- AI-Overview
- AI-Providers
- AI-Chat-Panel
- AI-Tools-Reference
- Personas
- Skills
- Task-Templates
- Agent-Orchestration
- Fleet-Dashboard
- Goal-Runner-Overview
- Starting-a-Goal
- Success-Criteria
- Goal-Policy-and-Risk-Levels
- Critic-and-Replan
- Vision-Verification
- Goal-Dashboard