Skip to content

pejmanjohn/slot-machine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎰 Slot Machine

An open-source agent skill for Claude Code.

AI agents are probabilistic. The same spec produces different code every time — different designs, different bugs, different quality. A single attempt is a coin flip. Slot Machine gives you N coins and keeps the best one — or combines the best parts of each.

Run N independent implementations of the same feature in parallel. Each gets reviewed by an independent agent that hunts for real bugs. A meta-judge compares all of them and makes one of three calls: pick the clear winner, synthesize the best elements from multiple implementations into something better than any individual, or reject all if none meet the bar.

What You Can Do

Run 3 competing implementations and pick the best one:

/slot-machine with 3 slots — Implement the payment webhook handler from PLAN.md

Three agents implement the same spec independently, each steered toward a different emphasis such as simplicity, robustness, or functional style. Independent reviewers hunt for bugs in each. A judge picks the winner — or synthesizes the best parts of several.

Assign a different skill to each slot:

/slot-machine with /superpowers:test-driven-development and /ce:work — Build the rate limiter

Slot 1 follows TDD (tests first). Slot 2 follows CE patterns (codebase-aware). Same spec, different methodologies, best result wins.

Or even run some slots on Codex:

/slot-machine with /superpowers:test-driven-development, /ce:work, and codex — Implement the API

Three slots: Claude with TDD, Claude with CE patterns, and OpenAI Codex. Different models find different bugs — the evaluation pipeline reviews all of them the same way.

It works for writing too:

/slot-machine with profile: writing — Write the launch announcement

Each slot drafts with a different voice and structure. The judge picks the strongest draft or synthesizes the best elements from several.

Set it once in your project and forget:

## Slot Machine Settings (add to CLAUDE.md)
slot-machine-slots:
  - /superpowers:test-driven-development
  - /ce:work
  - codex
  - default

Every /slot-machine invocation in this project uses these slots automatically.

How It Works

Slot-machine dispatches a pipeline of specialized agents. Each role is isolated — implementers never see each other's work, reviewers never see each other's reviews.

Step Agent What it does
Implement N implementers (parallel) Each builds the full spec independently in an isolated git worktree. Different slots can use different skills or even different agent harnesses (Codex).
Review N reviewers (parallel) Each reviews one implementation blind — spec compliance, adversarial bug hunting with file:line evidence, test gap analysis.
Judge 1 judge Reads all reviewer scorecards, does targeted code inspection where reviewers disagree, and issues a verdict: PICK the winner, SYNTHESIZE the best elements, or NONE_ADEQUATE.
Synthesize 1 synthesizer (if needed) Takes one slot as base, ports specific elements from donors per the judge's plan, verifies coherence, runs the full test suite.
Resolve Orchestrator Merges the winner, cleans up worktrees, writes result artifacts with full model attribution.

The key insight: the agent that implements never evaluates. The agent that reviews never sees alternatives. The judge only sees structured scorecards, not raw code (unless it needs to inspect a specific disagreement). This separation prevents the bias that happens when one agent does everything.

Install

This repo is currently distributed as a standalone Claude Code skill, not a plugin.

Install it into your Claude Code user skills directory:

git clone https://github.com/pejmanjohn/slot-machine.git ~/.claude/skills/slot-machine

If ~/.claude/skills does not exist yet, create it first.

Then invoke it from any Claude Code session:

/slot-machine with 3 slots — Implement the payment webhook handler from PLAN.md

To update later:

git -C ~/.claude/skills/slot-machine pull

See It Work

You give it a spec:

/slot-machine with 3 slots — Implement the TaskScheduler from the spec

The skill takes over:

Slot Machine — coding profile

Feature: TaskScheduler
Slots: 3 | Simplest approach (claude-opus-4-6), Robustness (claude-opus-4-6), Functional (claude-opus-4-6)

Three agents implement the full spec independently, each in an isolated worktree with a different implementation emphasis. Then independent reviewers inspect each one — not a rubber stamp, an adversarial review with evidence:

Slot 1 Review:
  Spec Compliance: PASS
  Critical: src/scheduler.ts:47 — unhandled TypeError when concurrency
           is non-integer. Constructor accepts 1.5 silently.
  Important: src/scheduler.ts:38 — drain() doesn't account for tasks
             scheduled after drain is called.
  Verdict: Not a contender — critical validation bug.

Slot 2 Review:
  Spec Compliance: PASS
  Important: tests/scheduler.test.ts:92 — flaky timing assertion.
  Minor: No error message in constructor throw.
  Verdict: Yes — strongest validation, 17 tests.

Slot 3 Review:
  Spec Compliance: PASS
  Important: drain() uses snapshot semantics — may miss late-scheduled tasks.
  Verdict: Yes with concerns — clean API but drain limitation.

The judge compares all three, does targeted code inspection, and decides:

---

Verdict: PICK Slot 2 (Claude Code claude-opus-4-6) | Confidence: HIGH

Zero critical issues, strongest test coverage (17 tests including concurrency
stress tests), correct drain semantics. No synthesis needed — clear winner.

---

Bugs caught that would have shipped with a single implementation. The winner has 3x the test coverage of either alternative.

When the Judge Synthesizes

Sometimes no single slot is the best at everything. In a cross-model run, the judge saw complementary strengths and called SYNTHESIZE:

---

Verdict: SYNTHESIZE | Confidence: HIGH

Slot 3 has the cleanest code. Slot 1 has the best tests. Combining both
produces something better than either.

- Base: Slot 3 (Codex gpt-5.4) — cleanest implementation, proper drain pattern
- + Slot 1 (Claude Code opus-4.6 w/ /ce:work) — 19-test suite: nested scheduling,
  timing verification, counter tracking
- Keep Slot 3: event-ordering drain test, error propagation test

---

The synthesizer agent starts with one slot as the base, ports specific elements from the donors, checks for coherence, and runs the full test suite. The result reads like one person wrote it, not like pieces were stitched together.

Without vs With Slot Machine

Without Skill With Slot Machine
Implementations 1 3 (parallel)
Review Self-review (finds 0 bugs) 3 independent adversarial reviewers
Bugs found 0 3 (including a crash-severity TypeError)
Tests in winner ~20 45
Decision process Ships whatever it built Evidence-based PICK or SYNTHESIZE with file:line reasoning
Synthesis N/A Can combine best code from one slot with best tests from another
Confidence "Looks good to me" HIGH — judge verified via targeted code inspection
Design alternatives 0 (never explored) 2 rejected alternatives with documented reasons
Cross-model N/A Claude vs Codex on same spec — different models find different bugs

Usage

/slot-machine

Spec: Implement the payment webhook handler from PLAN.md

Or inline with options:

/slot-machine with 3 slots, profile: writing

Spec: Write a changelog entry announcing the new task profiles feature

The skill also triggers on natural language: "slot-machine this", "best-of-N", "pull the lever", or "parallel implementations."

Cross-Model Runs

Run the same spec across different agent harnesses and pick the best result:

/slot-machine with /ce:work, /ce:work + codex, and codex

Spec: Implement the TaskScheduler class from PLAN.md

Three slots: Claude Code with CE patterns, Codex with CE patterns, and bare Codex. Each implements independently, all reviewed by the same evaluation pipeline. The progress table shows which model ran each slot:

Slot Status Model Tests Approach
1 DONE claude-opus-4-6 17 tests /ce:work
2 DONE_WITH_CONCERNS gpt-5.4 5 tests /ce:work + codex
3 DONE_WITH_CONCERNS gpt-5.4 5 tests codex

Skills guide methodology (TDD, CE patterns). Harnesses choose the AI system (Claude, Codex). Compose them with +:

/slot-machine with 4 slots:
  slot 1: /superpowers:test-driven-development
  slot 2: /superpowers:test-driven-development + codex
  slot 3: /ce:work
  slot 4: codex

Skills are invoked natively by each harness — Claude uses the Skill tool, Codex uses $ prefix. Each loads the full skill document in its own way.

Or set project defaults in CLAUDE.md:

## Slot Machine Settings
slot-machine-slots:
  - /superpowers:test-driven-development
  - /ce:work
  - codex
  - default

Profiles: Coding and Writing

Slot-machine auto-detects whether your spec is a coding task or a writing task and loads the right profile. Each profile has its own approach hints, reviewer criteria, and synthesis strategy.

Coding profile (isolation: worktree):

  • Hints steer toward different implementation emphases: simplicity, robustness, functional style, idiomatic APIs, extensibility
  • Reviewer checks spec compliance, hunts bugs with file:line evidence, assesses test coverage
  • Pre-checks run your test suite before review
  • Each slot gets an isolated git worktree

Writing profile (isolation: file):

  • Hints steer toward different voices: concise, narrative, technical, conversational, structured
  • Reviewer checks brief compliance, prose quality, audience fit, coherence
  • No git worktrees — each slot writes to a file
  • Synthesis merges the best phrasing and structure from multiple drafts

Force a profile with profile: writing or profile: coding, or let auto-detection handle it.

Works With Your Other Skills

Each implementer slot is a full Claude Code session with access to all your installed skills. You can assign a specific skill per slot — or let implementers pick up skills automatically from your environment.

Works with superpowers, compound-engineering, gstack, or any other implementation skill. The orchestrator passes your CLAUDE.md conventions as project context to each implementer, so project-specific rules apply to every slot automatically.

What the Reviewer Actually Finds

This is a real finding from one of our test runs. The reviewer is an independent agent that reads the actual code — not the implementer's self-report:

Critical: src/api.py:47 — Unhandled TypeError crash

  What: POST /tasks with priority="high" (string instead of int) causes
        an unhandled TypeError in PriorityQueue.put(). Flask catches it
        and returns a generic 500 Internal Server Error.

  Impact: Any API caller sending a non-integer priority crashes the endpoint.
          No error message, no 400 Bad Request — just a 500.

  Fix: Add type validation before queue insertion:
       if not isinstance(priority, int):
           return jsonify({"error": "priority must be an integer"}), 400

The reviewer cites the exact file and line, explains the impact, and suggests a fix. The implementer's self-review said "all requirements implemented, tests pass" — it missed this entirely.

The judge then ranks all slots based on reviewer findings:

| Rank | Slot | Critical | Important | Minor | Spec | Verdict       |
|------|------|----------|-----------|-------|------|---------------|
| 1    | 2    | 0        | 1         | 1     | PASS | Winner        |
| 2    | 3    | 0        | 2         | 1     | PASS | With concerns |
| 3    | 1    | 1        | 1         | 0     | PASS | Disqualified  |

That's what goes into your codebase. Not the first thing, not the prettiest — the one that held up under independent scrutiny.

Configuration

Setting Default Description
slots 3 Number of parallel attempts
approach_hints true Different architectural direction per slot
auto_synthesize true Allow combining best elements from multiple slots
max_retries 1 Re-run failed slots (0 = no retry)
cleanup true Delete worktrees after completion
quiet false Suppress progress tables (for autonomous loops)
implementer_model inherit Model for implementers (inherits from session)
reviewer_model inherit Model for reviewers (inherits from session)
judge_model inherit Model for judge (inherits from session)
synthesizer_model inherit Model for synthesizer (inherits from session)

Set in your project's CLAUDE.md or override inline: /slot-machine with 3 slots

When to Use

Slot-machine trades tokens and time for quality. In return, you get independent review that catches bugs self-review misses, multiple design alternatives compared under structured criteria, and the option to synthesize the best parts of each.

That tradeoff is worth it when the cost of shipping a bug exceeds the cost of the extra compute. It's not worth it when the task is mechanical.

Use when:

  • Feature has meaningful design choices (architecture, patterns, tradeoffs)
  • The code will ship to production or be built on top of
  • Spec is clear enough for independent implementation
  • Running in autonomous loops where you're not waiting at the terminal — the extra time costs nothing when the agent is working overnight

Skip when:

  • Simple mechanical changes (rename, add a field)
  • You already know exactly how it should be built
  • Spec is too vague — brainstorm first, then slot-machine
  • Interactive back-and-forth where you're waiting for each response

Does this problem have a design space worth exploring? If yes, pull the lever.

Works in Autonomous Loops

Slot-machine runs inside Ralph Loop and custom agent loops. No special setup — add config to your CLAUDE.md and the loop's AI instances pick it up automatically.

Slot-machine self-regulates: it evaluates each task and only engages when the task has meaningful design choices. Mechanical tasks (add a field, rename a function) get single-shot implementation. You can blanket-enable slot-machine and trust it to only spend compute when competition adds value.

Setup (add to CLAUDE.md):

## Slot Machine Settings
slot-machine-profile: coding
slots: 3
quiet: true

Every run writes a machine-readable result to .slot-machine/runs/latest/result.json that scripts can parse:

{
  "verdict": "PICK",
  "winning_slot": 2,
  "confidence": "HIGH",
  "slot_details": [
    {"slot": 1, "harness": "Claude Code", "model": "claude-opus-4-6", "skill": "/ce:work"},
    {"slot": 2, "harness": "Codex", "model": "gpt-5.4", "skill": null}
  ],
  "files_changed": ["src/api.py", "tests/test_api.py"],
  "tests_passing": 45
}

Set quiet: true to suppress progress tables in unattended runs. The run directory (.slot-machine/runs/) keeps all artifacts (slot drafts, reviewer scorecards, judge verdict) for post-hoc inspection.

Custom Profiles

Create your own profiles to customize how slot-machine implements, reviews, judges, and synthesizes. Use this to enforce your team's coding standards, define domain-specific review criteria, or change how the judge weighs tradeoffs. A profile is a folder with 5 files:

my-profile/
  0-profile.md       # Config: name, isolation, pre-checks, approach hints
  1-implementer.md   # Prompt for each implementation agent
  2-reviewer.md      # Prompt for each review agent
  3-judge.md         # Prompt for the meta-judge
  4-synthesizer.md   # Prompt for the synthesizer

Profile config (0-profile.md) uses YAML frontmatter:

---
name: api-review
description: For reviewing and reimplementing API endpoints with security focus.
extends: coding
isolation: worktree
pre_checks: |
  {test_command} 2>&1
  npm audit 2>&1
---

## Approach Hints

1. "Focus on input validation and authentication — treat every caller as untrusted."
2. "Optimize for observability — structured logging, error codes, request tracing."
3. "Design for backward compatibility — existing clients must not break."

Inheritance: Set extends: coding to inherit all prompts from the coding profile and override only what you change. Files present in your profile replace the base; missing files are inherited. One level of inheritance max.

Install locations:

  • Project-local: ./profiles/my-profile/ (checked into your repo)
  • Personal: ~/.slot-machine/profiles/my-profile/ (available in all projects)

Use it: /slot-machine with profile: my-profile

Or set as project default in CLAUDE.md:

slot-machine-profile: my-profile

All prompts receive universal variables ({{SPEC}}, {{PROJECT_CONTEXT}}, {{APPROACH_HINT}}, etc.) — your prompts just need to reference them.

Why Not Just Ask Claude to Do It 5 Times?

We tried that. Five parallel implementations, no skill, Claude doing what it naturally does. The parallelism worked fine. Six things broke:

Self-review finds nothing. The same agent that wrote the code reviewed it. In our benchmark, self-review found 0 bugs. Independent reviewers found 3 — including a crash-severity TypeError. You can't objectively evaluate your own work.

No structured comparison. Without a rubric, Claude made an ad hoc "this one looks best" decision. No spec compliance check, no severity categorization, no file:line evidence. The judge in slot-machine reads structured scorecards with ranked findings — not vibes.

No synthesis. When no single implementation is best at everything — one has the cleanest code, another has the best tests — Claude just picks one and loses the other's strengths. Slot-machine's judge can call SYNTHESIZE: combine the best code from one slot with the best tests from another.

No diversity. Without guidance, Claude produces similar implementations each time. Same patterns, same blind spots. Slot-machine creates diversity at three levels: hints steer each slot toward a different implementation emphasis (simplicity vs robustness vs functional style), skills assign different methodologies per slot (TDD for one, CE patterns for another), and cross-model dispatch runs some slots on entirely different agent harnesses (Codex finds bugs Claude doesn't, and vice versa).

No isolation. Without worktree management, parallel implementations write to the same files and clobber each other. Slot-machine gives each slot its own git worktree — fully isolated workspaces where implementations can't interfere. The winner's branch merges cleanly.

No trail. Without the skill, the comparison is ephemeral — gone when the conversation ends. Slot-machine saves reviewer scorecards, judge verdict, and result artifacts to .slot-machine/runs/ for post-hoc inspection.

The hard part isn't running N agents. It's evaluating their output honestly.

This is NOT Standard Parallel Agents

Every major tool splits different tasks across agents (frontend, backend, tests in parallel). That's task decomposition.

Slot Machine gives the same task to N agents and compares their full implementations. The value isn't parallelism — it's competition, independent review, and structured judgment. Different problem, different solution.

Running Tests

./tests/run-tests.sh                  # Contract validation (instant)
./tests/run-tests.sh --smoke          # + Real implementer/reviewer/judge smoke tests
./tests/run-tests.sh --integration    # + Smoke tier + real happy-path E2E + skipped edge-case E2E
./tests/run-tests.sh --all            # Everything, with unavailable headless tiers skipped

The fast suite verifies prompt contracts, repo structure, and harness integrity. The implementer, reviewer, and judge smoke tests plus the happy-path E2E test now run for real via headless claude -p; the edge-case E2E and reviewer-accuracy scripts still report explicit skips instead of passing silently.

License

MIT

About

An open-source skill for running parallel implementations, reviewing them independently, and selecting or synthesizing the best result.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages