Makes autonomous coding agents produce software that actually works — not software that passes tests.
codeharness is an npm CLI + Claude Code plugin that packages verification-driven development as an installable tool: black-box verification via Docker, agent-first observability via VictoriaMetrics, and mechanical enforcement via hooks that make skipping verification architecturally impossible.
- Verifies features work — not just that tests pass. Black-box verification runs the built CLI inside a Docker container with no source code access. If the feature doesn't work from a user's perspective, verification fails.
- Fixes what it finds — verification failures with code bugs automatically return to development with specific findings. The dev agent gets told exactly what's broken and why.
- Runs sprints autonomously — reads your sprint plan, picks the highest-priority story, implements it, reviews it, verifies it, and moves to the next one. Cross-epic prioritization, retry management, and session handoff built in.
- Makes agents see runtime — ephemeral VictoriaMetrics stack (logs, metrics, traces) that agents query programmatically during development. No guessing at what the code does at runtime.
Two components — install both:
# CLI (npm package)
npm install -g codeharness
# Claude Code plugin (slash commands, hooks, skills)
claude plugin install github:iVintik/codeharness# Initialize in your project
codeharness init
# Start autonomous sprint execution (inside Claude Code)
/harness-runThe CLI handles all mechanical work — stack detection, Docker management, verification, coverage, retry state.
| Command | Purpose |
|---|---|
codeharness init |
Detect stack, install dependencies, start observability, scaffold docs |
codeharness run |
Execute the autonomous coding loop (Ralph) |
codeharness verify --story <key> |
Run verification pipeline for a story |
codeharness status |
Show harness health, sprint progress, Docker stack |
codeharness coverage |
Run tests with coverage and evaluate against targets |
codeharness onboard epic |
Scan codebase for gaps, generate onboarding stories |
codeharness retry --status |
Show retry counts and flagged stories |
codeharness retry --reset |
Clear retry state for re-verification |
codeharness verify-env build |
Build Docker image for black-box verification |
codeharness stack start |
Start the shared observability stack |
codeharness teardown |
Remove harness from project |
All commands support --json for machine-readable output.
The plugin provides slash commands that orchestrate the CLI within Claude Code sessions:
| Command | Purpose |
|---|---|
/harness-run |
Autonomous sprint execution — picks stories by priority, runs create → dev → review → verify loop |
/harness-init |
Interactive project initialization |
/harness-status |
Quick overview of sprint progress and harness health |
/harness-onboard |
Scan project and generate onboarding plan |
/harness-verify |
Verify a story with real-world evidence |
codeharness integrates with BMAD Method for structured sprint planning:
| Phase | Commands |
|---|---|
| Analysis | /create-brief, /brainstorm-project, /market-research |
| Planning | /create-prd, /create-ux |
| Solutioning | /create-architecture, /create-epics-stories |
| Implementation | /sprint-planning, /create-story, then /harness-run |
┌─────────────────────────────────────────┐
│ Claude Code Session │
│ /harness-run picks next story │
│ → create-story → dev → review → verify │
└────────────────────┬────────────────────┘
│ verify
▼
┌─────────────────────────────────────────┐
│ Docker Container (no source code) │
│ - codeharness CLI installed from tarball│
│ - claude CLI for nested verification │
│ - curl/jq for observability queries │
│ Exercises CLI as a real user would │
└────────────────────┬────────────────────┘
│ queries
▼
┌─────────────────────────────────────────┐
│ Observability Stack (VictoriaMetrics) │
│ - VictoriaLogs :9428 (LogQL) │
│ - VictoriaMetrics :8428 (PromQL) │
│ - OTEL Collector :4318 │
└─────────────────────────────────────────┘
When verification finds code bugs → story returns to dev with findings → dev fixes → re-verify. This loop runs up to 10 times per story. Infrastructure failures (timeouts, Docker errors) retry 3 times then skip.
- Node.js >= 18
- Docker (for observability and verification)
- Claude Code (for plugin features)
MIT