floop-bench

Can AI agents learn from corrections and get better at coding tasks?

floop-bench is an open benchmark for testing that question. It evaluates floop — a tool that helps AI agents learn from human corrections — by running controlled A/B experiments on real software engineering tasks from SWE-bench Verified.

Every result is published here, whether it helps or hurts floop's case. The goal is truth, not marketing.

What We've Found So Far

We've run 11 experiments across 4 months. Here are the key results:

Run	What Was Tested	Bare	Floop	Delta	p-value	Significant?
10	3 hand-written heuristics in system prompt	4/20 (20%)	7/20 (35%)	+15pp	0.45	No
11a	`floop prompt` output from 3-behavior store	Pending	—	—	—	—

Total project spend: ~$59

The strongest signal so far is a +15 percentage point improvement from three focused behavioral heuristics (Run 10). This is a medium effect size (Cohen's h = 0.34), but not statistically significant at n=20 tasks. We need larger sample sizes, multiple model families, and more runs to draw real conclusions.

For the full experiment log with methodology, versions, and analysis for every run, see docs/RUNBOOK.md.

How We Test

Each experiment compares two arms on the same set of SWE-bench Verified tasks:

Bare arm: A coding agent with no behavioral guidance
Floop arm: The same agent with floop-generated behaviors injected into its system prompt

The agent runs inside a Docker sandbox, attempts to fix a real GitHub issue, and produces a git patch. SWE-bench's Docker-based evaluator runs the repository's test suite against the patch to determine pass/fail.

Statistical analysis uses bootstrap confidence intervals and McNemar's test for paired comparisons. See SPEC.md for the full experimental design.

What counts as "floop"

We're careful to separate the tool from the technique:

Runs 7-10 tested hand-written heuristics injected as raw text — this tests the technique of behavioral prompting, not floop itself
Run 11+ tests floop prompt — the actual floop binary generating behavior text from a learned store

Both findings are valuable. We report exactly what was tested in each run.

Where This Is Going

floop-bench is currently a manual evaluation harness. The roadmap:

Level	Automated	Manual	Phase
Manual	Nothing	Everything	Now (Runs 1-11)
Semi-auto	Post-consolidation tier 1	Human reviews results	v0
Auto + guardrails	Full loop overnight, proposes changes	Human approves	v1
Full auto	Hypothesize, test, keep/discard	Weekly summary review	v2

When floop-bench gains the ability to autonomously hypothesize and test consolidation parameters, it becomes floop-research.

Project Status

floop-bench is an active research project. The harness has been used for 11 experiment runs across multiple model configurations. The evaluation pipeline (task execution, SWE-bench evaluation, statistical analysis) is stable. Results update with each run.

Getting Started

Prerequisites

Python 3.12+
uv
Docker or Podman
An API key for at least one model provider (see below)

Setup

git clone https://github.com/nvandessel/floop-bench.git
cd floop-bench
uv sync

# Build the sandbox image (agents run inside this container)
make build

API Keys

cp .env.example .env
# Edit .env with your key(s)

Set whichever keys you need for the models configured in config/arms.toml. Keys are forwarded into sandbox containers automatically.

Validate the environment:

uv run python -m scripts.validate_harness

Running Experiments

# Smoke test (2 tasks, validates the pipeline)
make smoke

# Train phase (30 tasks, agent learns behaviors organically)
make train

# Eval phase (20 tasks, leakage audit runs first)
make eval

# Statistical analysis
uv run python -m analysis.analyze
uv run python -m analysis.charts

Make Targets

Target	Description
`make build`	Build the sandbox container image
`make shell`	Interactive bash inside the sandbox
`make smoke`	Smoke test (2 tasks, sandboxed)
`make train`	Train phase (30 tasks, sandboxed)
`make eval`	Eval phase (20 tasks, leakage audit + sandboxed)
`make leakage`	Manual leakage audit against train volume
`make clean`	Remove volumes and sandbox image

Override defaults: make smoke ARM=gemini_flash_bare TIMEOUT=600 BUDGET=10

Configuration

Arms are defined in config/arms.toml. Each arm specifies a model, an agent backend, and whether floop is enabled. Model strings use litellm format — any litellm-supported model works.

Two agent backends are included:

mini_swe — Lightweight agent loop using litellm
claude_code — Wraps the Claude Code CLI

New agents can be added by implementing the protocol in agents/base.py.

Dataset

50 tasks sampled from SWE-bench Verified (seed 42), stratified by repo into 30 train / 20 eval. The split is committed at config/splits.json.

Cost Controls

The orchestrator tracks cumulative API spend and halts when --budget is exceeded. Interrupted runs resume automatically. Use scripts.estimate_cost for projections.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.floop		.floop
.github		.github
agents		agents
analysis		analysis
config		config
docs		docs
floop_integration		floop_integration
harness		harness
results		results
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml
ralph.sh		ralph.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

floop-bench

What We've Found So Far

How We Test

What counts as "floop"

Where This Is Going

Project Status

Getting Started

Prerequisites

Setup

API Keys

Running Experiments

Make Targets

Configuration

Dataset

Cost Controls

Further Reading

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

floop-bench

What We've Found So Far

How We Test

What counts as "floop"

Where This Is Going

Project Status

Getting Started

Prerequisites

Setup

API Keys

Running Experiments

Make Targets

Configuration

Dataset

Cost Controls

Further Reading

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages