The Replication Trap

Precision Failures in LLM Scrutiny of Flawed Statistical Workflows

A submission to Claw4S 2026 — the executable science conference where papers run.

Overview

Can an AI agent reliably detect methodological errors in scientific code? Not syntax errors or crashes — the subtle, plausible-looking mistakes that produce clean output and confident conclusions while invalidating the science underneath.

This project answers that question empirically. We generate 36 realistic Python analysis scripts — 18 with carefully planted methodological flaws, 18 correct controls — then benchmark whether an executing agent can distinguish sound methodology from broken methodology, without hints.

The twist: when Claw (the Claw4S review agent) executes this submission, it is itself the subject of the experiment. The conference chairs will see Claw's own detection scores alongside the ground truth — a live measurement of agent review reliability, obtained at the moment of review.

Results

We evaluated five frontier LLMs across ten conditions. Detection of flawed scripts is near-ceiling across all models; false-positive rates on correct scripts are the primary axis of variation.

Model / Condition	Detection	FPR	F1
Claude
Claude Sonnet 4.6	100%	39%	0.84
Claude Sonnet 4.6 + taxonomy	100%	28%	0.88
Claude Sonnet 4.6 (3-run majority)	100%	50%	0.80
Claude Opus 4.6	94%	22%	0.87
Claude Opus 4.6 + taxonomy	100%	22%	0.90
OpenAI
GPT-4o	100%	39%	0.84
GPT-4o + taxonomy	94%	17%	0.90
Google
Gemini 2.5 Flash	100%	50%	0.80
Gemini 3 Flash	100%	28%	0.88
Gemini 3 Flash + taxonomy	100%	33%	0.86

Key findings:

All models detected all 6 flaw categories at the category level (detection is not the discriminating dimension)
Survivorship bias controls are over-flagged by every model — agents conflate methodological complexity with error
Taxonomy injection reduces FPR for high-FPR models (GPT-4o: 39%→17%, Sonnet: 39%→28%), but is neutral or adverse for already-precise models — the effect is conditional, not universal
Single-run FPR underestimates true FPR; majority-vote across 3 runs reveals the gap

Full results and analysis in research_note.tex.

Benchmark Design

Flaw Taxonomy

Six categories drawn from the replication crisis literature, each with 3 flawed variants and 3 matched controls (36 scripts total):

ID	Category	What goes wrong
F1	Data leakage	Scaler fit on full dataset before train/test split
F2	P-hacking	13 hypotheses tested; only significant one reported, no correction
F3	Circular validation	Hyperparameters selected by maximizing test-set performance
F4	Survivorship bias	Failed experimental runs silently dropped before computing summary stats
F5	Wrong distributional assumption	Parametric t-test applied to heavily skewed lognormal data
F6	Pseudoreplication	Repeated measures treated as independent observations

Design Principles

Opaque filenames. Scripts are named script_01.py–script_36.py with no ordering correlation to flaw type. The reviewer cannot infer anything from position or filename.

Plausible conclusions. Every script — flawed and control alike — runs without error and prints reasonable-sounding results. Flaws are visible only through methodological reading, not from output inspection.

3 variants per category. Each flaw is instantiated in three independent domains/data-generating processes. This enables category-level statistical inference with cluster-bootstrapped confidence intervals, avoiding pseudoreplication in our own study design.

Self-referential evaluation. The same agent platform that runs science at Claw4S is the one being evaluated. This mirrors the real-world situation where a system reviews work it might itself produce.

Fully deterministic. All random number generation is seeded (numpy.random.seed(42), random.Random(42)). Results are identical across runs and platforms.

Repository Structure

replication-trap/
├── SKILL.md                    # Executable Claw4S submission — the benchmark
├── research_note.tex           # Research paper with methodology and results
├── requirements.txt            # Python dependencies
├── generate_audit_scripts.py   # Generates 36 audit scripts + answer key
├── script_variants.py          # Additional script templates (imported by generator)
├── score_audit.py              # Scores agent reviews against ground truth
└── run_evaluation.py           # Multi-model API evaluation harness (not submitted)

Generated at runtime (gitignored):

├── audit_scripts/              # The 36 generated benchmark scripts
├── audit_answer_key/           # Ground truth answer key
├── audit_results/              # Agent review verdicts and audit report
└── results/                    # Multi-model API experiment results

Running the Benchmark

Prerequisites

pip install -r requirements.txt

Step 1 — Generate scripts

python3 generate_audit_scripts.py

Creates audit_scripts/script_01.py–script_36.py and audit_answer_key/answer_key.json. Safe to re-run — output is deterministic. Do not read the answer key until after Step 3.

Step 2 — Review each script

Read each of the 36 scripts independently and record your verdict:

Verdict: PASS (methodologically sound) or FAIL (contains a flaw)
Flaw identified: Description of the error, if FAIL
Confidence: LOW, MEDIUM, or HIGH

Save results to audit_results/reviews.json:

{
  "reviews": [
    {
      "script": "script_01.py",
      "verdict": "PASS or FAIL",
      "flaw_identified": "description or null",
      "confidence": "LOW | MEDIUM | HIGH"
    }
  ]
}

Step 3 — Score

python3 score_audit.py

Writes audit_results/audit_report.md with detection rate, false positive rate, F1, per-category breakdown, and a detailed review log.

Multi-Model Evaluation Harness

run_evaluation.py sends scripts to model APIs and compares results. Not part of the Claw4S submission — used to generate the paper's experimental data.

pip install "litellm>=1.83.0"

# Single model
python3 run_evaluation.py --model sonnet

# Contamination experiment (inject flaw taxonomy into system prompt)
python3 run_evaluation.py --model sonnet --with-taxonomy

# Reliability experiment (3 independent runs, majority vote)
python3 run_evaluation.py --model sonnet --runs 3

# Multi-model comparison table from saved results
python3 run_evaluation.py --compare --no-generate

Supported models: sonnet (Claude Sonnet 4.6), opus (Claude Opus 4.6), gpt4o, gemini (Gemini 2.5 Flash), gemini3 (Gemini 3 Flash).

Set API keys in .env (see .env.example).

Statistical Design Note

Treating 36 scripts as independent observations would itself be a form of pseudoreplication — scripts are nested within flaw categories (3 variants each). The primary analysis uses flaw category as the unit (n=6), with a category counted as detected if the majority of its 3 flawed variants were caught. Confidence intervals use cluster bootstrap (10,000 iterations, resampling categories) to account for within-category correlation.

Scientific Grounding

Ioannidis (2005) — Why Most Published Research Findings Are False
Simmons, Nelson & Simonsohn (2011) — False-Positive Psychology
Kapoor & Narayanan (2023) — Leakage and the Reproducibility Crisis in ML
Hurlbert (1984) — Pseudoreplication and the Design of Ecological Field Experiments
Nuijten et al. (2016) — The Prevalence of Statistical Reporting Errors in Psychology

Authors

Jeff Heuer and Chelate 🦞

Submitted to Claw4S 2026 — the conference where the paper runs.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
docs/superpowers		docs/superpowers
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
generate_audit_scripts.py		generate_audit_scripts.py
requirements.txt		requirements.txt
research_note.pdf		research_note.pdf
research_note.tex		research_note.tex
run_evaluation.py		run_evaluation.py
score_audit.py		score_audit.py
script_variants.py		script_variants.py
submission_content.md		submission_content.md
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Replication Trap

Overview

Results

Benchmark Design

Flaw Taxonomy

Design Principles

Repository Structure

Running the Benchmark

Prerequisites

Step 1 — Generate scripts

Step 2 — Review each script

Step 3 — Score

Multi-Model Evaluation Harness

Statistical Design Note

Scientific Grounding

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Replication Trap

Overview

Results

Benchmark Design

Flaw Taxonomy

Design Principles

Repository Structure

Running the Benchmark

Prerequisites

Step 1 — Generate scripts

Step 2 — Review each script

Step 3 — Score

Multi-Model Evaluation Harness

Statistical Design Note

Scientific Grounding

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages