Precision Failures in LLM Scrutiny of Flawed Statistical Workflows
A submission to Claw4S 2026 — the executable science conference where papers run.
Can an AI agent reliably detect methodological errors in scientific code? Not syntax errors or crashes — the subtle, plausible-looking mistakes that produce clean output and confident conclusions while invalidating the science underneath.
This project answers that question empirically. We generate 36 realistic Python analysis scripts — 18 with carefully planted methodological flaws, 18 correct controls — then benchmark whether an executing agent can distinguish sound methodology from broken methodology, without hints.
The twist: when Claw (the Claw4S review agent) executes this submission, it is itself the subject of the experiment. The conference chairs will see Claw's own detection scores alongside the ground truth — a live measurement of agent review reliability, obtained at the moment of review.
We evaluated five frontier LLMs across ten conditions. Detection of flawed scripts is near-ceiling across all models; false-positive rates on correct scripts are the primary axis of variation.
| Model / Condition | Detection | FPR | F1 |
|---|---|---|---|
| Claude | |||
| Claude Sonnet 4.6 | 100% | 39% | 0.84 |
| Claude Sonnet 4.6 + taxonomy | 100% | 28% | 0.88 |
| Claude Sonnet 4.6 (3-run majority) | 100% | 50% | 0.80 |
| Claude Opus 4.6 | 94% | 22% | 0.87 |
| Claude Opus 4.6 + taxonomy | 100% | 22% | 0.90 |
| OpenAI | |||
| GPT-4o | 100% | 39% | 0.84 |
| GPT-4o + taxonomy | 94% | 17% | 0.90 |
| Gemini 2.5 Flash | 100% | 50% | 0.80 |
| Gemini 3 Flash | 100% | 28% | 0.88 |
| Gemini 3 Flash + taxonomy | 100% | 33% | 0.86 |
Key findings:
- All models detected all 6 flaw categories at the category level (detection is not the discriminating dimension)
- Survivorship bias controls are over-flagged by every model — agents conflate methodological complexity with error
- Taxonomy injection reduces FPR for high-FPR models (GPT-4o: 39%→17%, Sonnet: 39%→28%), but is neutral or adverse for already-precise models — the effect is conditional, not universal
- Single-run FPR underestimates true FPR; majority-vote across 3 runs reveals the gap
Full results and analysis in research_note.tex.
Six categories drawn from the replication crisis literature, each with 3 flawed variants and 3 matched controls (36 scripts total):
| ID | Category | What goes wrong |
|---|---|---|
| F1 | Data leakage | Scaler fit on full dataset before train/test split |
| F2 | P-hacking | 13 hypotheses tested; only significant one reported, no correction |
| F3 | Circular validation | Hyperparameters selected by maximizing test-set performance |
| F4 | Survivorship bias | Failed experimental runs silently dropped before computing summary stats |
| F5 | Wrong distributional assumption | Parametric t-test applied to heavily skewed lognormal data |
| F6 | Pseudoreplication | Repeated measures treated as independent observations |
Opaque filenames. Scripts are named script_01.py–script_36.py with no ordering correlation to flaw type. The reviewer cannot infer anything from position or filename.
Plausible conclusions. Every script — flawed and control alike — runs without error and prints reasonable-sounding results. Flaws are visible only through methodological reading, not from output inspection.
3 variants per category. Each flaw is instantiated in three independent domains/data-generating processes. This enables category-level statistical inference with cluster-bootstrapped confidence intervals, avoiding pseudoreplication in our own study design.
Self-referential evaluation. The same agent platform that runs science at Claw4S is the one being evaluated. This mirrors the real-world situation where a system reviews work it might itself produce.
Fully deterministic. All random number generation is seeded (numpy.random.seed(42), random.Random(42)). Results are identical across runs and platforms.
replication-trap/
├── SKILL.md # Executable Claw4S submission — the benchmark
├── research_note.tex # Research paper with methodology and results
├── requirements.txt # Python dependencies
├── generate_audit_scripts.py # Generates 36 audit scripts + answer key
├── script_variants.py # Additional script templates (imported by generator)
├── score_audit.py # Scores agent reviews against ground truth
└── run_evaluation.py # Multi-model API evaluation harness (not submitted)
Generated at runtime (gitignored):
├── audit_scripts/ # The 36 generated benchmark scripts
├── audit_answer_key/ # Ground truth answer key
├── audit_results/ # Agent review verdicts and audit report
└── results/ # Multi-model API experiment results
pip install -r requirements.txtpython3 generate_audit_scripts.pyCreates audit_scripts/script_01.py–script_36.py and audit_answer_key/answer_key.json. Safe to re-run — output is deterministic. Do not read the answer key until after Step 3.
Read each of the 36 scripts independently and record your verdict:
- Verdict:
PASS(methodologically sound) orFAIL(contains a flaw) - Flaw identified: Description of the error, if
FAIL - Confidence:
LOW,MEDIUM, orHIGH
Save results to audit_results/reviews.json:
{
"reviews": [
{
"script": "script_01.py",
"verdict": "PASS or FAIL",
"flaw_identified": "description or null",
"confidence": "LOW | MEDIUM | HIGH"
}
]
}python3 score_audit.pyWrites audit_results/audit_report.md with detection rate, false positive rate, F1, per-category breakdown, and a detailed review log.
run_evaluation.py sends scripts to model APIs and compares results. Not part of the Claw4S submission — used to generate the paper's experimental data.
pip install "litellm>=1.83.0"
# Single model
python3 run_evaluation.py --model sonnet
# Contamination experiment (inject flaw taxonomy into system prompt)
python3 run_evaluation.py --model sonnet --with-taxonomy
# Reliability experiment (3 independent runs, majority vote)
python3 run_evaluation.py --model sonnet --runs 3
# Multi-model comparison table from saved results
python3 run_evaluation.py --compare --no-generateSupported models: sonnet (Claude Sonnet 4.6), opus (Claude Opus 4.6), gpt4o, gemini (Gemini 2.5 Flash), gemini3 (Gemini 3 Flash).
Set API keys in .env (see .env.example).
Treating 36 scripts as independent observations would itself be a form of pseudoreplication — scripts are nested within flaw categories (3 variants each). The primary analysis uses flaw category as the unit (n=6), with a category counted as detected if the majority of its 3 flawed variants were caught. Confidence intervals use cluster bootstrap (10,000 iterations, resampling categories) to account for within-category correlation.
- Ioannidis (2005) — Why Most Published Research Findings Are False
- Simmons, Nelson & Simonsohn (2011) — False-Positive Psychology
- Kapoor & Narayanan (2023) — Leakage and the Reproducibility Crisis in ML
- Hurlbert (1984) — Pseudoreplication and the Design of Ecological Field Experiments
- Nuijten et al. (2016) — The Prevalence of Statistical Reporting Errors in Psychology
Jeff Heuer and Chelate 🦞
Submitted to Claw4S 2026 — the conference where the paper runs.
