Autonomous goal-driven experimentation for Codex.
English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇫🇷 Français · 🇩🇪 Deutsch · 🇪🇸 Español · 🇧🇷 Português · 🇷🇺 Русский
The idea: tell Codex what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.
Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.
# Install
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearchOpen Codex in your project and go:
You: $codex-autoresearch
I want to get rid of all the `any` types in my TypeScript code
Codex: I found 47 `any` occurrences across src/**/*.ts.
Metric: `any` count (current: 47), direction: lower
Verify: grep count + tsc --noEmit as guard
Run mode: foreground or background?
You: Background, go. Run overnight.
Codex: Starting background run — baseline: 47. Iterating.
Each improvement stacks. Each failure reverts. Everything is logged.
See INSTALL.md for more options. See GUIDE.md for the full manual.
You say one sentence → Codex scans & confirms → You say "go"
|
+--------------+--------------+
| |
foreground background
(current session) (detached, overnight)
| |
+--------------+--------------+
|
v
+-------------------+
| The Loop |
| |
| modify one thing |
| git commit |
| run verify |
| improved? keep |
| worse? revert |
| log the result |
| repeat |
+-------------------+
That's it. You pick one: foreground keeps the loop in your current session, background hands it off to a detached process so you can sleep. Same loop either way, but they don't run at the same time.
| You say | What happens |
|---|---|
| "Improve my test coverage" | Iterates until target or interrupted |
| "Fix the 12 failing tests" | Repairs one by one until zero remain |
| "Why is the API returning 503?" | Hunts root cause with falsifiable hypotheses |
| "Is this code secure?" | STRIDE + OWASP audit, every finding backed by code evidence |
| "Ship it" | Verifies readiness, generates checklist, gates release |
| "I want to optimize but don't know what" | Analyzes repo, suggests metrics, generates config |
Behind the scenes, Codex maps your sentence to one of 7 modes (loop, plan, debug, fix, security, ship, exec). You never need to pick one.
You don't write config. Codex infers everything from your sentence and your repo:
| What it needs | How it gets it | Example |
|---|---|---|
| Goal | Your sentence | "get rid of all any types" |
| Scope | Scans repo structure | src/**/*.ts |
| Metric | Proposes based on goal + tooling | any count (current: 47) |
| Direction | Infers from "improve" / "reduce" / "eliminate" | lower |
| Verify | Matches to repo tooling | grep count + tsc --noEmit |
| Guard | Suggests if regression risk exists | npm test |
Before starting, Codex always shows what it found and asks you to confirm. Then you choose foreground or background and say "go."
Instead of blind retrying, the loop escalates:
| Trigger | Action |
|---|---|
| 3 consecutive failures | REFINE — adjust within current strategy |
| 5 consecutive failures | PIVOT — try a fundamentally different approach |
| 2 PIVOTs without progress | Web search — look for external solutions |
| 3 PIVOTs without progress | Stop — report that human input is needed |
One success resets all counters.
Every iteration is recorded in research-results.tsv:
iteration commit metric delta status description
0 a1b2c3d 47 0 baseline initial any count
1 b2c3d4e 41 -6 keep replace any in auth module
2 - 49 +8 discard generic wrapper introduced new anys
3 d4e5f6g 38 -3 keep type-narrow API response handlers
Failed experiments revert from git but stay in the log. The log is the real audit trail.
These are covered in detail in GUIDE.md:
- Cross-run learning — lessons from past runs bias future hypothesis generation
- Parallel experiments — test up to 3 hypotheses simultaneously via git worktrees
- Session resume — interrupted runs pick up from the last consistent state
- CI/CD mode (
exec) — non-interactive, JSON output, for automation pipelines - Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)
- Session hooks — auto-installed; keep Codex on track across session boundaries
It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the attention mechanism with linear attention and run the full eval"), and it will treat that as a single experiment to verify. The loop is best when the human sets the research direction and the agent does the heavy execution and analysis.
Is this more for engineering optimization than research?
It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended research where the direction itself is uncertain, use plan mode first to explore, then switch to loop once you know what to measure. Think of it as a human-AI collaboration: you provide judgment, it provides iteration speed.
How do I stop it? Foreground: interrupt Codex. Background: $codex-autoresearch then ask to stop.
Can it resume after interruption? Yes. It resumes from autoresearch-state.json automatically.
Can I run multiple sessions against the same repo at once? Yes, if each run uses its own artifact paths or directory (research-results.tsv, autoresearch-state.json, launch/runtime/log files). Background runs support this directly. Avoid overlapping edit scopes in the same worktree unless you really want both sessions touching the same files.
How do I use it in CI? Mode: exec with codex exec. All config upfront, JSON output, exit codes 0/1/2.
| Doc | What it covers |
|---|---|
| INSTALL.md | All installation methods, skill discovery paths, hooks setup |
| GUIDE.md | Full operator's manual: modes, config fields, safety model, advanced usage |
| EXAMPLES.md | Recipes by domain: coverage, performance, types, security, etc. |
Built on ideas from Karpathy's autoresearch. The Codex skills platform is by OpenAI.
MIT — see LICENSE.
