GitHub - lhl/codex-autoresearch: Codex Autoresearch Skill — A self-directed iterative system for Codex that continuously cycles through: modify, verify, retain or discard, and repeat indefinitely. Inspired by Karpathy’s autoresearch concept.

Aim. Iterate. Arrive.

Autonomous goal-driven experimentation for Codex.

English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇫🇷 Français · 🇩🇪 Deutsch · 🇪🇸 Español · 🇧🇷 Português · 🇷🇺 Русский

The idea: tell Codex what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.

Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.

Quick Start

# Install
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

Open Codex in your project and go:

You:   $codex-autoresearch
       I want to get rid of all the `any` types in my TypeScript code

Codex: I found 47 `any` occurrences across src/**/*.ts.
       Metric: `any` count (current: 47), direction: lower
       Verify: grep count + tsc --noEmit as guard
       Run mode: foreground or background?

You:   Background, go. Run overnight.

Codex: Starting background run — baseline: 47. Iterating.

Each improvement stacks. Each failure reverts. Everything is logged.

See INSTALL.md for more options. See GUIDE.md for the full manual.

How It Works

You say one sentence  →  Codex scans & confirms  →  You say "go"
                                                        |
                                         +--------------+--------------+
                                         |                             |
                                    foreground                    background
                                  (current session)            (detached, overnight)
                                         |                             |
                                         +--------------+--------------+
                                                        |
                                                        v
                                              +-------------------+
                                              |    The Loop       |
                                              |                   |
                                              |  modify one thing |
                                              |  git commit       |
                                              |  run verify       |
                                              |  improved? keep   |
                                              |  worse? revert    |
                                              |  log the result   |
                                              |  repeat           |
                                              +-------------------+

That's it. You pick one: foreground keeps the loop in your current session, background hands it off to a detached process so you can sleep. Same loop either way, but they don't run at the same time.

What You Say vs What Happens

You say	What happens
"Improve my test coverage"	Iterates until target or interrupted
"Fix the 12 failing tests"	Repairs one by one until zero remain
"Why is the API returning 503?"	Hunts root cause with falsifiable hypotheses
"Is this code secure?"	STRIDE + OWASP audit, every finding backed by code evidence
"Ship it"	Verifies readiness, generates checklist, gates release
"I want to optimize but don't know what"	Analyzes repo, suggests metrics, generates config

Behind the scenes, Codex maps your sentence to one of 7 modes (loop, plan, debug, fix, security, ship, exec). You never need to pick one.

What Codex Figures Out

You don't write config. Codex infers everything from your sentence and your repo:

What it needs	How it gets it	Example
Goal	Your sentence	"get rid of all any types"
Scope	Scans repo structure	`src/*/.ts`
Metric	Proposes based on goal + tooling	any count (current: 47)
Direction	Infers from "improve" / "reduce" / "eliminate"	lower
Verify	Matches to repo tooling	`grep` count + `tsc --noEmit`
Guard	Suggests if regression risk exists	`npm test`

Before starting, Codex always shows what it found and asks you to confirm. Then you choose foreground or background and say "go."

When It Gets Stuck

Instead of blind retrying, the loop escalates:

Trigger	Action
3 consecutive failures	REFINE — adjust within current strategy
5 consecutive failures	PIVOT — try a fundamentally different approach
2 PIVOTs without progress	Web search — look for external solutions
3 PIVOTs without progress	Stop — report that human input is needed

One success resets all counters.

Results Log

Every iteration is recorded in research-results.tsv:

iteration  commit   metric  delta   status    description
0          a1b2c3d  47      0       baseline  initial any count
1          b2c3d4e  41      -6      keep      replace any in auth module
2          -        49      +8      discard   generic wrapper introduced new anys
3          d4e5f6g  38      -3      keep      type-narrow API response handlers

Failed experiments revert from git but stay in the log. The log is the real audit trail.

More Features

These are covered in detail in GUIDE.md:

Cross-run learning — lessons from past runs bias future hypothesis generation
Parallel experiments — test up to 3 hypotheses simultaneously via git worktrees
Session resume — interrupted runs pick up from the last consistent state
CI/CD mode (exec) — non-interactive, JSON output, for automation pipelines
Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)
Session hooks — auto-installed; keep Codex on track across session boundaries

FAQ

It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the attention mechanism with linear attention and run the full eval"), and it will treat that as a single experiment to verify. The loop is best when the human sets the research direction and the agent does the heavy execution and analysis.

Is this more for engineering optimization than research? It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended research where the direction itself is uncertain, use plan mode first to explore, then switch to loop once you know what to measure. Think of it as a human-AI collaboration: you provide judgment, it provides iteration speed.

How do I stop it? Foreground: interrupt Codex. Background: $codex-autoresearch then ask to stop.

Can it resume after interruption? Yes. It resumes from autoresearch-state.json automatically.

Can I run multiple sessions against the same repo at once? Yes, if each run uses its own artifact paths or directory (research-results.tsv, autoresearch-state.json, launch/runtime/log files). Background runs support this directly. Avoid overlapping edit scopes in the same worktree unless you really want both sessions touching the same files.

How do I use it in CI? Mode: exec with codex exec. All config upfront, JSON output, exit codes 0/1/2.

Documentation

Doc	What it covers
INSTALL.md	All installation methods, skill discovery paths, hooks setup
GUIDE.md	Full operator's manual: modes, config fields, safety model, advanced usage
EXAMPLES.md	Recipes by domain: coverage, performance, types, security, etc.

Acknowledgments

Built on ideas from Karpathy's autoresearch. The Codex skills platform is by OpenAI.

Star History

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
agents		agents
docs		docs
image		image
references		references
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aim. Iterate. Arrive.

Quick Start

How It Works

What You Say vs What Happens

What Codex Figures Out

When It Gets Stuck

Results Log

More Features

FAQ

Documentation

Acknowledgments

Star History

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Aim. Iterate. Arrive.

Quick Start

How It Works

What You Say vs What Happens

What Codex Figures Out

When It Gets Stuck

Results Log

More Features

FAQ

Documentation

Acknowledgments

Star History

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages