Skip to content

lhl/codex-autoresearch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Codex Autoresearch

Aim. Iterate. Arrive.

Autonomous goal-driven experimentation for Codex.

Codex Skill GitHub Stars MIT License

English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇫🇷 Français · 🇩🇪 Deutsch · 🇪🇸 Español · 🇧🇷 Português · 🇷🇺 Русский


The idea: tell Codex what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.

Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.

Quick Start

# Install
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

Open Codex in your project and go:

You:   $codex-autoresearch
       I want to get rid of all the `any` types in my TypeScript code

Codex: I found 47 `any` occurrences across src/**/*.ts.
       Metric: `any` count (current: 47), direction: lower
       Verify: grep count + tsc --noEmit as guard
       Run mode: foreground or background?

You:   Background, go. Run overnight.

Codex: Starting background run — baseline: 47. Iterating.

Each improvement stacks. Each failure reverts. Everything is logged.

See INSTALL.md for more options. See GUIDE.md for the full manual.

How It Works

You say one sentence  →  Codex scans & confirms  →  You say "go"
                                                        |
                                         +--------------+--------------+
                                         |                             |
                                    foreground                    background
                                  (current session)            (detached, overnight)
                                         |                             |
                                         +--------------+--------------+
                                                        |
                                                        v
                                              +-------------------+
                                              |    The Loop       |
                                              |                   |
                                              |  modify one thing |
                                              |  git commit       |
                                              |  run verify       |
                                              |  improved? keep   |
                                              |  worse? revert    |
                                              |  log the result   |
                                              |  repeat           |
                                              +-------------------+

That's it. You pick one: foreground keeps the loop in your current session, background hands it off to a detached process so you can sleep. Same loop either way, but they don't run at the same time.

What You Say vs What Happens

You say What happens
"Improve my test coverage" Iterates until target or interrupted
"Fix the 12 failing tests" Repairs one by one until zero remain
"Why is the API returning 503?" Hunts root cause with falsifiable hypotheses
"Is this code secure?" STRIDE + OWASP audit, every finding backed by code evidence
"Ship it" Verifies readiness, generates checklist, gates release
"I want to optimize but don't know what" Analyzes repo, suggests metrics, generates config

Behind the scenes, Codex maps your sentence to one of 7 modes (loop, plan, debug, fix, security, ship, exec). You never need to pick one.

What Codex Figures Out

You don't write config. Codex infers everything from your sentence and your repo:

What it needs How it gets it Example
Goal Your sentence "get rid of all any types"
Scope Scans repo structure src/**/*.ts
Metric Proposes based on goal + tooling any count (current: 47)
Direction Infers from "improve" / "reduce" / "eliminate" lower
Verify Matches to repo tooling grep count + tsc --noEmit
Guard Suggests if regression risk exists npm test

Before starting, Codex always shows what it found and asks you to confirm. Then you choose foreground or background and say "go."

When It Gets Stuck

Instead of blind retrying, the loop escalates:

Trigger Action
3 consecutive failures REFINE — adjust within current strategy
5 consecutive failures PIVOT — try a fundamentally different approach
2 PIVOTs without progress Web search — look for external solutions
3 PIVOTs without progress Stop — report that human input is needed

One success resets all counters.

Results Log

Every iteration is recorded in research-results.tsv:

iteration  commit   metric  delta   status    description
0          a1b2c3d  47      0       baseline  initial any count
1          b2c3d4e  41      -6      keep      replace any in auth module
2          -        49      +8      discard   generic wrapper introduced new anys
3          d4e5f6g  38      -3      keep      type-narrow API response handlers

Failed experiments revert from git but stay in the log. The log is the real audit trail.

More Features

These are covered in detail in GUIDE.md:

  • Cross-run learning — lessons from past runs bias future hypothesis generation
  • Parallel experiments — test up to 3 hypotheses simultaneously via git worktrees
  • Session resume — interrupted runs pick up from the last consistent state
  • CI/CD mode (exec) — non-interactive, JSON output, for automation pipelines
  • Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)
  • Session hooks — auto-installed; keep Codex on track across session boundaries

FAQ

It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the attention mechanism with linear attention and run the full eval"), and it will treat that as a single experiment to verify. The loop is best when the human sets the research direction and the agent does the heavy execution and analysis.

Is this more for engineering optimization than research? It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended research where the direction itself is uncertain, use plan mode first to explore, then switch to loop once you know what to measure. Think of it as a human-AI collaboration: you provide judgment, it provides iteration speed.

How do I stop it? Foreground: interrupt Codex. Background: $codex-autoresearch then ask to stop.

Can it resume after interruption? Yes. It resumes from autoresearch-state.json automatically.

Can I run multiple sessions against the same repo at once? Yes, if each run uses its own artifact paths or directory (research-results.tsv, autoresearch-state.json, launch/runtime/log files). Background runs support this directly. Avoid overlapping edit scopes in the same worktree unless you really want both sessions touching the same files.

How do I use it in CI? Mode: exec with codex exec. All config upfront, JSON output, exit codes 0/1/2.

Documentation

Doc What it covers
INSTALL.md All installation methods, skill discovery paths, hooks setup
GUIDE.md Full operator's manual: modes, config fields, safety model, advanced usage
EXAMPLES.md Recipes by domain: coverage, performance, types, security, etc.

Acknowledgments

Built on ideas from Karpathy's autoresearch. The Codex skills platform is by OpenAI.

Star History

Star History Chart

License

MIT — see LICENSE.

About

Codex Autoresearch Skill — A self-directed iterative system for Codex that continuously cycles through: modify, verify, retain or discard, and repeat indefinitely. Inspired by Karpathy’s autoresearch concept.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 97.6%
  • Shell 2.4%