Agentic Dev Framework

A reusable framework for multi-agent adversarial development workflows using Claude Code (builder) and OpenAI Codex (reviewer).

What This Is

Templates, scripts, and conventions for setting up:

Claude Code as your development agent — writes code, opens PRs, applies review labels, monitors for reviews, and responds to findings.
Codex as your adversarial review panel — software review, methodology review, and red-team review roles, each with distinct prompts.
GitHub Actions as the orchestration layer — a lightweight workflow (no AI API calls) that posts @codex review comments triggered by PR labels.

Every PR gets multi-perspective adversarial review, and the builder agent is required to respond to every finding before the PR is considered complete.

Primary Usage

Start a claude session from this directory and describe the project you want to create. Claude will:

Ask structured clarifying questions (language, domain, test categories, risks, architecture)
Create the project directory and copy the template scaffolding
Fill in all placeholders based on your confirmed requirements
Customize CI, review prompts, and settings for your language/domain
Initialize git, create the GitHub repo, create labels, and push
Guide you through any manual steps (PAT creation, repo secret)

cd /path/to/agentic-dev-framework
claude

# Then describe your project:
# "Create a new Fortran project at ~/projects/hydro-solver that implements
#  a 2D compressible Euler solver with MPI parallelism. Use CMake + CTest.
#  It should have verification tests against Sod's shock tube."

Claude reads the CLAUDE.md in this repo, which tells it how to use the templates to bootstrap your project. You can also run the scripts manually if you prefer — see Manual Setup below.

Architecture

┌─────────────┐     opens PR        ┌──────────────────┐
│ Claude Code  │────────────────────▶│   GitHub PR       │
│ (builder)    │  + applies labels   │                  │
└──────┬───────┘                     └────────┬─────────┘
       │                                      │
       │                              labeled event
       │                                      │
       │                             ┌────────▼─────────┐
       │                             │  GitHub Action    │
       │                             │  (no AI calls)    │
       │                             │  posts @codex     │
       │                             │  review comments  │
       │                             └────────┬─────────┘
       │                                      │
       │                             ┌────────▼─────────┐
       │                             │  Codex            │
       │       reads findings        │  (3 reviewer      │
       │◀────────────────────────────│   roles)          │
       │                             └──────────────────┘
       │
       │  responds to each finding:
       │  • agree and fix
       │  • disagree with evidence
       │  • defer with reason
       ▼

The Three Layers

Layer	Tool	Role	Calls AI API?
Builder	Claude Code	Writes code, opens PRs, applies labels, responds to reviews	Yes (Anthropic)
Orchestrator	GitHub Actions	Posts `@codex review` comments with saved prompts	No
Reviewers	Codex (via GitHub)	Reviews PRs with role-specific prompts	Yes (OpenAI, via ChatGPT-linked Codex)

The Three Reviewer Roles

Role	Label	Focus
Software Review	`codex-software-review`	Bugs, edge cases, test gaps, API consistency, docs matching implementation
Methodology Review	`codex-methodology-review`	Domain-specific correctness: numerical methods, evaluation validity, metric alignment
Red Team Review	`codex-red-team-review`	Adversarial falsification: "How could this look good while being wrong?"

The methodology review is optional — it's most valuable for ML/data science, scientific computing, financial modeling, or simulation projects. The other two are universally applicable.

Prerequisites

Claude Code installed and authenticated
GitHub CLI (gh) installed and authenticated
ChatGPT Plus/Team/Enterprise with Codex access (for the review side)

First-Time Claude Code Setup (Optional)

If this is your first time using Claude Code, configure global settings and run the security precheck. Skip this section if you already have ~/.claude/settings.json configured.

Global settings (`~/.claude/settings.json`)

See templates/global-settings-example.json for a comprehensive starting point. It covers permissions for common toolchains (Python, C/C++, Fortran, Rust, Go, Node.js) and denies dangerous operations. Copy and customize:

mkdir -p ~/.claude
cp templates/global-settings-example.json ~/.claude/settings.json
# Edit to match your preferences

Key principles:

allow: Read-only operations and build/test commands that are safe to run freely.
deny: Destructive operations that should never happen without manual intervention.
ask: Operations that change shared state (git push, PR creation, file deletion). Always keep git push in ask. Authorization for one push does not carry forward to the next.

Security hook (defense-in-depth)

The global settings example includes a PreToolUse hook that hard-blocks dangerous patterns on every tool call, as defense-in-depth behind the deny list. Install it:

mkdir -p ~/.claude/hooks
cp templates/hooks/security-precheck.py ~/.claude/hooks/security-precheck.py
chmod +x ~/.claude/hooks/security-precheck.py

This hook catches things the deny list can't express, including:

Pipe-to-shell patterns (curl ... | bash)
Credential exfiltration via network tools
Recursive deletion of critical directories (~/.ssh, ~/.gnupg)
Base64 + network tool combinations
eval/exec, awk shell escapes, find -exec
Read/Edit/Write access to sensitive paths (.env, .ssh, .pem, etc.)
echo/printf with file redirection (use the Write tool instead)

The hook is registered in global-settings-example.json under the hooks key. It runs automatically on every PreToolUse event — exit code 0 allows the operation, exit code 2 hard-blocks it. Blocked events are logged to ~/.claude/security-audit.log for post-hoc review.

Note on echo/printf: Global settings allow both echo * and printf * for terminal output, but the security hook intentionally blocks file redirection (echo "x" > file.txt). This is by design — use the Write tool for file creation, which provides better auditability.

Manual Setup

If you prefer to set up a project manually rather than through Claude:

Step 1: Bootstrap the project

./scripts/init-project.sh /path/to/your-project

This copies all template files without overwriting existing ones.

Step 2: Fill in templates

Replace all {{PLACEHOLDER}} values in:

CLAUDE.md — Builder agent instructions. Include exact build/test/lint commands, architecture, and the PR review workflow with label trigger paths customized for your project.
AGENTS.md — Reviewer agent context. Write project-specific risks. Generic risks are useless; specific failure modes are gold.
.github/workflows/ci.yml — Replace placeholder steps with your actual toolchain.
.github/prompts/ — Condensed one-liner prompts for @codex review triggers (Codex ignores multi-line instructions). Customize the methodology one-liner for your domain, or remove it. Full detailed prompts are in .github/prompts/detailed/ for fallback/manual reviews.

Step 3: Create the PAT and labels

Codex ignores @codex review comments from github-actions[bot]. You need a PAT so comments appear from your user account.

GitHub.com → Settings → Developer settings → Fine-grained tokens
Generate a token with Pull requests: Read and write on your repo
Add as a repository secret named CODEX_TRIGGER_PAT
Create the review labels:

./scripts/create-labels.sh owner/repo

Development Workflow

Test-Driven Development

Write test data / fixtures — static inputs for the feature.
Write failing tests — tests that exercise the expected behavior.
Implement — make the tests pass.
Commit at each step — capture the red-green progression in git history.

The specific test categories depend on your project's domain:

Category	When to use
Unit tests	Any project — test individual functions/modules in isolation
Integration tests	Multi-module projects — test interactions between components
Regression tests	Any project — guard against re-introduced bugs
Performance tests	HPC/scientific computing — ensure performance doesn't degrade
Verification tests	Scientific computing — confirm numerical convergence rates
Validation tests	Physics/engineering — compare against analytical solutions or experimental data
Conservation tests	Physics simulations — verify conserved quantities (energy, mass, momentum)
Symmetry/invariant tests	Any project with symmetry properties — verify invariants hold
MPI/parallel tests	Parallel codes — verify correctness across process counts

Claude Code follows the TDD cycle when instructed via CLAUDE.md. The key is making it mandatory rather than suggesting it.

Git Conventions

Practice	Why
Feature branches + PRs	Never commit to main directly.
Granular commits	One commit per logical step, not one per feature.
Explicit push approval	Every `git push` requires confirmation.
Clean worktree between branches	Verify clean state before creating new branches.
Review CLAUDE.md with each PR	Keep docs in sync with code changes.
Audit tests before PRs	Review coverage for gaps and redundancy.

PR Review Pipeline

1. Claude opens PR + applies labels
   ├── codex-software-review (always)
   ├── codex-methodology-review (if domain code changed)
   └── codex-red-team-review (if evaluation/validation code changed)

2. GitHub Action posts @codex review comments (one per label)

3. Codex reviews arrive (2-5 minutes, sequential)

4. Claude monitors, then responds to EVERY finding:
   ├── Agree and fix → make the change, push, note in reply
   ├── Disagree with evidence → cite code/tests/design decisions
   └── Defer with reason → acknowledge but explain out-of-scope

5. Claude reports summary to user

The forced rebuttal (step 4) is not optional. It prevents reviews from becoming decorative.

Claude Code Collaboration Features

The builder-reviewer loop is one half of the workflow. The other half is the human-AI collaboration loop — how you work with Claude Code across sessions.

Memory system

Claude Code persists context across sessions in ~/.claude/projects/*/memory/. Use memory for:

Architectural decisions — why a design was chosen (not just what)
User preferences — coding style, review strictness, domain expertise level
Risk discoveries — failure modes found during development or review

Use CLAUDE.md for build commands, workflow instructions, and anything every session needs immediately. Memory is for context that's useful but not essential at session start.

Session continuity

Feature	When to use
`claude --continue`	Resume the most recent session with full context
`/compact`	Compress conversation history when approaching context limits
Worktrees	Isolated git worktrees for parallel work on multiple features
Plan mode (`/plan`)	Align on approach before complex implementations
`claude --print`	Non-interactive mode for scripted automation

Context window management

Keep CLAUDE.md lean. If it exceeds ~200 lines, move reference material (API docs, data dictionaries, environment setup) to separate files that Claude can read on demand. CLAUDE.md loads at every session start — everything in it costs context on every interaction.

Extending the Framework

Adding a new review type

Create .github/prompts/your-review.md with a condensed one-liner prompt (Codex ignores multi-line @codex review instructions). Optionally also create .github/prompts/detailed/your-review.md with the full prompt for fallback/manual reviews.
Add a job to .github/workflows/codex-review.yml (copy an existing job, change the label name and prompt file)
Run gh label create "codex-your-review" --description "..." --color "..."
Update CLAUDE.md Step 1 with when to apply the new label

Useful additional review types

Review Type	Focus	Best For
Test Design	Test proposals across all categories (unit, integration, regression, verification, validation)	Any project
Data Quality	Schema drift, format changes, missing entries, silent imputations	Data-heavy projects
Documentation Truth	Compare all docs to actual implementation	Any project
Numerical Methods	Stability, convergence, order of accuracy, limiters, boundary conditions	Scientific computing
Parallel Correctness	Race conditions, decomposition assumptions, communication patterns, load balance	MPI/threaded codes
Experiment Planning	Next experiments, ablation plans, baseline comparisons, stopping rules	Research projects

Keeping AGENTS.md current

AGENTS.md is written at project creation but must evolve. Stale risk documentation directs reviewer attention to the wrong places. Update it:

After each major feature: New components introduce new failure modes.
After review cycles where findings were consistently wrong: The risk profile has drifted.
At project milestones: Reassess which risks are still load-bearing.
After real bugs: Add the failure mode that was missed.

Remove risks that are no longer relevant. A risk about a data pipeline that was replaced is noise.

Making reviews effective

Give reviews authority to block. Define blocking conditions in CLAUDE.md (the template includes merge-blocking criteria).
Require code citations. Every finding must name a file, function, or test gap.
Reward important flaws, not volume. The prompts cap nits at 3.

File Reference

CLAUDE.md                                  # Instructions for Claude when bootstrapping projects
templates/
├── CLAUDE.md                              # Builder agent instructions (for new projects)
├── AGENTS.md                              # Reviewer agent context (for new projects)
├── .github/
│   ├── workflows/
│   │   ├── ci.yml                         # CI pipeline (language-agnostic)
│   │   └── codex-review.yml              # Codex review triggers
│   └── prompts/
│       ├── software-review.md             # One-liner Codex trigger prompt
│       ├── methodology-review.md          # One-liner Codex trigger prompt (optional)
│       ├── red-team-review.md             # One-liner Codex trigger prompt
│       └── detailed/                      # Full prompts for fallback/manual reviews
│           ├── software-review.md
│           ├── methodology-review.md
│           └── red-team-review.md
├── .claude/
│   └── settings.json                      # Project permissions template
├── hooks/
│   └── security-precheck.py               # PreToolUse hook (defense-in-depth)
└── global-settings-example.json           # ~/.claude/settings.json reference

scripts/
├── init-project.sh                        # Copy templates into a new project
└── create-labels.sh                       # Create GitHub review labels

Lessons Learned

Non-obvious things discovered through iteration:

Codex ignores bot comments. github-actions[bot] can't trigger @codex review. You need a PAT so comments appear from a real user.
GitHub fires duplicate labeled events. When applying multiple labels at PR creation, GitHub can fire extra events. Per-job concurrency groups with cancel-in-progress: true deduplicate them.
ready_for_review doesn't fire on non-draft PR creation. Use labeled as the sole trigger and always apply labels.
CLAUDE.md is loaded at conversation start. Existing sessions won't see mid-conversation updates. Start a new session after CLAUDE.md changes. Use --continue to resume a previous session, /compact to manage long sessions.
Forced rebuttal is the most important part. Without it, reviews become decorative. The builder agent must respond to every finding.
Save positive feedback, not just corrections. If Claude makes a good non-obvious choice, confirm it. Otherwise it only learns what NOT to do.
git push must always require approval. One approval does not extend to the next push.
Granular commits capture development progression. One commit per step is more valuable than one commit per feature.
Domain-specific risks are 10x more valuable than generic ones. "Don't introduce bugs" in AGENTS.md is useless. Specific failure modes catch real problems.
Test categories vary by domain. Unit tests alone are insufficient for scientific codes — add verification, validation, conservation, and convergence tests as appropriate.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.claude/skills		.claude/skills
scripts		scripts
templates		templates
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Agentic Dev Framework

What This Is

Primary Usage

Architecture

The Three Layers

The Three Reviewer Roles

Prerequisites

First-Time Claude Code Setup (Optional)

Global settings (~/.claude/settings.json)

Security hook (defense-in-depth)

Manual Setup

Step 1: Bootstrap the project

Step 2: Fill in templates

Step 3: Create the PAT and labels

Development Workflow

Test-Driven Development

Git Conventions

PR Review Pipeline

Claude Code Collaboration Features

Memory system

Session continuity

Context window management

Extending the Framework

Adding a new review type

Useful additional review types

Keeping AGENTS.md current

Making reviews effective

File Reference

Lessons Learned

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Global settings (`~/.claude/settings.json`)

Packages