Skip to content

jaharris87/agentic-dev-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Dev Framework

A reusable framework for multi-agent adversarial development workflows using Claude Code (builder) and OpenAI Codex (reviewer).

What This Is

Templates, scripts, and conventions for setting up:

  1. Claude Code as your development agent — writes code, opens PRs, applies review labels, monitors for reviews, and responds to findings.
  2. Codex as your adversarial review panel — software review, methodology review, and red-team review roles, each with distinct prompts.
  3. GitHub Actions as the orchestration layer — a lightweight workflow (no AI API calls) that posts @codex review comments triggered by PR labels.

Every PR gets multi-perspective adversarial review, and the builder agent is required to respond to every finding before the PR is considered complete.

Primary Usage

Start a claude session from this directory and describe the project you want to create. Claude will:

  1. Ask structured clarifying questions (language, domain, test categories, risks, architecture)
  2. Create the project directory and copy the template scaffolding
  3. Fill in all placeholders based on your confirmed requirements
  4. Customize CI, review prompts, and settings for your language/domain
  5. Initialize git, create the GitHub repo, create labels, and push
  6. Guide you through any manual steps (PAT creation, repo secret)
cd /path/to/agentic-dev-framework
claude

# Then describe your project:
# "Create a new Fortran project at ~/projects/hydro-solver that implements
#  a 2D compressible Euler solver with MPI parallelism. Use CMake + CTest.
#  It should have verification tests against Sod's shock tube."

Claude reads the CLAUDE.md in this repo, which tells it how to use the templates to bootstrap your project. You can also run the scripts manually if you prefer — see Manual Setup below.

Architecture

┌─────────────┐     opens PR        ┌──────────────────┐
│ Claude Code  │────────────────────▶│   GitHub PR       │
│ (builder)    │  + applies labels   │                  │
└──────┬───────┘                     └────────┬─────────┘
       │                                      │
       │                              labeled event
       │                                      │
       │                             ┌────────▼─────────┐
       │                             │  GitHub Action    │
       │                             │  (no AI calls)    │
       │                             │  posts @codex     │
       │                             │  review comments  │
       │                             └────────┬─────────┘
       │                                      │
       │                             ┌────────▼─────────┐
       │                             │  Codex            │
       │       reads findings        │  (3 reviewer      │
       │◀────────────────────────────│   roles)          │
       │                             └──────────────────┘
       │
       │  responds to each finding:
       │  • agree and fix
       │  • disagree with evidence
       │  • defer with reason
       ▼

The Three Layers

Layer Tool Role Calls AI API?
Builder Claude Code Writes code, opens PRs, applies labels, responds to reviews Yes (Anthropic)
Orchestrator GitHub Actions Posts @codex review comments with saved prompts No
Reviewers Codex (via GitHub) Reviews PRs with role-specific prompts Yes (OpenAI, via ChatGPT-linked Codex)

The Three Reviewer Roles

Role Label Focus
Software Review codex-software-review Bugs, edge cases, test gaps, API consistency, docs matching implementation
Methodology Review codex-methodology-review Domain-specific correctness: numerical methods, evaluation validity, metric alignment
Red Team Review codex-red-team-review Adversarial falsification: "How could this look good while being wrong?"

The methodology review is optional — it's most valuable for ML/data science, scientific computing, financial modeling, or simulation projects. The other two are universally applicable.

Prerequisites

  • Claude Code installed and authenticated
  • GitHub CLI (gh) installed and authenticated
  • ChatGPT Plus/Team/Enterprise with Codex access (for the review side)

First-Time Claude Code Setup (Optional)

If this is your first time using Claude Code, configure global settings and run the security precheck. Skip this section if you already have ~/.claude/settings.json configured.

Global settings (~/.claude/settings.json)

See templates/global-settings-example.json for a comprehensive starting point. It covers permissions for common toolchains (Python, C/C++, Fortran, Rust, Go, Node.js) and denies dangerous operations. Copy and customize:

mkdir -p ~/.claude
cp templates/global-settings-example.json ~/.claude/settings.json
# Edit to match your preferences

Key principles:

  • allow: Read-only operations and build/test commands that are safe to run freely.
  • deny: Destructive operations that should never happen without manual intervention.
  • ask: Operations that change shared state (git push, PR creation, file deletion). Always keep git push in ask. Authorization for one push does not carry forward to the next.

Security hook (defense-in-depth)

The global settings example includes a PreToolUse hook that hard-blocks dangerous patterns on every tool call, as defense-in-depth behind the deny list. Install it:

mkdir -p ~/.claude/hooks
cp templates/hooks/security-precheck.py ~/.claude/hooks/security-precheck.py
chmod +x ~/.claude/hooks/security-precheck.py

This hook catches things the deny list can't express, including:

  • Pipe-to-shell patterns (curl ... | bash)
  • Credential exfiltration via network tools
  • Recursive deletion of critical directories (~/.ssh, ~/.gnupg)
  • Base64 + network tool combinations
  • eval/exec, awk shell escapes, find -exec
  • Read/Edit/Write access to sensitive paths (.env, .ssh, .pem, etc.)
  • echo/printf with file redirection (use the Write tool instead)

The hook is registered in global-settings-example.json under the hooks key. It runs automatically on every PreToolUse event — exit code 0 allows the operation, exit code 2 hard-blocks it. Blocked events are logged to ~/.claude/security-audit.log for post-hoc review.

Note on echo/printf: Global settings allow both echo * and printf * for terminal output, but the security hook intentionally blocks file redirection (echo "x" > file.txt). This is by design — use the Write tool for file creation, which provides better auditability.

Manual Setup

If you prefer to set up a project manually rather than through Claude:

Step 1: Bootstrap the project

./scripts/init-project.sh /path/to/your-project

This copies all template files without overwriting existing ones.

Step 2: Fill in templates

Replace all {{PLACEHOLDER}} values in:

  • CLAUDE.md — Builder agent instructions. Include exact build/test/lint commands, architecture, and the PR review workflow with label trigger paths customized for your project.
  • AGENTS.md — Reviewer agent context. Write project-specific risks. Generic risks are useless; specific failure modes are gold.
  • .github/workflows/ci.yml — Replace placeholder steps with your actual toolchain.
  • .github/prompts/ — Condensed one-liner prompts for @codex review triggers (Codex ignores multi-line instructions). Customize the methodology one-liner for your domain, or remove it. Full detailed prompts are in .github/prompts/detailed/ for fallback/manual reviews.

Step 3: Create the PAT and labels

Codex ignores @codex review comments from github-actions[bot]. You need a PAT so comments appear from your user account.

  1. GitHub.com → Settings → Developer settings → Fine-grained tokens
  2. Generate a token with Pull requests: Read and write on your repo
  3. Add as a repository secret named CODEX_TRIGGER_PAT
  4. Create the review labels:
./scripts/create-labels.sh owner/repo

Development Workflow

Test-Driven Development

  1. Write test data / fixtures — static inputs for the feature.
  2. Write failing tests — tests that exercise the expected behavior.
  3. Implement — make the tests pass.
  4. Commit at each step — capture the red-green progression in git history.

The specific test categories depend on your project's domain:

Category When to use
Unit tests Any project — test individual functions/modules in isolation
Integration tests Multi-module projects — test interactions between components
Regression tests Any project — guard against re-introduced bugs
Performance tests HPC/scientific computing — ensure performance doesn't degrade
Verification tests Scientific computing — confirm numerical convergence rates
Validation tests Physics/engineering — compare against analytical solutions or experimental data
Conservation tests Physics simulations — verify conserved quantities (energy, mass, momentum)
Symmetry/invariant tests Any project with symmetry properties — verify invariants hold
MPI/parallel tests Parallel codes — verify correctness across process counts

Claude Code follows the TDD cycle when instructed via CLAUDE.md. The key is making it mandatory rather than suggesting it.

Git Conventions

Practice Why
Feature branches + PRs Never commit to main directly.
Granular commits One commit per logical step, not one per feature.
Explicit push approval Every git push requires confirmation.
Clean worktree between branches Verify clean state before creating new branches.
Review CLAUDE.md with each PR Keep docs in sync with code changes.
Audit tests before PRs Review coverage for gaps and redundancy.

PR Review Pipeline

1. Claude opens PR + applies labels
   ├── codex-software-review (always)
   ├── codex-methodology-review (if domain code changed)
   └── codex-red-team-review (if evaluation/validation code changed)

2. GitHub Action posts @codex review comments (one per label)

3. Codex reviews arrive (2-5 minutes, sequential)

4. Claude monitors, then responds to EVERY finding:
   ├── Agree and fix → make the change, push, note in reply
   ├── Disagree with evidence → cite code/tests/design decisions
   └── Defer with reason → acknowledge but explain out-of-scope

5. Claude reports summary to user

The forced rebuttal (step 4) is not optional. It prevents reviews from becoming decorative.

Claude Code Collaboration Features

The builder-reviewer loop is one half of the workflow. The other half is the human-AI collaboration loop — how you work with Claude Code across sessions.

Memory system

Claude Code persists context across sessions in ~/.claude/projects/*/memory/. Use memory for:

  • Architectural decisions — why a design was chosen (not just what)
  • User preferences — coding style, review strictness, domain expertise level
  • Risk discoveries — failure modes found during development or review

Use CLAUDE.md for build commands, workflow instructions, and anything every session needs immediately. Memory is for context that's useful but not essential at session start.

Session continuity

Feature When to use
claude --continue Resume the most recent session with full context
/compact Compress conversation history when approaching context limits
Worktrees Isolated git worktrees for parallel work on multiple features
Plan mode (/plan) Align on approach before complex implementations
claude --print Non-interactive mode for scripted automation

Context window management

Keep CLAUDE.md lean. If it exceeds ~200 lines, move reference material (API docs, data dictionaries, environment setup) to separate files that Claude can read on demand. CLAUDE.md loads at every session start — everything in it costs context on every interaction.

Extending the Framework

Adding a new review type

  1. Create .github/prompts/your-review.md with a condensed one-liner prompt (Codex ignores multi-line @codex review instructions). Optionally also create .github/prompts/detailed/your-review.md with the full prompt for fallback/manual reviews.
  2. Add a job to .github/workflows/codex-review.yml (copy an existing job, change the label name and prompt file)
  3. Run gh label create "codex-your-review" --description "..." --color "..."
  4. Update CLAUDE.md Step 1 with when to apply the new label

Useful additional review types

Review Type Focus Best For
Test Design Test proposals across all categories (unit, integration, regression, verification, validation) Any project
Data Quality Schema drift, format changes, missing entries, silent imputations Data-heavy projects
Documentation Truth Compare all docs to actual implementation Any project
Numerical Methods Stability, convergence, order of accuracy, limiters, boundary conditions Scientific computing
Parallel Correctness Race conditions, decomposition assumptions, communication patterns, load balance MPI/threaded codes
Experiment Planning Next experiments, ablation plans, baseline comparisons, stopping rules Research projects

Keeping AGENTS.md current

AGENTS.md is written at project creation but must evolve. Stale risk documentation directs reviewer attention to the wrong places. Update it:

  • After each major feature: New components introduce new failure modes.
  • After review cycles where findings were consistently wrong: The risk profile has drifted.
  • At project milestones: Reassess which risks are still load-bearing.
  • After real bugs: Add the failure mode that was missed.

Remove risks that are no longer relevant. A risk about a data pipeline that was replaced is noise.

Making reviews effective

  1. Give reviews authority to block. Define blocking conditions in CLAUDE.md (the template includes merge-blocking criteria).
  2. Require code citations. Every finding must name a file, function, or test gap.
  3. Reward important flaws, not volume. The prompts cap nits at 3.

File Reference

CLAUDE.md                                  # Instructions for Claude when bootstrapping projects
templates/
├── CLAUDE.md                              # Builder agent instructions (for new projects)
├── AGENTS.md                              # Reviewer agent context (for new projects)
├── .github/
│   ├── workflows/
│   │   ├── ci.yml                         # CI pipeline (language-agnostic)
│   │   └── codex-review.yml              # Codex review triggers
│   └── prompts/
│       ├── software-review.md             # One-liner Codex trigger prompt
│       ├── methodology-review.md          # One-liner Codex trigger prompt (optional)
│       ├── red-team-review.md             # One-liner Codex trigger prompt
│       └── detailed/                      # Full prompts for fallback/manual reviews
│           ├── software-review.md
│           ├── methodology-review.md
│           └── red-team-review.md
├── .claude/
│   └── settings.json                      # Project permissions template
├── hooks/
│   └── security-precheck.py               # PreToolUse hook (defense-in-depth)
└── global-settings-example.json           # ~/.claude/settings.json reference

scripts/
├── init-project.sh                        # Copy templates into a new project
└── create-labels.sh                       # Create GitHub review labels

Lessons Learned

Non-obvious things discovered through iteration:

  1. Codex ignores bot comments. github-actions[bot] can't trigger @codex review. You need a PAT so comments appear from a real user.

  2. GitHub fires duplicate labeled events. When applying multiple labels at PR creation, GitHub can fire extra events. Per-job concurrency groups with cancel-in-progress: true deduplicate them.

  3. ready_for_review doesn't fire on non-draft PR creation. Use labeled as the sole trigger and always apply labels.

  4. CLAUDE.md is loaded at conversation start. Existing sessions won't see mid-conversation updates. Start a new session after CLAUDE.md changes. Use --continue to resume a previous session, /compact to manage long sessions.

  5. Forced rebuttal is the most important part. Without it, reviews become decorative. The builder agent must respond to every finding.

  6. Save positive feedback, not just corrections. If Claude makes a good non-obvious choice, confirm it. Otherwise it only learns what NOT to do.

  7. git push must always require approval. One approval does not extend to the next push.

  8. Granular commits capture development progression. One commit per step is more valuable than one commit per feature.

  9. Domain-specific risks are 10x more valuable than generic ones. "Don't introduce bugs" in AGENTS.md is useless. Specific failure modes catch real problems.

  10. Test categories vary by domain. Unit tests alone are insufficient for scientific codes — add verification, validation, conservation, and convergence tests as appropriate.

About

Multi-agent adversarial dev workflow templates: Claude Code (builder) + Codex (reviewers) + GitHub Actions (orchestrator)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors