You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement a shadow/dual-run mode in the release promotion pipeline where the "next" channel tag agent runs in parallel with "stable" on the same PR, but only stable's output is posted. Shadow output is logged and compared for quality regression detection, providing the evidence needed for confident ring promotion without risking production PR reviews.
Market Signal
TrueFoundry's Agent Gateway (2026) supports Shadow Mode as a first-class deployment pattern for AI agents — run new versions in parallel, compare outputs, auto-promote when quality matches (Agent DevOps: CI/CD, Evals, and Canary Deployments). Research on identity-stable canary deployment (ICAN-Deploy, arxiv 2605.28097) formalizes the safety properties needed: capability versions can change independently of agent identity. GitHub Copilot's coding agent used quality evaluation against production tasks before GA rollout. Shadow testing is emerging as the standard pattern for validating AI agents before production deployment — traditional canary rollouts (traffic splitting) don't fully apply to AI agents because even a small percentage of bad reviews posted to real PRs causes harm.
User Signal
The Safe Release Strategy initiative (#495) is the project's top strategic initiative, with active Phase 2 work on:
Issue #503 specifically asks to prove safety — shadow mode directly enables this proof by running the candidate version against real workloads without posting its output. If shadow output diverges negatively from stable, promotion is blocked automatically with evidence.
Technical Opportunity
The release channel architecture (stable/next tags, ring definitions, cut-release.sh) is already built. The review pipeline already supports cross-engine comparison via DUCK_ENGINE/DUCK_MODEL in engine.sh. Extending this to run a second engine invocation from the "next" ref in shadow mode reuses existing infrastructure:
Shadow invocation: review-one-pr.sh receives a SHADOW_AGENT_REF env var; if set, it invokes the review a second time using scripts checked out at that ref
Output capture: Shadow output goes to a workflow artifact, not to the PR
Quality comparison: A lightweight comparison (Haiku-tier LLM call or deterministic diff scoring) evaluates whether shadow output is at least as good as stable
Requires dual invocation plumbing and quality comparison logic; reuses existing cross-engine infrastructure but adds workflow complexity
Impact
high
Directly enables safe self-modification — the most critical capability for an agent that modifies its own infrastructure
Urgency
med
The Safe Release Strategy is in Phase 2; shadow mode is a natural Phase 3 enhancement after rings and health gates are solid
Adversarial Review
Strongest objection: Shadow mode doubles compute cost during validation windows. Two concurrent agent runs per PR could hit rate limits, especially with the shared daily Claude subscription cap (issue #206).
Rebuttal: Shadow mode only runs during promotion windows (1-2 days before a ring promotion, covering ~20-50 PRs), not permanently. The cost is bounded and predictable — budget for 2x during promotion week. Rate limit concern is mitigated by the cross-provider fallback architecture in engine.sh — shadow runs can use the Gemini or Copilot provider while stable uses Claude, or vice versa. Starting with a small sample (shadow 10% of PRs randomly) further bounds cost while still generating statistically meaningful quality data. The ROI of catching a bad agent version before it touches production PRs across all org repos far outweighs the temporary cost increase during validation.
Suggested Next Step
Design a shadow-run mode for review-one-pr.sh that invokes the agent at a second ref (from SHADOW_AGENT_REF env var), captures output to a workflow artifact, and posts a comparison summary to the promotion tracking issue. Gate implementation on Safe Release Strategy Phase 2 completion (#501).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Implement a shadow/dual-run mode in the release promotion pipeline where the "next" channel tag agent runs in parallel with "stable" on the same PR, but only stable's output is posted. Shadow output is logged and compared for quality regression detection, providing the evidence needed for confident ring promotion without risking production PR reviews.
Market Signal
TrueFoundry's Agent Gateway (2026) supports Shadow Mode as a first-class deployment pattern for AI agents — run new versions in parallel, compare outputs, auto-promote when quality matches (Agent DevOps: CI/CD, Evals, and Canary Deployments). Research on identity-stable canary deployment (ICAN-Deploy, arxiv 2605.28097) formalizes the safety properties needed: capability versions can change independently of agent identity. GitHub Copilot's coding agent used quality evaluation against production tasks before GA rollout. Shadow testing is emerging as the standard pattern for validating AI agents before production deployment — traditional canary rollouts (traffic splitting) don't fully apply to AI agents because even a small percentage of bad reviews posted to real PRs causes harm.
User Signal
The Safe Release Strategy initiative (#495) is the project's top strategic initiative, with active Phase 2 work on:
Issue #503 specifically asks to prove safety — shadow mode directly enables this proof by running the candidate version against real workloads without posting its output. If shadow output diverges negatively from stable, promotion is blocked automatically with evidence.
Technical Opportunity
The release channel architecture (stable/next tags, ring definitions,
cut-release.sh) is already built. The review pipeline already supports cross-engine comparison viaDUCK_ENGINE/DUCK_MODELinengine.sh. Extending this to run a second engine invocation from the "next" ref in shadow mode reuses existing infrastructure:review-one-pr.shreceives aSHADOW_AGENT_REFenv var; if set, it invokes the review a second time using scripts checked out at that refAssessment
Adversarial Review
Strongest objection: Shadow mode doubles compute cost during validation windows. Two concurrent agent runs per PR could hit rate limits, especially with the shared daily Claude subscription cap (issue #206).
Rebuttal: Shadow mode only runs during promotion windows (1-2 days before a ring promotion, covering ~20-50 PRs), not permanently. The cost is bounded and predictable — budget for 2x during promotion week. Rate limit concern is mitigated by the cross-provider fallback architecture in
engine.sh— shadow runs can use the Gemini or Copilot provider while stable uses Claude, or vice versa. Starting with a small sample (shadow 10% of PRs randomly) further bounds cost while still generating statistically meaningful quality data. The ROI of catching a bad agent version before it touches production PRs across all org repos far outweighs the temporary cost increase during validation.Suggested Next Step
Design a shadow-run mode for
review-one-pr.shthat invokes the agent at a second ref (fromSHADOW_AGENT_REFenv var), captures output to a workflow artifact, and posts a comparison summary to the promotion tracking issue. Gate implementation on Safe Release Strategy Phase 2 completion (#501).Beta Was this translation helpful? Give feedback.
All reactions