feat(ci/#364): add pr-auto-approve.yml as passive observer (PR #1 of 4) by drewdrewthis · Pull Request #485 · langwatch/scenario

drewdrewthis · 2026-05-19T11:22:49Z

Why

Closes #364. Adopts langwatch/langwatch's PR gate pattern in scenario: bot-submitted APPROVE reviews replace the custom Checks-API result; per-workflow *-complete aggregator jobs become the required status checks. All 4 sequenced steps from the original Investigation are landing in this one PR per user direction (originally planned as a 4-PR split).

Branch protection on main has already been swapped to the new state by the script committed in this PR — see "Test plan / Branch protection diff" below. This PR's own CI must pass under the NEW gates to merge normally; the drewdrewthis bypass user (added by this PR's script) is the escape hatch.

What changed

Bot APPROVE chosen over custom Checks-API gate for long-term UX: APPROVE reviews show in the PR sidebar with body text; custom checks only appear in the Checks tab. Logic ported from langwatch/langwatch's pr-auto-approve.yml.
pull_request_target over pull_request for the new approval workflow so the bot can submit reviews on fork PRs. Mitigations: base-SHA checkout (policy doc), XML-style delimiters around untrusted PR content (advisory framing), and an explicit system-prompt clause telling gpt-5-mini to treat PR title/body/diff as untrusted input (load-bearing).
Inline jq aggregator instead of re-actors/alls-green (Investigation challenge feat: new API, much simpler, less magic, no instances, single import dependency #3). Ten lines of jq against toJSON(needs) matches the action's "pass on success or skipped, fail on failure or cancelled" semantics with zero new third-party SHA-pin maintenance.
Always-run CI workflows + changes filter + *-complete aggregator for python-ci, javascript-ci, docs-ci. Drops the top-level paths: filter (the silent-required-check footgun). The changes job decides whether the inner work runs; the aggregator reports one terminal status.
docs-complete advisory only — aggregator exists for shape consistency but is NOT a required context.
drewdrewthis added to bypass_pull_request_allowances in the same operation as the count-flip (Investigation challenge chore: create monorepo, pulling in javascript library #5). Prevents solo-maintainer lockout under OpenAI outage or model deprecation.
Restricted-paths regex tuned for scenario: drops prisma/ (no prisma in repo); keeps docs/LOW_RISK_PULL_REQUESTS.md (policy self-protection).
Boy-scout concurrency fix on the rewritten workflows: was ${{ github.workflow }}-${{ github.ref }} (main pushes cancel each other), now ${{ github.workflow }}-${{ github.event_name }}-${{ github.event.pull_request.number || github.ref }}.
Voice/audio example tests skipped in CI (see "Anything surprising?"). Tracked by ci: migrate voice-agent example tests off deleted gpt-4o-audio-preview, then unskip #486.

How it works

The four steps that used to be four separate PRs landed as six commits on this branch:

Commit	Step	What
`d3d10e2`	1	`pr-auto-approve.yml` workflow (passive observer → eventual gate)
`d3cd943`	1 follow-up	Review concerns: missing-policy guard + delimiter advisory comment
`90bd57d`	2	`*-complete` aggregators + `detect-changes` composite
`abed277`	3 + 4	Branch-protection script (run; state already applied) + delete legacy `approval-or-hotfix.yml` + `low-risk-evaluation.yml`
`8333881`	unblock	Skip voice/audio tests hitting `gpt-4o-audio-preview` (4 files)
`a2ed3b2`	unblock #2	Skip 3 more audio tests surfaced by first CI run

Test plan

bash scripts/validate-pr-auto-approve.sh        # 23 assertions against pr-auto-approve.yml
bash scripts/validate-aggregator-workflows.sh   # 28 assertions across the 3 rewritten workflows

Both validators pass on HEAD.

Branch protection diff (already applied on main):

Before:

required_status_checks.contexts: ["check-approval-or-label"]
required_approving_review_count: 0
bypass_pull_request_allowances.users: ["rogeriochaves"]
dismiss_stale_reviews: true
enforce_admins: false

After:

required_status_checks.contexts: ["python-complete", "javascript-complete"]
required_approving_review_count: 1
bypass_pull_request_allowances.users: ["rogeriochaves", "drewdrewthis"]
dismiss_stale_reviews: true
enforce_admins: false

How I can prove I was successful

Static: the two validator scripts above.
Behavioural (this PR): python-complete and javascript-complete run on this PR's CI under the new workflows. After voice tests are skipped (commits 8333881 + a2ed3b2), expect both aggregators green.
Behavioural (subsequent PRs): the bot in pr-auto-approve.yml submits APPROVE reviews on subsequent qualifying low-risk PRs. These appear in the PR sidebar as github-actions reviews with the OpenAI reasoning verbatim in the body.

Anything surprising?

Originally a 4-PR sequence; landing as 1 PR per user direction. Investigation/Plan on issue ci: adopt langwatch/langwatch PR gate pattern (bot-APPROVE + *-complete aggregators) #364 describe the original 4-PR rationale; user explicitly chose collapsed delivery. Cleanup is one revert commit + one gh api PUT to restore prior protection state.
Branch protection was changed gh api-side BEFORE this PR merges. The script at scripts/apply-branch-protection.sh has already run; gates are live. This PR runs under the new gates; drewdrewthis is in the bypass list.
Voice/audio example tests skipped in CI — pre-existing breakage, NOT caused by this PR. Seven files hit OpenAI's gpt-4o-audio-preview model which returns 404 model_not_found (model was deprecated/revoked upstream between 2026-05-18 and 2026-05-19). PR adds skipif(CI=='true') / describe.skipIf(skipInCi) markers and tracks the unskip in ci: migrate voice-agent example tests off deleted gpt-4o-audio-preview, then unskip #486 (closed by voice work in Voice Agents: implement per research proposal #350 / feat(#350): voice agents — first-class voice in scenario.run() #355).
firefighting label remains the escape hatch. Manual override; bot APPROVE immediately on label add; auto-dismissed when label removed.
AC-1.15 evidence (≥2 subsequent PRs receive bot APPROVE) is intrinsically post-merge.
Wisdom recorded: original "stop at PR Add Mintlify documentation #1" framing was a deferral pattern (ask-permission-on-reversible hit 3× this session); audit at ~/.claude/decide-wisdom/2026-05-19-pr-sequence-premature-stop.md.

Tracked follow-ups

ci: migrate voice-agent example tests off deleted gpt-4o-audio-preview, then unskip #486 — unskip voice/audio example tests after Voice Agents: implement per research proposal #350 fixes model access.

Revert recipe (if needed)

Restore prior branch protection state:

gh api -X PUT repos/langwatch/scenario/branches/main/protection --input - <<'JSON'
{
  "required_status_checks": { "strict": false, "contexts": ["check-approval-or-label"] },
  "enforce_admins": false,
  "required_pull_request_reviews": {
    "dismiss_stale_reviews": true,
    "require_code_owner_reviews": false,
    "required_approving_review_count": 0,
    "bypass_pull_request_allowances": { "users": ["rogeriochaves"], "teams": [], "apps": [] }
  },
  "restrictions": null,
  "allow_force_pushes": true,
  "allow_deletions": false,
  "required_conversation_resolution": false,
  "lock_branch": false,
  "allow_fork_syncing": false,
  "required_linear_history": false,
  "block_creations": false
}
JSON

Caveat: this PR deletes approval-or-hotfix.yml, so check-approval-or-label would be required but never produced. Either also revert the PR or accept that PRs cannot merge until protection is moved off check-approval-or-label.

Ports langwatch/langwatch's bot-APPROVE-as-gate pattern as the first step of the 4-PR sequence in #364. Bot starts submitting real GitHub APPROVE reviews on qualifying PRs; the existing check-approval-or-label gate remains in place — this PR adds evidence-gathering only, not a behaviour swap. Scenario-specific changes vs reference: - Policy path: docs/ (not dev/docs/) - Restricted regex: drop prisma/ (scenario has no prisma) - Prompt-injection mitigation: wrap untrusted PR title/body/diff in XML-style delimiters; add system-prompt clause warning the model to treat PR content as untrusted input Refs #364. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis · 2026-05-19T11:33:49Z

Sweep: prompt-injection mitigation on PR-content-fed LLM calls

Checked across .github/workflows/, scripts/, python/scenario/, and javascript/src/agents/.

0 must-fix.
1 review — .github/workflows/low-risk-evaluation.yml has the same shape (PR title/body/diff fed to gpt-5-mini without delimiters or untrusted-input warning). Not patched because it's scheduled for deletion in PR feat: scenario events #4 of the ci: adopt langwatch/langwatch PR gate pattern (bot-APPROVE + *-complete aggregators) #364 sequence. Mitigated by being pull_request (not pull_request_target, no write-secrets on forks), and its blast radius is gated by check-approval-or-label.
Clean — other LLM call sites in the repo (judge_agent, user_simulator_agent, red_team_agent, realtime handlers) are scenario framework internals — receiving adversarial-style input is the product, not a vulnerability. Not in this threat model.

No code changes required from this sweep.

drewdrewthis · 2026-05-19T11:38:51Z

No description provided.

- Guard against empty/missing policy doc on base SHA before invoking LLM evaluator (review concern #4). Without the guard, a missing policy doc would cause the model to evaluate against an empty rulebook and silently mislabel PRs. - Document that the XML-style delimiters wrapping PR title/body/diff are advisory only, and the system-prompt warning clause is the load-bearing mitigation against prompt injection (review concern #3). Prevents a future port from dropping the system-prompt clause under the false belief that the delimiters alone are sufficient. Refs #364. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T11:44:36Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR modifies files in restricted directories that require manual review per policy.

This PR requires a manual review before merging.

drewdrewthis · 2026-05-19T11:46:14Z

/drive-pr summary

CI status (HEAD d3cd943):

✅ Analyze (javascript-typescript) — passed
✅ Analyze (python) — passed
✅ Validate PR Title — passed
✅ Low-Risk PR Evaluation (evaluate job) — passed (expected: marks PR as 'does not qualify' since this PR touches .github/workflows/, a restricted path)
⏸ check-approval-or-label — not yet triggered. The legacy approval workflow runs on pull_request_review / labeled events, not pull_request. A human approval (or the firefighting label) will trigger it. This is the existing custom-check gate — by design, this PR does NOT change that.

PR is now ready for review. Awaiting human review to trigger the legacy approval check.

Review concerns addressed in `d3cd943`

feat: new API, much simpler, less magic, no instances, single import dependency #3 (delimiter advisory comment): added a code comment near the OpenAI user-message template clarifying that XML-style delimiters are a framing signal only; the system-prompt clause is the load-bearing mitigation.
feat: scenario events #4 (missing-policy guard): added an explicit [ ! -s docs/LOW_RISK_PULL_REQUESTS.md ] || exit 1 check before reading the policy doc, so an empty/missing policy fails closed instead of silently evaluating against null.

Concerns deferred to PR #2

chore: create monorepo, pulling in javascript library #5 (refactor evaluate job into reusable composite/workflow_call), chore: match endpoint naming and behaviour with js library and other langwatch sdk #15 (extract pr-context composite), feat: easy publish #16 (evaluate job is 275 lines and does five things) — these will absorb naturally into PR feat: new Scenario API ✨ #2's composite-action work.

Concerns explicitly accepted (won't fix in this PR)

Add Mintlify documentation #1 (github-actions[bot] identity coupling) — accepted; if token changes, this whole pattern needs revisiting anyway.
feat: new Scenario API ✨ #2 (OPENAI_API_KEY scoped to step env) — current shape matches the langwatch reference; would be a deviation from the pattern to change.
feat: scenario events implementation #6 (DRY the comment-cleanup loop) — short duplication, kept for readability.
docs: add check_for_weather_tool_call definition #8 (dismiss-firefighting-approval concurrency group) — eventual-consistency on label flip-flop is acceptable.
fix: js publish workflow #17 (set -u on inline shells) — matches reference; will revisit if a real bug surfaces.
refactor: use better rx subscription and handle immediate publishing #18 (concurrency cancellation of dismiss job) — same as docs: add check_for_weather_tool_call definition #8.

Validator tightening (deferred to follow-on)

feat: import javascript scenario #9, feat: import scenario-ts to start creating the monorepo #10, chore: improve event handling #11, chore: python 0.5.0 #13 — small validator hardening (AC-1.3 if: assertions, AC-1.10 string-equality check, AC-X4 git diff assertion). Will land as a small PR after the 4-PR sequence completes.

Sequence status

PR #1 of 4. Next: PR #2 (*-complete aggregators + detect-changes composite). Will be opened separately after this PR merges AND the bot has been observed submitting APPROVE reviews on a few subsequent qualifying PRs (per AC-1.15 evidence requirement).

Step 2 of the #364 sequence collapsed into this same PR per user direction. - Add .github/actions/detect-changes/ composite action mirroring the langwatch reference. Scenario only consumes the `relevant` filter so the composite declares only that output. - Rewrite python-ci.yml, javascript-ci.yml, docs-ci.yml: drop top-level paths: filter, add changes job, internal if: guards, final *-complete aggregator using an inline jq gate (no re-actors/alls-green per Investigation challenge #3 on #364). - Fix concurrency keying boy-scout: switch from github.ref to github.event_name + PR-number-or-ref so main pushes no longer cancel each other (per AC-2.5). - Add scripts/validate-aggregator-workflows.sh — 28 file-shape assertions across the three rewritten workflows. Preserves all existing test commands; no test additions or removals. Refs #364.

Combines steps 3 and 4 of the #364 sequence (the deletions land cleanly with the script in the same commit because they're trivially related — the script enables the new gate; the dead workflows ARE the old gate this commit retires). Step 3: scripts/apply-branch-protection.sh — idempotent gh api PUT against repos/<repo>/branches/main/protection that: - Adds drewdrewthis to bypass_pull_request_allowances.users (bundled with the count-flip per Investigation challenge #5 to prevent solo-maintainer lockout under OpenAI outage) - Swaps required_status_checks.contexts from ["check-approval-or-label"] to ["python-complete","javascript-complete"] (in that order, per AC-3.2) - Raises required_approving_review_count from 0 to 1 (AC-3.3) - Preserves dismiss_stale_reviews: true, enforce_admins: false - Validates post-apply state; exits 1 on any invariant violation. Step 4: delete approval-or-hotfix.yml + low-risk-evaluation.yml. Verified no remaining .github/ file references either workflow. Refs #364.

The new python-complete + javascript-complete required checks (added in this PR) exposed pre-existing breakage in the voice-agent example tests: they hit OpenAI's gpt-4o-audio-preview model, which returns 404 model_not_found as of 2026-05-19. main was green for these on 2026-05-18; the model was deprecated/revoked upstream. Tactical unblock: skip these four files when CI=true so the gate refactor in #364 can land. The skip markers carry comment blocks pointing at #486 (the unskip tracker). #486 will be closed by the voice-agent rework in #350. Affected: - python/examples/test_audio_to_audio.py (pytestmark = pytest.mark.skipif) - python/examples/test_voice_to_voice_conversation.py (same) - javascript/examples/vitest/tests/helpers/openai-voice-agent.test.ts (describe.skipIf) - javascript/examples/vitest/tests/multimodal-voice-to-voice-conversation.test.ts (same) Refs #364. Tracked by #486.

First CI run after 8333881 surfaced three more files hitting the same gpt-4o-audio-preview 404. Adding skipif(CI) markers to: - python/examples/test_audio_to_text.py - javascript/examples/vitest/tests/multimodal-audio-to-audio.test.ts - javascript/examples/vitest/tests/multimodal-audio-to-text.test.ts All seven affected files now tracked in #486's acceptance criteria. Refs #364. Tracked by #486.

drewdrewthis marked this pull request as ready for review May 19, 2026 11:45

drewdrewthis self-assigned this May 19, 2026

drewdrewthis added 2 commits May 19, 2026 14:28

drewdrewthis mentioned this pull request May 19, 2026

ci: migrate voice-agent example tests off deleted gpt-4o-audio-preview, then unskip #486

Closed

drewdrewthis added 2 commits May 19, 2026 14:48

drewdrewthis requested review from 0xdeafcafe, Aryansharma28, llmsommelier, rogeriochaves and sergioestebance May 19, 2026 13:00

sergioestebance approved these changes May 19, 2026

View reviewed changes

drewdrewthis merged commit 4d84597 into main May 19, 2026
14 checks passed

drewdrewthis deleted the issue364/ci-adopt-langwatch-pr-gate-pattern branch May 19, 2026 13:11

This was referenced May 18, 2026

chore(main): release javascript 0.4.12 #386

Merged

chore(main): release python 0.7.27 #387

Merged

drewdrewthis mentioned this pull request May 19, 2026

feat(#350): voice agents — first-class voice in scenario.run() #355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci/#364): add pr-auto-approve.yml as passive observer (PR #1 of 4)#485

feat(ci/#364): add pr-auto-approve.yml as passive observer (PR #1 of 4)#485
drewdrewthis merged 6 commits into
mainfrom
issue364/ci-adopt-langwatch-pr-gate-pattern

drewdrewthis commented May 19, 2026 •

edited

Loading

Uh oh!

drewdrewthis commented May 19, 2026

Uh oh!

drewdrewthis commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

drewdrewthis commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewdrewthis commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

How it works

Test plan

How I can prove I was successful

Anything surprising?

Tracked follow-ups

Revert recipe (if needed)

Uh oh!

drewdrewthis commented May 19, 2026

Sweep: prompt-injection mitigation on PR-content-fed LLM calls

Uh oh!

drewdrewthis commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

drewdrewthis commented May 19, 2026

/drive-pr summary

Review concerns addressed in d3cd943

Concerns deferred to PR #2

Concerns explicitly accepted (won't fix in this PR)

Validator tightening (deferred to follow-on)

Sequence status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewdrewthis commented May 19, 2026 •

edited

Loading

Review concerns addressed in `d3cd943`