chore: main-side cleanup — docs + spec + python/TS parity by drewdrewthis · Pull Request #586 · langwatch/scenario

drewdrewthis · 2026-05-29T14:02:37Z

Summary

Consolidated main-side cleanup PR: docs housekeeping + spec housekeeping + python/TS parity fixes. Surfaced by a directory-structure audit on PR #561. Lands the changes for the follow-up issues listed under Closes below (most already closed; see that section).

File-structure changes

What moved, what was deleted, what was added (the recordings tree moved wholesale — 68 files — shown collapsed):

docs/
  voice-bug-bash.md                          ✖ deleted (orphan QA guide, zero refs)
  voice-twilio.md                            ✖ deleted (superseded by voice/adapters/twilio.mdx)
  voice/
    capability-matrix.md                     ✖ deleted ─┐ folded into the published mdx
    happy-path-elevenlabs.md                 ──► docs/docs/pages/voice/happy-path-elevenlabs.mdx   (published)
    happy-path-openai-realtime.md            ──► docs/docs/pages/voice/happy-path-openai-realtime.mdx (published)
  LOW_RISK_PULL_REQUESTS.md                  ──► .github/LOW_RISK_PULL_REQUESTS.md   (GitHub-process doc)
  docs/pages/voice/capability-matrix.mdx     ✎ edited (now the single source-of-truth; sidebar in docs/vocs.config.tsx)

javascript/
  docs/voice/capability-matrix.md            ✖ deleted ─┘ (was a divergent copy)

python/
  recordings/                                ──► python/outputs/recordings/   (whole tree: 18 demos, 68 files)
  outputs/README.md                          ✚ added (documents the new outputs/ parent)
  scenario/voice/assets/noise/*.wav          ✎ refreshed (cafe/street/office/airport + babble)
  scenario/voice/capabilities.py             ✎ edited (error msg → published capability-matrix URL)

javascript/src/voice/capabilities.ts         ✎ edited (error msg → published capability-matrix URL)
.github/workflows/pr-auto-approve.yml        ✎ edited (HTTP-406 oversized-diff handling; new LOW_RISK path)
javascript/tsconfig.json                     ✎ edited (removed duplicate "target" key)
specs/langwatch-pr-gate-pattern.feature      ✎ edited (dropped a stale, no-longer-meaningful scenario)

Changes

Docs housekeeping

Delete two orphan docs — docs/voice-bug-bash.md (zero refs, stale to closed PR #355) and docs/voice-twilio.md (superseded by the published voice/adapters/twilio.mdx).
Publish the two happy-path guides — docs/voice/happy-path-{elevenlabs,openai-realtime}.md → docs/docs/pages/voice/*.mdx, with sidebar entries in docs/vocs.config.tsx so they render on the docs site.
Fold 3 capability-matrix files into 1 published source-of-truth — deleted docs/voice/capability-matrix.md and javascript/docs/voice/capability-matrix.md; merged into the existing docs/docs/pages/voice/capability-matrix.mdx. The error messages in python/scenario/voice/capabilities.py and javascript/src/voice/capabilities.ts now point at the published URL, and the JS contract-surface test (javascript/src/voice/__tests__/voice-contract-surface.test.ts) points at the published mdx.

Process / infra

Move docs/LOW_RISK_PULL_REQUESTS.md → .github/ — GitHub-process docs idiomatically live alongside CODEOWNERS and PULL_REQUEST_TEMPLATE. The path is load-bearing in .github/workflows/pr-auto-approve.yml; the workflow regex, validator script, and spec assertions were updated in the same change. Verified via scripts/validate-pr-auto-approve.sh.
Catch HTTP 406 in pr-auto-approve.yml diff fetch + harden reason interpolation — gh pr diff hits GitHub's 20k-line API cap on huge PRs (observed on #561, run 26644602950) and crashes the evaluate step before reaching the workflow's own oversized-diff guard. Two parts: (1) wrap the fetch with a grep-specific catch on the 20k-line error string (auth/network failures still exit 1); (2) emit oversized_reason via heredoc and pass it through the env: pattern to the fail-fast step (the documented GitHub Actions anti-shell-injection pattern, even when the source is gh stderr). Supersedes PR #572. Closes ci: evaluate workflow hard-fails on PRs >20k-line diff instead of its oversized path #571.
Fix a duplicate "target" key in javascript/tsconfig.json — a pre-existing defect that broke the JS test suite under Node 24's strict JSON parser (it was already failing on main). This change unblocks that suite.

Spec housekeeping

Remove the stale PR #1 does not modify the legacy approval workflows scenario in specs/langwatch-pr-gate-pattern.feature. It asserted PR Add Mintlify documentation #1 left approval-or-hotfix.yml and low-risk-evaluation.yml untouched, but both workflows have since been deleted on main, so the assertion is no longer meaningful. The matching traceability comment is updated to mark it removed. (The separate deletion scenarios for those two workflows in the same spec are unaffected.)

Python / TS parity

Rename python/recordings/ → python/outputs/recordings/ — mirrors the TS rename + nest in PR #561. outputs/ is the parent for all test-run artifacts (recordings now, traces/logs/screenshots later): "outputs" describes the purpose, "recordings" the format. The helper module and function names are preserved (_recording_helper.py, save_demo_recording); a new python/outputs/README.md documents the parent dir.
Refresh python/scenario/voice/assets/noise/*.wav to match the layered, distinct cafe / street / office / airport ambience (plus the babble sample used by the multiple_voices effect) now shipping on TS in PR #561. Closes the parity gap in backgroundNoise(). Generated with the same deterministic seeded generator as the TS assets (which ship on feat(typescript-sdk): voice agent testing — consolidated clean stack #561, not yet on main), so the two are reproducible from one seed; a single canonical generator that writes both targets is tracked as follow-up voice: fold to one canonical noise-sample generator (single source-of-truth) #588.

Test sanity

Python voice unit + capability + recording suites pass with 0 regressions (see CI). Live-LLM / integration tests are deselected under CI=true (they require API keys).
The capability error-message refactor keeps the existing "capability matrix" in str(err) assertion green; the recordings rename is path-only, so the recording artifact contract is unchanged.

Out of scope (kept as one follow-up issue)

voice: fold to one canonical noise-sample generator (single source-of-truth) #588 — fold to a single canonical noise-sample generator (one script that writes both python/scenario/voice/assets/noise/*.wav AND javascript/src/voice/assets/noise/*.wav). Architectural change; deferred.

Closes

Closes #571.

Also carries to main the work tracked by #587, #589, and #590 — those issues were already closed on 2026-05-29; this PR lands their changes. (GitHub auto-closes only the still-open #571 on merge.)

🤖 Generated with Claude Code

…es closed PR #355) Internal QA guide superseded by PR #355's merge. Zero references in the repo. The file referenced workflow that no longer matches the current voice-adapter architecture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed voice/adapters/twilio.mdx) The orphan was a hands-on smoke-test playbook (Twilio trial setup, cloudflared install, 3 runnable smoke scripts, manual webhook reset recipe). It is superseded by the published reference doc at docs/docs/pages/voice/adapters/twilio.mdx, which is the canonical user-facing source. The orphan referenced example files that no longer exist in the current tree (voice_pipecat_scenario.py, voice_twilio_agent_answers_scenario.py, voice_twilio_simulator_calls_human_scenario.py), so keeping it would have actively misled users to dead-link bait. Verified zero external refs: rg -l voice-twilio --hidden # empty Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…process doc convention) Process docs idiomatically live in .github/ alongside CODEOWNERS, PULL_REQUEST_TEMPLATE, and contributing guides. The Low-Risk PR policy is GitHub-workflow scaffolding, not user-facing product docs. Coordinated updates to every load-bearing reference: - .github/workflows/pr-auto-approve.yml — RESTRICTED_PATTERN regex, 3 file-path uses, 2 GitHub URLs, 1 comment. - specs/langwatch-pr-gate-pattern.feature — 3 spec assertions and the AC-X4 coverage-map comment. - scripts/validate-pr-auto-approve.sh — EXPECTED_PATTERN regex (must match the workflow byte-for-byte), AC-X3 policy-path grep assertions and messages, comment-URL assertion, plus a new fail-guard against the prior docs/ path resurfacing. Verified the validator runs clean on the new state (AC-X3 + AC-1.6 all pass). Two pre-existing AC-1.11 failures on main are unrelated to this move (legacy workflows already deleted by a later PR in the gate-swap sequence). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit feedback: these were real docs with real refs but were unpublished — discoverable only by grep. Moved into the vocs pages tree + added to the sidebar so users actually find them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…f-truth Audit feedback: three files claimed "capability matrix" with slightly different content (Python source-of-truth, TS mirror, published mdx). Reduced to one: the published mdx is now the canonical source. Both sides reference the public URL. Error messages in Python and TS code now point users at the live docs URL instead of an unpublished markdown file. JS contract-surface test points at the published mdx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rkflows Workflows approval-or-hotfix.yml and low-risk-evaluation.yml were deleted in PR #4 of the gate-swap sequence (already merged on main) but the spec still asserted the diff of PR #1 must not modify them. Both PRs have long-since landed and the files no longer exist on main, so the assertion is no longer meaningful. The AC mapping comment is preserved with a tombstone explaining the removal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y with TS) Mirrors the TS rename in PR #561. "Outputs" describes the purpose (what the example tests produced); "recordings" describes the format (audio recordings). The recording artifact name is preserved inside the helper (save_demo_recording, _recording_helper.py, _RECORDINGS_ROOT variable) — only the on-disk dir changes. Updates: - git mv python/recordings → python/outputs (68 files; rename detected) - python/examples/voice/_recording_helper.py: directory path + docstrings - python/examples/voice/pipecat_{scenario,ws}.py: comment paths - python/.env.example: SCENARIO_LOG_FILE default - .gitignore: all python/recordings patterns - .github/workflows/voice-integration.yml: upload-artifact path - specs/voice-agents.feature: 3 mentions in scenario text - specs/voice-docs-surface.feature: 1 mention in troubleshooting scenario Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…red ambience) Python was shipping 0.5s/24KB single-tone placeholders; TS upgraded to 3s/144KB layered cafe/office/highway/wind ambience in PR #561. Bringing Python into parity so backgroundNoise() ships the same quality on both sides. Byte-identical to the TS asset bundle (same deterministic seeded generator). WAVs: 144044 bytes each, 3.00s, mono PCM16 24kHz. LICENSES.md: updated to describe the new generator + layered content; notes the cross-language byte-identical copy with a reference to the single-canonical-generator follow-up at #588. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User feedback: outputs/ should be a parent for all test-run artifact types (recordings now, traces/logs/screenshots later). Adds the recordings/ nesting + a new outputs/README.md describing the shape. Writer in _recording_helper.py updated to point at outputs/recordings/. Symmetric with the TS-side nesting in PR #561. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JSON allows duplicate keys (last wins) but Node 24's tsx/oxc parser rejects with a strict JSON parse error. This was breaking 21 test files on the ci-checks (24.x) CI matrix on both main and PR #586. Removes the duplicate; first occurrence (same value, ES2022) is kept. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-existing failure masked by the tsconfig parse error; surfaced after PR #586's tsconfig fix unblocked the test loader. Commit 71dd5ed (origin/main, 2026-05-29) added 'interruption' to the feature step: And it declares: streaming_transcripts, native_vad, dtmf, interruption, input_formats, output_formats but did not update the matching binding string in voice-contract-surface.test.ts, so vitest-cucumber raised StepAbleUnknowStepError for 'Scenario: Every adapter publishes a capabilities attribute' at specs/voice-agents.feature:751. The binding body already asserted empty.interruption — only the title string was stale. Aligning it restores the green path: Test Files 1 passed (1) Tests 16 passed (16) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-3-pro-preview returned 404 on the live Gemini API: "This model models/gemini-3-pro-preview is no longer available. Please update your code to use a newer model for the latest features and improvements." Swapped both occurrences (JudgeAgent + litellm.completion) for gemini-2.5-pro — current stable pro-tier Gemini, same intent as the original (pro-tier judge). The other Gemini reference in this file (UserSimulatorAgent on gemini-2.5-flash) and other repo Gemini call sites already use the gemini-2.5 family. Pre-existing failure on main; this unblocks main CI and PR #586. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `gh pr diff` API returns 406 when a PR's diff exceeds GitHub's 20k-line cap (observed on PR #561's evaluate run 26644602950). The downstream "Fail fast for oversized diffs" step already handles `oversized=true` gracefully (sets qualifies=false, exits 0, posts manual-review-required review), but the upstream `gh pr diff` call crashes before reaching it. Wrap the fetch: on HTTP 406, set oversized=true and exit 0. Other failures still propagate as exit 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to bafdbf7: the 406-handling fix now emits an oversized_reason output so the downstream "Fail fast for oversized diffs" step can post a more honest review explaining WHY evaluation was skipped (fetch failure vs char-count exceeded). Pass the reason via env: pattern (intermediate environment variable, not direct ${{ }} expansion in run:) per the documented GitHub Actions shell-injection hardening pattern. Use heredoc for GITHUB_OUTPUT writes since gh CLI stderr can be multi-line and would otherwise corrupt the output-file format. Pattern source: PR #572 commit 7dd9311. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drewdrewthis · 2026-05-31T17:30:07Z

/prove-it: PASS + /review: clean (re-verified at 06489d8, after addressing review findings).

All ACs across #571 / #587 / #589 / #590 + PR-body claims map to first-hand evidence:

ci: evaluate workflow hard-fails on PRs >20k-line diff instead of its oversized path #571 — evaluate 406/oversized: specific-grep catch routes the GitHub 20k-line diff cap → oversized path (exit 0); genuine auth/network errors still exit 1; oversized_reason passed via env: (shell-injection-hardened). evaluate job is green on this head. ✅
voice(python): noise-sample parity with TS + rename recordings → outputs #587 — noise WAVs refreshed to 3.00s / 24kHz (was 0.5s placeholder); python/recordings/ → python/outputs/recordings/ rename complete (helper points there, no stale refs); LICENSES.md updated. ✅
docs(voice): publish or demote orphan voice docs (capability-matrix, happy-path-*) #589 / docs(voice): reduce 3 capability-matrix files to 1 source-of-truth + 1 published #590 — 3 capability-matrix files folded to 1 published .mdx; py + ts error messages point at https://scenario-docs.langwatch.ai/voice/capability-matrix. ✅
Docs/infra — 2 orphan docs deleted; LOW_RISK_PULL_REQUESTS.md → .github/ (workflow regex + validator + spec changed in coordination); tsconfig.json duplicate target key fixed; stale AC-1.11 assertion removed. ✅
Tests — voice unit 315 passed, 35 deselected (CI-equivalent), capabilities 5/5, recording 11/11. ✅

Review findings addressed (commit 06489d8):

(blocking) scripts/validate-pr-auto-approve.sh AC-1.11 block asserted two workflows exist that were deleted in PR feat: scenario events #4 → validator exited 1 despite the workflow being correct. The PR body cited this script as verification. Fixed — stale block removed, validator now exits 0 (mirrors the spec's own AC-1.11 removal).
LICENSES.md regen instruction pointed at generate-noise-samples.mjs, which ships on feat(typescript-sdk): voice agent testing — consolidated clean stack #561 and isn't on main yet. Fixed — reworded to reference feat(typescript-sdk): voice agent testing — consolidated clean stack #561 + follow-up voice: fold to one canonical noise-sample generator (single source-of-truth) #588.
PR-body "byte-identical to the TS asset bundle" softened — no TS bundle on main to compare against; assets are reproducible from the same seed.

Non-fixes (deliberate): spec Background line listing the two now-deleted workflows is the pre-sequence (t0) baseline every scenario references — not a current-state claim — so it's correct as-is. Filed #592 for the systemic gap that the validator is never run in CI.

_{🤖 /prove-it + /review via Claude Code}

The AC-1.11 block asserted approval-or-hotfix.yml and low-risk-evaluation.yml still exist, but both were deleted in PR #4 of the gate-swap sequence, so the validator exited 1 even though the workflow is correct. The matching spec scenario was already removed in this PR; this aligns the validator script. Also fix the noise-sample LICENSES.md regeneration note, which pointed at generate-noise-samples.mjs — a script that ships with PR #561 and is not on main yet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

drewdrewthis · 2026-05-31T18:14:27Z

📣 Slack PR-review request posted to #dev — thread. (CI 🟢 on 06489d8, /prove-it + /review clean.) Label slack-requested may be absent due to the org OAuth-App restriction on this token; this comment is the dedup marker.

github-actions · 2026-06-01T09:27:10Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR's diff exceeds the size limit for automated low-risk evaluation. Manual review required.

This PR requires a manual review before merging.

gemini-3-pro-preview returned 404 on the live Gemini API: "This model models/gemini-3-pro-preview is no longer available. Please update your code to use a newer model for the latest features and improvements." Swapped both occurrences (JudgeAgent + litellm.completion) for gemini-2.5-pro — current stable pro-tier Gemini, same intent as the original (pro-tier judge). The other Gemini reference in this file (UserSimulatorAgent on gemini-2.5-flash) and other repo Gemini call sites already use the gemini-2.5 family. Pre-existing failure on main; this unblocks main CI and PR #586. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drewdrewthis and others added 3 commits May 29, 2026 13:58

drewdrewthis self-assigned this May 29, 2026

drewdrewthis and others added 5 commits May 29, 2026 14:25

drewdrewthis changed the title ~~chore: docs cleanup — delete 2 stale orphans, move LOW_RISK policy to .github/~~ chore: main-side cleanup — docs + spec + python/TS parity May 29, 2026

drewdrewthis and others added 2 commits May 29, 2026 14:59

drewdrewthis mentioned this pull request May 29, 2026

feat(typescript-sdk): voice agent testing — consolidated clean stack #561

Merged

drewdrewthis and others added 3 commits May 29, 2026 17:41

drewdrewthis mentioned this pull request May 31, 2026

fix(ci/#571): soft-pass oversized PR diffs in pr-auto-approve evaluate #572

Closed

drewdrewthis mentioned this pull request May 31, 2026

ci: validate-pr-auto-approve.sh is never run — wire into CI or delete #592

Open

3 tasks

drewdrewthis added the pr-ready label Jun 1, 2026

drewdrewthis requested review from 0xdeafcafe, Aryansharma28, rogeriochaves and sergioestebance June 1, 2026 09:26

drewdrewthis mentioned this pull request Jun 1, 2026

chore: scrub vestigial AC-reference tags from code + specs #594

Merged

0xdeafcafe approved these changes Jun 1, 2026

View reviewed changes

drewdrewthis merged commit 371f94c into main Jun 1, 2026
25 checks passed

drewdrewthis deleted the chore/docs-cleanup-orphans branch June 1, 2026 15:43

This was referenced Jun 1, 2026

chore(main): release javascript 0.4.12 #386

Merged

chore(main): release python 0.7.28 #542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: main-side cleanup — docs + spec + python/TS parity#586

chore: main-side cleanup — docs + spec + python/TS parity#586
drewdrewthis merged 15 commits into
mainfrom
chore/docs-cleanup-orphans

drewdrewthis commented May 29, 2026 •

edited

Loading

Uh oh!

drewdrewthis commented May 31, 2026 •

edited

Loading

Uh oh!

drewdrewthis commented May 31, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewdrewthis commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

File-structure changes

Changes

Docs housekeeping

Process / infra

Spec housekeeping

Python / TS parity

Test sanity

Out of scope (kept as one follow-up issue)

Closes

Uh oh!

drewdrewthis commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drewdrewthis commented May 31, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewdrewthis commented May 29, 2026 •

edited

Loading

drewdrewthis commented May 31, 2026 •

edited

Loading