fix(voice/#350): post-merge cleanup — capability AC, disconnect logging, doc drift, e2e wedge by drewdrewthis · Pull Request #492 · langwatch/scenario

drewdrewthis · 2026-05-20T11:55:37Z

Summary

Bundles four post-merge voice-#350 cleanup issues into one PR by request. Source: fresh-eyes audit of PR #355.

Four atomic commits — one per issue — so any one can be reverted or cherry-picked without losing the others.

Closes #487
Closes #488
Closes #489

#491 is partial: VAD parametrized test + isolation-requirement docs land here; wedge root-cause diagnosis stays open as a follow-up with a comment attached. Refs #491.

#490 (15 MB recordings) is intentionally not in this PR — needs a maintainer policy choice between LFS / golden-per-cluster / manifests-only / CDN.

#489 — doc drift on @Unit scenario count + stale Python classifiers

The doc-drift half is moot: docs/proposals/issue-350-prove-it-report.md was already deleted in f423d67 (May 11) as a relocated process artifact. The remaining work is the stale classifier fix.

python/pyproject.toml:17-21 — drop Programming Language :: Python :: 3.8 and 3.9 entries. requires-python is >=3.10, so those were PyPI search false-positives.

Evidence: git diff c833990^..c833990 -- python/pyproject.toml shows two lines removed.

#488 — log disconnect failures in `_voice_disconnect_all`

The docstring at python/scenario/scenario_executor.py:750-751 promised "failures are logged but do not mask the primary scenario result", but the implementation silently swallowed every Exception. That made real operational risks invisible — most notably the Twilio B-leg voice_url restore path, where a failure leaves a customer account clobbered with no operator signal.

python/scenario/scenario_executor.py:758-768 — add logger.warning(..., exc_info=True) in both swallow sites (adapter disconnect() and ffmpeg_playback.stop()), keying the adapter message on type(agent).__name__. Exceptions are still absorbed so cleanup completes — only their visibility changes.
python/tests/test_scenario_executor.py::test_voice_disconnect_logs_adapter_failures — fake ExplodingVoiceAdapter whose disconnect() raises; asserts a WARNING record with adapter name + exc_info is captured.

Evidence:

$ uv run pytest tests/test_scenario_executor.py -k voice_disconnect -v
PASSED tests/test_scenario_executor.py::test_voice_disconnect_logs_adapter_failures

#487 — `interruption` in published-capabilities AC + parametrized adapter tests

The interruption field exists in AdapterCapabilities and is set by Pipecat, Twilio, OpenAI Realtime, and Gemini Live adapters, but was not named in the feature file's published-capabilities scenario nor asserted by the parametrized adapter test. TS port (#372) would have inherited the gap.

specs/voice-agents.feature:754 — add interruption to the declared-capability list.
python/tests/voice/test_adapters.py:202-211 — extend test_every_adapter_publishes_capabilities to assert each capability flag (streaming_transcripts, native_vad, dtmf, interruption) is a bool, alongside the existing format-list checks.
python/tests/voice/test_capabilities.py — cover interruption=False default + new test_interrupt_raises_unsupported_when_interruption_false (a BareVoiceAdapter with default capabilities raises UnsupportedCapabilityError from await adapter.interrupt(), naming both adapter and capability).
docs/voice/capability-matrix.md — fix Gemini Live cell from ❌ to ✅ (the adapter declares interruption=True via Activity markers; the matrix said ❌ — drift caught while scoping this fix). Update the Deferred section to drop Gemini Live from the ElevenLabs-only "no native cancel" note.

Evidence:

$ uv run pytest tests/voice/test_capabilities.py tests/voice/test_adapters.py tests/voice/test_feature_file_contract.py -v
64 passed, 210 warnings in 5.70s

#491 — partial — VAD rate-limit tests + isolation-requirement docs

Full delivery on the VAD half:

python/tests/voice/test_vad.py — replace the two scalar warning-count tests with a single parametrized test_vad_fallback_rate_limits_by_adapter_name covering the three documented cases:
- ["A","A"] → 1 warning
- ["A","B"] → 2 warnings
- cross-instance same-string → 1 warning (locks in that the dedupe is on the caller-supplied adapter_name string in a ClassVar set, across all instances)
Keep one scalar test for the warning content (names adapter, mentions native VAD + accuracy caveat) — different assertion shape from rate-limit count.

Partial delivery on the wedge half:

TESTING.md — new section documenting the @e2e suite isolation requirement: the multi-turn voice demos wedge when the full suite is run in a single pytest process. Notes that conftest auto-marks *_e2e.py as integration (deselected from default CI), the @pytest.mark.skip markers are belt-and-suspenders for the nightly voice-integration.yml dispatch, and the per-cluster invocation pattern that avoids the wedge.
Wedge root-cause is not isolated in this PR. Diagnosis (static observations + likely contributors + recommended next-step path) posted as a comment on voice: diagnose multi-turn @e2e suite-wedge + tighten VAD warning rate-limit tests #491. Brief budget cap (couple hours, ship VAD + docs if stubborn) hit; LLM credits preserved for demo verification.

Evidence:

$ uv run pytest tests/voice/test_vad.py -v
6 passed, 3 warnings in 3.58s

Test plan

#489 — classifier diff inspected; no runtime tests needed.
#488 — pytest tests/test_scenario_executor.py -k voice_disconnect -v passes.
#487 — pytest tests/voice/test_capabilities.py tests/voice/test_adapters.py tests/voice/test_feature_file_contract.py -v passes (64 tests).
#491 (partial) — pytest tests/voice/test_vad.py -v passes (6 tests). Wedge diagnosis posted on the issue.
Pre-commit hooks (commitizen, large-file check) pass on all four commits.
Full voice suite intentionally NOT run (preserves LLM credits for demo verification per project memory).

🤖 Generated with Claude Code

requires-python is >=3.10, so PyPI 3.8/3.9 classifier entries were false-positives in package search. The doc-drift half of #489 (prove-it report scenario counts) is moot — docs/proposals/issue-350-prove-it-report.md was already deleted in f423d67 (May 11) as a relocated process artifact. Closes #489. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The docstring at scenario_executor.py:750-751 promises "failures are logged but do not mask the primary scenario result", but the implementation silently swallowed every Exception. That made real operational risks invisible — most notably the Twilio B-leg voice_url restore path, where a failure leaves a customer account clobbered with no operator signal. Adds logger.warning(..., exc_info=True) at both swallow sites (adapter.disconnect() and the ffmpeg playback .stop()), keying the adapter message on type(agent).__name__. Exceptions are still absorbed so cleanup completes — only their visibility changes. Closes #488. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dapter tests The interruption capability is published by adapters (Pipecat, Twilio, OpenAI Realtime, Gemini Live all set interruption=True) and gated by the base VoiceAgentAdapter.interrupt() default, which raises UnsupportedCapabilityError. Until now neither the feature file's "Every adapter publishes a capabilities attribute" scenario nor the parametrized adapter assertions named the field, so a TS-port (#372) re-implementation could silently drop it. - specs/voice-agents.feature:754 — name `interruption` in the declared-capability list. - python/tests/voice/test_adapters.py:202-211 — assert every adapter declares interruption as a bool, alongside the other flags. Catches any future capability removed by accident. - python/tests/voice/test_capabilities.py — extend the conservative- defaults assertion to cover interruption; add a negative-case test that BareVoiceAdapter (interruption=False) raises UnsupportedCapabilityError when ``await adapter.interrupt()`` is called against it, naming both adapter and capability. - docs/voice/capability-matrix.md — fix Gemini Live cell from ❌ to ✅ (the adapter declares interruption=True via Activity markers; the matrix said ❌, drift caught while scoping this fix). Update the Deferred section to drop Gemini Live from the ElevenLabs-only "no native cancel" note. Closes #487. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@e2e

…on requirement VAD half (full delivery): - Parametrize WebRTCVadFallback warning rate-limit assertions to lock in the three documented cases: ["A","A"] (1), ["A","B"] (2), cross-instance same-string (1). The dedupe is on the caller-passed adapter_name string, memoized in the ClassVar set across all instances — the parametrized form makes that property obvious and cheaper to extend. - Keep one scalar test for the warning *content* (names adapter, mentions native VAD + accuracy caveat) — that's a different assertion shape than rate-limit count. E2E wedge half (partial — diagnosis only, no fix): - TESTING.md now documents the isolation requirement: the multi-turn voice @e2e demos wedge when run in a single pytest process. They're already deselected from default CI (conftest auto-marks *_e2e.py as integration; python-ci uses -m "not integration") and additionally carry @pytest.mark.skip. Documents the per-cluster invocation pattern for nightly voice-integration runs. - Wedge root-cause not isolated in this PR. Likely contributors (background tasks not reaped between function-scoped loops, ffmpeg subprocess lingering, max_turns runaway) called out in the new TESTING.md section. Refs #491 — VAD AC done, isolation-requirement AC done. Wedge root-cause AC stays open in #491 with diagnosis attached as a comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ss (CodeQL nit) CodeQL flagged five locations where the empty-body convention `async def foo(self) -> None: ...` was used as a stub on test-only inline VoiceAgentAdapter subclasses (BareVoiceAdapter, ExplodingVoiceAdapter). Replace each with an explicit `pass` to silence the bot. No behavior change — the stubs are still no-ops, and both touched tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-24T23:27:01Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

Scope: Adds logging for failures in _voice_disconnect_all; updates tests for voice adapter capabilities, VAD rate-limits, and disconnect logging; modifies feature file to include the 'interruption' capability; updates docs (TESTING.md, capability matrix) and removes Python 3.8/3.9 classifiers from python/pyproject.toml.
Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
Classification: low-risk-change under the documented policy.

The changes are limited to documentation, test code, packaging metadata, and an observability-only change that logs exceptions during voice cleanup while still absorbing them; there are no changes to authentication/authorization, secrets/encryption, database schemas/migrations, business‑critical logic, or external integration behavior. The runtime change does not alter control flow or external API calls (it only records warnings for failures), so it fits the low-risk criteria.

An approving review has been submitted by automation. The PR may merge once required CI checks pass.

github-actions

Approved by automation: PR qualifies as low-risk-change under the documented policy.

drewdrewthis · 2026-05-24T23:32:24Z

[grinder] READY for human review

CI: green (14 non-skipping checks passing, 0 failing, 0 pending)
Threads: 0 unresolved

Verification Report

#	Criterion	Evidence	Status
1	Disconnect failures logged via WARNING (#488)	`logger.warning("voice adapter %s disconnect failed"...)` added to `scenario_executor.py`; `test_voice_disconnect_logs_adapter_failures` test confirms WARNING emitted	PASS
2	Stale Python 3.8/3.9 classifiers removed from pyproject.toml (#489)	`-"Programming Language :: Python :: 3.8"` and `3.9` confirmed removed from diff	PASS
3	VAD parametrized test + isolation docs (#491 partial)	`test_vad.py` in diff; CI green	PASS
4	TESTING.md and capability-matrix.md updated (#487 cleanup)	Both files in diff	PASS
5	All CI green	`gh pr checks 492` → 14 pass, 0 fail, 0 pending	PASS

Verdict: 5/5 PASS

Verified by:
`gh pr checks 492` → 14 pass, 0 fail, 0 pending
`gh pr diff 492 | grep logger.warning` → WARNING log added for disconnect failures
`gh pr diff 492 | grep "3.(8|9)"` → classifiers removed
GraphQL reviewThreads → 0 unresolved threads

Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed / PR #492) lists `interruption` in the adapter-capabilities declaration. The vitest-cucumber binding at voice-contract-surface.test.ts:177 still had the pre-71dd5ed step title (missing `interruption`), so StepAble couldn't find the matching feature step. Update the step title to match the feature file and add the live-adapter `typeof caps.interruption === "boolean"` check (the empty-adapter check on line 192 already exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drewdrewthis and others added 4 commits May 20, 2026 13:45

github-code-quality Bot found potential problems May 20, 2026

View reviewed changes

drewdrewthis added the grinding Grinder is actively managing this PR label May 24, 2026

github-actions Bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label May 24, 2026

github-actions Bot approved these changes May 24, 2026

View reviewed changes

drewdrewthis added pr-ready and removed grinding Grinder is actively managing this PR labels May 24, 2026

drewdrewthis merged commit 71dd5ed into main May 29, 2026
26 of 29 checks passed

drewdrewthis deleted the issue487/voice-post-merge-cleanup branch May 29, 2026 12:38

rogeriochaves mentioned this pull request May 29, 2026

chore(main): release python 0.7.28 #542

Open

langwatch-agent mentioned this pull request May 31, 2026

fix(deps): bump liquidjs override to >=10.26.0 to close RCE/ReDoS alerts #591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(voice/#350): post-merge cleanup — capability AC, disconnect logging, doc drift, e2e wedge#492

fix(voice/#350): post-merge cleanup — capability AC, disconnect logging, doc drift, e2e wedge#492
drewdrewthis merged 5 commits into
mainfrom
issue487/voice-post-merge-cleanup

drewdrewthis commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

drewdrewthis commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewdrewthis commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

#489 — doc drift on @Unit scenario count + stale Python classifiers

#488 — log disconnect failures in _voice_disconnect_all

#487 — interruption in published-capabilities AC + parametrized adapter tests

#491 — partial — VAD rate-limit tests + isolation-requirement docs

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

drewdrewthis commented May 24, 2026

Verification Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewdrewthis commented May 20, 2026 •

edited

Loading

#488 — log disconnect failures in `_voice_disconnect_all`

#487 — `interruption` in published-capabilities AC + parametrized adapter tests