Skip to content

fix(voice/#350): post-merge cleanup — capability AC, disconnect logging, doc drift, e2e wedge#492

Merged
drewdrewthis merged 5 commits into
mainfrom
issue487/voice-post-merge-cleanup
May 29, 2026
Merged

fix(voice/#350): post-merge cleanup — capability AC, disconnect logging, doc drift, e2e wedge#492
drewdrewthis merged 5 commits into
mainfrom
issue487/voice-post-merge-cleanup

Conversation

@drewdrewthis
Copy link
Copy Markdown
Collaborator

@drewdrewthis drewdrewthis commented May 20, 2026

Summary

Bundles four post-merge voice-#350 cleanup issues into one PR by request. Source: fresh-eyes audit of PR #355.

Four atomic commits — one per issue — so any one can be reverted or cherry-picked without losing the others.

Closes #487
Closes #488
Closes #489

#491 is partial: VAD parametrized test + isolation-requirement docs land here; wedge root-cause diagnosis stays open as a follow-up with a comment attached. Refs #491.

#490 (15 MB recordings) is intentionally not in this PR — needs a maintainer policy choice between LFS / golden-per-cluster / manifests-only / CDN.


#489 — doc drift on @Unit scenario count + stale Python classifiers

The doc-drift half is moot: docs/proposals/issue-350-prove-it-report.md was already deleted in f423d67 (May 11) as a relocated process artifact. The remaining work is the stale classifier fix.

  • python/pyproject.toml:17-21 — drop Programming Language :: Python :: 3.8 and 3.9 entries. requires-python is >=3.10, so those were PyPI search false-positives.

Evidence: git diff c833990^..c833990 -- python/pyproject.toml shows two lines removed.


#488 — log disconnect failures in _voice_disconnect_all

The docstring at python/scenario/scenario_executor.py:750-751 promised "failures are logged but do not mask the primary scenario result", but the implementation silently swallowed every Exception. That made real operational risks invisible — most notably the Twilio B-leg voice_url restore path, where a failure leaves a customer account clobbered with no operator signal.

  • python/scenario/scenario_executor.py:758-768 — add logger.warning(..., exc_info=True) in both swallow sites (adapter disconnect() and ffmpeg_playback.stop()), keying the adapter message on type(agent).__name__. Exceptions are still absorbed so cleanup completes — only their visibility changes.
  • python/tests/test_scenario_executor.py::test_voice_disconnect_logs_adapter_failures — fake ExplodingVoiceAdapter whose disconnect() raises; asserts a WARNING record with adapter name + exc_info is captured.

Evidence:

$ uv run pytest tests/test_scenario_executor.py -k voice_disconnect -v
PASSED tests/test_scenario_executor.py::test_voice_disconnect_logs_adapter_failures

#487interruption in published-capabilities AC + parametrized adapter tests

The interruption field exists in AdapterCapabilities and is set by Pipecat, Twilio, OpenAI Realtime, and Gemini Live adapters, but was not named in the feature file's published-capabilities scenario nor asserted by the parametrized adapter test. TS port (#372) would have inherited the gap.

  • specs/voice-agents.feature:754 — add interruption to the declared-capability list.
  • python/tests/voice/test_adapters.py:202-211 — extend test_every_adapter_publishes_capabilities to assert each capability flag (streaming_transcripts, native_vad, dtmf, interruption) is a bool, alongside the existing format-list checks.
  • python/tests/voice/test_capabilities.py — cover interruption=False default + new test_interrupt_raises_unsupported_when_interruption_false (a BareVoiceAdapter with default capabilities raises UnsupportedCapabilityError from await adapter.interrupt(), naming both adapter and capability).
  • docs/voice/capability-matrix.md — fix Gemini Live cell from ❌ to ✅ (the adapter declares interruption=True via Activity markers; the matrix said ❌ — drift caught while scoping this fix). Update the Deferred section to drop Gemini Live from the ElevenLabs-only "no native cancel" note.

Evidence:

$ uv run pytest tests/voice/test_capabilities.py tests/voice/test_adapters.py tests/voice/test_feature_file_contract.py -v
64 passed, 210 warnings in 5.70s

#491 — partial — VAD rate-limit tests + isolation-requirement docs

Full delivery on the VAD half:

  • python/tests/voice/test_vad.py — replace the two scalar warning-count tests with a single parametrized test_vad_fallback_rate_limits_by_adapter_name covering the three documented cases:
    • ["A","A"] → 1 warning
    • ["A","B"] → 2 warnings
    • cross-instance same-string → 1 warning (locks in that the dedupe is on the caller-supplied adapter_name string in a ClassVar set, across all instances)
  • Keep one scalar test for the warning content (names adapter, mentions native VAD + accuracy caveat) — different assertion shape from rate-limit count.

Partial delivery on the wedge half:

  • TESTING.md — new section documenting the @e2e suite isolation requirement: the multi-turn voice demos wedge when the full suite is run in a single pytest process. Notes that conftest auto-marks *_e2e.py as integration (deselected from default CI), the @pytest.mark.skip markers are belt-and-suspenders for the nightly voice-integration.yml dispatch, and the per-cluster invocation pattern that avoids the wedge.
  • Wedge root-cause is not isolated in this PR. Diagnosis (static observations + likely contributors + recommended next-step path) posted as a comment on voice: diagnose multi-turn @e2e suite-wedge + tighten VAD warning rate-limit tests #491. Brief budget cap (couple hours, ship VAD + docs if stubborn) hit; LLM credits preserved for demo verification.

Evidence:

$ uv run pytest tests/voice/test_vad.py -v
6 passed, 3 warnings in 3.58s

Test plan

  • #489 — classifier diff inspected; no runtime tests needed.
  • #488pytest tests/test_scenario_executor.py -k voice_disconnect -v passes.
  • #487pytest tests/voice/test_capabilities.py tests/voice/test_adapters.py tests/voice/test_feature_file_contract.py -v passes (64 tests).
  • #491 (partial) — pytest tests/voice/test_vad.py -v passes (6 tests). Wedge diagnosis posted on the issue.
  • Pre-commit hooks (commitizen, large-file check) pass on all four commits.
  • Full voice suite intentionally NOT run (preserves LLM credits for demo verification per project memory).

🤖 Generated with Claude Code

drewdrewthis and others added 4 commits May 20, 2026 13:45
requires-python is >=3.10, so PyPI 3.8/3.9 classifier entries were
false-positives in package search. The doc-drift half of #489 (prove-it
report scenario counts) is moot — docs/proposals/issue-350-prove-it-report.md
was already deleted in f423d67 (May 11) as a relocated process artifact.

Closes #489.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The docstring at scenario_executor.py:750-751 promises "failures are
logged but do not mask the primary scenario result", but the
implementation silently swallowed every Exception. That made real
operational risks invisible — most notably the Twilio B-leg
voice_url restore path, where a failure leaves a customer account
clobbered with no operator signal.

Adds logger.warning(..., exc_info=True) at both swallow sites
(adapter.disconnect() and the ffmpeg playback .stop()), keying the
adapter message on type(agent).__name__. Exceptions are still
absorbed so cleanup completes — only their visibility changes.

Closes #488.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dapter tests

The interruption capability is published by adapters (Pipecat, Twilio,
OpenAI Realtime, Gemini Live all set interruption=True) and gated by
the base VoiceAgentAdapter.interrupt() default, which raises
UnsupportedCapabilityError. Until now neither the feature file's
"Every adapter publishes a capabilities attribute" scenario nor the
parametrized adapter assertions named the field, so a TS-port (#372)
re-implementation could silently drop it.

- specs/voice-agents.feature:754 — name `interruption` in the
  declared-capability list.
- python/tests/voice/test_adapters.py:202-211 — assert every adapter
  declares interruption as a bool, alongside the other flags. Catches
  any future capability removed by accident.
- python/tests/voice/test_capabilities.py — extend the conservative-
  defaults assertion to cover interruption; add a negative-case test
  that BareVoiceAdapter (interruption=False) raises
  UnsupportedCapabilityError when ``await adapter.interrupt()`` is
  called against it, naming both adapter and capability.
- docs/voice/capability-matrix.md — fix Gemini Live cell from ❌ to ✅
  (the adapter declares interruption=True via Activity markers; the
  matrix said ❌, drift caught while scoping this fix). Update the
  Deferred section to drop Gemini Live from the ElevenLabs-only
  "no native cancel" note.

Closes #487.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on requirement

VAD half (full delivery):
- Parametrize WebRTCVadFallback warning rate-limit assertions to lock
  in the three documented cases: ["A","A"] (1), ["A","B"] (2),
  cross-instance same-string (1). The dedupe is on the caller-passed
  adapter_name string, memoized in the ClassVar set across all
  instances — the parametrized form makes that property obvious and
  cheaper to extend.
- Keep one scalar test for the warning *content* (names adapter,
  mentions native VAD + accuracy caveat) — that's a different
  assertion shape than rate-limit count.

E2E wedge half (partial — diagnosis only, no fix):
- TESTING.md now documents the isolation requirement: the multi-turn
  voice @e2e demos wedge when run in a single pytest process. They're
  already deselected from default CI (conftest auto-marks *_e2e.py
  as integration; python-ci uses -m "not integration") and additionally
  carry @pytest.mark.skip. Documents the per-cluster invocation pattern
  for nightly voice-integration runs.
- Wedge root-cause not isolated in this PR. Likely contributors
  (background tasks not reaped between function-scoped loops, ffmpeg
  subprocess lingering, max_turns runaway) called out in the new
  TESTING.md section.

Refs #491 — VAD AC done, isolation-requirement AC done. Wedge
root-cause AC stays open in #491 with diagnosis attached as a comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread python/tests/voice/test_capabilities.py Fixed
Comment thread python/tests/voice/test_capabilities.py Fixed
Comment thread python/tests/voice/test_capabilities.py Fixed
Comment thread python/tests/test_scenario_executor.py Fixed
Comment thread python/tests/test_scenario_executor.py Fixed
…ss (CodeQL nit)

CodeQL flagged five locations where the empty-body convention
`async def foo(self) -> None: ...` was used as a stub on test-only
inline VoiceAgentAdapter subclasses (BareVoiceAdapter,
ExplodingVoiceAdapter). Replace each with an explicit `pass` to
silence the bot.

No behavior change — the stubs are still no-ops, and both touched
tests stay green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drewdrewthis drewdrewthis added the grinding Grinder is actively managing this PR label May 24, 2026
@github-actions github-actions Bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label May 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

  • Scope: Adds logging for failures in _voice_disconnect_all; updates tests for voice adapter capabilities, VAD rate-limits, and disconnect logging; modifies feature file to include the 'interruption' capability; updates docs (TESTING.md, capability matrix) and removes Python 3.8/3.9 classifiers from python/pyproject.toml.
  • Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
  • Classification: low-risk-change under the documented policy.

The changes are limited to documentation, test code, packaging metadata, and an observability-only change that logs exceptions during voice cleanup while still absorbing them; there are no changes to authentication/authorization, secrets/encryption, database schemas/migrations, business‑critical logic, or external integration behavior. The runtime change does not alter control flow or external API calls (it only records warnings for failures), so it fits the low-risk criteria.

An approving review has been submitted by automation. The PR may merge once required CI checks pass.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved by automation: PR qualifies as low-risk-change under the documented policy.

@drewdrewthis
Copy link
Copy Markdown
Collaborator Author

[grinder] READY for human review

CI: green (14 non-skipping checks passing, 0 failing, 0 pending)
Threads: 0 unresolved

Verification Report

# Criterion Evidence Status
1 Disconnect failures logged via WARNING (#488) `logger.warning("voice adapter %s disconnect failed"...)` added to `scenario_executor.py`; `test_voice_disconnect_logs_adapter_failures` test confirms WARNING emitted PASS
2 Stale Python 3.8/3.9 classifiers removed from pyproject.toml (#489) `-"Programming Language :: Python :: 3.8"` and `3.9` confirmed removed from diff PASS
3 VAD parametrized test + isolation docs (#491 partial) `test_vad.py` in diff; CI green PASS
4 TESTING.md and capability-matrix.md updated (#487 cleanup) Both files in diff PASS
5 All CI green `gh pr checks 492` → 14 pass, 0 fail, 0 pending PASS

Verdict: 5/5 PASS

Verified by:
`gh pr checks 492` → 14 pass, 0 fail, 0 pending
`gh pr diff 492 | grep logger.warning` → WARNING log added for disconnect failures
`gh pr diff 492 | grep "3.(8|9)"` → classifiers removed
GraphQL reviewThreads → 0 unresolved threads

@drewdrewthis drewdrewthis added pr-ready and removed grinding Grinder is actively managing this PR labels May 24, 2026
@drewdrewthis drewdrewthis merged commit 71dd5ed into main May 29, 2026
26 of 29 checks passed
@drewdrewthis drewdrewthis deleted the issue487/voice-post-merge-cleanup branch May 29, 2026 12:38
drewdrewthis added a commit that referenced this pull request May 31, 2026
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed
/ PR #492) lists `interruption` in the adapter-capabilities declaration.
The vitest-cucumber binding at voice-contract-surface.test.ts:177 still
had the pre-71dd5ed step title (missing `interruption`), so StepAble
couldn't find the matching feature step.

Update the step title to match the feature file and add the live-adapter
`typeof caps.interruption === "boolean"` check (the empty-adapter check
on line 192 already exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rogeriochaves pushed a commit that referenced this pull request Jun 2, 2026
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed
/ PR #492) lists `interruption` in the adapter-capabilities declaration.
The vitest-cucumber binding at voice-contract-surface.test.ts:177 still
had the pre-71dd5ed step title (missing `interruption`), so StepAble
couldn't find the matching feature step.

Update the step title to match the feature file and add the live-adapter
`typeof caps.interruption === "boolean"` check (the empty-adapter check
on line 192 already exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
drewdrewthis added a commit that referenced this pull request Jun 4, 2026
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed
/ PR #492) lists `interruption` in the adapter-capabilities declaration.
The vitest-cucumber binding at voice-contract-surface.test.ts:177 still
had the pre-71dd5ed step title (missing `interruption`), so StepAble
couldn't find the matching feature step.

Update the step title to match the feature file and add the live-adapter
`typeof caps.interruption === "boolean"` check (the empty-adapter check
on line 192 already exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

low-risk-change PR qualifies as low-risk per policy and can be merged without manual review pr-ready

Projects

None yet

1 participant