Skip to content

fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744

Open
masatohoshino wants to merge 9 commits intoopenclaw:mainfrom
masatohoshino:fix/whatsapp-audio-as-voice-consumption
Open

fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744
masatohoshino wants to merge 9 commits intoopenclaw:mainfrom
masatohoshino:fix/whatsapp-audio-as-voice-consumption

Conversation

@masatohoshino
Copy link
Copy Markdown
Contributor

Summary

On the WhatsApp channel, audio replies flagged with [[audio_as_voice]] were being delivered as generic document attachments instead of native PTT (push-to-talk) voice notes.

This matters because WhatsApp presents voice notes differently from generic file attachments, so the intended voice-message UX was lost for audio reply scenarios.

This PR updates the WhatsApp adapter to consume the audioAsVoice flag across the relevant send paths (send.ts, outbound-base.ts, outbound-adapter.ts, and auto-reply/deliver-reply.ts). When the flag is true and the media is audio (detected by MIME type or filename extension), the file is routed to the PTT path with its mimetype normalized to audio/ogg; codecs=opus. Non-audio files continue to use the document path even if the flag is true.

Scope is intentionally limited to the WhatsApp adapter. Pipeline-side audioAsVoice propagation (src/infra/outbound/), other channel adapters, and public APIs are unchanged.

Change Type

  • Bug fix

Scope

  • Integrations (WhatsApp adapter only)

Linked Issue/PR

Root Cause

The pipeline already propagated audioAsVoice through sendOverrides, but the WhatsApp adapter did not consistently consume and forward that flag on its audio-send paths.

A structural factor that made this easier to miss was that send.ts and outbound-base.ts used separate local options types, and the outbound-base.ts version did not include audioAsVoice. As a result, the flag could fall through silently on that path.

This gap was surfaced by @escotilha in #66053.

What Changed

  • WhatsApp send paths now consistently forward and consume audioAsVoice
  • Audio media with audioAsVoice=true is sent via the PTT / voice-note path
  • The outgoing mimetype is normalized to audio/ogg; codecs=opus on that path
  • Non-audio files remain on the document path even when audioAsVoice=true

Regression Test Plan

  • Coverage: Unit + seam integration
  • Files touched: extensions/whatsapp/src/{send,outbound-base,outbound-adapter}.ts, extensions/whatsapp/src/auto-reply/deliver-reply.ts, plus their *.test.ts siblings
  • Scenarios covered:
    • audioAsVoice=true + audio MIME or .ogg → PTT voice-note path
    • audioAsVoice=true + non-audio (application/octet-stream, etc.) → document path preserved
    • audioAsVoice unset/false + kind="document" with audio-like filename → document path preserved
    • Adapter/base → send forwarding: audioAsVoice propagates from outbound-adapter.sendMedia and outbound-base.sendMedia to sendMessageWhatsApp
  • Test results: pnpm test extensions/whatsapp/src passes locally (57 files / 483 tests)
  • Added/updated coverage includes:
    • send.test.ts
    • auto-reply/deliver-reply.test.ts

Security Impact

  • New permissions: No
  • Secrets handling: No
  • Network calls: No (uses the existing WhatsApp send path)
  • Data scope: No

Repro + Verification

Steps

  1. Configure OpenClaw with the WhatsApp channel enabled.
  2. Trigger an outbound reply that produces an audio file (.ogg / opus) together with [[audio_as_voice]] and MEDIA:<path>.
  3. Inspect how the message arrives in the WhatsApp client.

Before this fix

  • The audio file arrives as a generic document attachment rather than a native voice note.

After this fix

  • The audio file arrives as a PTT voice note (waveform UI / play button).

Evidence

  • Failing mode covered by tests + passing after
    • The newly added audioAsVoice=true cases in send.test.ts and auto-reply/deliver-reply.test.ts are designed to cover the missing-flag-consumption path that existed before this fix. With this change, the expected PTT routing, non-audio document preservation, and unset-flag regression guards all pass locally.
  • Trace/log snippets
    • Local test run: pnpm test extensions/whatsapp/src57 files / 483 tests pass
    • codex review --base origin/main reported no P1 or P2 findings
  • Screenshot/recording
    • I do not have a reproducible WhatsApp test environment for live before/after captures. The behavior difference is covered by unit + seam tests, and the repro steps above should allow maintainers to verify it directly in a configured WhatsApp environment.

Human Verification

  • Verified scenarios:
    • Local build and full WhatsApp test suite pass after the change
    • pre-commit hook (pnpm check, type check, import-cycle check) passes cleanly
    • Unset-path regression tests confirm that documents with audio-like filenames remain on the document path when audioAsVoice is unset
  • Edge cases covered by tests:
    • Non-audio files are not mis-routed to PTT when the flag is true
    • kind: "document" with voice.ogg filename and unset flag remains on the document path
  • Not verified:
    • Live WhatsApp delivery on a real account (no reproducible environment on my side)
    • Very large audio payloads (>16MB) and uncommon audio container formats

Compatibility / Migration

  • Backward compatible: Yes
  • Config / env changes: No
  • Migration: No

Risks and Follow-ups

  • I noticed a few places in other channel adapters that may be worth checking for similar flag-consumption gaps. I intentionally kept this PR scoped to WhatsApp so the fix remains reviewable and directly resolves WhatsApp: [[audio_as_voice]] / PTT voice-note delivery not working #66053.
  • The duplicated local options-type definitions in send.ts and outbound-base.ts made this omission easier to miss. I left that broader cleanup out of scope here to keep the PR minimal, but consolidating those types in a follow-up could help prevent similar issues.

AI-assisted

  • Drafted with Claude (strategy and split decisions) + Codex CLI (implementation)
  • Tested locally — all relevant tests pass
  • Author understands the change

WhatsApp adapter previously did not consume the audioAsVoice flag from
ChannelOutboundContext, so audio messages with [[audio_as_voice]]
directive were delivered as document attachments instead of PTT
voice notes.

This change adds audioAsVoice consumption across the three layers:
- send.ts: forceVoiceDelivery decision based on audioAsVoice + audio
  MIME/extension, with mediaType normalization
- outbound-base.ts / outbound-adapter.ts: forwards audioAsVoice from
  sendMedia to lower layers
- deliver-reply.ts: routes audio-equivalent files to voice path when
  audioAsVoice is true, with mimetype normalization to
  "audio/ogg; codecs=opus"

Non-audio files are kept on the document path even when audioAsVoice
is true, to avoid mis-application.

Closes openclaw#66053
Add regression coverage for the case where audioAsVoice is unset
(or false) and the media kind is document. Both send.ts and
deliver-reply.ts paths are covered, since deliver-reply.ts has
independent voice-routing logic that calls msg.sendMedia directly
rather than going through sendMessageWhatsApp.

These tests fix the previously implicit assumption that
audioAsVoice consumption (added in the parent commit) does not
regress the historical document-path behavior.
@masatohoshino
Copy link
Copy Markdown
Contributor Author

For a bit of extra context on the follow-up note in the PR body:

While working on this fix, I noticed a few other channel adapters that may be worth checking for similar audioAsVoice consumption gaps in their outbound paths. I deliberately kept this PR scoped to WhatsApp so it stays easy to review and resolves #66053 cleanly.

Happy to prepare a separate follow-up PR for the cross-channel side if maintainers think that would be useful — otherwise I'll leave the observation here for the record.

@openclaw-barnacle openclaw-barnacle bot added channel: whatsapp-web Channel integration: whatsapp-web size: M labels Apr 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 19, 2026

Greptile Summary

This PR fixes the WhatsApp adapter to correctly route audio replies flagged with audioAsVoice=true to the PTT (push-to-talk) voice-note path instead of the document-attachment path. The fix is applied consistently across send.ts, outbound-base.ts, outbound-adapter.ts, and auto-reply/deliver-reply.ts, with a follow-up commit (566b640) also guarding the media.kind === "audio" branch against being overwritten when forceVoiceDelivery is active — directly addressing the concern raised in the previous review thread.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 issues have been addressed and the remaining changes are well-tested.

The previously flagged bug (audio-kind branch overwriting forceVoiceDelivery) is fixed in the latest commit with the !forceVoiceDelivery guard on line 133 of send.ts. The mimetype normalization logic is consistent across all four touched files. No new logic paths are left unguarded, and no remaining P1 or P0 findings exist.

No files require special attention.

Reviews (2): Last reviewed commit: "fix(whatsapp): guard audio-kind branch a..." | Re-trigger Greptile

Comment thread extensions/whatsapp/src/send.ts Outdated
…rride

Address greptile review on openclaw#68744. The audio-kind branch in send.ts
was a standalone if statement, unlike the video/image/document
branches that all share an else-if chain guarded by !forceVoiceDelivery.
Without the guard, the correct "audio/ogg; codecs=opus" mimetype set
by forceVoiceDelivery was overwritten with "application/octet-stream"
when media.contentType was null, silently defeating the audioAsVoice
fix for that case.

Move the audio-kind branch into the else-if chain so it is consistent
with the other media-kind branches, and add a regression test that
pins the null-contentType path to the PTT mimetype.
@masatohoshino
Copy link
Copy Markdown
Contributor Author

Thanks @greptile-apps for the catch. Addressed in the latest commit:

  • send.ts — moved the media.kind === "audio" branch into the else if chain guarded by !forceVoiceDelivery, matching the pattern used by the video/image/document branches. The forceVoiceDelivery block now correctly owns the audio-routing decision when the flag is set.
  • send.test.ts — added a regression test for the audioAsVoice=true + kind="audio" + contentType=null path, asserting the resulting mimetype is audio/ogg; codecs=opus.

Verification:

  • pnpm test extensions/whatsapp/src → 57 files / 484 tests pass
  • codex review --base origin/main reports no P1 or P2 findings on the latest state
  • The new test fails against the pre-fix code as expected

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566b640784

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread extensions/whatsapp/src/send.ts Outdated
Add two media utilities for channel adapters that route audio media:

- isVerifiedAudioSource: classifies media as audio only when the kind
  or contentType comes from a trusted upstream signal, deliberately
  ignoring filename or URL hints that an attacker can control.
- sanitizeMediaMime: validates a MIME string against RFC 6838 token
  syntax, rejects control characters that could enable header injection
  in downstream API clients, and strips parameters by default.

These are exposed via openclaw/plugin-sdk/media-runtime through the
existing wildcard re-export, so plugin adapters (WhatsApp here, others
later) can adopt them without further plugin-sdk changes.
…r PTT

Address two aisle-research-bot findings on openclaw#68744:

- CWE-20 (Aisle openclaw#1): the previous forceVoiceDelivery decision treated
  filename or URL extension as sufficient evidence that media is audio.
  An attacker controlling those hints could coerce non-audio content
  onto the PTT path. Switch to isVerifiedAudioSource so only kind="audio"
  or a contentType in the audio/* family triggers voice routing.
- CWE-93 (Aisle openclaw#2): media.contentType was forwarded to Baileys as
  the outgoing mimetype with no validation. Run it through
  sanitizeMediaMime before use, falling back to the canonical opus
  string when the input is unsafe or malformed.

Two existing tests were rewritten because they had previously locked
in the spoofing behavior (filename-only voice coercion). They now
assert that document delivery is preserved instead. Two new
integration tests cover the CRLF rejection path on both send.ts and
deliver-reply.ts.
…udio

Generalize sanitizeMediaMime application beyond the audio-only path.
The previous PR-W introduced sanitizeMediaMime but applied it only to
the audio/voice-note branch, leaving image/video/document branches
forwarding raw media.contentType to Baileys (CWE-113 header injection
risk).

Address aisle-research-bot findings openclaw#1 and openclaw#2 on openclaw#68744 by:
- send.ts: route all kinds (image/video/document/audio) through
  sanitizeMediaMime with kind-specific fallbacks (image/jpeg for image,
  video/mp4 for video, application/octet-stream for document) so the
  downstream send-api kind detection (which keys off the mediaType
  prefix) keeps working when sanitization rejects the input.
- deliver-reply.ts: apply the same pattern to msg.sendMedia mimetype
  for image/video/document branches with matching prefix-guarded
  fallbacks.
- isVerifiedAudioSource: normalize contentType through sanitizeMediaMime
  before checking the audio/ prefix, so malformed values are rejected
  and valid case variants are accepted. The trust model itself (kind
  or contentType-based classification) is preserved for backward
  compatibility with the [[audio_as_voice]] feature path.

Add regression tests covering the CRLF-rejection fallback path for
each media kind in deliver-reply.ts (image, video, document) and a
representative case in send.ts. Add unit tests for isVerifiedAudioSource
covering case-insensitive matching, malformed-input rejection, and
invalid-token rejection.

Aisle finding openclaw#3 (Medium) raises a separate question about the trust
model itself (whether contentType-based classification should be allowed
at all). That requires a behavior change on the [[audio_as_voice]]
feature path and is left out of this PR for separate discussion.
…on safety

Add sanitizeFileName as an outbound document-name sanitizer and apply it
to the WhatsApp document delivery paths. Aisle finding openclaw#2 on commit
7247b12 noted that document.fileName is passed to Baileys without
sanitization in both deliver-reply.ts (auto-reply document send) and
send.ts (documentFileName forwarding), allowing CRLF/control characters
in attacker-controllable filenames to reach downstream Content-Disposition
multipart headers.

- src/media/mime.ts: introduce sanitizeFileName(input) -> string with
  control-character stripping (charCodeAt-based to avoid the
  no-control-regex lint rule), path-separator and quote replacement
  with underscore, length cap (128 chars), and a "file" fallback for
  empty/unsafe input. Exposed via the existing
  openclaw/plugin-sdk/media-runtime wildcard re-export.
- deliver-reply.ts: wrap the document fileName resolution with
  sanitizeFileName before passing to msg.sendMedia.
- send.ts: wrap documentFileName with sanitizeFileName before forwarding
  through sendOptions.

Add 19 unit tests for sanitizeFileName covering CRLF/control character
stripping, path separator and quote replacement, length cap, and empty
fallback. Add regression tests for the document filename sanitization
in deliver-reply.ts and send.ts.

Aisle finding openclaw#1 (Medium, magic-byte verification in isVerifiedAudioSource)
and openclaw#3 (Medium, fileName sanitization in inbound/send-api.ts) are out of
scope for this PR and will be tracked as separate follow-ups: openclaw#1 because
it requires loadWebMedia() refactoring and changes the [[audio_as_voice]]
trust model; openclaw#3 because it shares the same root cause as
parseContentDispositionFileName upstream and benefits from a coordinated
inbound-side fix.
…eration

Address aisle-research-bot CWE-400 finding on commit 949d3d8.
The previous sanitizeFileName implementation built the result with
repeated string concatenation (stripped += char) and applied the
128-character cap only after the full string was built. For very long
attacker-controlled filenames (e.g. from Content-Disposition headers
in remote URLs), this could degrade to O(n^2) build cost and enable
CPU/memory DoS.

Switch to an Array-based accumulator joined at the end (O(n) build),
inline the path-separator and quote replacement into the iteration,
and break the loop as soon as 128 valid characters have been
accumulated. This makes both the build cost and the loop bound
proportional to min(input length, 128) rather than the full input
length.

No behavioral change: all existing 19 unit tests for sanitizeFileName
continue to pass with identical output. Add one regression test that
asserts a 100k-character input is capped at 128 with the expected
content, exercising the early-cap path.
…(RTLO)

Strengthen the outbound document filename sanitization policy to also
strip Unicode bidirectional and invisible format characters before
length-capping and ASCII control stripping. This closes a UI-spoofing
class of attack where an attacker-controlled filename can use right-to-
left override (RTLO, U+202E) and related directional formatting
characters to make a malicious filename render misleadingly in the
WhatsApp client (e.g. "invoice<RLO>gnp.exe" appears as "invoiceexe.png"
to the recipient).

The sanitizeFileName helper now strips the following Unicode format
characters at the start of processing:
- U+202A..U+202E: explicit directional formatting (LRE/RLE/PDF/LRO/RLO)
- U+2066..U+2069: directional isolates (LRI/RLI/FSI/PDI)
- U+200E..U+200F: implicit marks (LRM/RLM)
- U+061C:        Arabic Letter Mark (ALM)

These format characters have no legitimate use in outbound filenames
and are removed via a module-scope regex so the cost is paid once at
module load. JSDoc updated to document the bidi-stripping intent.

Also addresses aisle-research-bot finding on commit 9f34b56.

Add four regression tests covering the classic RTLO trick, LRM/RLM
implicit marks, LRI/PDI directional isolates, and Arabic Letter Mark.
All previous unit tests for sanitizeFileName continue to pass with
identical output.
@masatohoshino
Copy link
Copy Markdown
Contributor Author

@codex review

@masatohoshino
Copy link
Copy Markdown
Contributor Author

All security findings raised by the analysis bots on this PR have been addressed or scoped to follow-ups as discussed above. The PR is ready for human review when convenient.

Two items intentionally left for separate follow-ups:

  • magic-byte verification in isVerifiedAudioSource (needs loadWebMedia refactoring and changes the [[audio_as_voice]] trust model)
  • fileName sanitization in inbound/send-api.ts (best fixed alongside the upstream parseContentDispositionFileName sanitizer)

Happy to split either of those into their own PRs if maintainers want them tracked separately.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WhatsApp: [[audio_as_voice]] / PTT voice-note delivery not working

1 participant