fix(whatsapp): respect audioAsVoice flag in outbound delivery by masatohoshino · Pull Request #68744 · openclaw/openclaw

masatohoshino · 2026-04-19T00:58:20Z

Summary

On the WhatsApp channel, audio replies flagged with [[audio_as_voice]] were being delivered as generic document attachments instead of native PTT (push-to-talk) voice notes.

This matters because WhatsApp presents voice notes differently from generic file attachments, so the intended voice-message UX was lost for audio reply scenarios.

This PR updates the WhatsApp adapter to consume the audioAsVoice flag across the relevant send paths (send.ts, outbound-base.ts, outbound-adapter.ts, and auto-reply/deliver-reply.ts). When the flag is true and the media is audio (detected by MIME type or filename extension), the file is routed to the PTT path with its mimetype normalized to audio/ogg; codecs=opus. Non-audio files continue to use the document path even if the flag is true.

Scope is intentionally limited to the WhatsApp adapter. Pipeline-side audioAsVoice propagation (src/infra/outbound/), other channel adapters, and public APIs are unchanged.

Change Type

Bug fix

Scope

Integrations (WhatsApp adapter only)

Linked Issue/PR

Closes WhatsApp: [[audio_as_voice]] / PTT voice-note delivery not working #66053
This PR fixes a bug or regression

Root Cause

The pipeline already propagated audioAsVoice through sendOverrides, but the WhatsApp adapter did not consistently consume and forward that flag on its audio-send paths.

A structural factor that made this easier to miss was that send.ts and outbound-base.ts used separate local options types, and the outbound-base.ts version did not include audioAsVoice. As a result, the flag could fall through silently on that path.

This gap was surfaced by @escotilha in #66053.

What Changed

WhatsApp send paths now consistently forward and consume audioAsVoice
Audio media with audioAsVoice=true is sent via the PTT / voice-note path
The outgoing mimetype is normalized to audio/ogg; codecs=opus on that path
Non-audio files remain on the document path even when audioAsVoice=true

Regression Test Plan

Coverage: Unit + seam integration
Files touched: extensions/whatsapp/src/{send,outbound-base,outbound-adapter}.ts, extensions/whatsapp/src/auto-reply/deliver-reply.ts, plus their *.test.ts siblings
Scenarios covered:
- audioAsVoice=true + audio MIME or .ogg → PTT voice-note path
- audioAsVoice=true + non-audio (application/octet-stream, etc.) → document path preserved
- audioAsVoice unset/false + kind="document" with audio-like filename → document path preserved
- Adapter/base → send forwarding: audioAsVoice propagates from outbound-adapter.sendMedia and outbound-base.sendMedia to sendMessageWhatsApp
Test results: pnpm test extensions/whatsapp/src passes locally (57 files / 483 tests)
Added/updated coverage includes:
- send.test.ts
- auto-reply/deliver-reply.test.ts

Security Impact

New permissions: No
Secrets handling: No
Network calls: No (uses the existing WhatsApp send path)
Data scope: No

Repro + Verification

Steps

Configure OpenClaw with the WhatsApp channel enabled.
Trigger an outbound reply that produces an audio file (.ogg / opus) together with [[audio_as_voice]] and MEDIA:<path>.
Inspect how the message arrives in the WhatsApp client.

Before this fix

The audio file arrives as a generic document attachment rather than a native voice note.

After this fix

The audio file arrives as a PTT voice note (waveform UI / play button).

Evidence

Failing mode covered by tests + passing after
- The newly added audioAsVoice=true cases in send.test.ts and auto-reply/deliver-reply.test.ts are designed to cover the missing-flag-consumption path that existed before this fix. With this change, the expected PTT routing, non-audio document preservation, and unset-flag regression guards all pass locally.
Trace/log snippets
- Local test run: pnpm test extensions/whatsapp/src → 57 files / 483 tests pass
- codex review --base origin/main reported no P1 or P2 findings
Screenshot/recording
- I do not have a reproducible WhatsApp test environment for live before/after captures. The behavior difference is covered by unit + seam tests, and the repro steps above should allow maintainers to verify it directly in a configured WhatsApp environment.

Human Verification

Verified scenarios:
- Local build and full WhatsApp test suite pass after the change
- pre-commit hook (pnpm check, type check, import-cycle check) passes cleanly
- Unset-path regression tests confirm that documents with audio-like filenames remain on the document path when audioAsVoice is unset
Edge cases covered by tests:
- Non-audio files are not mis-routed to PTT when the flag is true
- kind: "document" with voice.ogg filename and unset flag remains on the document path
Not verified:
- Live WhatsApp delivery on a real account (no reproducible environment on my side)
- Very large audio payloads (>16MB) and uncommon audio container formats

Compatibility / Migration

Backward compatible: Yes
Config / env changes: No
Migration: No

Risks and Follow-ups

I noticed a few places in other channel adapters that may be worth checking for similar flag-consumption gaps. I intentionally kept this PR scoped to WhatsApp so the fix remains reviewable and directly resolves WhatsApp: [[audio_as_voice]] / PTT voice-note delivery not working #66053.
The duplicated local options-type definitions in send.ts and outbound-base.ts made this omission easier to miss. I left that broader cleanup out of scope here to keep the PR minimal, but consolidating those types in a follow-up could help prevent similar issues.

AI-assisted

Drafted with Claude (strategy and split decisions) + Codex CLI (implementation)
Tested locally — all relevant tests pass
Author understands the change

WhatsApp adapter previously did not consume the audioAsVoice flag from ChannelOutboundContext, so audio messages with [[audio_as_voice]] directive were delivered as document attachments instead of PTT voice notes. This change adds audioAsVoice consumption across the three layers: - send.ts: forceVoiceDelivery decision based on audioAsVoice + audio MIME/extension, with mediaType normalization - outbound-base.ts / outbound-adapter.ts: forwards audioAsVoice from sendMedia to lower layers - deliver-reply.ts: routes audio-equivalent files to voice path when audioAsVoice is true, with mimetype normalization to "audio/ogg; codecs=opus" Non-audio files are kept on the document path even when audioAsVoice is true, to avoid mis-application. Closes openclaw#66053

Add regression coverage for the case where audioAsVoice is unset (or false) and the media kind is document. Both send.ts and deliver-reply.ts paths are covered, since deliver-reply.ts has independent voice-routing logic that calls msg.sendMedia directly rather than going through sendMessageWhatsApp. These tests fix the previously implicit assumption that audioAsVoice consumption (added in the parent commit) does not regress the historical document-path behavior.

masatohoshino · 2026-04-19T00:58:22Z

For a bit of extra context on the follow-up note in the PR body:

While working on this fix, I noticed a few other channel adapters that may be worth checking for similar audioAsVoice consumption gaps in their outbound paths. I deliberately kept this PR scoped to WhatsApp so it stays easy to review and resolves #66053 cleanly.

Happy to prepare a separate follow-up PR for the cross-channel side if maintainers think that would be useful — otherwise I'll leave the observation here for the record.

greptile-apps · 2026-04-19T01:00:43Z

Greptile Summary

This PR fixes the WhatsApp adapter to correctly route audio replies flagged with audioAsVoice=true to the PTT (push-to-talk) voice-note path instead of the document-attachment path. The fix is applied consistently across send.ts, outbound-base.ts, outbound-adapter.ts, and auto-reply/deliver-reply.ts, with a follow-up commit (566b640) also guarding the media.kind === "audio" branch against being overwritten when forceVoiceDelivery is active — directly addressing the concern raised in the previous review thread.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 issues have been addressed and the remaining changes are well-tested.

The previously flagged bug (audio-kind branch overwriting forceVoiceDelivery) is fixed in the latest commit with the !forceVoiceDelivery guard on line 133 of send.ts. The mimetype normalization logic is consistent across all four touched files. No new logic paths are left unguarded, and no remaining P1 or P0 findings exist.

No files require special attention.

_{Reviews (2): Last reviewed commit: "fix(whatsapp): guard audio-kind branch a..." | Re-trigger Greptile}

…rride Address greptile review on openclaw#68744. The audio-kind branch in send.ts was a standalone if statement, unlike the video/image/document branches that all share an else-if chain guarded by !forceVoiceDelivery. Without the guard, the correct "audio/ogg; codecs=opus" mimetype set by forceVoiceDelivery was overwritten with "application/octet-stream" when media.contentType was null, silently defeating the audioAsVoice fix for that case. Move the audio-kind branch into the else-if chain so it is consistent with the other media-kind branches, and add a regression test that pins the null-contentType path to the PTT mimetype.

masatohoshino · 2026-04-19T01:39:43Z

Thanks @greptile-apps for the catch. Addressed in the latest commit:

send.ts — moved the media.kind === "audio" branch into the else if chain guarded by !forceVoiceDelivery, matching the pattern used by the video/image/document branches. The forceVoiceDelivery block now correctly owns the audio-routing decision when the flag is set.
send.test.ts — added a regression test for the audioAsVoice=true + kind="audio" + contentType=null path, asserting the resulting mimetype is audio/ogg; codecs=opus.

Verification:

pnpm test extensions/whatsapp/src → 57 files / 484 tests pass
codex review --base origin/main reports no P1 or P2 findings on the latest state
The new test fails against the pre-fix code as expected

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566b640784

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Add two media utilities for channel adapters that route audio media: - isVerifiedAudioSource: classifies media as audio only when the kind or contentType comes from a trusted upstream signal, deliberately ignoring filename or URL hints that an attacker can control. - sanitizeMediaMime: validates a MIME string against RFC 6838 token syntax, rejects control characters that could enable header injection in downstream API clients, and strips parameters by default. These are exposed via openclaw/plugin-sdk/media-runtime through the existing wildcard re-export, so plugin adapters (WhatsApp here, others later) can adopt them without further plugin-sdk changes.

…r PTT Address two aisle-research-bot findings on openclaw#68744: - CWE-20 (Aisle openclaw#1): the previous forceVoiceDelivery decision treated filename or URL extension as sufficient evidence that media is audio. An attacker controlling those hints could coerce non-audio content onto the PTT path. Switch to isVerifiedAudioSource so only kind="audio" or a contentType in the audio/* family triggers voice routing. - CWE-93 (Aisle openclaw#2): media.contentType was forwarded to Baileys as the outgoing mimetype with no validation. Run it through sanitizeMediaMime before use, falling back to the canonical opus string when the input is unsafe or malformed. Two existing tests were rewritten because they had previously locked in the spoofing behavior (filename-only voice coercion). They now assert that document delivery is preserved instead. Two new integration tests cover the CRLF rejection path on both send.ts and deliver-reply.ts.

…udio Generalize sanitizeMediaMime application beyond the audio-only path. The previous PR-W introduced sanitizeMediaMime but applied it only to the audio/voice-note branch, leaving image/video/document branches forwarding raw media.contentType to Baileys (CWE-113 header injection risk). Address aisle-research-bot findings openclaw#1 and openclaw#2 on openclaw#68744 by: - send.ts: route all kinds (image/video/document/audio) through sanitizeMediaMime with kind-specific fallbacks (image/jpeg for image, video/mp4 for video, application/octet-stream for document) so the downstream send-api kind detection (which keys off the mediaType prefix) keeps working when sanitization rejects the input. - deliver-reply.ts: apply the same pattern to msg.sendMedia mimetype for image/video/document branches with matching prefix-guarded fallbacks. - isVerifiedAudioSource: normalize contentType through sanitizeMediaMime before checking the audio/ prefix, so malformed values are rejected and valid case variants are accepted. The trust model itself (kind or contentType-based classification) is preserved for backward compatibility with the [[audio_as_voice]] feature path. Add regression tests covering the CRLF-rejection fallback path for each media kind in deliver-reply.ts (image, video, document) and a representative case in send.ts. Add unit tests for isVerifiedAudioSource covering case-insensitive matching, malformed-input rejection, and invalid-token rejection. Aisle finding openclaw#3 (Medium) raises a separate question about the trust model itself (whether contentType-based classification should be allowed at all). That requires a behavior change on the [[audio_as_voice]] feature path and is left out of this PR for separate discussion.

…on safety Add sanitizeFileName as an outbound document-name sanitizer and apply it to the WhatsApp document delivery paths. Aisle finding openclaw#2 on commit 7247b12 noted that document.fileName is passed to Baileys without sanitization in both deliver-reply.ts (auto-reply document send) and send.ts (documentFileName forwarding), allowing CRLF/control characters in attacker-controllable filenames to reach downstream Content-Disposition multipart headers. - src/media/mime.ts: introduce sanitizeFileName(input) -> string with control-character stripping (charCodeAt-based to avoid the no-control-regex lint rule), path-separator and quote replacement with underscore, length cap (128 chars), and a "file" fallback for empty/unsafe input. Exposed via the existing openclaw/plugin-sdk/media-runtime wildcard re-export. - deliver-reply.ts: wrap the document fileName resolution with sanitizeFileName before passing to msg.sendMedia. - send.ts: wrap documentFileName with sanitizeFileName before forwarding through sendOptions. Add 19 unit tests for sanitizeFileName covering CRLF/control character stripping, path separator and quote replacement, length cap, and empty fallback. Add regression tests for the document filename sanitization in deliver-reply.ts and send.ts. Aisle finding openclaw#1 (Medium, magic-byte verification in isVerifiedAudioSource) and openclaw#3 (Medium, fileName sanitization in inbound/send-api.ts) are out of scope for this PR and will be tracked as separate follow-ups: openclaw#1 because it requires loadWebMedia() refactoring and changes the [[audio_as_voice]] trust model; openclaw#3 because it shares the same root cause as parseContentDispositionFileName upstream and benefits from a coordinated inbound-side fix.

…eration Address aisle-research-bot CWE-400 finding on commit 949d3d8. The previous sanitizeFileName implementation built the result with repeated string concatenation (stripped += char) and applied the 128-character cap only after the full string was built. For very long attacker-controlled filenames (e.g. from Content-Disposition headers in remote URLs), this could degrade to O(n^2) build cost and enable CPU/memory DoS. Switch to an Array-based accumulator joined at the end (O(n) build), inline the path-separator and quote replacement into the iteration, and break the loop as soon as 128 valid characters have been accumulated. This makes both the build cost and the loop bound proportional to min(input length, 128) rather than the full input length. No behavioral change: all existing 19 unit tests for sanitizeFileName continue to pass with identical output. Add one regression test that asserts a 100k-character input is capped at 128 with the expected content, exercising the early-cap path.

…(RTLO) Strengthen the outbound document filename sanitization policy to also strip Unicode bidirectional and invisible format characters before length-capping and ASCII control stripping. This closes a UI-spoofing class of attack where an attacker-controlled filename can use right-to- left override (RTLO, U+202E) and related directional formatting characters to make a malicious filename render misleadingly in the WhatsApp client (e.g. "invoice<RLO>gnp.exe" appears as "invoiceexe.png" to the recipient). The sanitizeFileName helper now strips the following Unicode format characters at the start of processing: - U+202A..U+202E: explicit directional formatting (LRE/RLE/PDF/LRO/RLO) - U+2066..U+2069: directional isolates (LRI/RLI/FSI/PDI) - U+200E..U+200F: implicit marks (LRM/RLM) - U+061C: Arabic Letter Mark (ALM) These format characters have no legitimate use in outbound filenames and are removed via a module-scope regex so the cost is paid once at module load. JSDoc updated to document the bidi-stripping intent. Also addresses aisle-research-bot finding on commit 9f34b56. Add four regression tests covering the classic RTLO trick, LRM/RLM implicit marks, LRI/PDI directional isolates, and Arabic Letter Mark. All previous unit tests for sanitizeFileName continue to pass with identical output.

masatohoshino · 2026-04-19T22:41:17Z

@codex review

masatohoshino · 2026-04-19T22:47:45Z

All security findings raised by the analysis bots on this PR have been addressed or scoped to follow-ups as discussed above. The PR is ready for human review when convenient.

Two items intentionally left for separate follow-ups:

magic-byte verification in isVerifiedAudioSource (needs loadWebMedia refactoring and changes the [[audio_as_voice]] trust model)
fileName sanitization in inbound/send-api.ts (best fixed alongside the upstream parseContentDispositionFileName sanitizer)

Happy to split either of those into their own PRs if maintainers want them tracked separately.

chatgpt-codex-connector · 2026-04-19T22:48:29Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

masatohoshino added 2 commits April 18, 2026 23:45

openclaw-barnacle bot added channel: whatsapp-web Channel integration: whatsapp-web size: M labels Apr 19, 2026

greptile-apps bot reviewed Apr 19, 2026

View reviewed changes

Comment thread extensions/whatsapp/src/send.ts Outdated

chatgpt-codex-connector bot reviewed Apr 19, 2026

View reviewed changes

Comment thread extensions/whatsapp/src/send.ts Outdated

masatohoshino added 3 commits April 19, 2026 04:16

openclaw-barnacle bot added size: L and removed size: M labels Apr 19, 2026

masatohoshino added 3 commits April 19, 2026 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744

fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744
masatohoshino wants to merge 9 commits intoopenclaw:mainfrom
masatohoshino:fix/whatsapp-audio-as-voice-consumption

masatohoshino commented Apr 19, 2026

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

greptile-apps bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

masatohoshino commented Apr 19, 2026

Summary

Change Type

Scope

Linked Issue/PR

Root Cause

What Changed

Regression Test Plan

Security Impact

Repro + Verification

Steps

Before this fix

After this fix

Evidence

Human Verification

Compatibility / Migration

Risks and Follow-ups

AI-assisted

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

greptile-apps bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Apr 19, 2026 •

edited

Loading