fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744
fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744masatohoshino wants to merge 9 commits intoopenclaw:mainfrom
Conversation
WhatsApp adapter previously did not consume the audioAsVoice flag from ChannelOutboundContext, so audio messages with [[audio_as_voice]] directive were delivered as document attachments instead of PTT voice notes. This change adds audioAsVoice consumption across the three layers: - send.ts: forceVoiceDelivery decision based on audioAsVoice + audio MIME/extension, with mediaType normalization - outbound-base.ts / outbound-adapter.ts: forwards audioAsVoice from sendMedia to lower layers - deliver-reply.ts: routes audio-equivalent files to voice path when audioAsVoice is true, with mimetype normalization to "audio/ogg; codecs=opus" Non-audio files are kept on the document path even when audioAsVoice is true, to avoid mis-application. Closes openclaw#66053
Add regression coverage for the case where audioAsVoice is unset (or false) and the media kind is document. Both send.ts and deliver-reply.ts paths are covered, since deliver-reply.ts has independent voice-routing logic that calls msg.sendMedia directly rather than going through sendMessageWhatsApp. These tests fix the previously implicit assumption that audioAsVoice consumption (added in the parent commit) does not regress the historical document-path behavior.
|
For a bit of extra context on the follow-up note in the PR body: While working on this fix, I noticed a few other channel adapters that may be worth checking for similar Happy to prepare a separate follow-up PR for the cross-channel side if maintainers think that would be useful — otherwise I'll leave the observation here for the record. |
Greptile SummaryThis PR fixes the WhatsApp adapter to correctly route audio replies flagged with Confidence Score: 5/5Safe to merge — all previously flagged P1 issues have been addressed and the remaining changes are well-tested. The previously flagged bug (audio-kind branch overwriting forceVoiceDelivery) is fixed in the latest commit with the !forceVoiceDelivery guard on line 133 of send.ts. The mimetype normalization logic is consistent across all four touched files. No new logic paths are left unguarded, and no remaining P1 or P0 findings exist. No files require special attention. Reviews (2): Last reviewed commit: "fix(whatsapp): guard audio-kind branch a..." | Re-trigger Greptile |
…rride Address greptile review on openclaw#68744. The audio-kind branch in send.ts was a standalone if statement, unlike the video/image/document branches that all share an else-if chain guarded by !forceVoiceDelivery. Without the guard, the correct "audio/ogg; codecs=opus" mimetype set by forceVoiceDelivery was overwritten with "application/octet-stream" when media.contentType was null, silently defeating the audioAsVoice fix for that case. Move the audio-kind branch into the else-if chain so it is consistent with the other media-kind branches, and add a regression test that pins the null-contentType path to the PTT mimetype.
|
Thanks @greptile-apps for the catch. Addressed in the latest commit:
Verification:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 566b640784
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Add two media utilities for channel adapters that route audio media: - isVerifiedAudioSource: classifies media as audio only when the kind or contentType comes from a trusted upstream signal, deliberately ignoring filename or URL hints that an attacker can control. - sanitizeMediaMime: validates a MIME string against RFC 6838 token syntax, rejects control characters that could enable header injection in downstream API clients, and strips parameters by default. These are exposed via openclaw/plugin-sdk/media-runtime through the existing wildcard re-export, so plugin adapters (WhatsApp here, others later) can adopt them without further plugin-sdk changes.
…r PTT Address two aisle-research-bot findings on openclaw#68744: - CWE-20 (Aisle openclaw#1): the previous forceVoiceDelivery decision treated filename or URL extension as sufficient evidence that media is audio. An attacker controlling those hints could coerce non-audio content onto the PTT path. Switch to isVerifiedAudioSource so only kind="audio" or a contentType in the audio/* family triggers voice routing. - CWE-93 (Aisle openclaw#2): media.contentType was forwarded to Baileys as the outgoing mimetype with no validation. Run it through sanitizeMediaMime before use, falling back to the canonical opus string when the input is unsafe or malformed. Two existing tests were rewritten because they had previously locked in the spoofing behavior (filename-only voice coercion). They now assert that document delivery is preserved instead. Two new integration tests cover the CRLF rejection path on both send.ts and deliver-reply.ts.
…udio Generalize sanitizeMediaMime application beyond the audio-only path. The previous PR-W introduced sanitizeMediaMime but applied it only to the audio/voice-note branch, leaving image/video/document branches forwarding raw media.contentType to Baileys (CWE-113 header injection risk). Address aisle-research-bot findings openclaw#1 and openclaw#2 on openclaw#68744 by: - send.ts: route all kinds (image/video/document/audio) through sanitizeMediaMime with kind-specific fallbacks (image/jpeg for image, video/mp4 for video, application/octet-stream for document) so the downstream send-api kind detection (which keys off the mediaType prefix) keeps working when sanitization rejects the input. - deliver-reply.ts: apply the same pattern to msg.sendMedia mimetype for image/video/document branches with matching prefix-guarded fallbacks. - isVerifiedAudioSource: normalize contentType through sanitizeMediaMime before checking the audio/ prefix, so malformed values are rejected and valid case variants are accepted. The trust model itself (kind or contentType-based classification) is preserved for backward compatibility with the [[audio_as_voice]] feature path. Add regression tests covering the CRLF-rejection fallback path for each media kind in deliver-reply.ts (image, video, document) and a representative case in send.ts. Add unit tests for isVerifiedAudioSource covering case-insensitive matching, malformed-input rejection, and invalid-token rejection. Aisle finding openclaw#3 (Medium) raises a separate question about the trust model itself (whether contentType-based classification should be allowed at all). That requires a behavior change on the [[audio_as_voice]] feature path and is left out of this PR for separate discussion.
…on safety Add sanitizeFileName as an outbound document-name sanitizer and apply it to the WhatsApp document delivery paths. Aisle finding openclaw#2 on commit 7247b12 noted that document.fileName is passed to Baileys without sanitization in both deliver-reply.ts (auto-reply document send) and send.ts (documentFileName forwarding), allowing CRLF/control characters in attacker-controllable filenames to reach downstream Content-Disposition multipart headers. - src/media/mime.ts: introduce sanitizeFileName(input) -> string with control-character stripping (charCodeAt-based to avoid the no-control-regex lint rule), path-separator and quote replacement with underscore, length cap (128 chars), and a "file" fallback for empty/unsafe input. Exposed via the existing openclaw/plugin-sdk/media-runtime wildcard re-export. - deliver-reply.ts: wrap the document fileName resolution with sanitizeFileName before passing to msg.sendMedia. - send.ts: wrap documentFileName with sanitizeFileName before forwarding through sendOptions. Add 19 unit tests for sanitizeFileName covering CRLF/control character stripping, path separator and quote replacement, length cap, and empty fallback. Add regression tests for the document filename sanitization in deliver-reply.ts and send.ts. Aisle finding openclaw#1 (Medium, magic-byte verification in isVerifiedAudioSource) and openclaw#3 (Medium, fileName sanitization in inbound/send-api.ts) are out of scope for this PR and will be tracked as separate follow-ups: openclaw#1 because it requires loadWebMedia() refactoring and changes the [[audio_as_voice]] trust model; openclaw#3 because it shares the same root cause as parseContentDispositionFileName upstream and benefits from a coordinated inbound-side fix.
…eration Address aisle-research-bot CWE-400 finding on commit 949d3d8. The previous sanitizeFileName implementation built the result with repeated string concatenation (stripped += char) and applied the 128-character cap only after the full string was built. For very long attacker-controlled filenames (e.g. from Content-Disposition headers in remote URLs), this could degrade to O(n^2) build cost and enable CPU/memory DoS. Switch to an Array-based accumulator joined at the end (O(n) build), inline the path-separator and quote replacement into the iteration, and break the loop as soon as 128 valid characters have been accumulated. This makes both the build cost and the loop bound proportional to min(input length, 128) rather than the full input length. No behavioral change: all existing 19 unit tests for sanitizeFileName continue to pass with identical output. Add one regression test that asserts a 100k-character input is capped at 128 with the expected content, exercising the early-cap path.
…(RTLO) Strengthen the outbound document filename sanitization policy to also strip Unicode bidirectional and invisible format characters before length-capping and ASCII control stripping. This closes a UI-spoofing class of attack where an attacker-controlled filename can use right-to- left override (RTLO, U+202E) and related directional formatting characters to make a malicious filename render misleadingly in the WhatsApp client (e.g. "invoice<RLO>gnp.exe" appears as "invoiceexe.png" to the recipient). The sanitizeFileName helper now strips the following Unicode format characters at the start of processing: - U+202A..U+202E: explicit directional formatting (LRE/RLE/PDF/LRO/RLO) - U+2066..U+2069: directional isolates (LRI/RLI/FSI/PDI) - U+200E..U+200F: implicit marks (LRM/RLM) - U+061C: Arabic Letter Mark (ALM) These format characters have no legitimate use in outbound filenames and are removed via a module-scope regex so the cost is paid once at module load. JSDoc updated to document the bidi-stripping intent. Also addresses aisle-research-bot finding on commit 9f34b56. Add four regression tests covering the classic RTLO trick, LRM/RLM implicit marks, LRI/PDI directional isolates, and Arabic Letter Mark. All previous unit tests for sanitizeFileName continue to pass with identical output.
|
@codex review |
|
All security findings raised by the analysis bots on this PR have been addressed or scoped to follow-ups as discussed above. The PR is ready for human review when convenient. Two items intentionally left for separate follow-ups:
Happy to split either of those into their own PRs if maintainers want them tracked separately. |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
On the WhatsApp channel, audio replies flagged with
[[audio_as_voice]]were being delivered as generic document attachments instead of native PTT (push-to-talk) voice notes.This matters because WhatsApp presents voice notes differently from generic file attachments, so the intended voice-message UX was lost for audio reply scenarios.
This PR updates the WhatsApp adapter to consume the
audioAsVoiceflag across the relevant send paths (send.ts,outbound-base.ts,outbound-adapter.ts, andauto-reply/deliver-reply.ts). When the flag is true and the media is audio (detected by MIME type or filename extension), the file is routed to the PTT path with its mimetype normalized toaudio/ogg; codecs=opus. Non-audio files continue to use the document path even if the flag is true.Scope is intentionally limited to the WhatsApp adapter. Pipeline-side
audioAsVoicepropagation (src/infra/outbound/), other channel adapters, and public APIs are unchanged.Change Type
Scope
Linked Issue/PR
Root Cause
The pipeline already propagated
audioAsVoicethroughsendOverrides, but the WhatsApp adapter did not consistently consume and forward that flag on its audio-send paths.A structural factor that made this easier to miss was that
send.tsandoutbound-base.tsused separate local options types, and theoutbound-base.tsversion did not includeaudioAsVoice. As a result, the flag could fall through silently on that path.This gap was surfaced by @escotilha in #66053.
What Changed
audioAsVoiceaudioAsVoice=trueis sent via the PTT / voice-note pathaudio/ogg; codecs=opuson that pathaudioAsVoice=trueRegression Test Plan
extensions/whatsapp/src/{send,outbound-base,outbound-adapter}.ts,extensions/whatsapp/src/auto-reply/deliver-reply.ts, plus their*.test.tssiblingsaudioAsVoice=true+ audio MIME or.ogg→ PTT voice-note pathaudioAsVoice=true+ non-audio (application/octet-stream, etc.) → document path preservedaudioAsVoiceunset/false +kind="document"with audio-like filename → document path preservedaudioAsVoicepropagates fromoutbound-adapter.sendMediaandoutbound-base.sendMediatosendMessageWhatsApppnpm test extensions/whatsapp/srcpasses locally (57 files / 483 tests)send.test.tsauto-reply/deliver-reply.test.tsSecurity Impact
Repro + Verification
Steps
.ogg/ opus) together with[[audio_as_voice]]andMEDIA:<path>.Before this fix
After this fix
Evidence
audioAsVoice=truecases insend.test.tsandauto-reply/deliver-reply.test.tsare designed to cover the missing-flag-consumption path that existed before this fix. With this change, the expected PTT routing, non-audio document preservation, and unset-flag regression guards all pass locally.pnpm test extensions/whatsapp/src→57 files / 483 tests passcodex review --base origin/mainreported no P1 or P2 findingsHuman Verification
pre-commithook (pnpm check, type check, import-cycle check) passes cleanlyaudioAsVoiceis unsetkind: "document"withvoice.oggfilename and unset flag remains on the document pathCompatibility / Migration
Risks and Follow-ups
send.tsandoutbound-base.tsmade this omission easier to miss. I left that broader cleanup out of scope here to keep the PR minimal, but consolidating those types in a follow-up could help prevent similar issues.AI-assisted