realtime: honor output_modalities to skip TTS in text-only mode#9838
Merged
Conversation
e5dd7b4 to
8db5a54
Compare
The emulated realtime pipeline previously ignored the OpenAI Realtime spec field output_modalities and always synthesized TTS. Add resolveOutputModalities + modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio* emission so a client requesting ["text"] gets only ResponseOutputText* events. This lets thin clients (e.g. thing5-poc) cache TTS on the client side while still using the realtime WS for VAD + STT + LLM + tool-call parsing. Assisted-by: Claude:claude-opus-4-7
Follow-up to the previous commit: - Resolve response.create's output_modalities at the gate so a per-response override of an audio session is honored (the test asserted this contract but the production call site was passing nil). - Mirror OutputModalities in the RealtimeSession echo so session.update round-trips the client-supplied value, matching MaxOutputTokens's pattern. Assisted-by: Claude:claude-opus-4-7
CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)` inside the audio-emission block (now wrapped by the modality gate). Wrap the call in a closure that explicitly discards the error — the canonical Go pattern for "I want to defer a cleanup whose error I genuinely don't care about." Assisted-by: Claude:claude-opus-4-7 golangci-lint
f3ef553 to
49027ee
Compare
Collaborator
|
Looks good! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The emulated realtime pipeline previously ignored the OpenAI Realtime spec field
output_modalities— the field was declared onRealtimeSessionandResponsebut never read, so the server always ran the TTS step and emittedresponse.output_audio.*events.This PR gates the audio block in
core/http/endpoints/openai/realtime.goon the resolved modalities. When a client requests["text"](session-level or per-response viaresponse.create), the server emitsresponse.output_text.delta+response.output_text.donewithfinalSpeechand skips TTS entirely.This enables thin clients that want to use the realtime WebSocket for VAD + STT + LLM + tool-call parsing while running their own TTS pipeline (e.g., for client-side caching).
Changes
resolveOutputModalities(session, response)andmodalitiesContainAudio(m). The TTS /ResponseOutputAudio*block (lines ~1657-1755) is wrapped in anif modalitiesContainAudio(modalities)branch; theelsebranch emits the text events.OutputModalities []types.Modalityadded on the localSessionstruct (mirrorsMaxOutputTokenspattern), copied fromSessionUpdateinupdateSession, echoed back in thesession.updateserver response, and resolved againstoverrides.OutputModalitiesfromresponse.create.modalitiesContainAudiotruth table.defer os.Remove(audioFilePath)rewritten asdefer func() { _ = os.Remove(audioFilePath) }()to satisfy errcheck (the block's now inside the gated branch).Test plan
go test ./core/http/endpoints/openai/— all 90 specs pass.go vetclean.session.update {output_modalities: ["text"]}, send audio, confirm onlyresponse.output_text.*events arrive (noresponse.output_audio.*).["audio"]— confirm existing audio-mode behavior is unchanged.Notes
FLAG_REALTIME_AUDIO) use a different code path.