Skip to content

realtime: honor output_modalities to skip TTS in text-only mode#9838

Merged
mudler merged 3 commits into
masterfrom
feat/realtime-honor-output-modalities
May 15, 2026
Merged

realtime: honor output_modalities to skip TTS in text-only mode#9838
mudler merged 3 commits into
masterfrom
feat/realtime-honor-output-modalities

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot commented May 15, 2026

Summary

The emulated realtime pipeline previously ignored the OpenAI Realtime spec field output_modalities — the field was declared on RealtimeSession and Response but never read, so the server always ran the TTS step and emitted response.output_audio.* events.

This PR gates the audio block in core/http/endpoints/openai/realtime.go on the resolved modalities. When a client requests ["text"] (session-level or per-response via response.create), the server emits response.output_text.delta + response.output_text.done with finalSpeech and skips TTS entirely.

This enables thin clients that want to use the realtime WebSocket for VAD + STT + LLM + tool-call parsing while running their own TTS pipeline (e.g., for client-side caching).

Changes

  • realtime.go: two new helpers resolveOutputModalities(session, response) and modalitiesContainAudio(m). The TTS / ResponseOutputAudio* block (lines ~1657-1755) is wrapped in an if modalitiesContainAudio(modalities) branch; the else branch emits the text events.
  • Plumbing: OutputModalities []types.Modality added on the local Session struct (mirrors MaxOutputTokens pattern), copied from SessionUpdate in updateSession, echoed back in the session.update server response, and resolved against overrides.OutputModalities from response.create.
  • realtime_modality_test.go: new Ginkgo spec, 6 cases covering default-to-audio, session-level text-only, response-level override, and modalitiesContainAudio truth table.
  • lint fix: pre-existing defer os.Remove(audioFilePath) rewritten as defer func() { _ = os.Remove(audioFilePath) }() to satisfy errcheck (the block's now inside the gated branch).

Test plan

  • go test ./core/http/endpoints/openai/ — all 90 specs pass.
  • go vet clean.
  • Manual: connect a WS client with session.update {output_modalities: ["text"]}, send audio, confirm only response.output_text.* events arrive (no response.output_audio.*).
  • Manual: same with default ["audio"] — confirm existing audio-mode behavior is unchanged.

Notes

  • Audio-mode behavior is preserved byte-for-byte (the gated block contents are unmodified).
  • Only the emulated pipeline is affected. Native any-to-any audio models (FLAG_REALTIME_AUDIO) use a different code path.
  • WebRTC and WebSocket transports both honor the gate.

@mudler mudler force-pushed the feat/realtime-honor-output-modalities branch from e5dd7b4 to 8db5a54 Compare May 15, 2026 10:24
@mudler mudler added the bug Something isn't working label May 15, 2026
mudler added 3 commits May 15, 2026 10:31
The emulated realtime pipeline previously ignored the OpenAI Realtime spec
field output_modalities and always synthesized TTS. Add resolveOutputModalities
+ modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio*
emission so a client requesting ["text"] gets only ResponseOutputText* events.

This lets thin clients (e.g. thing5-poc) cache TTS on the client side while
still using the realtime WS for VAD + STT + LLM + tool-call parsing.

Assisted-by: Claude:claude-opus-4-7
Follow-up to the previous commit:
- Resolve response.create's output_modalities at the gate so a per-response
  override of an audio session is honored (the test asserted this contract
  but the production call site was passing nil).
- Mirror OutputModalities in the RealtimeSession echo so session.update
  round-trips the client-supplied value, matching MaxOutputTokens's pattern.

Assisted-by: Claude:claude-opus-4-7
CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)`
inside the audio-emission block (now wrapped by the modality gate). Wrap
the call in a closure that explicitly discards the error — the canonical
Go pattern for "I want to defer a cleanup whose error I genuinely don't
care about."

Assisted-by: Claude:claude-opus-4-7 golangci-lint
@mudler mudler force-pushed the feat/realtime-honor-output-modalities branch from f3ef553 to 49027ee Compare May 15, 2026 10:32
@richiejp
Copy link
Copy Markdown
Collaborator

Looks good!

@mudler mudler merged commit a39591f into master May 15, 2026
57 checks passed
@mudler mudler deleted the feat/realtime-honor-output-modalities branch May 15, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants