Skip to content

feat(realtime): support multi-message generation per response#1555

Merged
longcw merged 3 commits into
mainfrom
footsie-sundial-perused
May 25, 2026
Merged

feat(realtime): support multi-message generation per response#1555
longcw merged 3 commits into
mainfrom
footsie-sundial-perused

Conversation

@rosetta-livekit-bot
Copy link
Copy Markdown
Contributor

@rosetta-livekit-bot rosetta-livekit-bot Bot commented May 20, 2026

Summary

  • Process each MessageGeneration from generation_ev.message_stream serially via perform_audio_forwarding + perform_text_forwarding + wait_for_playout. Only one flush is in flight at a time.
  • Per-msg state is derived directly from the playback_finished event:
    • full → emit ChatMessage(interrupted=False) with the msg's message_id
    • partial → emit ChatMessage(interrupted=True) and call _rt_session.truncate(...) with this msg's local playback_position (not a cumulative offset)
    • skipped → drop locally and call update_chat_ctx(...) so the realtime server removes never-played items from its history
  • _on_first_frame now early-returns once started_speaking_at is set, so per-msg first-frame callbacks don't re-fire _update_agent_state("speaking") for each message.

Alternative considered

#5690 makes multi-message work by flushing per message — that needs the synchronizer to keep pending/finalizing impls alive and serialize concurrent flushes in room_io/_output.py. Our AudioOutput assumes there is only one speech at a time, serializing per-message at the wait_for_playout boundary (this PR) avoids both changes.

close livekit/agents#5690, livekit/agents#5684

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 20, 2026

🦋 Changeset detected

Latest commit: 896d71c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 33 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

@yaniv-peretz
Copy link
Copy Markdown
Contributor

yaniv-peretz commented May 21, 2026

Local Test Results: Success

  • I built the package and swapped the agents/dist/voice/agent_activity.* files in my local dev server.

Result

Instructions

Availability per group size:
1-2: not available
3-15:  available - only if all are adult males
15-30:  available - only if all are adult females
25+: not available

Conversation Log

  • Note the 2nd Agent conversation item
Agent: hi;
Agent: hi;
User: ~~On planet~~ **I Plan** to come either as a group of two males, a group of ten mixed, a group of twenty only females, or a group of forty. Can I come?
Agent: Let’s walk through each group option against the availability rules.
Agent: Only the group of 20 adult females can come. The group of two males is too small, the group of 10 is mixed so it doesn’t qualify, and the group of 40 is not available.
image

@yaniv-peretz
Copy link
Copy Markdown
Contributor

yaniv-peretz commented May 21, 2026

@tinalenguyen Stating the obvious how can / what required to push this through.
gpt-realtime-2 expected to be a banger (75% cost reduction + higher intelligence).
Opening opportunities for customer support and more complex customer-service scenarios.

Speech Reasoning (Big Bench Audio) vs Cost per Hour of Input Audio (21 May '26) (1)

@theomonnom theomonnom requested review from longcw and removed request for longcw May 21, 2026 17:56
Comment thread agents/src/voice/agent_activity.ts Outdated
this.agentSession._conversationItemAdded(message);

// TODO(brian): add tracing span
if (realtimeModel.capabilities.midSessionChatCtxUpdate) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't check if any messages are skipped so the updateChatCtx is called every time the agent speech is interrupted. should we check if any skipped before this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to only call updateChatCtx on interruption when at least one processed message was skipped.

Comment thread agents/src/voice/agent_activity.ts Outdated
Comment on lines +3097 to +3099
} else if (interrupted && output.synchronizedTranscript !== undefined) {
forwardedText = output.synchronizedTranscript;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is not reachable since synchronizedTranscript is set only if audioOut is valid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unreachable fallback branch; interrupted audio still uses synchronizedTranscript when available and falls back to an empty string otherwise.

}
await waitFor(forwardTasks);
} catch (error) {
this.logger.error(error, 'error reading messages from the realtime API');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we raise this error? what was the original behavior when failed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the existing TS behavior here: the previous readMessages path caught and logged errors from reading the realtime message stream rather than rethrowing. This port preserves that behavior.

textOut = _textOut;
forwardTasks.push(forwardTask);
output.audioOut = audioOut;
audioOut.firstFrameFut.await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we clean up firstFrameFut like that in python?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra cleanup is needed in the JS path: performAudioForwarding resolves firstFrameFut on playback start and rejects it in its finally block if playback never starts, and each await already has a catch to avoid an unhandled rejection.

@longcw longcw merged commit 181c868 into main May 25, 2026
9 checks passed
@longcw longcw deleted the footsie-sundial-perused branch May 25, 2026 07:26
@github-actions github-actions Bot mentioned this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(realtime): support multi-message generation per response - gpt-realtime-2

2 participants