Multi-modal conversation entities #1085

Shulyaka · 2024-05-14T17:06:45Z

Shulyaka
May 14, 2024

Context

Historically, the conversation entities have been only capable to have one input/output modality, text. There is the Assist pipeline that adds Voice capability to the conversation entities, by integrating speech-to-text and text-to-switch entities. However with the advent of multi-modal LLMs that natively support speech input and output, the intermediate text format becomes a limiting factor.

The multi-modality may also include image, file, or sensory input and output, however this architecture proposal focuses solely on audio.

Proposal

The GPT-4o audio API is not released yet as of writing this, but the below solution looks universal.

Introduce supported_features method to the ConversationEntity that would indicate if the entity supports audio input and audio output. The text input and output remains mandatory to all conversation entities.
The assist_pipeline should provide an option to bypass stt and/or tts and stream the audio to/from the conversation entity directly, if the entity supports that.
For audio input capability, introduce API similar to the stt entities, i.e. supported_formats, supported_codecs, supported_bit_rates, supported_sample_rates, supported_channels, and check_metadata method. Additionally it should support async_process_audio_stream(self, metadata: stt.SpeechMetadata, stream: AsyncIterable[bytes]) -> ConversationResult method.
For audio output capability, introduce API similar to the tts entities, i.e. async_get_supported_voices, supported_options, default_options, and additionally implement async_process_to_audio(self, user_input: ConversationInput, options: dict[str, Any]) -> tts.TtsAudioType.
Additionally, an entity that supports both audio input and output should implement async_process_audio_stream_to_audio(self, metadata: stt.SpeechMetadata, stream: AsyncIterable[bytes], options: dict[str, Any]) -> tts.TtsAudioType.

balloob · 2024-05-14T18:38:18Z

balloob
May 14, 2024
Maintainer

From the examples that I saw, it will be possible to interrupt the AI as well. In that case, it shouldn't just be a process_stream but it needs to be more like a call.

3 replies

Shulyaka May 14, 2024
Author

I think the interrupts are not a feature of the model itself, so it is more like a job of assist_pipeline to do the voice detection and cancel the task or end the stream.

balloob May 15, 2024
Maintainer

But what is a model? The experience that OpenAI showed yesterday is a combination of many different things, which they might still make available as a single API. Shouldn't that also be covered by this proposal ? It becomes a 2-way conversation sending audio, video, photos, text…

Shulyaka May 15, 2024
Author

Yes, a model is whatever is represented to us as a single API. If OpenAI releases API for interruptions, we might want to incorporate it as well.
Images and files could be supported the same way, with additional supported_feature and additional methods to attach them to the message history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-modal conversation entities #1085

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Multi-modal conversation entities #1085

Shulyaka May 14, 2024

Context

Proposal

Replies: 1 comment · 3 replies

balloob May 14, 2024 Maintainer

Shulyaka May 14, 2024 Author

balloob May 15, 2024 Maintainer

Shulyaka May 15, 2024 Author

Shulyaka
May 14, 2024

Replies: 1 comment 3 replies

balloob
May 14, 2024
Maintainer

Shulyaka May 14, 2024
Author

balloob May 15, 2024
Maintainer

Shulyaka May 15, 2024
Author