Multi-modal conversation entities #1085
Unanswered
Shulyaka
asked this question in
Entity Models
Replies: 1 comment 3 replies
-
From the examples that I saw, it will be possible to interrupt the AI as well. In that case, it shouldn't just be a |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Context
Historically, the conversation entities have been only capable to have one input/output modality, text. There is the Assist pipeline that adds Voice capability to the conversation entities, by integrating speech-to-text and text-to-switch entities. However with the advent of multi-modal LLMs that natively support speech input and output, the intermediate text format becomes a limiting factor.
The multi-modality may also include image, file, or sensory input and output, however this architecture proposal focuses solely on audio.
Proposal
The GPT-4o audio API is not released yet as of writing this, but the below solution looks universal.
supported_features
method to theConversationEntity
that would indicate if the entity supports audio input and audio output. The text input and output remains mandatory to all conversation entities.assist_pipeline
should provide an option to bypassstt
and/ortts
and stream the audio to/from theconversation
entity directly, if the entity supports that.stt
entities, i.e.supported_formats
,supported_codecs
,supported_bit_rates
,supported_sample_rates
,supported_channels
, andcheck_metadata
method. Additionally it should supportasync_process_audio_stream(self, metadata: stt.SpeechMetadata, stream: AsyncIterable[bytes]) -> ConversationResult
method.tts
entities, i.e.async_get_supported_voices
,supported_options
,default_options
, and additionally implementasync_process_to_audio(self, user_input: ConversationInput, options: dict[str, Any]) -> tts.TtsAudioType
.async_process_audio_stream_to_audio(self, metadata: stt.SpeechMetadata, stream: AsyncIterable[bytes], options: dict[str, Any]) -> tts.TtsAudioType
.Beta Was this translation helpful? Give feedback.
All reactions