Multiple connections versus persistent connection for conversational AI #421
-
|
I am using Deepgram for a conversational AI use case. Its working well but I have a question on best practices. Since the conversational dialog flow is very much prompt-response-repeat, I am connecting to deepgram via websocket with each prompt, collecting a transcript and then disconnecting, processing the transcript and saying something to the user, and then repeating the whole cycle. That is fine but I do pay a slight price for reconnecting each time, in terms of the time to establish the socket connection (tls handshake etc) which at times has caused a delay resulting in some speech audio not making it to deepgram. Also, since each turn of conversation is a completely new session on deepgram, I am wondering if I am forgoing some accuracy as the speaker gets further into the conversation -- what I mean to ask here is whether there is any "learning" that deepgram does as more audio is processed such that transcripts would be more accurate in a single long audio session vs provided over multiple unrelated (to deepgram) sessions? That is my main question, and while the answer could lead me to want to adjust my implementation and just connect once for the entire conversation, there is a blocker there --- because things like keywords can only be provided in the URL path on the connection, I would not be able to change keywords during the conversation which would render this approach useless. Have you considered augmenting the API so that clients could send JSON text frames during the connection to manipulate things like keywords, or even language? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
You can send Deepgram a Keep Alive message every few seconds and stop sending audio packets to Deepgram when not needed. When not sending audio but sending a keep alive we will not charge since we are not processing any audio. You will be able to keep the websocket connection open and once you send audio again we will transcribe it. See: https://developers.deepgram.com/reference/streaming#stream-keepalive |
Beta Was this translation helpful? Give feedback.
-
|
Thanks, appreciate that info. Wonder if you could respond to my question though: |
Beta Was this translation helpful? Give feedback.
Hi @davehorton - The short answer is no, you're not getting additional cumulative accuracy by using a single uninterrupted stream. Deepgram doesn't use all the previous audio in a stream to improve transcription as the stream progresses.
However, it does use some context to improve accuracy. If you send very short snippets of audio (say, less than 5 seconds), there can be context missing, which can lower accuracy. But if a complete turn is only a few seconds, then in some ways that's the full context available for the utterance. I'd say that a stream should last at least the length of one full turn, but it can be at your discretion whether you want to keep reconnecting for each new turn, …