Bug Description
We have been making good use of manual turn control recently (i.e. https://docs.livekit.io/agents/logic/turns/#manual). We are using it with gpt-realtime, so I will use that as an example, but I think the issue I describe here is relevant to any model even cascaded models.
In short, the problem is that in manual turn taking, as per the offiical docs, we mark end of user turn essentially by muting the user microphone (session.input.set_audio_enabled(False)) - this has an unexpected side effect however of "locking" the session.user_state in the previous state of "speaking" (if the user was previously speaking) regardless of whether they have stopped speaking, and regardless of the fact that we know they cannot be speaking as we have muted their microphone.
I should note, you might wonder why we have Server Side VAD enabled at all if we are using manual turn taking, this is because its important for us to be able to recognise user speech/non-speech even in a turn taking scenario, e.g. so we can avoid triggering interruptions mid-user speech, and so we can commit user speech to context within their turn. If there is no VAD, the livekit speaking state will always be listening, so this issue would go away of course, but for users in a scenario like us where we need VAD with gpt-realtime, and indeed for any cascaded model user (as these require VAD), I think this issue would be evident.
Expected Behavior
In short, when a user ends turn, we expect if they stop speaking for the user state to no longer be marked as "speaking".
Furthermore, if we have muted the microphone, we expect even if the user is speaking, for the livekit user state to be no longer be marked as speaking".
Reproduction Steps
Run the following and speak during the first 10 seconds, then stop speaking --> the speaking state is stuck in "speaking".
import asyncio
import json
import logging
from dotenv import load_dotenv
from livekit import rtc
from livekit.agents import Agent, AgentServer, AgentSession, JobContext, JobProcess, cli, room_io
from livekit.plugins import noise_cancellation, openai, silero
from openai.types.realtime import AudioTranscription
from openai.types.realtime.realtime_audio_input_turn_detection import SemanticVad
logger = logging.getLogger("agent")
load_dotenv(".env.local")
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions=("You are a helpful voice AI assistant. Speak English. "),
)
server = AgentServer()
def prewarm(proc: JobProcess) -> None:
proc.userdata["vad"] = silero.VAD.load()
server.setup_fnc = prewarm
@server.rtc_session()
async def my_agent(ctx: JobContext) -> None:
# Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
semantic_vad = SemanticVad(
type="semantic_vad",
create_response=True,
interrupt_response=True,
eagerness="low",
)
session = AgentSession(
turn_detection="manual",
# stt=StreamAdapter(
# vad=ctx.proc.userdata["vad"],
# ),
llm=openai.realtime.RealtimeModel(
input_audio_transcription=AudioTranscription(language="en", model="whisper-1"),
max_session_duration=20 * 60,
turn_detection=semantic_vad,
),
)
# Start the session, which initializes the voice pipeline and warms up the models
await session.start(
agent=Assistant(),
room=ctx.room,
room_options=room_io.RoomOptions(
audio_input=room_io.AudioInputOptions(
noise_cancellation=lambda params: noise_cancellation.BVCTelephony()
if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP
else noise_cancellation.BVC(),
),
),
)
session.input.set_audio_enabled(True)
async def end_turn():
logger.info("ENDING TURN: %s", session.user_state)
session.input.set_audio_enabled(False) # Stop listening
session.commit_user_turn() # Process the input and generate response
logger.info("ENDED TURN: %s", session.user_state)
# Join the room and connect to the user
await ctx.connect()
for i in range(20):
await asyncio.sleep(0.5)
logger.info("BEFORE END TURN: %s", session.user_state)
await end_turn()
while True:
await asyncio.sleep(0.5)
logger.info("After ENDED TURN: %s", session.user_state)
if __name__ == "__main__":
cli.run_app(server)
Now if you try talking to this agent, constantly speaking until the turn ends (after 5 seconds) you see the below:
14:43:37.094 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:37.595 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:38.096 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:38.597 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:39.099 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:39.601 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:39.718 INFO … root ignoring text stream with topic 'lk.agent.request', no callback attached {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:40.102 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:40.604 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:41.105 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:41.606 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:42.107 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:42.608 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:43.109 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:43.610 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:44.110 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:44.611 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:44.716 INFO … root ignoring text stream with topic 'lk.agent.request', no callback attached {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:45.112 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
INFO … agent ENDING TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
DEBUG… livekit.agents input stream detached
{"participant": "identity-kD4R", "source": "SOURCE_MICROPHONE", "accepted_sources": ["SOURCE_MICROPHONE"], "pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id":
"RM_yz9uBJCXQhQw"}
14:43:45.117 INFO … agent ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:45.618 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:46.119 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:46.620 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:47.120 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:47.621 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:48.121 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:48.622 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:49.123 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
14:43:49.625 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
i.e. we are correctly in speaking state before turn ends, but then incorrectly in speaking state after the user's microphone has been muted, and will remain incorrectly in this state indefinately until user is unmuted.
Operating System
Linux
Models Used
gpt-realtime
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Potential solution: I think session.input.set_audio_enabled(False) should set session.user_state out of "speaking"
Alternative solution: Expose a public method to switch the state away from speaking, that we can call as part of end_turn().
Additional Context
No response
Screenshots and Recordings
No response
Bug Description
We have been making good use of manual turn control recently (i.e. https://docs.livekit.io/agents/logic/turns/#manual). We are using it with gpt-realtime, so I will use that as an example, but I think the issue I describe here is relevant to any model even cascaded models.
In short, the problem is that in manual turn taking, as per the offiical docs, we mark end of user turn essentially by muting the user microphone (
session.input.set_audio_enabled(False)) - this has an unexpected side effect however of "locking" the session.user_state in the previous state of "speaking" (if the user was previously speaking) regardless of whether they have stopped speaking, and regardless of the fact that we know they cannot be speaking as we have muted their microphone.I should note, you might wonder why we have Server Side VAD enabled at all if we are using manual turn taking, this is because its important for us to be able to recognise user speech/non-speech even in a turn taking scenario, e.g. so we can avoid triggering interruptions mid-user speech, and so we can commit user speech to context within their turn. If there is no VAD, the livekit speaking state will always be listening, so this issue would go away of course, but for users in a scenario like us where we need VAD with gpt-realtime, and indeed for any cascaded model user (as these require VAD), I think this issue would be evident.
Expected Behavior
In short, when a user ends turn, we expect if they stop speaking for the user state to no longer be marked as "speaking".
Furthermore, if we have muted the microphone, we expect even if the user is speaking, for the livekit user state to be no longer be marked as speaking".
Reproduction Steps
Run the following and speak during the first 10 seconds, then stop speaking --> the speaking state is stuck in "speaking".
Now if you try talking to this agent, constantly speaking until the turn ends (after 5 seconds) you see the below:
14:43:37.094 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:37.595 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:38.096 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:38.597 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:39.099 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:39.601 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:39.718 INFO … root ignoring text stream with topic 'lk.agent.request', no callback attached {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:40.102 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:40.604 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:41.105 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:41.606 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:42.107 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:42.608 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:43.109 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:43.610 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:44.110 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:44.611 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:44.716 INFO … root ignoring text stream with topic 'lk.agent.request', no callback attached {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:45.112 INFO … agent BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} INFO … agent ENDING TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} DEBUG… livekit.agents input stream detached {"participant": "identity-kD4R", "source": "SOURCE_MICROPHONE", "accepted_sources": ["SOURCE_MICROPHONE"], "pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:45.117 INFO … agent ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:45.618 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:46.119 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:46.620 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:47.120 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:47.621 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:48.121 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:48.622 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:49.123 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"} 14:43:49.625 INFO … agent After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}i.e. we are correctly in speaking state before turn ends, but then incorrectly in speaking state after the user's microphone has been muted, and will remain incorrectly in this state indefinately until user is unmuted.
Operating System
Linux
Models Used
gpt-realtime
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Potential solution: I think
session.input.set_audio_enabled(False)should setsession.user_stateout of "speaking"Alternative solution: Expose a public method to switch the state away from speaking, that we can call as part of
end_turn().Additional Context
No response
Screenshots and Recordings
No response