Skip to content

Livekit user speaking state not fully compatible with manual turn taking #5118

@bml1g12

Description

@bml1g12

Bug Description

We have been making good use of manual turn control recently (i.e. https://docs.livekit.io/agents/logic/turns/#manual). We are using it with gpt-realtime, so I will use that as an example, but I think the issue I describe here is relevant to any model even cascaded models.

In short, the problem is that in manual turn taking, as per the offiical docs, we mark end of user turn essentially by muting the user microphone (session.input.set_audio_enabled(False)) - this has an unexpected side effect however of "locking" the session.user_state in the previous state of "speaking" (if the user was previously speaking) regardless of whether they have stopped speaking, and regardless of the fact that we know they cannot be speaking as we have muted their microphone.

I should note, you might wonder why we have Server Side VAD enabled at all if we are using manual turn taking, this is because its important for us to be able to recognise user speech/non-speech even in a turn taking scenario, e.g. so we can avoid triggering interruptions mid-user speech, and so we can commit user speech to context within their turn. If there is no VAD, the livekit speaking state will always be listening, so this issue would go away of course, but for users in a scenario like us where we need VAD with gpt-realtime, and indeed for any cascaded model user (as these require VAD), I think this issue would be evident.

Expected Behavior

In short, when a user ends turn, we expect if they stop speaking for the user state to no longer be marked as "speaking".
Furthermore, if we have muted the microphone, we expect even if the user is speaking, for the livekit user state to be no longer be marked as speaking".

Reproduction Steps

Run the following and speak during the first 10 seconds, then stop speaking --> the speaking state is stuck in "speaking".

import asyncio
import json
import logging

from dotenv import load_dotenv
from livekit import rtc
from livekit.agents import Agent, AgentServer, AgentSession, JobContext, JobProcess, cli, room_io
from livekit.plugins import noise_cancellation, openai, silero
from openai.types.realtime import AudioTranscription
from openai.types.realtime.realtime_audio_input_turn_detection import SemanticVad

logger = logging.getLogger("agent")

load_dotenv(".env.local")


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=("You are a helpful voice AI assistant. Speak English. "),
        )


server = AgentServer()


def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load()


server.setup_fnc = prewarm


@server.rtc_session()
async def my_agent(ctx: JobContext) -> None:
    # Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
    semantic_vad = SemanticVad(
        type="semantic_vad",
        create_response=True,
        interrupt_response=True,
        eagerness="low",
    )
    session = AgentSession(
        turn_detection="manual",
        # stt=StreamAdapter(
        #     vad=ctx.proc.userdata["vad"],
        # ),
        llm=openai.realtime.RealtimeModel(
            input_audio_transcription=AudioTranscription(language="en", model="whisper-1"),
            max_session_duration=20 * 60,
            turn_detection=semantic_vad,
        ),
    )

    # Start the session, which initializes the voice pipeline and warms up the models
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            audio_input=room_io.AudioInputOptions(
                noise_cancellation=lambda params: noise_cancellation.BVCTelephony()
                if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP
                else noise_cancellation.BVC(),
            ),
        ),
    )

    session.input.set_audio_enabled(True)

    async def end_turn():
        logger.info("ENDING TURN: %s", session.user_state)
        session.input.set_audio_enabled(False)  # Stop listening
        session.commit_user_turn()  # Process the input and generate response
        logger.info("ENDED TURN: %s", session.user_state)


    # Join the room and connect to the user
    await ctx.connect()

    for i in range(20):
        await asyncio.sleep(0.5)
        logger.info("BEFORE END TURN: %s", session.user_state)
    await end_turn()
    while True:
        await asyncio.sleep(0.5)
        logger.info("After ENDED TURN: %s", session.user_state)


if __name__ == "__main__":
    cli.run_app(server)

Now if you try talking to this agent, constantly speaking until the turn ends (after 5 seconds) you see the below:

 14:43:37.094 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:37.595 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:38.096 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:38.597 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:39.099 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:39.601 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:39.718 INFO … root               ignoring text stream with topic 'lk.agent.request', no callback attached {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:40.102 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:40.604 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:41.105 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:41.606 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:42.107 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:42.608 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:43.109 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:43.610 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:44.110 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:44.611 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:44.716 INFO … root               ignoring text stream with topic 'lk.agent.request', no callback attached {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:45.112 INFO … agent              BEFORE END TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
                 INFO … agent              ENDING TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
                 DEBUG… livekit.agents     input stream detached  
                                         {"participant": "identity-kD4R", "source": "SOURCE_MICROPHONE", "accepted_sources": ["SOURCE_MICROPHONE"], "pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": 
"RM_yz9uBJCXQhQw"}
    14:43:45.117 INFO … agent              ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:45.618 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:46.119 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:46.620 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:47.120 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:47.621 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:48.121 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:48.622 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:49.123 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}
    14:43:49.625 INFO … agent              After ENDED TURN: speaking {"pid": 1798205, "job_id": "AJ_t2sWpaXpZbDq", "room_id": "RM_yz9uBJCXQhQw"}

i.e. we are correctly in speaking state before turn ends, but then incorrectly in speaking state after the user's microphone has been muted, and will remain incorrectly in this state indefinately until user is unmuted.

Operating System

Linux

Models Used

gpt-realtime

Package Versions

1.4.5

Session/Room/Call IDs

No response

Proposed Solution

Potential solution: I think session.input.set_audio_enabled(False) should set session.user_state out of "speaking"

Alternative solution: Expose a public method to switch the state away from speaking, that we can call as part of end_turn().

Additional Context

No response

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions