Skip to content

Transcription text streams not generated for more than one participant #3657

@pabloFuente

Description

@pabloFuente

Bug description

I have been looking into a strange behavior when receiving text messages from the lk.transcription topic, generated by an STT agent.

Reading the agents documentation, it seems that an agent with a well-configured stt node (see snippet below) should generate transcription messages for every participant connected to the room and publishing an audio track.

async def entrypoint(ctx: JobContext):
    # Configure AWS Transcribe STT
    stt = aws.STT(language="en-US")

    # Create a simple agent with STT
    agent = Agent(instructions="not-needed", stt=stt)

    # Create agent session with VAD and turn detection
    session = AgentSession(
        vad=ctx.proc.userdata["vad"],
        turn_detection=MultilingualModel(),
    )

    # Start the session
    await session.start(
        agent=agent,
        room=ctx.room,
        room_output_options=RoomOutputOptions(
            # The agent will only generate text transcriptions as output
            transcription_enabled=True,
            audio_enabled=False,
        ),
        room_input_options=RoomInputOptions(
            # The agent will only receive audio tracks as input
            text_enabled=False,
            video_enabled=False,
            audio_enabled=True,
            pre_connect_audio=True,
            pre_connect_audio_timeout=3.0,
        ),
    )

    # Connect to room and subscribe to audio only
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

I am assuming that the agent above should automatically subscribe to all the audio tracks published to its room (both already existing, and new ones), and start generating transcription messages on topic lk.transcription for each subsribed track.

But what I am seeing is that in a 1-to-1 audio-only Room, the agent sends transcription messages for one participant, but not for the other.

Minimal setup to reproduce the issue

I have created a very minimal project to demonstrate the issue:

https://github.com/OpenVidu/livekit-agents-transcription-test

This repository contains:

  1. A very simple agent built with the latest livekit-agents Python SDK, based on a simpler version of the official agent-starter-python using the aws plugin.
  2. A very simple web app that allows joining participants to rooms, each publishing a single audio track (and not subscribing to remote tracks at all). A textarea for each participant shows the transcription messages received in topic lk.transcription.

Instructions to run the minimal setup (also available in its README):

# Clone the repository
git clone https://github.com/OpenVidu/livekit-agents-transcription-test.git
cd livekit-agents-transcription-test

# Build the agent container
docker build -t livekit/transcription-agent-test:latest agent/.

# Export your LiveKit Cloud credentials
export LIVEKIT_URL=wss://xxxxxxxx.livekit.cloud
export LIVEKIT_API_KEY=your_livekit_cloud_api_key
export LIVEKIT_API_SECRET=your_livekit_cloud_api_secret

# Export your AWS credentials
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_DEFAULT_REGION=your_aws_region

# Start the agent
docker compose up -d

# Modify the webapp LiveKit Cloud credentials. Search and replace in webapp/index.html:
# const LIVEKIT_URL = "wss://xxxxxxxx.livekit.cloud";
# const LIVEKIT_API_KEY = "your_livekit_cloud_api_key";
# const LIVEKIT_API_SECRET = "your_livekit_cloud_api_secret";

# Run the web app
cd webapp
npm install
npm start

The web app will be available at http://localhost:3000. You can connect to it and launch the following scenario:

Scenario

  1. Creating a Room with two partcipants, both publishing a single audio track each.
  2. The stt agent is dispatched automatically, and starts generating transcription text messages.

Expected behavior

Expected to receive transcription messages for both participants in the frontend.

Actual, wrong behavior

Transcription messages are received only for the first participant.

Images demonstrating the issue

The screenshots below demonstrate the issue:

  1. Set a room-name:
Image
  1. Connect first participant. The agent automatically joins the room:
Image
  1. The agent starts generating transcription messages for the first participant:
Image
  1. Connect second participant.
Image
  1. Both participants are talking, but the agent generates transcription messages only for the first participant:
Image

The behavior is the same even when changing the moment the agent joins the room with manual dispatch. If the agent is added to the room after the two participants are connected and publishing audio, the agent keeps generating transcription events only for one participant.

I am not actually sure if this behavior is a bug, or maybe I just have missunderstood the agent documentation. At first I also thought that maybe connecting two participants and publishing two audio tracks from the same device and source could cause problems. But:

  • Event IsSpeakingChanged indicates that both participants are actively speaking.
  • I have also tested this scenario from separated devices and it still happens.

I have also checked that the agent's worker availability is OK, forcing a load_threshold that is fully permissive.

I am quite lost at the moment. I am thinking about exploring the STT standalone usage strategy to see if I can overcome this behavior, but I would first like to confirm that this is a problem or maybe I just have misinterpreted the documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions