Transcription text streams not generated for more than one participant

## Bug description

I have been looking into a strange behavior when receiving text messages from the `lk.transcription` topic, generated by an STT agent.

Reading the agents documentation, it seems that an agent with a well-configured stt node (see snippet below) should generate transcription messages for every participant connected to the room and publishing an audio track.

```python
async def entrypoint(ctx: JobContext):
    # Configure AWS Transcribe STT
    stt = aws.STT(language="en-US")

    # Create a simple agent with STT
    agent = Agent(instructions="not-needed", stt=stt)

    # Create agent session with VAD and turn detection
    session = AgentSession(
        vad=ctx.proc.userdata["vad"],
        turn_detection=MultilingualModel(),
    )

    # Start the session
    await session.start(
        agent=agent,
        room=ctx.room,
        room_output_options=RoomOutputOptions(
            # The agent will only generate text transcriptions as output
            transcription_enabled=True,
            audio_enabled=False,
        ),
        room_input_options=RoomInputOptions(
            # The agent will only receive audio tracks as input
            text_enabled=False,
            video_enabled=False,
            audio_enabled=True,
            pre_connect_audio=True,
            pre_connect_audio_timeout=3.0,
        ),
    )

    # Connect to room and subscribe to audio only
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
```

I am assuming that the agent above **should automatically subscribe to all the audio tracks published to its room** (both already existing, and new ones), and start generating transcription messages on topic `lk.transcription` for each subsribed track.

But what I am seeing is that in a 1-to-1 audio-only Room, the agent sends transcription messages for one participant, but not for the other.

## Minimal setup to reproduce the issue

I have created a very minimal project to demonstrate the issue: 

**https://github.com/OpenVidu/livekit-agents-transcription-test**

This repository contains:

1. A very simple agent built with the latest livekit-agents Python SDK, based on a simpler version of the official [agent-starter-python](https://github.com/livekit-examples/agent-starter-python) using the aws plugin.
2. A very simple web app that allows joining participants to rooms, each publishing a single audio track (and not subscribing to remote tracks at all). A textarea for each participant shows the transcription messages received in topic `lk.transcription`.

Instructions to run the minimal setup (also available in its [README](https://github.com/OpenVidu/livekit-agents-transcription-test/tree/main?tab=readme-ov-file#livekit-agents-transcription-test)):

```bash
# Clone the repository
git clone https://github.com/OpenVidu/livekit-agents-transcription-test.git
cd livekit-agents-transcription-test

# Build the agent container
docker build -t livekit/transcription-agent-test:latest agent/.

# Export your LiveKit Cloud credentials
export LIVEKIT_URL=wss://xxxxxxxx.livekit.cloud
export LIVEKIT_API_KEY=your_livekit_cloud_api_key
export LIVEKIT_API_SECRET=your_livekit_cloud_api_secret

# Export your AWS credentials
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_DEFAULT_REGION=your_aws_region

# Start the agent
docker compose up -d

# Modify the webapp LiveKit Cloud credentials. Search and replace in webapp/index.html:
# const LIVEKIT_URL = "wss://xxxxxxxx.livekit.cloud";
# const LIVEKIT_API_KEY = "your_livekit_cloud_api_key";
# const LIVEKIT_API_SECRET = "your_livekit_cloud_api_secret";

# Run the web app
cd webapp
npm install
npm start
```

The web app will be available at [http://localhost:3000](http://localhost:3000). You can connect to it and launch the following scenario:

**Scenario**

1. Creating a Room with two partcipants, both publishing a single audio track each.
2. The stt agent is dispatched automatically, and starts generating transcription text messages.

**Expected behavior**

Expected to receive transcription messages for both participants in the frontend.

**Actual, wrong behavior**

Transcription messages are received only for the first participant.

## Images demonstrating the issue

The screenshots below demonstrate the issue:

1. Set a room-name:

<img width="1142" height="332" alt="Image" src="https://github.com/user-attachments/assets/0c0baaf3-92b9-443d-9b76-0714f3b7a729" />

2. Connect first participant. The agent automatically joins the room:

<img width="1915" height="965" alt="Image" src="https://github.com/user-attachments/assets/30973827-ae19-43a3-806c-8a40e873bf17" />

3. The agent starts generating transcription messages for the first participant:

<img width="1915" height="965" alt="Image" src="https://github.com/user-attachments/assets/29e8a13f-ba01-47a3-a10a-84afdbe67a54" />

4. Connect second participant.

<img width="1915" height="1080" alt="Image" src="https://github.com/user-attachments/assets/82fb058e-7ae6-4d3f-a078-defe9eacf6b3" />

5. Both participants are talking, but the agent generates transcription messages only for the first participant:

<img width="1915" height="1080" alt="Image" src="https://github.com/user-attachments/assets/c3e6730b-4bed-4b7d-ab1e-21928804cd25" />

The behavior is the same even when changing the moment the agent joins the room with manual dispatch.  If the agent is added to the room after the two participants are connected and publishing audio, the agent keeps generating transcription events only for one participant.

I am not actually sure if this behavior is a bug, or maybe I just have missunderstood the agent documentation. At first I also thought that maybe connecting two participants and publishing two audio tracks from the same device and source could cause problems. But:

- Event IsSpeakingChanged indicates that both participants are actively speaking.
- I have also tested this scenario from separated devices and it still happens.

I have also checked that the agent's worker availability is OK, forcing a `load_threshold` that is fully permissive.

I am quite lost at the moment. I am thinking about exploring the [STT standalone usage](https://docs.livekit.io/agents/models/stt/#standalone-usage) strategy to see if I can overcome this behavior, but I would first like to confirm that this is a problem or maybe I just have misinterpreted the documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcription text streams not generated for more than one participant #3657

Bug description

Minimal setup to reproduce the issue

Images demonstrating the issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Transcription text streams not generated for more than one participant #3657

Description

Bug description

Minimal setup to reproduce the issue

Images demonstrating the issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions