Bug description
I have been looking into a strange behavior when receiving text messages from the lk.transcription topic, generated by an STT agent.
Reading the agents documentation, it seems that an agent with a well-configured stt node (see snippet below) should generate transcription messages for every participant connected to the room and publishing an audio track.
async def entrypoint(ctx: JobContext):
# Configure AWS Transcribe STT
stt = aws.STT(language="en-US")
# Create a simple agent with STT
agent = Agent(instructions="not-needed", stt=stt)
# Create agent session with VAD and turn detection
session = AgentSession(
vad=ctx.proc.userdata["vad"],
turn_detection=MultilingualModel(),
)
# Start the session
await session.start(
agent=agent,
room=ctx.room,
room_output_options=RoomOutputOptions(
# The agent will only generate text transcriptions as output
transcription_enabled=True,
audio_enabled=False,
),
room_input_options=RoomInputOptions(
# The agent will only receive audio tracks as input
text_enabled=False,
video_enabled=False,
audio_enabled=True,
pre_connect_audio=True,
pre_connect_audio_timeout=3.0,
),
)
# Connect to room and subscribe to audio only
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
I am assuming that the agent above should automatically subscribe to all the audio tracks published to its room (both already existing, and new ones), and start generating transcription messages on topic lk.transcription for each subsribed track.
But what I am seeing is that in a 1-to-1 audio-only Room, the agent sends transcription messages for one participant, but not for the other.
Minimal setup to reproduce the issue
I have created a very minimal project to demonstrate the issue:
https://github.com/OpenVidu/livekit-agents-transcription-test
This repository contains:
- A very simple agent built with the latest livekit-agents Python SDK, based on a simpler version of the official agent-starter-python using the aws plugin.
- A very simple web app that allows joining participants to rooms, each publishing a single audio track (and not subscribing to remote tracks at all). A textarea for each participant shows the transcription messages received in topic
lk.transcription.
Instructions to run the minimal setup (also available in its README):
# Clone the repository
git clone https://github.com/OpenVidu/livekit-agents-transcription-test.git
cd livekit-agents-transcription-test
# Build the agent container
docker build -t livekit/transcription-agent-test:latest agent/.
# Export your LiveKit Cloud credentials
export LIVEKIT_URL=wss://xxxxxxxx.livekit.cloud
export LIVEKIT_API_KEY=your_livekit_cloud_api_key
export LIVEKIT_API_SECRET=your_livekit_cloud_api_secret
# Export your AWS credentials
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_DEFAULT_REGION=your_aws_region
# Start the agent
docker compose up -d
# Modify the webapp LiveKit Cloud credentials. Search and replace in webapp/index.html:
# const LIVEKIT_URL = "wss://xxxxxxxx.livekit.cloud";
# const LIVEKIT_API_KEY = "your_livekit_cloud_api_key";
# const LIVEKIT_API_SECRET = "your_livekit_cloud_api_secret";
# Run the web app
cd webapp
npm install
npm start
The web app will be available at http://localhost:3000. You can connect to it and launch the following scenario:
Scenario
- Creating a Room with two partcipants, both publishing a single audio track each.
- The stt agent is dispatched automatically, and starts generating transcription text messages.
Expected behavior
Expected to receive transcription messages for both participants in the frontend.
Actual, wrong behavior
Transcription messages are received only for the first participant.
Images demonstrating the issue
The screenshots below demonstrate the issue:
- Set a room-name:
- Connect first participant. The agent automatically joins the room:
- The agent starts generating transcription messages for the first participant:
- Connect second participant.
- Both participants are talking, but the agent generates transcription messages only for the first participant:
The behavior is the same even when changing the moment the agent joins the room with manual dispatch. If the agent is added to the room after the two participants are connected and publishing audio, the agent keeps generating transcription events only for one participant.
I am not actually sure if this behavior is a bug, or maybe I just have missunderstood the agent documentation. At first I also thought that maybe connecting two participants and publishing two audio tracks from the same device and source could cause problems. But:
- Event IsSpeakingChanged indicates that both participants are actively speaking.
- I have also tested this scenario from separated devices and it still happens.
I have also checked that the agent's worker availability is OK, forcing a load_threshold that is fully permissive.
I am quite lost at the moment. I am thinking about exploring the STT standalone usage strategy to see if I can overcome this behavior, but I would first like to confirm that this is a problem or maybe I just have misinterpreted the documentation.
Bug description
I have been looking into a strange behavior when receiving text messages from the
lk.transcriptiontopic, generated by an STT agent.Reading the agents documentation, it seems that an agent with a well-configured stt node (see snippet below) should generate transcription messages for every participant connected to the room and publishing an audio track.
I am assuming that the agent above should automatically subscribe to all the audio tracks published to its room (both already existing, and new ones), and start generating transcription messages on topic
lk.transcriptionfor each subsribed track.But what I am seeing is that in a 1-to-1 audio-only Room, the agent sends transcription messages for one participant, but not for the other.
Minimal setup to reproduce the issue
I have created a very minimal project to demonstrate the issue:
https://github.com/OpenVidu/livekit-agents-transcription-test
This repository contains:
lk.transcription.Instructions to run the minimal setup (also available in its README):
The web app will be available at http://localhost:3000. You can connect to it and launch the following scenario:
Scenario
Expected behavior
Expected to receive transcription messages for both participants in the frontend.
Actual, wrong behavior
Transcription messages are received only for the first participant.
Images demonstrating the issue
The screenshots below demonstrate the issue:
The behavior is the same even when changing the moment the agent joins the room with manual dispatch. If the agent is added to the room after the two participants are connected and publishing audio, the agent keeps generating transcription events only for one participant.
I am not actually sure if this behavior is a bug, or maybe I just have missunderstood the agent documentation. At first I also thought that maybe connecting two participants and publishing two audio tracks from the same device and source could cause problems. But:
I have also checked that the agent's worker availability is OK, forcing a
load_thresholdthat is fully permissive.I am quite lost at the moment. I am thinking about exploring the STT standalone usage strategy to see if I can overcome this behavior, but I would first like to confirm that this is a problem or maybe I just have misinterpreted the documentation.