Skip to content

Comments

Add vad_threshold parameter to AssemblyAI STT plugin#4880

Merged
chenghao-mou merged 4 commits intolivekit:mainfrom
AhmadIbrahiim:add-assemblyai-vad-threshold
Feb 19, 2026
Merged

Add vad_threshold parameter to AssemblyAI STT plugin#4880
chenghao-mou merged 4 commits intolivekit:mainfrom
AhmadIbrahiim:add-assemblyai-vad-threshold

Conversation

@AhmadIbrahiim
Copy link
Contributor

@AhmadIbrahiim AhmadIbrahiim commented Feb 18, 2026

Summary

  • Add support for the vad_threshold parameter from AssemblyAI's streaming API
  • This allows users to configure VAD sensitivity (0-1 range) for different audio environments

Changes

  • Added vad_threshold to STTOptions dataclass
  • Added vad_threshold parameter to STT.__init__()
  • Added vad_threshold to STT.update_options() and SpeechStream.update_options()
  • Include vad_threshold in WebSocket connection config

API Reference

From AssemblyAI's docs:

The threshold for voice activity detection (VAD). A value between 0 and 1 that determines how sensitive the VAD is. Lower values make the VAD more sensitive, meaning it will detect quieter speech. Higher values make the VAD less sensitive. The default value is 0.4

Closes #4879

Add support for the vad_threshold parameter from AssemblyAI's streaming
API. This allows users to configure VAD sensitivity (0-1 range) for
different audio environments.

Closes livekit#4879
Add unit tests covering:
- Default value (NOT_GIVEN)
- Setting value in constructor
- Boundary values (0 and 1)
- Dynamic updates via update_options
- Interaction with other options
- Partial updates
@AhmadIbrahiim
Copy link
Contributor Author

AhmadIbrahiim commented Feb 18, 2026

Hi @theomonnom @davidzhao 👋

This PR adds support for the vad_threshold parameter from AssemblyAI's streaming API. We need this to configure VAD sensitivity for different audio environments.

Would appreciate a review when you have a moment. Thanks!

@chenghao-mou
Copy link
Member

/test-stt

@github-actions
Copy link
Contributor

STT Test Results

Status: ✗ Some tests failed

Metric Count
✓ Passed 20
✗ Failed 3
× Errors 1
→ Skipped 15
▣ Total 39
⏱ Duration 191.8s
Failed Tests
  • tests.test_stt::test_recognize[livekit.plugins.google]
    self = <livekit.plugins.google.stt.STT object at 0x7f2dc5297cb0>
    buffer = [rtc.AudioFrame(sample_rate=24000, num_channels=1, samples_per_channel=1119168, duration=46.632)]
    
        async def _recognize_impl(
            self,
            buffer: utils.AudioBuffer,
            *,
            language: NotGivenOr[SpeechLanguages | str] = NOT_GIVEN,
            conn_options: APIConnectOptions,
        ) -> stt.SpeechEvent:
            frame = rtc.combine_audio_frames(buffer)
      
            config = self._build_recognition_config(
                sample_rate=frame.sample_rate,
                num_channels=frame.num_channels,
                language=language,
            )
      
            try:
                async with self._pool.connection(timeout=conn_options.timeout) as client:
    >               raw = await client.recognize(
                        self._build_recognition_request(client, config, frame.data.tobytes()),
                        timeout=conn_options.timeout,
                    )
    
    livekit-plugins/livekit-plugins-google/livekit/plugins/google/stt.py:368: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <google.cloud.speech_v1.services.speech.async_client.SpeechAsyncClient object at 0x7f2dc52cd400>
    request = config {
      encoding: LINEAR16
      sample_rate_hertz: 24000
      language_code: "en-US"
      audio_channel_count: 1
      enable_wo...000\000\000\000\000\000\000\001\000\001\000\001\000\000\000\000\000\001\000\377\377\377\377\000\000\000\000\000\000"
    }
    
    
        async def recognize(
            self,
            request: Optional[Union[cloud_speech.RecognizeRequest, dict]] = None,
            *,
            config: Optional[cloud_speech.RecognitionConfig] = None,
            audio: Optional[cloud_speech.RecognitionAudio] = None,
            retry: OptionalRetry = gapic_v1.method.DEFAULT,
            timeout: Union[float, object] = gapic_v1.method.DEFAULT,
            metadata: Sequence[Tuple[str, Union[str, bytes]]] = (),
        ) -> cloud_speech.RecognizeResponse:
            r"""Performs synchronous speech recognition: receive
    
  • tests.test_stt::test_stream[livekit.plugins.speechmatics]
    def finalizer() -> None:
            """Yield again, to finalize."""
      
            async def async_finalizer() -> None:
                try:
                    await gen_obj.__anext__()
                except StopAsyncIteration:
                    pass
                else:
                    msg = "Async generator fixture didn't stop."
                    msg += "Yield only once."
                    raise ValueError(msg)
      
    >       runner.run(async_finalizer(), context=context)
    
    .venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:330: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <asyncio.runners.Runner object at 0x7f4f66b55010>
    coro = <coroutine object _wrap_asyncgen_fixture.<locals>._asyncgen_fixture_wrapper.<locals>.finalizer.<locals>.async_finalizer at 0x7f4f66b43c60>
    
        def run(self, coro, *, context=None):
            """Run a coroutine inside the embedded event loop."""
            if not coroutines.iscoroutine(coro):
                raise ValueError("a coroutine was expected, got {!r}".format(coro))
      
            if events._get_running_loop() is not None:
                # fail fast with short traceback
                raise RuntimeError(
                    "Runner.run() cannot be called from a running event loop")
      
            self._lazy_init()
      
            if context is None:
                context = self._context
            task = self._loop.create_task(coro, context=context)
      
            if (threading.current_thread() is threading.main_thread()
                and signal.getsignal(signal.SIGINT) is signal.default_int_handler
            ):
                sigint_handler = functools.partial(self._on_sigint, main_task=task)
                try:
                    signal.signal(signal.SIGINT, sigint_handler)
                except ValueError:
                    # `signal.signal` may throw if `threading.main_thread` does
                    # not support signals (e.g. embedded interpreter with signals
                    # not registered - see gh-91880)
                    sigint_handler = None
    
  • tests.test_stt::test_stream[livekit.agents.inference]
    stt_factory = <function parameter_factory.<locals>.<lambda> at 0x7f2dc520dee0>
    request = <FixtureRequest for <Coroutine test_stream[livekit.agents.inference]>>
    
        @pytest.mark.usefixtures("job_process")
        @pytest.mark.parametrize("stt_factory", STTs)
        async def test_stream(stt_factory: Callable[[], STT], request):
            sample_rate = SAMPLE_RATE
            plugin_id = request.node.callspec.id.split("-")[0]
            frames, transcript, _ = await make_test_speech(chunk_duration_ms=10, sample_rate=sample_rate)
      
            # TODO: differentiate missing key vs other errors
            try:
                stt_instance: STT = stt_factory()
            except ValueError as e:
                pytest.skip(f"{plugin_id}: {e}")
      
            async with stt_instance as stt:
                label = f"{stt.model}@{stt.provider}"
                if not stt.capabilities.streaming:
                    pytest.skip(f"{label} does not support streaming")
      
                for attempt in range(MAX_RETRIES):
                    try:
                        state = {"closing": False}
      
                        async def _stream_input(
                            frames: list[rtc.AudioFrame], stream: RecognizeStream, state: dict = state
                        ):
                            for frame in frames:
                                stream.push_frame(frame)
                                await asyncio.sleep(0.005)
      
                            stream.end_input()
                            state["closing"] = True
      
                        async def _stream_output(stream: RecognizeStream, state: dict = state):
                            text = ""
                            # make sure the events are sent in the right order
                            recv_start, recv_end = False, True
                            start_time = time.time()
                            got_final_transcript = False
      
                            async for event in stream:
                                if event.type == agents.stt.SpeechEventType.START_OF_SPEECH:
    
  • tests.test_stt::test_stream[livekit.plugins.google]
    self = <google.api_core.grpc_helpers_async._WrappedStreamStreamCall object at 0x7f4f75c5af60>
    
        async def _wrapped_aiter(self) -> AsyncGenerator[P, None]:
            try:
                # NOTE(lidiz) coverage doesn't understand the exception raised from
                # __anext__ method. It is covered by test case:
                #     test_wrap_stream_errors_aiter_non_rpc_error
    >           async for response in self._call:  # pragma: no branch
    
    .venv/lib/python3.12/site-packages/google/api_core/grpc_helpers_async.py:107: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <_AioCall of RPC that terminated with:
    	status = Request had invalid authentication credentials. Expected OAuth 2 acce... other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project."}"
    >
    
        async def _fetch_stream_responses(self) -> ResponseType:
            message = await self._read()
            while message is not cygrpc.EOF:
                yield message
                message = await self._read()
      
            # If the read operation failed, Core should explain why.
    >       await self._raise_for_status()
    
    .venv/lib/python3.12/site-packages/grpc/aio/_call.py:364: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <_AioCall of RPC that terminated with:
    	status = Request had invalid authentication credentials. Expected OAuth 2 acce... other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project."}"
    >
    
        async def _raise_for_status(self) -> None:
            if self._cython_call.is_locally_cancelled():
                raise asyncio.CancelledError()
            code = await self.code()
            if code != grpc.StatusCode.OK:
    >           raise _create_rpc_error(
                    await self.initial_metadata(),
                    await self._cython_call.status(),
                )
    E           grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with
    
Skipped Tests
Test Reason
tests.test_stt::test_recognize[livekit.plugins.assemblyai] universal-streaming-english@AssemblyAI does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.speechmatics] enhanced@Speechmatics does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.fireworksai] unknown@FireworksAI does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.cartesia] ink-whisper@Cartesia does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.soniox] stt-rt-v3@Soniox does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.aws] unknown@Amazon Transcribe does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.nvidia] unknown@unknown does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.deepgram.STTv2] flux-general-en@Deepgram does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.gradium.STT] unknown@Gradium does not support batch recognition
tests.test_stt::test_recognize[livekit.agents.inference] unknown@livekit does not support batch recognition
tests.test_stt::test_recognize[livekit.plugins.azure] unknown@Azure STT does not support batch recognition
tests.test_stt::test_stream[livekit.plugins.elevenlabs] scribe_v1@ElevenLabs does not support streaming
tests.test_stt::test_stream[livekit.plugins.mistralai] voxtral-mini-latest@MistralAI does not support streaming
tests.test_stt::test_stream[livekit.plugins.openai] gpt-4o-mini-transcribe@api.openai.com does not support streaming
tests.test_stt::test_stream[livekit.plugins.fal] Wizper@Fal does not support streaming

Triggered by workflow run #881

@AhmadIbrahiim
Copy link
Contributor Author

/test-stt

@chenghao-mou I see 3 failed tests -- I don't think it's related to the changes in this PR?

@chenghao-mou chenghao-mou self-assigned this Feb 18, 2026
@chenghao-mou
Copy link
Member

Yeah, it should be fine. I will review and test this by EOW.

Copy link
Member

@chenghao-mou chenghao-mou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor issue about the docstring.

vad_threshold: The threshold for voice activity detection (VAD). A value between
0 and 1 that determines how sensitive the VAD is. Lower values make the VAD
more sensitive (detects quieter speech). Higher values make it less sensitive.
Defaults to 0.5.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking: the official default value is 0.4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ✅

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

vad_threshold: The threshold for voice activity detection (VAD). A value between
0 and 1 that determines how sensitive the VAD is. Lower values make the VAD
more sensitive (detects quieter speech). Higher values make it less sensitive.
Defaults to 0.4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Docstring states incorrect default value for vad_threshold (0.4 vs 0.5)

The docstring for vad_threshold at line 93 states "Defaults to 0.4" but the AssemblyAI API documentation (referenced in the PR description itself) states the server-side default is 0.5.

Detailed Explanation

The PR description directly quotes AssemblyAI's docs:

The default value is 0.5.

However, the docstring at livekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py:93 says:

Defaults to 0.4.

Since vad_threshold defaults to NOT_GIVEN and is sent as None (thus omitted from the WebSocket URL query parameters), the actual default applied is whatever AssemblyAI's server uses — which is 0.5 per their docs. Users reading this docstring would be misled about the effective default behavior of the VAD sensitivity.

Impact: Users relying on the documented default of 0.4 to understand the VAD sensitivity behavior will have incorrect expectations. The actual server-side default is 0.5 (less sensitive than documented).

Suggested change
Defaults to 0.4.
Defaults to 0.5.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@chenghao-mou chenghao-mou merged commit 1cb9c53 into livekit:main Feb 19, 2026
9 checks passed
@AhmadIbrahiim AhmadIbrahiim deleted the add-assemblyai-vad-threshold branch February 19, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add vad_threshold parameter support to AssemblyAI STT plugin

2 participants