Skip to content

silero VAD silently falls back to CPU when TensorRT is absent, despite CUDA being available #5860

@ahmedmuzammilAI

Description

@ahmedmuzammilAI

Bug Description

When silero.VAD.load(force_cpu=False) is called on a machine with a CUDA-capable GPU
but without TensorRT installed, the ONNX Runtime session silently falls back to
CPUExecutionProvider instead of using CUDAExecutionProvider.

The issue is in livekit/plugins/silero/onnx_model.py, in the new_inference_session()
function. When force_cpu=False, the session is created with no explicit providers
argument:

session = onnxruntime.InferenceSession(path, sess_options=opts)

ONNX Runtime's default provider priority list includes TensorrtExecutionProvider before
CUDAExecutionProvider. When TensorRT is not installed (missing libnvinfer.so.10),
ORT fails to load the TRT provider and silently falls all the way back to
CPUExecutionProvider — skipping CUDAExecutionProvider entirely.

Verified: explicitly passing providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
to the same InferenceSession call works correctly and uses the GPU.

Expected Behavior

When force_cpu=False and a CUDA GPU is available, silero.VAD.load() should use
CUDAExecutionProvider regardless of whether TensorRT is installed.

The fix is to explicitly build the provider list in new_inference_session() instead
of relying on ORT's default auto-detection, which does not gracefully cascade from
a failed TensorrtExecutionProvider to CUDAExecutionProvider:

available = onnxruntime.get_available_providers()
if "CUDAExecutionProvider" in available:
    providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
else:
    providers = ["CPUExecutionProvider"]
session = onnxruntime.InferenceSession(path, providers=providers, sess_options=opts)

Reproduction Steps

1. Set up a machine with a CUDA GPU and CUDA 12.x drivers (e.g. Azure NC/NV VM with Tesla V100)
2. Install onnxruntime-gpu but NOT TensorRT (libnvinfer.so.10 absent)
3. Verify CUDA provider is listed: `onnxruntime.get_available_providers()`
   → ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
4. Run the following:

    from livekit.plugins import silero
    vad = silero.VAD.load(force_cpu=False)
    # Inspect which provider the underlying ONNX session is using:
    from livekit.plugins.silero.onnx_model import new_inference_session
    sess = new_inference_session(force_cpu=False)
    print(sess.get_providers())  # Prints: ['CPUExecutionProvider']  ← BUG

5. Expected: ['CUDAExecutionProvider', 'CPUExecutionProvider']
6. Confirmed fix: passing providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
   explicitly to InferenceSession produces the correct result.

Operating System

Ubuntu 24.04.4 LTS (Azure VM, NVIDIA Tesla V100 PCIe 16GB, CUDA 12.2, Driver 535.309.01)

Models Used

Deepgram Nova-3 (STT), Azure OpenAI GPT (LLM), Deepgram Aura-2 (TTS), Silero VAD, LiveKit MultilingualModel (turn detection)

Package Versions

livekit-agents==1.5.13
livekit-plugins-silero==1.5.13
onnxruntime-gpu==1.26.0

Session/Room/Call IDs

No response

Proposed Solution

In `livekit/plugins/silero/onnx_model.py`, replace the implicit provider
auto-detection with an explicit provider list in `new_inference_session()`:

Current code (force_cpu=False path):

    else:
        session = onnxruntime.InferenceSession(path, sess_options=opts)

Proposed fix:

    else:
        available = onnxruntime.get_available_providers()
        if "CUDAExecutionProvider" in available:
            providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        else:
            providers = ["CPUExecutionProvider"]
        session = onnxruntime.InferenceSession(
            path, providers=providers, sess_options=opts
        )

This skips TensorrtExecutionProvider intentionallyTRT requires a separate
heavyweight install (libnvinfer) that most users won't have, and Silero VAD
does not benefit meaningfully from TRT over plain CUDA. Users who do have TRT
installed and want it can still pass it explicitly via a future `providers`
parameter on `VAD.load()`.

Additional Context

This affects any deployment running on a CUDA-capable machine without TensorRT
(common in cloud VMs, Docker containers, and CI environments). The failure is
completely silent — no warning or error is raised, and the agent runs on CPU
without the developer knowing.

The root cause is a known quirk in ONNX Runtime's Python bindings: when no
providers list is given to InferenceSession, ORT iterates its internal
priority list (TRT → CUDA → CPU). If TRT fails to load its shared library, the
fallback does NOT cascade to CUDA — it drops straight to CPU. This is different
from the behaviour when providers are listed explicitly, where ORT correctly
falls back through the list (confirmed with onnxruntime-gpu==1.26.0).

Workaround applied locally until this is fixed upstream:

  • Patched the installed onnx_model.py with the proposed fix above.
  • Registered the pip-installed cuDNN path system-wide via ldconfig
    (/etc/ld.so.conf.d/nvidia-pip-cudnn.conf) so libcudnn.so.9 is
    discoverable without setting LD_LIBRARY_PATH manually.

After the patch, sess.get_providers() correctly returns
['CUDAExecutionProvider', 'CPUExecutionProvider'] on every run.

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions