Bug Description
When silero.VAD.load(force_cpu=False) is called on a machine with a CUDA-capable GPU
but without TensorRT installed, the ONNX Runtime session silently falls back to
CPUExecutionProvider instead of using CUDAExecutionProvider.
The issue is in livekit/plugins/silero/onnx_model.py, in the new_inference_session()
function. When force_cpu=False, the session is created with no explicit providers
argument:
session = onnxruntime.InferenceSession(path, sess_options=opts)
ONNX Runtime's default provider priority list includes TensorrtExecutionProvider before
CUDAExecutionProvider. When TensorRT is not installed (missing libnvinfer.so.10),
ORT fails to load the TRT provider and silently falls all the way back to
CPUExecutionProvider — skipping CUDAExecutionProvider entirely.
Verified: explicitly passing providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
to the same InferenceSession call works correctly and uses the GPU.
Expected Behavior
When force_cpu=False and a CUDA GPU is available, silero.VAD.load() should use
CUDAExecutionProvider regardless of whether TensorRT is installed.
The fix is to explicitly build the provider list in new_inference_session() instead
of relying on ORT's default auto-detection, which does not gracefully cascade from
a failed TensorrtExecutionProvider to CUDAExecutionProvider:
available = onnxruntime.get_available_providers()
if "CUDAExecutionProvider" in available:
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
else:
providers = ["CPUExecutionProvider"]
session = onnxruntime.InferenceSession(path, providers=providers, sess_options=opts)
Reproduction Steps
1. Set up a machine with a CUDA GPU and CUDA 12.x drivers (e.g. Azure NC/NV VM with Tesla V100)
2. Install onnxruntime-gpu but NOT TensorRT (libnvinfer.so.10 absent)
3. Verify CUDA provider is listed: `onnxruntime.get_available_providers()`
→ ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
4. Run the following:
from livekit.plugins import silero
vad = silero.VAD.load(force_cpu=False)
# Inspect which provider the underlying ONNX session is using:
from livekit.plugins.silero.onnx_model import new_inference_session
sess = new_inference_session(force_cpu=False)
print(sess.get_providers()) # Prints: ['CPUExecutionProvider'] ← BUG
5. Expected: ['CUDAExecutionProvider', 'CPUExecutionProvider']
6. Confirmed fix: passing providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
explicitly to InferenceSession produces the correct result.
Operating System
Ubuntu 24.04.4 LTS (Azure VM, NVIDIA Tesla V100 PCIe 16GB, CUDA 12.2, Driver 535.309.01)
Models Used
Deepgram Nova-3 (STT), Azure OpenAI GPT (LLM), Deepgram Aura-2 (TTS), Silero VAD, LiveKit MultilingualModel (turn detection)
Package Versions
livekit-agents==1.5.13
livekit-plugins-silero==1.5.13
onnxruntime-gpu==1.26.0
Session/Room/Call IDs
No response
Proposed Solution
In `livekit/plugins/silero/onnx_model.py`, replace the implicit provider
auto-detection with an explicit provider list in `new_inference_session()`:
Current code (force_cpu=False path):
else:
session = onnxruntime.InferenceSession(path, sess_options=opts)
Proposed fix:
else:
available = onnxruntime.get_available_providers()
if "CUDAExecutionProvider" in available:
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
else:
providers = ["CPUExecutionProvider"]
session = onnxruntime.InferenceSession(
path, providers=providers, sess_options=opts
)
This skips TensorrtExecutionProvider intentionally — TRT requires a separate
heavyweight install (libnvinfer) that most users won't have, and Silero VAD
does not benefit meaningfully from TRT over plain CUDA. Users who do have TRT
installed and want it can still pass it explicitly via a future `providers`
parameter on `VAD.load()`.
Additional Context
This affects any deployment running on a CUDA-capable machine without TensorRT
(common in cloud VMs, Docker containers, and CI environments). The failure is
completely silent — no warning or error is raised, and the agent runs on CPU
without the developer knowing.
The root cause is a known quirk in ONNX Runtime's Python bindings: when no
providers list is given to InferenceSession, ORT iterates its internal
priority list (TRT → CUDA → CPU). If TRT fails to load its shared library, the
fallback does NOT cascade to CUDA — it drops straight to CPU. This is different
from the behaviour when providers are listed explicitly, where ORT correctly
falls back through the list (confirmed with onnxruntime-gpu==1.26.0).
Workaround applied locally until this is fixed upstream:
- Patched the installed
onnx_model.py with the proposed fix above.
- Registered the pip-installed cuDNN path system-wide via ldconfig
(/etc/ld.so.conf.d/nvidia-pip-cudnn.conf) so libcudnn.so.9 is
discoverable without setting LD_LIBRARY_PATH manually.
After the patch, sess.get_providers() correctly returns
['CUDAExecutionProvider', 'CPUExecutionProvider'] on every run.
Screenshots and Recordings
No response
Bug Description
When
silero.VAD.load(force_cpu=False)is called on a machine with a CUDA-capable GPUbut without TensorRT installed, the ONNX Runtime session silently falls back to
CPUExecutionProvider instead of using CUDAExecutionProvider.
The issue is in
livekit/plugins/silero/onnx_model.py, in thenew_inference_session()function. When
force_cpu=False, the session is created with no explicitprovidersargument:
ONNX Runtime's default provider priority list includes TensorrtExecutionProvider before
CUDAExecutionProvider. When TensorRT is not installed (missing
libnvinfer.so.10),ORT fails to load the TRT provider and silently falls all the way back to
CPUExecutionProvider — skipping CUDAExecutionProvider entirely.
Verified: explicitly passing
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]to the same InferenceSession call works correctly and uses the GPU.
Expected Behavior
When
force_cpu=Falseand a CUDA GPU is available,silero.VAD.load()should useCUDAExecutionProvider regardless of whether TensorRT is installed.
The fix is to explicitly build the provider list in
new_inference_session()insteadof relying on ORT's default auto-detection, which does not gracefully cascade from
a failed TensorrtExecutionProvider to CUDAExecutionProvider:
Reproduction Steps
Operating System
Ubuntu 24.04.4 LTS (Azure VM, NVIDIA Tesla V100 PCIe 16GB, CUDA 12.2, Driver 535.309.01)
Models Used
Deepgram Nova-3 (STT), Azure OpenAI GPT (LLM), Deepgram Aura-2 (TTS), Silero VAD, LiveKit MultilingualModel (turn detection)
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
This affects any deployment running on a CUDA-capable machine without TensorRT
(common in cloud VMs, Docker containers, and CI environments). The failure is
completely silent — no warning or error is raised, and the agent runs on CPU
without the developer knowing.
The root cause is a known quirk in ONNX Runtime's Python bindings: when no
providerslist is given toInferenceSession, ORT iterates its internalpriority list (TRT → CUDA → CPU). If TRT fails to load its shared library, the
fallback does NOT cascade to CUDA — it drops straight to CPU. This is different
from the behaviour when providers are listed explicitly, where ORT correctly
falls back through the list (confirmed with onnxruntime-gpu==1.26.0).
Workaround applied locally until this is fixed upstream:
onnx_model.pywith the proposed fix above.(
/etc/ld.so.conf.d/nvidia-pip-cudnn.conf) solibcudnn.so.9isdiscoverable without setting LD_LIBRARY_PATH manually.
After the patch,
sess.get_providers()correctly returns['CUDAExecutionProvider', 'CPUExecutionProvider']on every run.Screenshots and Recordings
No response