-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Description
When accessing a TPU device from within a container, at most one process can access the TPU device for the lifetime of the sandbox. All subsequent processes error with an error, even if the original process has completed.
Error Message
This is the error returned by jax:
RuntimeError: Unable to initialize backend 'tpu': UNKNOWN: TPU initialization failed: open(/dev/vfio/1): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/1 (set JAX_PLATFORMS='' to automatically choose an available backend)
Details
The first time we run the repro script, we receive the correct response. Subsequent attempts to run the repro script will fail. Repro script:
import jax
print(jax.devices())
The first time we run it, we receive the correct response:
root@runsc:~# python test.py
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
The second time, we receive an error:
root@runsc:~# python test.py
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/jax/_src/xla_bridge.py", line 917, in backends
backend = _init_backend(platform)
File "/usr/local/lib/python3.10/site-packages/jax/_src/xla_bridge.py", line 1003, in _init_backend
backend = registration.factory()
File "/usr/local/lib/python3.10/site-packages/jax/_src/xla_bridge.py", line 151, in tpu_client_timer_callback
client = xla_client.make_tpu_client(
File "/usr/local/lib/python3.10/site-packages/jaxlib/xla_client.py", line 212, in make_tpu_client
return make_tfrt_tpu_c_api_client(options)
File "/usr/local/lib/python3.10/site-packages/jaxlib/xla_client.py", line 134, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: TPU initialization failed: open(/dev/vfio/1): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/1
However, if we look for the first process, we don't see anything. Whatever is consuming the vfio resource is outside the container:
root@runsc:~# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 120 0.0 0.0 14120 6484 ? Ss 20:31 0:00 bash
root 1331 0.0 0.0 16716 9428 ? R 20:54 0:00 \_ ps auxf
Additionally
- This error does not reproduce on the same host, outside of gvisor.
- After accessing the tpu device,
lsofshows ~60vfio-deviceobjects, all belonging to a list of[exe]processes, none of which are visible from within the container. - We have tried running with
-platform=kvm, but container creation errored out, seemingly because our host does not support it.
This means that, once any process has used the TPUs, no other process is able to use them, even if the original process has completed. This is making in-container TPUs nearly impossible to use. Is it possible to work around or address this issue?
Steps to reproduce
We are running the sandbox with the -tpuproxy flag, and we've mounted the /dev/vfio devices in the container with this piece of config:
config_dict["linux"]["devices"] = [
{
"path": "/dev/vfio/0",
"type": "c",
"major": 245,
"minor": 1,
"fileMode": 438,
"uid": 0,
"gid": 0,
},
{
"path": "/dev/vfio/1",
"type": "c",
"major": 245,
"minor": 2,
"fileMode": 438,
"uid": 0,
"gid": 0,
},
{
"path": "/dev/vfio/2",
"type": "c",
"major": 245,
"minor": 3,
"fileMode": 438,
"uid": 0,
"gid": 0,
},
{
"path": "/dev/vfio/3",
"type": "c",
"major": 245,
"minor": 0,
"fileMode": 438,
"uid": 0,
"gid": 0,
},
{
"path": "/dev/vfio/vfio",
"type": "c",
"major": 10,
"minor": 196,
"fileMode": 438,
"uid": 0,
"gid": 0,
},
runsc version
runsc version release-20250210.0-dirty
spec: 1.1.0-rc.1
`-dirty` because it includes https://github.com/google/gvisor/pull/11482 on behalf of https://github.com/google/gvisor/issues/11425docker version (if using docker)
uname
Linux sandboxing-tpu-0 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 24 16:19:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
No response