Skip to content

TPU resources consumed by since-completed processes #11545

@isaacwr

Description

@isaacwr

Description

When accessing a TPU device from within a container, at most one process can access the TPU device for the lifetime of the sandbox. All subsequent processes error with an error, even if the original process has completed.

Error Message

This is the error returned by jax:

RuntimeError: Unable to initialize backend 'tpu': UNKNOWN: TPU initialization failed: open(/dev/vfio/1): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/1 (set JAX_PLATFORMS='' to automatically choose an available backend)

Details

The first time we run the repro script, we receive the correct response. Subsequent attempts to run the repro script will fail. Repro script:

import jax

print(jax.devices())

The first time we run it, we receive the correct response:

root@runsc:~# python test.py
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]

The second time, we receive an error:

root@runsc:~# python test.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/jax/_src/xla_bridge.py", line 917, in backends
    backend = _init_backend(platform)
  File "/usr/local/lib/python3.10/site-packages/jax/_src/xla_bridge.py", line 1003, in _init_backend
    backend = registration.factory()
  File "/usr/local/lib/python3.10/site-packages/jax/_src/xla_bridge.py", line 151, in tpu_client_timer_callback
    client = xla_client.make_tpu_client(
  File "/usr/local/lib/python3.10/site-packages/jaxlib/xla_client.py", line 212, in make_tpu_client
    return make_tfrt_tpu_c_api_client(options)
  File "/usr/local/lib/python3.10/site-packages/jaxlib/xla_client.py", line 134, in make_tfrt_tpu_c_api_client
    return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: TPU initialization failed: open(/dev/vfio/1): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/1

However, if we look for the first process, we don't see anything. Whatever is consuming the vfio resource is outside the container:

root@runsc:~# ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       120  0.0  0.0  14120  6484 ?        Ss   20:31   0:00 bash
root      1331  0.0  0.0  16716  9428 ?        R    20:54   0:00  \_ ps auxf

Additionally

  • This error does not reproduce on the same host, outside of gvisor.
  • After accessing the tpu device, lsof shows ~60 vfio-device objects, all belonging to a list of [exe] processes, none of which are visible from within the container.
  • We have tried running with -platform=kvm, but container creation errored out, seemingly because our host does not support it.

This means that, once any process has used the TPUs, no other process is able to use them, even if the original process has completed. This is making in-container TPUs nearly impossible to use. Is it possible to work around or address this issue?

Steps to reproduce

We are running the sandbox with the -tpuproxy flag, and we've mounted the /dev/vfio devices in the container with this piece of config:

        config_dict["linux"]["devices"] = [
            {
                "path": "/dev/vfio/0",
                "type": "c",
                "major": 245,
                "minor": 1,
                "fileMode": 438,
                "uid": 0,
                "gid": 0,
            },
            {
                "path": "/dev/vfio/1",
                "type": "c",
                "major": 245,
                "minor": 2,
                "fileMode": 438,
                "uid": 0,
                "gid": 0,
            },
            {
                "path": "/dev/vfio/2",
                "type": "c",
                "major": 245,
                "minor": 3,
                "fileMode": 438,
                "uid": 0,
                "gid": 0,
            },
            {
                "path": "/dev/vfio/3",
                "type": "c",
                "major": 245,
                "minor": 0,
                "fileMode": 438,
                "uid": 0,
                "gid": 0,
            },
            {
                "path": "/dev/vfio/vfio",
                "type": "c",
                "major": 10,
                "minor": 196,
                "fileMode": 438,
                "uid": 0,
                "gid": 0,
            },

runsc version

runsc version release-20250210.0-dirty
spec: 1.1.0-rc.1


`-dirty` because it includes https://github.com/google/gvisor/pull/11482 on behalf of https://github.com/google/gvisor/issues/11425

docker version (if using docker)

uname

Linux sandboxing-tpu-0 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 24 16:19:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

repo state (if built from source)

No response

runsc debug logs (if available)

Metadata

Metadata

Assignees

Labels

type: bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions