Skip to content

[GSD-12641] USM device allocation fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY when Level Zero context spans 2 discrete Arc A770 GPUs (regression from 25.40 to 26.09) #916

@fradav

Description

@fradav

Pre-submission Checklist

  • I am using the latest GPU driver version (releases)
  • I have searched for similar issues and found none

GPU Hardware

2× Intel Arc A770 (DG2, rev 08), 16 GiB each

DRI Devices Information

$ ls -ls /dev/dri/*
0 crw-rw---- 1 root video  226,   0 Apr 17 11:07 /dev/dri/card0
0 crw-rw---- 1 root video  226,   1 Apr 17 11:07 /dev/dri/card1
0 crw-rw-rw- 1 root render 226, 128 Apr 17 11:07 /dev/dri/renderD128
0 crw-rw-rw- 1 root render 226, 129 Apr 17 11:07 /dev/dri/renderD129

$ ls -la /dev/dri/by-path/
total 0
lrwxrwxrwx 1 root root  8 Apr 17 11:07 pci-0000:0d:00.0-card -> ../card0
lrwxrwxrwx 1 root root 13 Apr 17 11:07 pci-0000:0d:00.0-render -> ../renderD128
lrwxrwxrwx 1 root root  8 Apr 17 11:07 pci-0000:11:00.0-card -> ../card1
lrwxrwxrwx 1 root root 13 Apr 17 11:07 pci-0000:11:00.0-render -> ../renderD129

GPU Detailed Information (lspci output)

$ sudo lspci -vvv -k -s 0000:0d:00.0
0d:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Region 0: Memory at fb000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at f800000000 (64-bit, prefetchable) [size=16G]
        LnkCap: Port #0, Speed 2.5GT/s, Width x1
        LnkSta: Speed 2.5GT/s, Width x1
        Kernel driver in use: xe
        Kernel modules: i915, xe

$ sudo lspci -vvv -k -s 0000:11:00.0
11:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Region 0: Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at f000000000 (64-bit, prefetchable) [size=16G]
        LnkCap: Port #0, Speed 2.5GT/s, Width x1
        LnkSta: Speed 2.5GT/s, Width x1
        Kernel driver in use: xe
        Kernel modules: i915, xe

Driver Version

26.09.37435.1

Installed GPU Driver Packages

$ pacman -Q | grep -iE "intel|level.zero|igc|gmmlib|opencl|compute-runtime|ze"
intel-compute-runtime 26.09.37435.1-1
intel-gmmlib 22.9.0-2
intel-graphics-compiler-bin 1:2.30.1-1
intel-media-driver 25.4.6-1
intel-metee 6.2.1-1
intel-oneapi-basekit-2025 2025.3.1-6
intel-xpumanager-bin 1.3.5-1
level-zero-headers 1.28.0-1
level-zero-loader 1.28.0-1
vulkan-intel 1:26.0.4-1

Driver Installation Details

  • Installation method: Arch Linux official extra repository (pacman)
  • Command: sudo pacman -S intel-compute-runtime
  • Repository: Arch extra (https://archlinux.org/packages/extra/x86_64/intel-compute-runtime/)
  • No custom kernel parameters for the driver itself; xe driver forced via boot params: xe.force_probe=56a0 i915.force_probe=!56a0

Linux Distribution

Arch Linux

Other Linux Distribution

No response

Kernel Version & Boot Parameters

$ uname -r
6.19.11-arch1-1

$ cat /proc/cmdline
initrd=\initramfs-linux.img root=UUID=c68104cc-894e-4ccb-a8dd-d0fd2275a9a3 resume=UUID=8d90b58e-8569-45e5-8d71-15d924a7bc74 rw i915.force_probe=!56a0 xe.force_probe=56a0

$ lsmod | grep -E 'i915|xe'
xe                   4227072  6
intel_vsec             28672  1 xe
drm_ttm_helper         20480  1 xe
drm_suballoc_helper    16384  1 xe
gpu_sched              69632  1 xe
drm_gpuvm              57344  1 xe
drm_exec               12288  2 drm_gpuvm,xe
drm_gpusvm_helper      40960  1 xe
i915                 4943872  0
drm_buddy              32768  2 xe,i915
ttm                   126976  3 drm_ttm_helper,xe,i915
drm_display_helper    286720  2 xe,i915

Note: i915 is loaded but has 0 uses (force-probed away from DG2 via i915.force_probe=!56a0). xe is the active KMD for both Arc A770 cards.

Actual Behavior

When running any PyTorch XPU workload (torch 2.11.0+xpu) with 2 discrete Intel Arc A770 GPUs present, the Unified Runtime (UR) layer creates a Level Zero context spanning both devices (urContextCreate(.DeviceCount = 2)), then immediately fails on urUSMDeviceAlloc with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY, despite each GPU having 16 GiB of free VRAM.

The Python exception raised is:

RuntimeError: could not create a memory buffer of 2097152 bytes

This results in the application being completely unable to allocate any tensor on the XPU device.

Expected Behavior

USM device memory allocation should succeed when a Level Zero context spans 2 discrete root devices, as it did with intel-compute-runtime 25.40.35563.4. Each GPU has 16 GiB of VRAM and the allocation request is only 2 MiB.

Reproduction Rate

Always reproduces - 100%

Steps to Reproduce

  1. Install intel-compute-runtime 26.09.37435.1 on Arch Linux
  2. Install PyTorch with XPU support: pip install torch==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
  3. Ensure 2 discrete Intel Arc A770 GPUs are present and enumerated (both visible in /dev/dri/)
  4. Run the following Python script:
import torch
print(torch.xpu.device_count())  # prints 2
x = torch.zeros(1, device='xpu:0')  # FAILS here
  1. Observe RuntimeError: could not create a memory buffer of 2097152 bytes

Workaround: Set export ONEAPI_DEVICE_SELECTOR=level_zero:0 before running — this restricts UR to a single-device context and allocations succeed.

Is this a regression?

  • Yes, this is a regression - functionality that previously worked is now broken

Last Known Working Driver Version

25.40.35563.4

First Known Failing Driver Version

26.09.37435.1

API Call Logs

UR trace captured using UR_LOG_TRACING=output:file:minimal.log;level:info (binary format, key excerpts below):

<--- urContextCreate(.DeviceCount = 2, .phDevices = {0x325bd5a0, 0x325bd5c0}, .pProperties = nullptr, .phContext = ...) -> UR_RESULT_SUCCESS;

---> urUSMDeviceAlloc
<--- urUSMDeviceAlloc(.hContext = 0x34921e00, .hDevice = 0x325bd5a0,
     .pUSMDesc = {.align = 512}, .pool = nullptr,
     .size = 2097152, .ppMem = nullptr) -> UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY;

---> urUSMDeviceAlloc
<--- urUSMDeviceAlloc(.hContext = 0x34921e00, .hDevice = 0x325bd5a0,
     .pUSMDesc = {.align = 512}, .pool = nullptr,
     .size = 2097152, .ppMem = nullptr) -> UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY;

Key observation: context is created with DeviceCount=2 (both Arc A770s). Every subsequent USM allocation fails immediately. With ONEAPI_DEVICE_SELECTOR=level_zero:0, context is DeviceCount=1 and all allocations succeed.

Suspected root cause: commit 7bb2d32 "performance: enable l0 device usm growing pools" (merged ~Oct 2025) changed useUsmPoolManager default from false to true. The growing pool allocator appears to fail when the L0 context spans multiple root devices.

strace Logs

No response

System Logs / dmesg Output

No GPU errors or warnings in dmesg. Both cards load cleanly:

$ sudo dmesg | grep -iE 'xe|drm' | grep -iE 'error|warn|fail|memory|OOM'
(no output - clean driver load)

Backtrace (if crash or hang occurred)

No response

Source Code / Reproducer

Minimal Python reproducer (requires torch 2.11.0+xpu and 2 discrete Intel Arc GPUs):

import torch

# This will print 2 with 2 Arc A770 cards
print(f"XPU device count: {torch.xpu.device_count()}")

# This fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY under 26.09
# Works fine under 25.40 or with ONEAPI_DEVICE_SELECTOR=level_zero:0
x = torch.zeros(1, device='xpu:0')
print("Success:", x)

Run with:

pip install torch==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
python reproducer.py

For UR-level tracing:

UR_LOG_TRACING=output:file:trace.log;level:info python reproducer.py
strings trace.log | grep -E "urContextCreate|urUSMDeviceAlloc|UR_RESULT"

Command Line / Application Details

# Reproduces with:
python reproducer.py

# Does NOT reproduce with (workaround):
ONEAPI_DEVICE_SELECTOR=level_zero:0 python reproducer.py

oneAPI Version (if applicable)

No response

Screenshots / Video

No response

Additional Notes

  • The bug is specific to having 2 or more discrete root devices in the Level Zero context. With a single GPU (or with ONEAPI_DEVICE_SELECTOR restricting to one), all allocations succeed.
  • The suspected culprit commit is 7bb2d32 ("performance: enable l0 device usm growing pools") which enables useUsmPoolManager=true by default. This was introduced between 25.40 and 26.09.
  • Attempted workarounds that did NOT help:
    • NEO_EnableUsmAllocationPoolManager=0
    • NEO_EnableDeviceUsmAllocationPool=0
    • Both had no effect (UR trace unchanged)
  • Related upstream issue (PyTorch side): XPU OOM when allocate tensor according to its reported available memory pytorch/pytorch#164966
  • Context: this is triggered by PyTorch XPU (torch 2.11.0+xpu) which uses the UR adapter libur_adapter_level_zero.so.0 bundled with PyTorch (UR 0.11.x). The UR adapter itself calls into intel-compute-runtime's Level Zero implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OS: LinuxIssue specific to Linux distributions (Ubuntu, Fedora, RHEL, etc.)Type: BugGeneral bug report, unexpected behavior or crashType: RegressionPreviously working functionality is now broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions