You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[GSD-12641] USM device allocation fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY when Level Zero context spans 2 discrete Arc A770 GPUs (regression from 25.40 to 26.09) #916
Note: i915 is loaded but has 0 uses (force-probed away from DG2 via i915.force_probe=!56a0). xe is the active KMD for both Arc A770 cards.
Actual Behavior
When running any PyTorch XPU workload (torch 2.11.0+xpu) with 2 discrete Intel Arc A770 GPUs present, the Unified Runtime (UR) layer creates a Level Zero context spanning both devices (urContextCreate(.DeviceCount = 2)), then immediately fails on urUSMDeviceAlloc with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY, despite each GPU having 16 GiB of free VRAM.
The Python exception raised is:
RuntimeError: could not create a memory buffer of 2097152 bytes
This results in the application being completely unable to allocate any tensor on the XPU device.
Expected Behavior
USM device memory allocation should succeed when a Level Zero context spans 2 discrete root devices, as it did with intel-compute-runtime 25.40.35563.4. Each GPU has 16 GiB of VRAM and the allocation request is only 2 MiB.
Reproduction Rate
Always reproduces - 100%
Steps to Reproduce
Install intel-compute-runtime 26.09.37435.1 on Arch Linux
Install PyTorch with XPU support: pip install torch==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
Ensure 2 discrete Intel Arc A770 GPUs are present and enumerated (both visible in /dev/dri/)
Run the following Python script:
importtorchprint(torch.xpu.device_count()) # prints 2x=torch.zeros(1, device='xpu:0') # FAILS here
Observe RuntimeError: could not create a memory buffer of 2097152 bytes
Workaround: Set export ONEAPI_DEVICE_SELECTOR=level_zero:0 before running — this restricts UR to a single-device context and allocations succeed.
Is this a regression?
Yes, this is a regression - functionality that previously worked is now broken
Last Known Working Driver Version
25.40.35563.4
First Known Failing Driver Version
26.09.37435.1
API Call Logs
UR trace captured using UR_LOG_TRACING=output:file:minimal.log;level:info (binary format, key excerpts below):
Key observation: context is created with DeviceCount=2 (both Arc A770s). Every subsequent USM allocation fails immediately. With ONEAPI_DEVICE_SELECTOR=level_zero:0, context is DeviceCount=1 and all allocations succeed.
Suspected root cause: commit 7bb2d32 "performance: enable l0 device usm growing pools" (merged ~Oct 2025) changed useUsmPoolManager default from false to true. The growing pool allocator appears to fail when the L0 context spans multiple root devices.
strace Logs
No response
System Logs / dmesg Output
No GPU errors or warnings in dmesg. Both cards load cleanly:
importtorch# This will print 2 with 2 Arc A770 cardsprint(f"XPU device count: {torch.xpu.device_count()}")
# This fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY under 26.09# Works fine under 25.40 or with ONEAPI_DEVICE_SELECTOR=level_zero:0x=torch.zeros(1, device='xpu:0')
print("Success:", x)
# Reproduces with:
python reproducer.py
# Does NOT reproduce with (workaround):
ONEAPI_DEVICE_SELECTOR=level_zero:0 python reproducer.py
oneAPI Version (if applicable)
No response
Screenshots / Video
No response
Additional Notes
The bug is specific to having 2 or more discrete root devices in the Level Zero context. With a single GPU (or with ONEAPI_DEVICE_SELECTOR restricting to one), all allocations succeed.
The suspected culprit commit is 7bb2d32 ("performance: enable l0 device usm growing pools") which enables useUsmPoolManager=true by default. This was introduced between 25.40 and 26.09.
Context: this is triggered by PyTorch XPU (torch 2.11.0+xpu) which uses the UR adapter libur_adapter_level_zero.so.0 bundled with PyTorch (UR 0.11.x). The UR adapter itself calls into intel-compute-runtime's Level Zero implementation.
Pre-submission Checklist
GPU Hardware
2× Intel Arc A770 (DG2, rev 08), 16 GiB each
DRI Devices Information
GPU Detailed Information (lspci output)
Driver Version
26.09.37435.1
Installed GPU Driver Packages
Driver Installation Details
sudo pacman -S intel-compute-runtimexe.force_probe=56a0 i915.force_probe=!56a0Linux Distribution
Arch Linux
Other Linux Distribution
No response
Kernel Version & Boot Parameters
Note: i915 is loaded but has 0 uses (force-probed away from DG2 via i915.force_probe=!56a0). xe is the active KMD for both Arc A770 cards.
Actual Behavior
When running any PyTorch XPU workload (torch 2.11.0+xpu) with 2 discrete Intel Arc A770 GPUs present, the Unified Runtime (UR) layer creates a Level Zero context spanning both devices (
urContextCreate(.DeviceCount = 2)), then immediately fails onurUSMDeviceAllocwithUR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY, despite each GPU having 16 GiB of free VRAM.The Python exception raised is:
This results in the application being completely unable to allocate any tensor on the XPU device.
Expected Behavior
USM device memory allocation should succeed when a Level Zero context spans 2 discrete root devices, as it did with intel-compute-runtime 25.40.35563.4. Each GPU has 16 GiB of VRAM and the allocation request is only 2 MiB.
Reproduction Rate
Always reproduces - 100%
Steps to Reproduce
pip install torch==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu/dev/dri/)RuntimeError: could not create a memory buffer of 2097152 bytesWorkaround: Set
export ONEAPI_DEVICE_SELECTOR=level_zero:0before running — this restricts UR to a single-device context and allocations succeed.Is this a regression?
Last Known Working Driver Version
25.40.35563.4
First Known Failing Driver Version
26.09.37435.1
API Call Logs
UR trace captured using
UR_LOG_TRACING=output:file:minimal.log;level:info(binary format, key excerpts below):Key observation: context is created with DeviceCount=2 (both Arc A770s). Every subsequent USM allocation fails immediately. With ONEAPI_DEVICE_SELECTOR=level_zero:0, context is DeviceCount=1 and all allocations succeed.
Suspected root cause: commit 7bb2d32 "performance: enable l0 device usm growing pools" (merged ~Oct 2025) changed
useUsmPoolManagerdefault from false to true. The growing pool allocator appears to fail when the L0 context spans multiple root devices.strace Logs
No response
System Logs / dmesg Output
No GPU errors or warnings in dmesg. Both cards load cleanly:
Backtrace (if crash or hang occurred)
No response
Source Code / Reproducer
Minimal Python reproducer (requires torch 2.11.0+xpu and 2 discrete Intel Arc GPUs):
Run with:
For UR-level tracing:
Command Line / Application Details
oneAPI Version (if applicable)
No response
Screenshots / Video
No response
Additional Notes
7bb2d32("performance: enable l0 device usm growing pools") which enablesuseUsmPoolManager=trueby default. This was introduced between 25.40 and 26.09.NEO_EnableUsmAllocationPoolManager=0NEO_EnableDeviceUsmAllocationPool=0libur_adapter_level_zero.so.0bundled with PyTorch (UR 0.11.x). The UR adapter itself calls into intel-compute-runtime's Level Zero implementation.