[GSD-12641] USM device allocation fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY when Level Zero context spans 2 discrete Arc A770 GPUs (regression from 25.40 to 26.09)

### Pre-submission Checklist

- [x] I am using the latest GPU driver version ([releases](https://github.com/intel/compute-runtime/releases))
- [x] I have searched for similar issues and found none

### GPU Hardware

2× Intel Arc A770 (DG2, rev 08), 16 GiB each

### DRI Devices Information

```
$ ls -ls /dev/dri/*
0 crw-rw---- 1 root video  226,   0 Apr 17 11:07 /dev/dri/card0
0 crw-rw---- 1 root video  226,   1 Apr 17 11:07 /dev/dri/card1
0 crw-rw-rw- 1 root render 226, 128 Apr 17 11:07 /dev/dri/renderD128
0 crw-rw-rw- 1 root render 226, 129 Apr 17 11:07 /dev/dri/renderD129

$ ls -la /dev/dri/by-path/
total 0
lrwxrwxrwx 1 root root  8 Apr 17 11:07 pci-0000:0d:00.0-card -> ../card0
lrwxrwxrwx 1 root root 13 Apr 17 11:07 pci-0000:0d:00.0-render -> ../renderD128
lrwxrwxrwx 1 root root  8 Apr 17 11:07 pci-0000:11:00.0-card -> ../card1
lrwxrwxrwx 1 root root 13 Apr 17 11:07 pci-0000:11:00.0-render -> ../renderD129
```

### GPU Detailed Information (lspci output)

```
$ sudo lspci -vvv -k -s 0000:0d:00.0
0d:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Region 0: Memory at fb000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at f800000000 (64-bit, prefetchable) [size=16G]
        LnkCap: Port #0, Speed 2.5GT/s, Width x1
        LnkSta: Speed 2.5GT/s, Width x1
        Kernel driver in use: xe
        Kernel modules: i915, xe

$ sudo lspci -vvv -k -s 0000:11:00.0
11:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Region 0: Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at f000000000 (64-bit, prefetchable) [size=16G]
        LnkCap: Port #0, Speed 2.5GT/s, Width x1
        LnkSta: Speed 2.5GT/s, Width x1
        Kernel driver in use: xe
        Kernel modules: i915, xe
```

### Driver Version

26.09.37435.1

### Installed GPU Driver Packages

```
$ pacman -Q | grep -iE "intel|level.zero|igc|gmmlib|opencl|compute-runtime|ze"
intel-compute-runtime 26.09.37435.1-1
intel-gmmlib 22.9.0-2
intel-graphics-compiler-bin 1:2.30.1-1
intel-media-driver 25.4.6-1
intel-metee 6.2.1-1
intel-oneapi-basekit-2025 2025.3.1-6
intel-xpumanager-bin 1.3.5-1
level-zero-headers 1.28.0-1
level-zero-loader 1.28.0-1
vulkan-intel 1:26.0.4-1
```

### Driver Installation Details

- Installation method: Arch Linux official extra repository (pacman)
- Command: `sudo pacman -S intel-compute-runtime`
- Repository: Arch extra (https://archlinux.org/packages/extra/x86_64/intel-compute-runtime/)
- No custom kernel parameters for the driver itself; xe driver forced via boot params: `xe.force_probe=56a0 i915.force_probe=!56a0`

### Linux Distribution

Arch Linux

### Other Linux Distribution

_No response_

### Kernel Version & Boot Parameters

```
$ uname -r
6.19.11-arch1-1

$ cat /proc/cmdline
initrd=\initramfs-linux.img root=UUID=c68104cc-894e-4ccb-a8dd-d0fd2275a9a3 resume=UUID=8d90b58e-8569-45e5-8d71-15d924a7bc74 rw i915.force_probe=!56a0 xe.force_probe=56a0

$ lsmod | grep -E 'i915|xe'
xe                   4227072  6
intel_vsec             28672  1 xe
drm_ttm_helper         20480  1 xe
drm_suballoc_helper    16384  1 xe
gpu_sched              69632  1 xe
drm_gpuvm              57344  1 xe
drm_exec               12288  2 drm_gpuvm,xe
drm_gpusvm_helper      40960  1 xe
i915                 4943872  0
drm_buddy              32768  2 xe,i915
ttm                   126976  3 drm_ttm_helper,xe,i915
drm_display_helper    286720  2 xe,i915
```

Note: i915 is loaded but has 0 uses (force-probed away from DG2 via i915.force_probe=!56a0). xe is the active KMD for both Arc A770 cards.

### Actual Behavior

When running any PyTorch XPU workload (torch 2.11.0+xpu) with 2 discrete Intel Arc A770 GPUs present, the Unified Runtime (UR) layer creates a Level Zero context spanning both devices (`urContextCreate(.DeviceCount = 2)`), then immediately fails on `urUSMDeviceAlloc` with `UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`, despite each GPU having 16 GiB of free VRAM.

The Python exception raised is:
```
RuntimeError: could not create a memory buffer of 2097152 bytes
```

This results in the application being completely unable to allocate any tensor on the XPU device.

### Expected Behavior

USM device memory allocation should succeed when a Level Zero context spans 2 discrete root devices, as it did with intel-compute-runtime 25.40.35563.4. Each GPU has 16 GiB of VRAM and the allocation request is only 2 MiB.

### Reproduction Rate

Always reproduces - 100%

### Steps to Reproduce

1. Install intel-compute-runtime 26.09.37435.1 on Arch Linux
2. Install PyTorch with XPU support: `pip install torch==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu`
3. Ensure 2 discrete Intel Arc A770 GPUs are present and enumerated (both visible in `/dev/dri/`)
4. Run the following Python script:
```python
import torch
print(torch.xpu.device_count())  # prints 2
x = torch.zeros(1, device='xpu:0')  # FAILS here
```
5. Observe `RuntimeError: could not create a memory buffer of 2097152 bytes`

**Workaround**: Set `export ONEAPI_DEVICE_SELECTOR=level_zero:0` before running — this restricts UR to a single-device context and allocations succeed.

### Is this a regression?

- [x] Yes, this is a regression - functionality that previously worked is now broken

### Last Known Working Driver Version

25.40.35563.4

### First Known Failing Driver Version

26.09.37435.1

### API Call Logs

UR trace captured using `UR_LOG_TRACING=output:file:minimal.log;level:info` (binary format, key excerpts below):

```
<--- urContextCreate(.DeviceCount = 2, .phDevices = {0x325bd5a0, 0x325bd5c0}, .pProperties = nullptr, .phContext = ...) -> UR_RESULT_SUCCESS;

---> urUSMDeviceAlloc
<--- urUSMDeviceAlloc(.hContext = 0x34921e00, .hDevice = 0x325bd5a0,
     .pUSMDesc = {.align = 512}, .pool = nullptr,
     .size = 2097152, .ppMem = nullptr) -> UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY;

---> urUSMDeviceAlloc
<--- urUSMDeviceAlloc(.hContext = 0x34921e00, .hDevice = 0x325bd5a0,
     .pUSMDesc = {.align = 512}, .pool = nullptr,
     .size = 2097152, .ppMem = nullptr) -> UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY;
```

Key observation: context is created with DeviceCount=2 (both Arc A770s). Every subsequent USM allocation fails immediately. With ONEAPI_DEVICE_SELECTOR=level_zero:0, context is DeviceCount=1 and all allocations succeed.

Suspected root cause: commit 7bb2d32 "performance: enable l0 device usm growing pools" (merged ~Oct 2025) changed `useUsmPoolManager` default from false to true. The growing pool allocator appears to fail when the L0 context spans multiple root devices.

### strace Logs

_No response_

### System Logs / dmesg Output

No GPU errors or warnings in dmesg. Both cards load cleanly:

```
$ sudo dmesg | grep -iE 'xe|drm' | grep -iE 'error|warn|fail|memory|OOM'
(no output - clean driver load)
```

### Backtrace (if crash or hang occurred)

_No response_

### Source Code / Reproducer

Minimal Python reproducer (requires torch 2.11.0+xpu and 2 discrete Intel Arc GPUs):

```python
import torch

# This will print 2 with 2 Arc A770 cards
print(f"XPU device count: {torch.xpu.device_count()}")

# This fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY under 26.09
# Works fine under 25.40 or with ONEAPI_DEVICE_SELECTOR=level_zero:0
x = torch.zeros(1, device='xpu:0')
print("Success:", x)
```

Run with:
```bash
pip install torch==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
python reproducer.py
```

For UR-level tracing:
```bash
UR_LOG_TRACING=output:file:trace.log;level:info python reproducer.py
strings trace.log | grep -E "urContextCreate|urUSMDeviceAlloc|UR_RESULT"
```

### Command Line / Application Details

```bash
# Reproduces with:
python reproducer.py

# Does NOT reproduce with (workaround):
ONEAPI_DEVICE_SELECTOR=level_zero:0 python reproducer.py
```

### oneAPI Version (if applicable)

_No response_

### Screenshots / Video

_No response_

### Additional Notes

- The bug is specific to having **2 or more discrete root devices** in the Level Zero context. With a single GPU (or with ONEAPI_DEVICE_SELECTOR restricting to one), all allocations succeed.
- The suspected culprit commit is `7bb2d32` ("performance: enable l0 device usm growing pools") which enables `useUsmPoolManager=true` by default. This was introduced between 25.40 and 26.09.
- Attempted workarounds that did NOT help:
  - `NEO_EnableUsmAllocationPoolManager=0`
  - `NEO_EnableDeviceUsmAllocationPool=0`
  - Both had no effect (UR trace unchanged)
- Related upstream issue (PyTorch side): https://github.com/pytorch/pytorch/issues/164966
- Context: this is triggered by PyTorch XPU (torch 2.11.0+xpu) which uses the UR adapter `libur_adapter_level_zero.so.0` bundled with PyTorch (UR 0.11.x). The UR adapter itself calls into intel-compute-runtime's Level Zero implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSD-12641] USM device allocation fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY when Level Zero context spans 2 discrete Arc A770 GPUs (regression from 25.40 to 26.09) #916

Pre-submission Checklist

GPU Hardware

DRI Devices Information

GPU Detailed Information (lspci output)

Driver Version

Installed GPU Driver Packages

Driver Installation Details

Linux Distribution

Other Linux Distribution

Kernel Version & Boot Parameters

Actual Behavior

Expected Behavior

Reproduction Rate

Steps to Reproduce

Is this a regression?

Last Known Working Driver Version

First Known Failing Driver Version

API Call Logs

strace Logs

System Logs / dmesg Output

Backtrace (if crash or hang occurred)

Source Code / Reproducer

Command Line / Application Details

oneAPI Version (if applicable)

Screenshots / Video

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GSD-12641] USM device allocation fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY when Level Zero context spans 2 discrete Arc A770 GPUs (regression from 25.40 to 26.09) #916

Description

Pre-submission Checklist

GPU Hardware

DRI Devices Information

GPU Detailed Information (lspci output)

Driver Version

Installed GPU Driver Packages

Driver Installation Details

Linux Distribution

Other Linux Distribution

Kernel Version & Boot Parameters

Actual Behavior

Expected Behavior

Reproduction Rate

Steps to Reproduce

Is this a regression?

Last Known Working Driver Version

First Known Failing Driver Version

API Call Logs

strace Logs

System Logs / dmesg Output

Backtrace (if crash or hang occurred)

Source Code / Reproducer

Command Line / Application Details

oneAPI Version (if applicable)

Screenshots / Video

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions