Almost all huggingface accuracy inference models failed with a stride mismatch bug in PyTorch's XPU backend for the Scaled Dot Product Attention (SDPA) operation:
Traceback (most recent call last):
File "C:\Users\gta\repositories\pytorch\pytorch\benchmarks\dynamo\common.py", line 2379, in check_accuracy
new_result = self.run_n_iterations(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\repositories\pytorch\pytorch\benchmarks\dynamo\common.py", line 2077, in run_n_iterations
model_iter_fn(mod, inputs, collect_outputs=False)
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1036, in compile_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\repositories\pytorch\pytorch\benchmarks\dynamo\huggingface.py", line 554, in forward_pass
def forward_pass(self, mod, inputs, collect_outputs=True):
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_dynamo\eval_frame.py", line 1272, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\aot_autograd.py", line 1186, in forward
return compiled_fn(full_args)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 767, in runtime_wrapper
all_outs = compiled_invoker.run(args, on_before_call=exit_prologue)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 513, in run
return call_func_at_runtime_with_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\utils.py", line 126, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 840, in wrapper
return compiled_fn(runtime_args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_functorch\_aot_autograd\runtime_wrappers.py", line 1044, in inner_fn
outs = compiled_fn(args)
^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_inductor\output_code.py", line 682, in __call__
return self.current_callable(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\miniforge3\envs\202604010405_4e67aacf_32.0.101.8626_2025.3.2.61\Lib\site-packages\torch\_inductor\utils.py", line 3459, in run
out = model(new_inputs)
^^^^^^^^^^^^^^^^^
File "C:\Users\gta\AppData\Local\Temp\torchinductor_gta\2i\c2ijxofv4zv6scovbk6xqyo7xb5lx4g3ofjg5yuax4p24dh4smau.py", line 2567, in call
(buf464, buf467, buf471) = self.partitions[0](partition0_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gta\AppData\Local\Temp\torchinductor_gta\2i\c2ijxofv4zv6scovbk6xqyo7xb5lx4g3ofjg5yuax4p24dh4smau.py", line 1559, in partition_0
assert_size_stride(buf19, (1, 64, 512, 64), (2097152, 32768, 64, 1), 'torch.ops.aten._scaled_dot_product_fused_attention_overrideable.default')
AssertionError: expected size 64==64, stride 64==32768 at dim=1; expected size 512==512, stride 4096==64 at dim=2
Error in op: torch.ops.aten._scaled_dot_product_fused_attention_overrideable.default
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.
Use torch.library.opcheck to test your custom op.
See https://pytorch.org/docs/stable/library.html#torch.library.opcheck
TorchDynamo optimized model failed to run because of following error
fail_to_run
Python version: 3.12.13 | packaged by conda-forge | (main, Mar 5 2026, 16:36:12) [MSC v.1944 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250302
Intel GPU driver version:
🐛 Describe the bug
Almost all huggingface accuracy inference models failed with a stride mismatch bug in PyTorch's XPU backend for the Scaled Dot Product Attention (SDPA) operation:
Versions
PyTorch version: 2.12.0a0+git4e67aac
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Pro (10.0.26100 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: N/A
Python version: 3.12.13 | packaged by conda-forge | (main, Mar 5 2026, 16:36:12) [MSC v.1944 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250302
Intel GPU driver version:
Intel GPU models onboard:
Intel GPU models detected:
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A
CPU:
Name: 13th Gen Intel(R) Core(TM) i5-13400
Manufacturer: GenuineIntel
Family: 205
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2500
MaxClockSpeed: 2500
L2CacheSize: 9728
L2CacheSpeed: None
Revision: None
Versions of relevant libraries:
[pip3] intel-openmp==2025.3.3
[pip3] mkl-include==2025.3.1
[pip3] mkl-static==2025.3.1
[pip3] mypy==1.20.0
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.2
[pip3] onemkl-license==2025.3.1
[pip3] onnx==1.20.0
[pip3] onnx-ir==0.1.16
[pip3] onnxscript==0.6.2
[pip3] optree==0.13.0
[pip3] pytorch-labs-segment-anything-fast==0.2
[pip3] tbb==2022.3.1
[pip3] tbb-devel==2022.3.1
[pip3] tcmlib==1.4.1
[pip3] torch==2.12.0a0+git4e67aac
[pip3] torch_geometric==2.4.0
[pip3] torchao==0.17.0
[pip3] torchaudio==2.11.0a0+c0cbdb9
[pip3] torchbench==0.1
[pip3] torchmetrics==1.9.0
[pip3] torchmultimodal==0.1.0b0
[pip3] torchrec-nightly==2022.4.26
[pip3] torchtext==0.17.0a0+a5e6106
[pip3] torchvision==0.27.0a0+9bf794d
[pip3] torchx-nightly==2026.3.31
[pip3] triton-xpu==3.7.0+git33f782ef
[conda] intel-openmp 2025.3.3 pypi_0 pypi
[conda] mkl-include 2025.3.1 pypi_0 pypi
[conda] mkl-static 2025.3.1 pypi_0 pypi
[conda] numpy 1.26.2 pypi_0 pypi
[conda] onemkl-license 2025.3.1 pypi_0 pypi
[conda] optree 0.13.0 pypi_0 pypi
[conda] pytorch-labs-segment-anything-fast 0.2 pypi_0 pypi
[conda] tbb 2022.3.1 pypi_0 pypi
[conda] tbb-devel 2022.3.1 pypi_0 pypi
[conda] tcmlib 1.4.1 pypi_0 pypi
[conda] torch 2.12.0a0+git4e67aac pypi_0 pypi
[conda] torch-geometric 2.4.0 pypi_0 pypi
[conda] torchao 0.17.0 pypi_0 pypi
[conda] torchaudio 2.11.0a0+c0cbdb9 pypi_0 pypi
[conda] torchbench 0.1 pypi_0 pypi
[conda] torchmetrics 1.9.0 pypi_0 pypi
[conda] torchmultimodal 0.1.0b0 pypi_0 pypi
[conda] torchrec-nightly 2022.4.26 pypi_0 pypi
[conda] torchtext 0.17.0a0+a5e6106 pypi_0 pypi
[conda] torchvision 0.27.0a0+9bf794d pypi_0 pypi
[conda] torchx-nightly 2026.3.31 pypi_0 pypi
[conda] triton-xpu 3.7.0+git33f782ef pypi_0 pypi