[DirectML] Divergent results between CPU and DirectML only on Intel Integrated GPU

**Issue**

On a Windows laptop with an Intel iGPU and discrete NVIDIA GPU (NVIDIA Optimus), the results differ between CPU and DirectML running on Intel iGPU. CPU and DirectML running on NVIDIA GPU match and are within 1e-5 tolerance.

Originally discovered using ONNXRuntime DirectML (issue opened here: https://github.com/microsoft/onnxruntime/issues/14214) but on further testing, torch-directml is also affected that indicates that the issue is in the underlying DirectML backend.


Using Python 3.9, torch 1.13.1 and torch-directml 0.1.13.dev221216 installed from PyPI

System information:
Intel Core i7-11800H with Intel UHD Graphics (Tiger Lake GT1)
NVIDIA RTX 3060 Laptop GPU
Laptop is running with NVIDIA Optimus to be able to switch between Intel iGPU and NVIDIA discrete GPU

Platform: Windows

OS Version: Windows 11 21H2 Build 22000.1335

ONNX Runtime Version or Commit ID
1.13.1

ONNX Runtime API
Python

Architecture
X64

Execution Provider
DirectML

Execution Provider Library Version
onnxruntime-directml 1.13.1


**To Reproduce**

Run the following code, updating the device IDs based on the ordering on your device:

```
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = "TRUE"
import torch
import torch.nn as nn
import torch.onnx
import torchvision
import onnxruntime as rt
import numpy as np
import matplotlib.pyplot as plt
import torch_directml

# Print devices
print([torch_directml.device_name(d_idx) for d_idx in range(torch_directml.device_count())])
# On my machine, NVIDIA GPU is device 0, Intel iGPU is devices 1 and 2 (not sure why it's showing up twice)

import skimage.data
from torchvision.io.image import read_image
from torchvision.models.segmentation import fcn_resnet50, FCN_ResNet50_Weights
from torchvision.transforms.functional import to_pil_image

img = torch.from_numpy(skimage.data.cat().T)

## Run using CPU
# Step 1: Initialize model with the best available weights
weights = FCN_ResNet50_Weights.DEFAULT
model = fcn_resnet50(weights=weights)
model.to('cpu')
model.eval()

# Step 2: Initialize the inference transforms
preprocess = weights.transforms()

# Step 3: Apply inference preprocessing transforms
batch = preprocess(img).unsqueeze(0)

# Step 4: Use the model and visualize the prediction
with torch.no_grad():
    prediction_cpu = model(batch)["out"][0, 8]  # Class 8 is cat


## Run using NVIDIA GPU
# Step 1: Initialize model with the best available weights
weights = FCN_ResNet50_Weights.DEFAULT
model = fcn_resnet50(weights=weights)
device = torch_directml.device(0)  # !! Update device number here
model.to(device)
model.eval()

# Step 2: Initialize the inference transforms
preprocess = weights.transforms()

# Step 3: Apply inference preprocessing transforms
batch = preprocess(img).unsqueeze(0)

# Step 4: Use the model and visualize the prediction
with torch.no_grad():
    prediction_nvidia_gpu = model(batch.to(device))["out"].to('cpu')[0, 8]

## Run using Intel iGPU
# Step 1: Initialize model with the best available weights
weights = FCN_ResNet50_Weights.DEFAULT
model = fcn_resnet50(weights=weights)
device = torch_directml.device(1)  # !! Update device number here
model.to(device)
model.eval()

# Step 2: Initialize the inference transforms
preprocess = weights.transforms()

# Step 3: Apply inference preprocessing transforms
batch = preprocess(img).unsqueeze(0)

# Step 4: Use the model and visualize the prediction
with torch.no_grad():
    prediction_intel_igpu = model(batch.to(device))["out"].to('cpu')[0, 8]


# Print numerical comparisons
print(np.isclose(prediction_cpu.detach().numpy(), prediction_intel_igpu.detach().numpy(), atol=1e-4).all())  # Returns False, only returns True around atol=1e1
print(np.isclose(prediction_cpu.detach().numpy(), prediction_nvidia_gpu.detach().numpy(), atol=1e-5).all())  # Returns True

# Display image results
plt.figure()
plt.subplot(1, 4, 1)
plt.imshow(np.transpose(img, [1, 2, 0]))
plt.subplot(1, 4, 2)
plt.imshow(np.squeeze(prediction_cpu.detach().numpy()))
plt.colorbar()
plt.subplot(1, 4, 3)
plt.imshow(np.squeeze(prediction_intel_igpu.detach().numpy()))
plt.colorbar()
plt.subplot(1, 4, 4)
plt.imshow(np.squeeze(prediction_nvidia_gpu.detach().numpy()))
plt.colorbar()
# Note that colorbar scales will be different

```

Other models have the same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DirectML] Divergent results between CPU and DirectML only on Intel Integrated GPU #362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DirectML] Divergent results between CPU and DirectML only on Intel Integrated GPU #362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions