Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.Tensor.as_strided yields not the same results after conversion to ONNX with CPUExecutionProvider #13920

Closed
fxmarty opened this issue Dec 9, 2022 · 4 comments

Comments

@fxmarty
Copy link
Contributor

fxmarty commented Dec 9, 2022

Describe the issue

Exporting a very simple PyTorch model with a tensor.as_strided() operation, no warning or error is raised during the export.

However, the results are different compared to PyTorch when running with ONNX Runtime. It could be related to a limited dynamic shape support.

To reproduce

Define the model:

import torch
import torch.nn as nn

class MyModel3(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x: torch.Tensor):
        a, b, c = x.size()

        strides_original = x.stride()

        shape = (a, b // 2, 4)
        stride = (strides_original[0] // 2, strides_original[1] - 3, 3)
        
        # stride = (x[0][0][0], 4, 3)  <-- if used instead, this will raise an error

        x_strided = x.as_strided(size=shape, stride=stride)

        return x_strided

Export to ONNX:

model = MyModel3()
x = torch.randint(8, (50, 30, 15)) + 1
res = model(x)
print(res)

torch.onnx.export(
    model,
    (x,),
    "/home/fxmarty/asstrided_model.onnx",
    input_names=["x"],
    output_names=["x_out"],
    dynamic_axes={"x":  {0: "axis0", 1: "axis1", 2: "axis2"}},
    opset_version=14
)

No warning or error whatsoever is shown during the export.

Then, compare the inference between PyTorch and ONNX Runtime with CPUExecutionProvider:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("/home/fxmarty/asstrided_model.onnx", providers=["CPUExecutionProvider"])

inp = {
    "x": np.random.randint(8, size=(45, 56, 29)) + 1,
}

res = session.run(None, inp)

model = MyModel3()
with torch.no_grad():
    res_pt = model(torch.tensor(inp["x"]))

res_ort = res[0]
res_pt_np = res_pt.numpy()

assert res_ort.shape == res_pt_np.shape

diff = np.max(np.abs(res_ort - res_pt_np))
print(f"[x] Maxdiff: {diff}")

Prints:

[x] Maxdiff: 0.9846013188362122

Same issue doing the export with opset 15, 16, 17.


PyTorch version: 1.13.1

@justinchuby https://www.justinchuby.com/torch-onnx-op-matrix/ shows Broken support for as_strided, is it related to my issue? In pytorch/pytorch#80039 as_strided is marked as supported so not sure.

image

Thanks everyone!

Urgency

mediumish

Platform

Linux

OS Version

Linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.13.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

@justinchuby
Copy link
Contributor

The support matrix shows an export error, whereas we are seeing a runtime error here, so the errors detected there may be different than yours:

image

It is still possible that the exported model is incorrect. Could you also share the exported onnx model here?

@fxmarty
Copy link
Contributor Author

fxmarty commented Dec 9, 2022

Oh did not know we could click on there. Awesome!

Yes, you can find the model exported as in my original post here and a preview in netron here.

I suspect that the strides_original = x.stride() is not raising an error while maybe it should. Trying to put dynamic stride by using the shape values (a, b, c), or e.g. x[0][0][0] (see the updated comment in the original model for an example), then torch.onnx.export rightfully raises:

torch.onnx.errors.SymbolicValueError: Failed to export a node '%17 : Long(requires_grad=0, device=cpu) = onnx::Gather[axis=0](%16, %1), scope: __main__.MyModel3:: # /home/fxmarty/test_torchsript.py:48:0
' (in list node %21 : int[] = prim::ListConstruct(%17, %18, %20), scope: __main__.MyModel3::
) because it is not constant. Please try to make things (e.g. kernel sizes) static if possible.  [Caused by the value '21 defined in (%21 : int[] = prim::ListConstruct(%17, %18, %20), scope: __main__.MyModel3::
)' (type 'List[int]') in the TorchScript graph. The containing node has kind 'prim::ListConstruct'.] 

Edit: oh, yes, it seems the graph is wrong. There are hard-coded values in Mul coming from the example given in torch.onnx.export. Should I rather open an issue under pytorch? I think this is somewhat very minor so maybe not worth it. Did still spend a quite long time debugging this on a large model!

@justinchuby
Copy link
Contributor

Yes, please open an issue on torch and mention me. Thanks for doing the hard work to isolate the issue!

@fxmarty
Copy link
Contributor Author

fxmarty commented Dec 10, 2022

Great thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants