Integrate torch-mlir@ec6d7aa onnx.resize op #17358

AmosLewis · 2024-05-12T23:37:53Z

Unsloved iree issue:
ONNX "resize" op test failures #17345

One torch-mlir commit before:
Integrate both llvm-project@2083e97e (+1 ↩️, +1 🍒) and torch-mlir@bce800a3 #17330

Discord discussion:
https://discord.com/channels/973663919757492264/1238540944383541330

related torch-mlir onnx.resize patch: llvm/torch-mlir#3013, author: https://github.com/aldesilv

ScottTodd · 2024-05-13T15:34:47Z

Please follow https://iree.dev/developers/general/contributing/#obtaining-commit-access to get at least triage access to this repository so workflows can run without approval.

ScottTodd · 2024-05-13T17:31:49Z

Updating the XFAIL lists here is going to be a bit bumpy, since I've had to turn off the main runners used: #17370 and there is a new CUDA hang.

We'll only be testing CUDA and Vulkan until the w7900 runner is back (Switch pkgci CPU ONNX tests to use standard GitHub runner. #17375 will bring CPU testing back).

I'm trying to make debugging the CUDA hang easier: https://github.com/iree-org/iree/actions/runs/9055030697/job/24911366379#step:9:45, but for now you may just be able to skip test_resize_downsample_scales_linear and hope that is enough here:

iree/build_tools/pkgci/external_test_suite/onnx_gpu_cuda.json

Lines 13 to 17 in 2a701d5

    
           "skip_run_tests": [ 
        
             "test_gather_elements_negative_indices", 
        
             "test_gridsample_zeros_padding", 
        
             "test_scatter_elements_with_negative_indices" 
        
           ],

Can you at least sync this PR to include the newly disabled jobs?

ScottTodd · 2024-05-13T23:51:15Z

Pushed a commit syncing this PR after a few of my fixes to the CI landed. Hopefully that should show the new test outcomes (passes/failures) and timeouts. We'll have to update the ROCm tests later - once the w7900 runner is back online and stable.

ScottTodd · 2024-05-14T00:26:35Z

Ok, the tests that hang can be spotted easily now.
Logs from this PR: https://github.com/iree-org/iree/actions/runs/9071387883/job/24925163523?pr=17358#step:9:3466

Note the Failed: Timeout >30.0s lines:

PASSED SHARK-TestSuite/iree_tests/onnx/node/generated/test_xor_bcast4v4d/model.mlir::gpu_cuda_t4_test
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_downsample_scales_linear/model.mlir::gpu_cuda_t4_test
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_downsample_scales_linear_align_corners/model.mlir::gpu_cuda_t4_test - Failed: Timeout >30.0s
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_downsample_scales_linear_half_pixel_symmetric/model.mlir::gpu_cuda_t4_test - Failed: Timeout >30.0s
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_downsample_sizes_linear_pytorch_half_pixel/model.mlir::gpu_cuda_t4_test
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_upsample_scales_linear/model.mlir::gpu_cuda_t4_test
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_upsample_scales_linear_align_corners/model.mlir::gpu_cuda_t4_test
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_upsample_scales_linear_half_pixel_symmetric/model.mlir::gpu_cuda_t4_test
FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_resize_upsample_sizes_nearest_floor_align_corners/model.mlir::gpu_cuda_t4_test
============ 8 failed, 581 passed, 643 xfailed in 292.09s (0:04:52) ============

So to update xfail lists:

Download .json files from the summary page: https://github.com/iree-org/iree/actions/runs/9071387883?pr=17358
Move those files to https://github.com/iree-org/iree/tree/main/build_tools/pkgci/external_test_suite
** new ** edit those files manually (just the CUDA one in this case), putting any of the timeout tests in skip_run_tests, not expected_run_failures

- Update skip_run_tests for onnx_gpu_cuda.json to fix cuda resize tests hang - Mark most of the cpu resize tests xfail

Solve iree issue: ONNX "resize" op test failures iree-org#17345

Solve iree issue: ONNX "resize" op test failures iree-org#17345 Signed-off-by: Lubo Litchev <lubol@google.com>

AmosLewis mentioned this pull request May 12, 2024

ONNX "resize" op test failures #17345

Open

AmosLewis force-pushed the rolltorchmlir-resize branch from a68b55a to 439c1a2 Compare May 14, 2024 16:24

Integrate torch-mlir@ec6d7aa onnx.resize op

3f6f2e6

- Update skip_run_tests for onnx_gpu_cuda.json to fix cuda resize tests hang - Mark most of the cpu resize tests xfail

AmosLewis force-pushed the rolltorchmlir-resize branch from 439c1a2 to 3f6f2e6 Compare May 14, 2024 19:37

AmosLewis marked this pull request as ready for review May 14, 2024 20:04

AmosLewis requested a review from ScottTodd as a code owner May 14, 2024 20:04

ScottTodd approved these changes May 14, 2024

View reviewed changes

AmosLewis merged commit 78f5e8d into iree-org:main May 14, 2024
52 checks passed

AmosLewis deleted the rolltorchmlir-resize branch May 14, 2024 21:05

bangtianliu pushed a commit to bangtianliu/iree that referenced this pull request Jun 5, 2024

Integrate torch-mlir@ec6d7aa onnx.resize op (iree-org#17358)

788c110

Solve iree issue: ONNX "resize" op test failures iree-org#17345

LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024

Integrate torch-mlir@ec6d7aa onnx.resize op (iree-org#17358)

c04a609

Solve iree issue: ONNX "resize" op test failures iree-org#17345 Signed-off-by: Lubo Litchev <lubol@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate torch-mlir@ec6d7aa onnx.resize op #17358

Integrate torch-mlir@ec6d7aa onnx.resize op #17358

AmosLewis commented May 12, 2024 •

edited

Loading

ScottTodd commented May 13, 2024

ScottTodd commented May 13, 2024

ScottTodd commented May 13, 2024

ScottTodd commented May 14, 2024

Integrate torch-mlir@ec6d7aa onnx.resize op #17358

Integrate torch-mlir@ec6d7aa onnx.resize op #17358

Conversation

AmosLewis commented May 12, 2024 • edited Loading

ScottTodd commented May 13, 2024

ScottTodd commented May 13, 2024

ScottTodd commented May 13, 2024

ScottTodd commented May 14, 2024

AmosLewis commented May 12, 2024 •

edited

Loading