Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ROCM) Failed to distribute matmul in sdxl-turbo unet #17495

Open
monorimet opened this issue May 23, 2024 · 1 comment
Open

(ROCM) Failed to distribute matmul in sdxl-turbo unet #17495

monorimet opened this issue May 23, 2024 · 1 comment
Labels
bug 🐞 Something isn't working codegen/rocm ROCm code generation compiler backend

Comments

@monorimet
Copy link
Collaborator

monorimet commented May 23, 2024

What happened?

Module-level error:

<unknown>:0: error: cannot get concrete layout for contraction
<stdin>:2704:12: error: 'func.func' op failed to distribute
    %445 = torch.aten.convolution %338, %439, %440, %441, %442, %443, %false_465, %444, %int1_468 : !torch.vtensor<[1,320,64,64],f16>, !torch.vtensor<[640,320,1,1],f16>, !torch.vtensor<[640],f16>, !torch.list<int>, !torch.list<int>, !torch.list<int>, !torch.bool, !torch.list<int>, !torch.int -> !torch.vtensor<[1,640,64,64],f16>
           ^
<stdin>:1718:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[1,4,128,128],f16>, !torch.vtensor<[1,64,2048],f16>, !torch.vtensor<[1,1280],f16>, !torch.vtensor<[1,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1,4,128,128],f16>
         ^
<stdin>:2704:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>
    %445 = torch.aten.convolution %338, %439, %440, %441, %442, %443, %false_465, %444, %int1_468 : !torch.vtensor<[1,320,64,64],f16>, !torch.vtensor<[640,320,1,1],f16>, !torch.vtensor<[640],f16>, !torch.list<int>, !torch.list<int>, !torch.list<int>, !torch.bool, !torch.list<int>, !torch.int -> !torch.vtensor<[1,640,64,64],f16>
           ^
<stdin>:1718:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[1,4,128,128],f16>, !torch.vtensor<[1,64,2048],f16>, !torch.vtensor<[1,1280],f16>, !torch.vtensor<[1,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1,4,128,128],f16>
         ^

Dispatch-level Error:

iree.compiler.tools.binaries.CompilerToolError: Error invoking IREE compiler tool iree-compile
Error code: 1
Diagnostics:
failed to translate executables
/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:26:10: error: cannot get concrete layout for contraction
          %11 = arith.addf %out, %10 : f32 loc("/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/compiled_scheduled_unet_run_forward$async_dispatch_50.mlir":26:10)
         ^
/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:9:6: error: 'func.func' op failed to distribute
      func.func @run_forward$async_dispatch_50_matmul_like_64x64x640x320_f16xf16xf32() attributes {translation_info = #iree_codegen.translation_info<LLVMGPUVectorDistribute workgroup_size = [128, 2, 1] subgroup_size = 64, {mma_schedule = #iree_gpu.mma_schedule<intrinsic = #iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, subgroup_m_count = 2, subgroup_n_count = 2>}>} {
     ^
/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:2:2: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>
  hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>) {
 ^

Reproducer:
attention_and_matmul_spec_mfma.mlir
compiled_scheduled_unet_run_forward$async_dispatch_50.mlir

Invoked with:
 iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-opt-const-eval=false --iree-rocm-waves-per-eu=2 --iree-hal-dump-executable-files-to=/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-opt-aggressively-propagate-transposes=true --iree-global-opt-propagate-transposes=true --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-codegen-llvmgpu-use-vector-distribution=true --iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics)) --iree-codegen-transform-dialect-library=/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/attention_and_matmul_spec_mfma.mlir --compile-from=flow compiled_scheduled_unet_run_forward\$async_dispatch_50.mlir

Steps to reproduce your issue

No response

What component(s) does this issue relate to?

Compiler

Version information

commit 2a2a4d0 (HEAD -> main, origin/main, origin/HEAD)

Additional context

This is a new version of the IR, where batch dim is 1. Hence, we likely have a vector distribution issue for this shape of this op.

@monorimet monorimet added bug 🐞 Something isn't working codegen/rocm ROCm code generation compiler backend labels May 23, 2024
@monorimet
Copy link
Collaborator Author

FWIW this also fails with vector distribution explicitly disabled:, same dispatch:

<unknown>:0: error: LLVM Translation failed for operation: builtin.unrealized_conversion_cast
/home/eagarvey/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:2:2: error: failed to translate the MLIR LLVM dialect to the native llvm::Module
  hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>) {
 ^
/home/eagarvey/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:2:2: error: failed to serialize executable for target backend rocm
  hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>) {
 ^
/home/eagarvey/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:1:0: error: failed to serialize executables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/rocm ROCm code generation compiler backend
Projects
None yet
Development

No branches or pull requests

1 participant