CUDA requesting shared memory size larger than allowed size #12771

mariecwhite · 2023-03-27T02:12:46Z

What happened?

Getting this error for many models recently:

/work/runtime/src/iree/hal/drivers/cuda/native_executable.c:136: INTERNAL; CUDA driver error: Requested shared memory size of 421376 larger than allowed size of 166912; while invoking native function hal.executable.create; while calling import; 
[ 1]   native hal.executable.create:0 -
[ 0] bytecode module.__init:2050 <eval_with_key>.65:118:14

Steps to reproduce your issue

iree-compile --iree-hal-target-backends=cuda \
    --iree-input-type=none \
    --iree-hal-cuda-llvm-target-arch=sm_80 \
    linalg.mlir -o linalg.vmfb

Run:

iree-benchmark-module --module=linalg.vmfb \
    --function=forward \
    --input=1x4x64x64xf32=0 \
    --device_allocator=caching \
    --device=cuda://0

What component(s) does this issue relate to?

Runtime

Version information

Based on IREE SHA c6092c4

Additional context

Also seeing this in SHARK: nod-ai/SHARK-Studio#1243

The text was updated successfully, but these errors were encountered:

mariecwhite · 2023-03-27T07:59:31Z

FYI @monorimet

* Rollback T5 models for torch as the inputs give some issues that aren't trivial to resolve * xfail efficientnet-b0 on torch+cuda -- see CUDA requesting shared memory size larger than allowed size iree-org/iree#12771

ThomasRaoux · 2023-03-29T02:04:37Z

I believe we had seen this issue before. Somehow a tensor.empty op is used as an operand of an elementwise op:

 %89 = tensor.empty() : tensor<1x4096x4096xf32>
    %90 = tensor.empty() : tensor<1x512x4096xf32>
    %91 = linalg.generic {indexing_maps = [#map5, #map6], iterator_types = ["parallel", "parallel", "parallel"]} ins(%collapsed_175 : tensor<1x4096x512xf32>) outs(%90 : tensor<1x512x4096xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<1x512x4096xf32>
    %92 = linalg.fill ins(%cst_14 : f32) outs(%89 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
    %93 = linalg.batch_matmul ins(%collapsed_173, %91 : tensor<1x4096x512xf32>, tensor<1x512x4096xf32>) outs(%92 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
    %94 = linalg.generic {indexing_maps = [#map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%93 : tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
    ^bb0(%in: f32, %out: f32):
      %716 = arith.mulf %in, %cst_9 : f32
      linalg.yield %716 : f32
    } -> tensor<1x4096x4096xf32>
    %95 = linalg.generic {indexing_maps = [#map9, #map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%94, %89 : tensor<1x4096x4096xf32>, tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
    ^bb0(%in: f32, %in_320: f32, %out: f32):
      %716 = arith.mulf %in_320, %cst_14 : f32
      %717 = arith.addf %in, %716 : f32
      linalg.yield %717 : f32
    } -> tensor<1x4096x4096xf32>

Is this a front end problem? @ramiro050, would you know?

ramiro050 · 2023-03-29T16:49:11Z

Is this a front end problem? @ramiro050, would you know?

Might be. Normally we do zero out tensors before passing them to linalg.generic, but this might be a case that got missed. @mariecwhite, do you have the torch-dialect MLIR for this model?

mariecwhite · 2023-03-29T22:30:12Z

Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir

ramiro050 · 2023-03-30T04:05:23Z

Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir

torch_mlir.compile but with output_type="torch"

mariecwhite · 2023-03-30T05:37:07Z

Here it is: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/vae_torch.mlir

ramiro050 · 2023-03-30T16:05:41Z

There is an empty tensor being fed to the torch.aten.add.Tensor op

    %258 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !to
rch.list<int>
    %259 = torch.aten.empty.memory_format %258, %int6, %none, %cpu, %false, %none : !torch.list<int>, !torc
h.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.vtensor<[1,4096,4096],f32>
    %260 = torch.aten.transpose.int %254, %int-1, %int-2 : !torch.vtensor<[1,4096,512],f32>, !torch.int, !t
orch.int -> !torch.vtensor<[1,512,4096],f32>
    %261 = torch.aten.bmm %251, %260 : !torch.vtensor<[1,4096,512],f32>, !torch.vtensor<[1,512,4096],f32> -> !torch.vtensor<[1,4096,4096],f32>
    %262 = torch.aten.mul.Scalar %261, %float4.419420e-02 : !torch.vtensor<[1,4096,4096],f32>, !torch.float -> !torch.vtensor<[1,4096,4096],f32>
    %263 = torch.aten.add.Tensor %262, %259, %int0 : !torch.vtensor<[1,4096,4096],f32>, !torch.vtensor<[1,4096,4096],f32>, !torch.int -> !torch.vtensor<[1,4096,4096],f32>

While it could still be a bug in torch-mlir that results in this (chances are small), this could also be a bug in the model itself.

@mariecwhite, can you link one last IR: output_type="raw"? This will allow me to say for sure if torch-mlir is causing this or not.

mariecwhite · 2023-03-30T23:34:32Z

Raw uploaded here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/vae_raw_after_fx.mlir

ramiro050 · 2023-03-31T15:44:35Z

This seems to be a bug in the model, not in torch-mlir:

    %278 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %279 = torch.aten.empty.memory_format %278, %int6, %none_0, %cpu, %false, %none_0 : !torch.list<int>, !torch.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.tensor
    %280 = torch.aten.transpose.int %271, %int-1, %int-2 : !torch.tensor, !torch.int, !torch.int -> !torch.tensor
    %281 = torch.aten.baddbmm %279, %265, %280, %int0, %float4.419420e-02 : !torch.tensor, !torch.tensor, !torch.tensor, !torch.int, !torch.float -> !torch.tensor

An empty tensor is being used as the first argument for baddbmm. This also appears in the Python code string that FX graphs have attached to them:

    empty = torch.ops.aten.empty([1, 4096, 4096], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
    transpose_1 = torch.ops.aten.transpose(view_13, -1, -2);  view_13 = None
    baddbmm = torch.ops.aten.baddbmm(empty, view_11, transpose_1, beta = 0, alpha = 0.044194173824159216);  empty = view_11 = transpose_1 = None

allieculp · 2023-04-18T20:39:40Z

@ramiro050 Can this be closed? Or is there additional questions here?

ramiro050 · 2023-04-18T21:25:05Z

If the CUDA error is caused by the zero tensor being used as an argument, then this seems to be an issue with the model and not with IREE. @mariecwhite, can you confirm?

mariecwhite · 2023-04-19T21:10:48Z

My implementation provides non-zero tensors as input. There is something else going on. @ramiro050 is there a way to visualize the FX graph?

ramiro050 · 2023-04-19T21:37:44Z

Sorry, I meant empty tensors being used as arguments in the baddbmm op. If you have a torch.fx graph module, you can print the graph by doing print(my_module.graph). You can also see the Python code representation by doing print(my_module.code)

allieculp · 2023-05-03T20:20:58Z

@ramiro050 @mariecwhite Any update on this one?

mariecwhite · 2023-05-03T21:48:01Z

I haven't had cycles to look into this. I'll try and look into it next week.

mariecwhite added bug 🐞 Something isn't working awaiting-triage labels Mar 27, 2023

monorimet mentioned this issue Mar 27, 2023

Disable and xfail some models that fail validation/compilation. nod-ai/SHARK-Studio#1251

Merged

powderluv added this to the IREE + NOD Model Coverage milestone Mar 28, 2023

julianwa removed the awaiting-triage label Apr 5, 2023

mariecwhite mentioned this issue Apr 17, 2023

Add Stable Diffusion Unet/clip/vae compile test to presubmits #12635

Closed

allieculp assigned ramiro050 Apr 18, 2023

allieculp assigned mariecwhite Apr 19, 2023

ramiro050 removed their assignment Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA requesting shared memory size larger than allowed size #12771

CUDA requesting shared memory size larger than allowed size #12771

mariecwhite commented Mar 27, 2023

mariecwhite commented Mar 27, 2023

ThomasRaoux commented Mar 29, 2023

ramiro050 commented Mar 29, 2023

mariecwhite commented Mar 29, 2023

ramiro050 commented Mar 30, 2023

mariecwhite commented Mar 30, 2023

ramiro050 commented Mar 30, 2023

mariecwhite commented Mar 30, 2023

ramiro050 commented Mar 31, 2023

allieculp commented Apr 18, 2023

ramiro050 commented Apr 18, 2023

mariecwhite commented Apr 19, 2023

ramiro050 commented Apr 19, 2023 •

edited

Loading

allieculp commented May 3, 2023

mariecwhite commented May 3, 2023

CUDA requesting shared memory size larger than allowed size #12771

CUDA requesting shared memory size larger than allowed size #12771

Comments

mariecwhite commented Mar 27, 2023

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

mariecwhite commented Mar 27, 2023

ThomasRaoux commented Mar 29, 2023

ramiro050 commented Mar 29, 2023

mariecwhite commented Mar 29, 2023

ramiro050 commented Mar 30, 2023

mariecwhite commented Mar 30, 2023

ramiro050 commented Mar 30, 2023

mariecwhite commented Mar 30, 2023

ramiro050 commented Mar 31, 2023

allieculp commented Apr 18, 2023

ramiro050 commented Apr 18, 2023

mariecwhite commented Apr 19, 2023

ramiro050 commented Apr 19, 2023 • edited Loading

allieculp commented May 3, 2023

mariecwhite commented May 3, 2023

ramiro050 commented Apr 19, 2023 •

edited

Loading