Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA requesting shared memory size larger than allowed size #12771

Open
mariecwhite opened this issue Mar 27, 2023 · 15 comments
Open

CUDA requesting shared memory size larger than allowed size #12771

mariecwhite opened this issue Mar 27, 2023 · 15 comments
Assignees
Labels
bug 🐞 Something isn't working

Comments

@mariecwhite
Copy link
Contributor

What happened?

Getting this error for many models recently:

/work/runtime/src/iree/hal/drivers/cuda/native_executable.c:136: INTERNAL; CUDA driver error: Requested shared memory size of 421376 larger than allowed size of 166912; while invoking native function hal.executable.create; while calling import; 
[ 1]   native hal.executable.create:0 -
[ 0] bytecode module.__init:2050 <eval_with_key>.65:118:14

Steps to reproduce your issue

  1. Download https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir

  2. Compile:

iree-compile --iree-hal-target-backends=cuda \
    --iree-input-type=none \
    --iree-hal-cuda-llvm-target-arch=sm_80 \
    linalg.mlir -o linalg.vmfb
  1. Run:
iree-benchmark-module --module=linalg.vmfb \
    --function=forward \
    --input=1x4x64x64xf32=0 \
    --device_allocator=caching \
    --device=cuda://0

What component(s) does this issue relate to?

Runtime

Version information

Based on IREE SHA c6092c4

Additional context

Also seeing this in SHARK: nod-ai/SHARK-Studio#1243

@mariecwhite mariecwhite added bug 🐞 Something isn't working awaiting-triage labels Mar 27, 2023
@mariecwhite
Copy link
Contributor Author

FYI @monorimet

monorimet added a commit to nod-ai/SHARK-Studio that referenced this issue Mar 27, 2023
* Rollback T5 models for torch as the inputs give some issues that aren't trivial to resolve
* xfail efficientnet-b0 on torch+cuda -- see CUDA requesting shared memory size larger than allowed size iree-org/iree#12771
@powderluv powderluv added this to the IREE + NOD Model Coverage milestone Mar 28, 2023
@ThomasRaoux
Copy link
Contributor

I believe we had seen this issue before. Somehow a tensor.empty op is used as an operand of an elementwise op:

 %89 = tensor.empty() : tensor<1x4096x4096xf32>
    %90 = tensor.empty() : tensor<1x512x4096xf32>
    %91 = linalg.generic {indexing_maps = [#map5, #map6], iterator_types = ["parallel", "parallel", "parallel"]} ins(%collapsed_175 : tensor<1x4096x512xf32>) outs(%90 : tensor<1x512x4096xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<1x512x4096xf32>
    %92 = linalg.fill ins(%cst_14 : f32) outs(%89 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
    %93 = linalg.batch_matmul ins(%collapsed_173, %91 : tensor<1x4096x512xf32>, tensor<1x512x4096xf32>) outs(%92 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
    %94 = linalg.generic {indexing_maps = [#map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%93 : tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
    ^bb0(%in: f32, %out: f32):
      %716 = arith.mulf %in, %cst_9 : f32
      linalg.yield %716 : f32
    } -> tensor<1x4096x4096xf32>
    %95 = linalg.generic {indexing_maps = [#map9, #map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%94, %89 : tensor<1x4096x4096xf32>, tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
    ^bb0(%in: f32, %in_320: f32, %out: f32):
      %716 = arith.mulf %in_320, %cst_14 : f32
      %717 = arith.addf %in, %716 : f32
      linalg.yield %717 : f32
    } -> tensor<1x4096x4096xf32>

Is this a front end problem? @ramiro050, would you know?

@ramiro050
Copy link
Member

Is this a front end problem? @ramiro050, would you know?

Might be. Normally we do zero out tensors before passing them to linalg.generic, but this might be a case that got missed. @mariecwhite, do you have the torch-dialect MLIR for this model?

@mariecwhite
Copy link
Contributor Author

@ramiro050
Copy link
Member

Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir

torch_mlir.compile but with output_type="torch"

@ramiro050
Copy link
Member

There is an empty tensor being fed to the torch.aten.add.Tensor op

    %258 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !to
rch.list<int>
    %259 = torch.aten.empty.memory_format %258, %int6, %none, %cpu, %false, %none : !torch.list<int>, !torc
h.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.vtensor<[1,4096,4096],f32>
    %260 = torch.aten.transpose.int %254, %int-1, %int-2 : !torch.vtensor<[1,4096,512],f32>, !torch.int, !t
orch.int -> !torch.vtensor<[1,512,4096],f32>
    %261 = torch.aten.bmm %251, %260 : !torch.vtensor<[1,4096,512],f32>, !torch.vtensor<[1,512,4096],f32> -> !torch.vtensor<[1,4096,4096],f32>
    %262 = torch.aten.mul.Scalar %261, %float4.419420e-02 : !torch.vtensor<[1,4096,4096],f32>, !torch.float -> !torch.vtensor<[1,4096,4096],f32>
    %263 = torch.aten.add.Tensor %262, %259, %int0 : !torch.vtensor<[1,4096,4096],f32>, !torch.vtensor<[1,4096,4096],f32>, !torch.int -> !torch.vtensor<[1,4096,4096],f32>

While it could still be a bug in torch-mlir that results in this (chances are small), this could also be a bug in the model itself.

@mariecwhite, can you link one last IR: output_type="raw"? This will allow me to say for sure if torch-mlir is causing this or not.

@ramiro050
Copy link
Member

This seems to be a bug in the model, not in torch-mlir:

    %278 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %279 = torch.aten.empty.memory_format %278, %int6, %none_0, %cpu, %false, %none_0 : !torch.list<int>, !torch.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.tensor
    %280 = torch.aten.transpose.int %271, %int-1, %int-2 : !torch.tensor, !torch.int, !torch.int -> !torch.tensor
    %281 = torch.aten.baddbmm %279, %265, %280, %int0, %float4.419420e-02 : !torch.tensor, !torch.tensor, !torch.tensor, !torch.int, !torch.float -> !torch.tensor

An empty tensor is being used as the first argument for baddbmm. This also appears in the Python code string that FX graphs have attached to them:

    empty = torch.ops.aten.empty([1, 4096, 4096], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
    transpose_1 = torch.ops.aten.transpose(view_13, -1, -2);  view_13 = None
    baddbmm = torch.ops.aten.baddbmm(empty, view_11, transpose_1, beta = 0, alpha = 0.044194173824159216);  empty = view_11 = transpose_1 = None

@allieculp
Copy link

@ramiro050 Can this be closed? Or is there additional questions here?

@ramiro050
Copy link
Member

If the CUDA error is caused by the zero tensor being used as an argument, then this seems to be an issue with the model and not with IREE. @mariecwhite, can you confirm?

@mariecwhite
Copy link
Contributor Author

My implementation provides non-zero tensors as input. There is something else going on. @ramiro050 is there a way to visualize the FX graph?

@ramiro050
Copy link
Member

ramiro050 commented Apr 19, 2023

Sorry, I meant empty tensors being used as arguments in the baddbmm op. If you have a torch.fx graph module, you can print the graph by doing print(my_module.graph). You can also see the Python code representation by doing print(my_module.code)

@allieculp
Copy link

@ramiro050 @mariecwhite Any update on this one?

@mariecwhite
Copy link
Contributor Author

I haven't had cycles to look into this. I'll try and look into it next week.

@ramiro050 ramiro050 removed their assignment Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants