-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA requesting shared memory size larger than allowed size #12771
Comments
FYI @monorimet |
* Rollback T5 models for torch as the inputs give some issues that aren't trivial to resolve * xfail efficientnet-b0 on torch+cuda -- see CUDA requesting shared memory size larger than allowed size iree-org/iree#12771
I believe we had seen this issue before. Somehow a
Is this a front end problem? @ramiro050, would you know? |
Might be. Normally we do zero out tensors before passing them to |
Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir |
|
There is an empty tensor being fed to the %258 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !to
rch.list<int>
%259 = torch.aten.empty.memory_format %258, %int6, %none, %cpu, %false, %none : !torch.list<int>, !torc
h.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.vtensor<[1,4096,4096],f32>
%260 = torch.aten.transpose.int %254, %int-1, %int-2 : !torch.vtensor<[1,4096,512],f32>, !torch.int, !t
orch.int -> !torch.vtensor<[1,512,4096],f32>
%261 = torch.aten.bmm %251, %260 : !torch.vtensor<[1,4096,512],f32>, !torch.vtensor<[1,512,4096],f32> -> !torch.vtensor<[1,4096,4096],f32>
%262 = torch.aten.mul.Scalar %261, %float4.419420e-02 : !torch.vtensor<[1,4096,4096],f32>, !torch.float -> !torch.vtensor<[1,4096,4096],f32>
%263 = torch.aten.add.Tensor %262, %259, %int0 : !torch.vtensor<[1,4096,4096],f32>, !torch.vtensor<[1,4096,4096],f32>, !torch.int -> !torch.vtensor<[1,4096,4096],f32> While it could still be a bug in torch-mlir that results in this (chances are small), this could also be a bug in the model itself. @mariecwhite, can you link one last IR: |
This seems to be a bug in the model, not in torch-mlir: %278 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
%279 = torch.aten.empty.memory_format %278, %int6, %none_0, %cpu, %false, %none_0 : !torch.list<int>, !torch.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.tensor
%280 = torch.aten.transpose.int %271, %int-1, %int-2 : !torch.tensor, !torch.int, !torch.int -> !torch.tensor
%281 = torch.aten.baddbmm %279, %265, %280, %int0, %float4.419420e-02 : !torch.tensor, !torch.tensor, !torch.tensor, !torch.int, !torch.float -> !torch.tensor An empty tensor is being used as the first argument for empty = torch.ops.aten.empty([1, 4096, 4096], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
transpose_1 = torch.ops.aten.transpose(view_13, -1, -2); view_13 = None
baddbmm = torch.ops.aten.baddbmm(empty, view_11, transpose_1, beta = 0, alpha = 0.044194173824159216); empty = view_11 = transpose_1 = None |
@ramiro050 Can this be closed? Or is there additional questions here? |
If the CUDA error is caused by the zero tensor being used as an argument, then this seems to be an issue with the model and not with IREE. @mariecwhite, can you confirm? |
My implementation provides non-zero tensors as input. There is something else going on. @ramiro050 is there a way to visualize the FX graph? |
Sorry, I meant |
@ramiro050 @mariecwhite Any update on this one? |
I haven't had cycles to look into this. I'll try and look into it next week. |
What happened?
Getting this error for many models recently:
Steps to reproduce your issue
Download https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir
Compile:
What component(s) does this issue relate to?
Runtime
Version information
Based on IREE SHA c6092c4
Additional context
Also seeing this in SHARK: nod-ai/SHARK-Studio#1243
The text was updated successfully, but these errors were encountered: