hal.fence.join: OUT_OF_RANGE; count 90 of iree_hal_fence > 32 #13543

silvasean · 2023-05-10T23:06:07Z

What happened?

Using ir19.flagfile and ir19.no_sharding.mlir, run the following commands:

iree-compile --iree-hal-target-backends=cuda --iree-input-type=mhlo --iree-hal-cuda-llvm-target-arch=sm_80 ir19.no_sharding.mlir --iree-codegen-llvmgpu-use-transform-dialect= --iree-codegen-llvmgpu-enable-transform-dialect-jit=false -o ir19.vmfb
iree-benchmark-module --function=main --module=ir19.vmfb --device=cuda --flagfile=ir19.flagfile --benchmark_repetitions=5

(The --iree-codegen-llvmgpu-use-transform-dialect= --iree-codegen-llvmgpu-enable-transform-dialect-jit=false is to work around #13419)

I see the following error:

iree/runtime/src/iree/modules/hal/module.c:1119: OUT_OF_RANGE; count 90 of iree_hal_fence > 32; while invoking native function hal.fence.join; while calling import; 
.....

(full log here)

Steps to reproduce your issue

See above

What component(s) does this issue relate to?

Runtime

Version information

iree.git @ 4d6c2b8 (I have a couple trivial local modifications, but should be unrelated to this bug)

Additional context

No response

The text was updated successfully, but these errors were encountered:

silvasean · 2023-05-10T23:41:11Z

Is the solution here as simple as just bumping up the number from 32 to 128 or something like that?

benvanik · 2023-05-11T00:34:36Z

We could slightly up it as a whackamole smack - it's guarding a stack allocation though and we'd want to keep it small and special case anything larger with a heap allocation. I think the real issue is that you have 90 non-folded fences - those are (relatively) expensive and indicate some earlier stage of the pipeline being really inefficient. --compile-to=stream and post the IR and we can see if it's obvious.

allieculp · 2023-05-11T22:51:36Z

@pavanimajety Curious if there is any update on this?

silvasean · 2023-05-12T19:48:51Z

@benvanik here is the --compile-to=stream output you requested: https://gist.github.com/silvasean/1038fec28172cf61872cdd6e000523ad

silvasean · 2023-05-12T20:00:32Z

For context, this is a LLM training workload. The IR structure is the forward pass, backward pass, and then optimizer updates. The forward and backward pass each have their own stablehlo.while op. In the current IR, this actually only iterates once (note sure if we fold it away, I see a ton of control flow in the --compile-to=stream IR, more than I would expect for this). In the user workload this corresponds to microbatching (see the image here).

allieculp · 2023-05-15T21:36:14Z

@pjannaty @pavanimajety Is this being worked on by your team?

pavanimajety · 2023-05-15T21:59:06Z

@allieculp I am taking a look, was out of office last few days.

pjannaty · 2023-05-15T22:19:35Z

Thanks, Allie and Pavani. Allie, this looks like a high-priority bug and we are gradually ramping up and don't mean to block if someone else also wants to take a look. Happy to collaborate on this as well.

benvanik · 2023-05-15T23:40:38Z

issue is FoldBlockArgumentsPattern not properly folding the duplicate block arguments. Will see if I can get that fixed.

The existing code would give up on particular args if multiple branch sites had non-identical duplicate arg sets for that arg. Fixes #13543.

The existing code would give up on particular args if multiple branch sites had non-identical duplicate arg sets for that arg. Fixes iree-org#13543.

silvasean added the bug 🐞 Something isn't working label May 10, 2023

silvasean mentioned this issue May 10, 2023

Improve memory allocation for LLM-like models #13545

Closed

pjannaty assigned pavanimajety May 11, 2023

pjannaty added the collab/nvidia label May 11, 2023

benvanik self-assigned this May 15, 2023

benvanik added a commit that referenced this issue May 16, 2023

Improving block argument folding to handle more cases.

93fe5d1

The existing code would give up on particular args if multiple branch sites had non-identical duplicate arg sets for that arg. Fixes #13543.

This was referenced May 16, 2023

Improving block argument folding to handle more cases. #13631

Merged

Improve JAXey LLMish model compilation through flow/stream. #13637

Open

benvanik closed this as completed in #13631 May 16, 2023

benvanik added a commit that referenced this issue May 16, 2023

Improving block argument folding to handle more cases. (#13631)

c680b64

The existing code would give up on particular args if multiple branch sites had non-identical duplicate arg sets for that arg. Fixes #13543.

This was referenced May 18, 2023

Add secondary dispatch region fusion/retensoring pass in flow dialect. #6948

Closed

ScheduleAllocation crashes when compiling included sparse program #13729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hal.fence.join: OUT_OF_RANGE; count 90 of iree_hal_fence > 32 #13543

hal.fence.join: OUT_OF_RANGE; count 90 of iree_hal_fence > 32 #13543

silvasean commented May 10, 2023 •

edited

Loading

silvasean commented May 10, 2023

benvanik commented May 11, 2023

allieculp commented May 11, 2023

silvasean commented May 12, 2023

silvasean commented May 12, 2023

allieculp commented May 15, 2023

pavanimajety commented May 15, 2023

pjannaty commented May 15, 2023 •

edited

Loading

benvanik commented May 15, 2023

hal.fence.join: OUT_OF_RANGE; count 90 of iree_hal_fence > 32 #13543

hal.fence.join: OUT_OF_RANGE; count 90 of iree_hal_fence > 32 #13543

Comments

silvasean commented May 10, 2023 • edited Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

silvasean commented May 10, 2023

benvanik commented May 11, 2023

allieculp commented May 11, 2023

silvasean commented May 12, 2023

silvasean commented May 12, 2023

allieculp commented May 15, 2023

pavanimajety commented May 15, 2023

pjannaty commented May 15, 2023 • edited Loading

benvanik commented May 15, 2023

silvasean commented May 10, 2023 •

edited

Loading

pjannaty commented May 15, 2023 •

edited

Loading