[FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) #21

jjsjann123 · 2023-03-16T08:27:59Z

Background

This is a pretty involved topic. The original question rises from @jacobhinkle 's comment on the lack of support on padding broadcast dimensions from codegen. #10 (comment)
I had an offline discussion with @naoyam on this and want to open up the issue to track the problem.

We need to support padding on broadcast dimensions. Look at the example below vvv:

def fn0():
  t0 = torch.randn(1)  # how are we define t0 IterDomain?                       
  t1 = torch.randn(5)
  o1 = torch._C._nn.pad(t0, (2, 1))
  o2 = t0 + t1

t0's IterDomain needs to be marked as broadcast, since we need to resolve output shape of o2 from t0 + t1, where t1 has a non-broadcast IterDomain.
Meanwhile, t0 is being padded, so we can't really work around it without proper backend support.

Our current API seems to suggest that we support dynamic pad length.

Fuser/csrc/ops/alias.cpp

Lines 382 to 384 in 86d5dd3

    
           // Padding widths are assumed to be non-negative. Currently there's no 
        
           // validation. 
        
           TensorView* pad(TensorView* inp, const std::vector<Val*>& pad_widths) {

I doubt that we actually can support it.

def fn1(pad_left, pad_right):
  t0 = torch.empty([1])
  t1 = torch._C._nn.pad(t0, (pad_left, pad_right))  # how are we define t1 IterDomain?

Given the example above, t0 has size-1 rank-1, which means it comes with a broadcast IterDomain.

So the tricky part here comes to, what do we do with t1? Does t1 has an broadcast or non-broadcast IterDomain? Well, that depends on pad_left/pad_right value!
Imagine if you are given 0 pad at runtime, then the output t1 needs to have a broadcast IterDomain, while with non-zero padding, we'll have a non-broadcast IterDomain (Note: no it can't be a broadcast with extend, since we are padding a certain value here!).

The last point is trivial. Currently codegen pad assumes non-negative padding value. Which is unfortunately not the case for pytorch.

>>> x = torch.randn(5)
>>> o = torch._C._nn.pad(x, (-2, -2))
>>> o
tensor([-0.6133])

Proposal

A few things we think that is needed:

We should support pad on broadcast dimensions.
We should enforce non-negative pad length on the op and have frontend transform pad with negative length to slice or other ops to keep our lives simple.
Maybe we should limit pad length to be static, that would make it straightforward to plumb the support of pad on broadcast dimensions. (In combination with the non-negative pad length, we'll know at compile time whether the output IterDomain should be broadcast or not.)

Note that point 3 here is not a decision that we can make here without looking at the models that we want to support. cc'ing @kevinstephano Do you think for nanogpt, static size padding is sufficient for now?

Tracking things:

is static pad length sufficient for our initial support?
does the same problem translate to cat / slice?

The text was updated successfully, but these errors were encountered:

jacobhinkle · 2023-03-16T08:48:49Z

It seems like this really calls for inferring a property on that IterDomain based on later uses within the Fusion which is a topic that comes up for dynamic shapes with other ops like view also. For instance we might mark the ID as "MaybeBroadcast" then when it is used later we would infer some potentially complicated relations between the inputs and the sizes and types of IDs. For pointwise binary ops the matching domains would imply that they match or at least one is a broadcast, for example. When we have processed the whole graph we can try and satisfy those relations and proceed as usual. So this issue is related to IterDomain graphs but with the new wrinkle that we may need to change the graph by changing whether to mark as bcast.

naoyam · 2023-03-16T15:24:37Z

It seems like this really calls for inferring a property on that IterDomain based on later uses within the Fusion which is a topic that comes up for dynamic shapes with other ops like view also. For instance we might mark the ID as "MaybeBroadcast" then when it is used later we would infer some potentially complicated relations between the inputs and the sizes and types of IDs. For pointwise binary ops the matching domains would imply that they match or at least one is a broadcast, for example. When we have processed the whole graph we can try and satisfy those relations and proceed as usual. So this issue is related to IterDomain graphs but with the new wrinkle that we may need to change the graph by changing whether to mark as bcast.

Exactly. Here's what I'm prototyping: https://github.com/NVIDIA/Fuser/compare/dynamic_reshape. I added a new IterType called Symbolic.

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

jacobhinkle · 2024-04-02T11:46:48Z

This functionality has existed for some time now.

jjsjann123 mentioned this issue Mar 16, 2023

Add pad to Python frontend #10

Merged

jjsjann123 changed the title ~~Pad on broadcast dimensions are not well supported by codegen (both frontend / backend)~~ [FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) Mar 16, 2023

naoyam mentioned this issue Apr 28, 2023

Support dynamic resize #256

Closed

jacobhinkle mentioned this issue Sep 12, 2023

Pad of Broadcast IterDomain broadcasts into the pad region #868

Closed

jacobhinkle closed this as completed Apr 2, 2024

wujingyue mentioned this issue May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Open

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this issue Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) #21

[FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) #21

jjsjann123 commented Mar 16, 2023 •

edited

Loading

jacobhinkle commented Mar 16, 2023

naoyam commented Mar 16, 2023

jacobhinkle commented Apr 2, 2024

[FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) #21

[FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) #21

Comments

jjsjann123 commented Mar 16, 2023 • edited Loading

Background

Proposal

jacobhinkle commented Mar 16, 2023

naoyam commented Mar 16, 2023

jacobhinkle commented Apr 2, 2024

jjsjann123 commented Mar 16, 2023 •

edited

Loading