Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) #21

Closed
2 tasks
jjsjann123 opened this issue Mar 16, 2023 · 3 comments

Comments

@jjsjann123
Copy link
Collaborator

jjsjann123 commented Mar 16, 2023

Background

This is a pretty involved topic. The original question rises from @jacobhinkle 's comment on the lack of support on padding broadcast dimensions from codegen. #10 (comment)
I had an offline discussion with @naoyam on this and want to open up the issue to track the problem.

  1. We need to support padding on broadcast dimensions. Look at the example below vvv:
def fn0():
  t0 = torch.randn(1)  # how are we define t0 IterDomain?                       
  t1 = torch.randn(5)
  o1 = torch._C._nn.pad(t0, (2, 1))
  o2 = t0 + t1

t0's IterDomain needs to be marked as broadcast, since we need to resolve output shape of o2 from t0 + t1, where t1 has a non-broadcast IterDomain.
Meanwhile, t0 is being padded, so we can't really work around it without proper backend support.

  1. Our current API seems to suggest that we support dynamic pad length.

    Fuser/csrc/ops/alias.cpp

    Lines 382 to 384 in 86d5dd3

    // Padding widths are assumed to be non-negative. Currently there's no
    // validation.
    TensorView* pad(TensorView* inp, const std::vector<Val*>& pad_widths) {

I doubt that we actually can support it.

def fn1(pad_left, pad_right):
  t0 = torch.empty([1])
  t1 = torch._C._nn.pad(t0, (pad_left, pad_right))  # how are we define t1 IterDomain?

Given the example above, t0 has size-1 rank-1, which means it comes with a broadcast IterDomain.

So the tricky part here comes to, what do we do with t1? Does t1 has an broadcast or non-broadcast IterDomain? Well, that depends on pad_left/pad_right value!
Imagine if you are given 0 pad at runtime, then the output t1 needs to have a broadcast IterDomain, while with non-zero padding, we'll have a non-broadcast IterDomain (Note: no it can't be a broadcast with extend, since we are padding a certain value here!).

  1. The last point is trivial. Currently codegen pad assumes non-negative padding value. Which is unfortunately not the case for pytorch.
>>> x = torch.randn(5)
>>> o = torch._C._nn.pad(x, (-2, -2))
>>> o
tensor([-0.6133])

Proposal

A few things we think that is needed:

  1. We should support pad on broadcast dimensions.
  2. We should enforce non-negative pad length on the op and have frontend transform pad with negative length to slice or other ops to keep our lives simple.
  3. Maybe we should limit pad length to be static, that would make it straightforward to plumb the support of pad on broadcast dimensions. (In combination with the non-negative pad length, we'll know at compile time whether the output IterDomain should be broadcast or not.)

Note that point 3 here is not a decision that we can make here without looking at the models that we want to support. cc'ing @kevinstephano Do you think for nanogpt, static size padding is sufficient for now?

Tracking things:

  • is static pad length sufficient for our initial support?
  • does the same problem translate to cat / slice?
@jjsjann123 jjsjann123 changed the title Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) [FeatureRequest] Pad on broadcast dimensions are not well supported by codegen (both frontend / backend) Mar 16, 2023
@jacobhinkle
Copy link
Collaborator

It seems like this really calls for inferring a property on that IterDomain based on later uses within the Fusion which is a topic that comes up for dynamic shapes with other ops like view also. For instance we might mark the ID as "MaybeBroadcast" then when it is used later we would infer some potentially complicated relations between the inputs and the sizes and types of IDs. For pointwise binary ops the matching domains would imply that they match or at least one is a broadcast, for example. When we have processed the whole graph we can try and satisfy those relations and proceed as usual. So this issue is related to IterDomain graphs but with the new wrinkle that we may need to change the graph by changing whether to mark as bcast.

@naoyam
Copy link
Collaborator

naoyam commented Mar 16, 2023

It seems like this really calls for inferring a property on that IterDomain based on later uses within the Fusion which is a topic that comes up for dynamic shapes with other ops like view also. For instance we might mark the ID as "MaybeBroadcast" then when it is used later we would infer some potentially complicated relations between the inputs and the sizes and types of IDs. For pointwise binary ops the matching domains would imply that they match or at least one is a broadcast, for example. When we have processed the whole graph we can try and satisfy those relations and proceed as usual. So this issue is related to IterDomain graphs but with the new wrinkle that we may need to change the graph by changing whether to mark as bcast.

Exactly. Here's what I'm prototyping: https://github.com/NVIDIA/Fuser/compare/dynamic_reshape. I added a new IterType called Symbolic.

jacobhinkle added a commit that referenced this issue Mar 22, 2024
This introduces a thread-local global memory allocator for each device
and uses it whenever there is an intermediate tensor needed which
requires zero-initialization.

To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the
allocator using `NVFUSER_DUMP=global_zeroed_memory`.

Before we enable this feature by default, we need to have high
confidence that every kernel using zero-initialized memory will always
clean up their semaphores. This is currently only the case for serial
grid reductions, as far as I know.

This enables the basic functionality of #1829. However, it does not
modify existing algorithms to clean up their memory. See
`NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory
build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`,
which succeeds when using serial grid reduction, but fails (in debug
mode) when using `gridReduce` (note that this test is updated to behave
differently in this PR):
```
# NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling                                                       
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = SerialGridReductionTest.Scheduling
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SerialGridReductionTest
[ RUN      ] SerialGridReductionTest.Scheduling
[global zeroed memory] Resizing arena to 512 bytes
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Resizing arena to 16384 bytes
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
unknown file: Failure
C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero
Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first):
frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests)
frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests)
frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests)
frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests)
frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests)
frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests)
frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests)
frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests)
frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests)
frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests)
frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests)
frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests)
frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests)
frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests)
frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests)
frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests)
frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests)
frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests)
frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests)
frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests)
frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling'
[  FAILED  ] SerialGridReductionTest.Scheduling (5669 ms)
[----------] 1 test from SerialGridReductionTest (5669 ms total)
```
This test runs with serial grid reduction, then with `gridReduce`. Each
time it runs two grid reductions. Both serial grid reductions succeed
because the semaphore buffer is properly zeroed. The `gridReduce`
succeeds the first time since the memory pool calls `at::zeros` again to
request a larger buffer size (`gridReduce` requires more semaphores
since there is one per thread segment vs one for each each block
segment). However, the second call to `gridReduce` fails because it has
not cleaned up its semaphores. Hacking that function to force
`PERSISTENT=1` would clean up the semaphores resulting in success in
this case. I'm leaving those kind of modifications for a follow-up.
@jacobhinkle
Copy link
Collaborator

This functionality has existed for some time now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants