[FeatureRequest] codegen reshape/view on python API #22

jjsjann123 · 2023-03-16T09:24:53Z

Background

reshape/view in nvfuser doesn't imply memory alias, so we'll be referring to this as reshape in this issue to keep the conversation simple and accurate.

nvfuser reshape is implemented via translating to a series of keep, merge and split:

Fuser/csrc/ops/alias.cpp

Lines 20 to 63 in 86d5dd3

    
           //! Transform TensorView according to keep, merge, and split transformations. 
        
           //! Squeeze and broadcast transformations are handled separately. 
        
           //! It is recommend to use the composite ops view function, which will call 
        
           //! the analyzeView function to generate the appropriate transformations. 
        
           //! 
        
           //! For example: 
        
           //! original sizes = [2, 10, 40] 
        
           //! new_size = [2, 10, 2, 20] 
        
           //! auto analysis = analyzeView(TV0, original_sizes, new_sizes) 
        
           //! auto TV1 = TV0->view(analysis.transforms); 
        
           //! 
        
           //! Transforms = [(Keep I0), (Keep I1), (Split I2 by 2)] 
        
           //! Before: TV0[I0, I1, I2] 
        
           //! After: TV0[I0, I1, 2, ceilDiv(I2, 2)] 
        
           //! 
        
           //! orig_tv is the tensor view originally coming in from user for the view 
        
           //! operation. This is the tensor view all of the view analysis is relative to. 
        
           //! View might be doing squeezes before sending into the view operation, so we 
        
           //! want the actual input to the view operation to be potentially after the 
        
           //! original view operation. 
        
           TensorView* applyViewTransforms( 
        
               TensorView* orig_tv, 
        
               TensorView* post_reduce_tv, 
        
               const AnalyzeViewResult& view_analysis) { 
        
             TORCH_INTERNAL_ASSERT(orig_tv != nullptr, "Input is invalid."); 
        
             TORCH_INTERNAL_ASSERT(post_reduce_tv != nullptr, "Input is invalid."); 
        
             TORCH_INTERNAL_ASSERT( 
        
                 !post_reduce_tv->hasComputeAt(), 
        
                 "Cannot modify rfactor domain after compute at has been set."); 
        
             TORCH_INTERNAL_ASSERT( 
        
                 post_reduce_tv->nDims() > 0, "Tried to view a 0-dim TensorView"); 
        
             TORCH_INTERNAL_ASSERT(!view_analysis.transforms.empty()); 
        
             TensorView* consumer = IrBuilder::create<TensorView>( 
        
                 orig_tv->container(), 
        
                 orig_tv->domain()->view(view_analysis), 
        
                 orig_tv->getDataType().value()); 
        
             IrBuilder::create<ViewOp>(orig_tv->container(), consumer, post_reduce_tv); 
        
             return consumer; 
        
           }

nvfuser reshape support in TorchScript

Currently we rely on some runtime checks to ensure that the reshape parsing, i.e. ViewOp in the fusion, is still semantically correct. This works fine for our TorchScript integration, where we can rely on a guard operator that queries the backend API

Fuser/csrc/register_interface.cpp

Lines 430 to 431 in 86d5dd3

    
           auto new_constraints = nvfuser::analyzeViewConstraint( 
        
               tensor_sizes_int_vec, view_sizes_int_vec);

to reject the fusion.

python API and cache

This workflow is harder to do with our python integration though. There're a few reasons:

The lack of shape inference in our python API makes it tricky for us to validate the runtime tensor shape to reshape ops.
FusionRecord design has the assumption that each leaf node in the trie structure indicates a single / unique fusion object. If our reshape node in FusionRecord would be lowered to different fusion based on input shapes, that's some nasty patching to the design. cc'ing @kevinstephano @jacobhinkle for reference.

current plan

IIUC, we are moving forward with more plumbing to support our reshape logic in python API, a few on-going items (cc'ing @csarofeen @naoyam for reference):

@naoyam is working on more API to allow expression evaluation accessible at python API, so we'll be able to infer input shapes to reshape ops.
We are plumbing nvfuser::analyzeViewConstraint to our cache system, so that we can map the inferred shape to pick the right fusion object in order to pick up the right fusion.

This is a lot of refactor that needs to happen in order for the new workflow to work. It feels like we are doing quite a lot plumbing on the codegen as well as the python API side in order to mimic a reshape op in the codegen.
But in the end, we are not doing anything more than just a decomposition. A decomposition should be much easier performed and validated at the program acquisition time. IIUC, the missing piece now that stops us from doing that is just shape inference in our integration.

I know this is mostly just a design decision and we are pushing to expose nvfuser expression evaluation to client facing APIs. I'm not sure if we could really expect our expression evaluation to replace a shape inference mechanism on our integration, merely due to the fact that nvfuser op coverage is limited, and the awkward program flow where expression evaluation is only available after we have a fusion IR.

The text was updated successfully, but these errors were encountered:

naoyam · 2023-04-28T16:45:08Z

Linking related PRs

I think the remaining task is mostly the frontend sides.

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

wujingyue mentioned this issue May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this issue Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeatureRequest] codegen reshape/view on python API #22

[FeatureRequest] codegen reshape/view on python API #22

jjsjann123 commented Mar 16, 2023

naoyam commented Apr 28, 2023

[FeatureRequest] codegen reshape/view on python API #22

[FeatureRequest] codegen reshape/view on python API #22

Comments

jjsjann123 commented Mar 16, 2023

Background

nvfuser reshape support in TorchScript

python API and cache

current plan

naoyam commented Apr 28, 2023