`slice` performance: Horizontal fusion based on `slice` of an input tensor results in segmentation #58

kevinstephano · 2023-03-22T23:41:40Z

In a use case from nanoGPT where the activations from the Input Linears of multihead attention are split, they should generate a horizontal fusion with 3 parallel sequences of slice+reshape+permute. The resulting fusion from nvFuser gets segmented into 6 kernels which is not great.

Repro:

import torch
from nvfuser import FusionDefinition, DataType

inputs = [                     
    torch.randn(16, 128, 3072, device='cuda'),
]                                      
                                                                       
def nvfuser_fusion(fd : FusionDefinition) -> None :                  
    T0 = fd.from_pytorch(inputs[0])                    
    T0_slice1 = fd.ops.slice(T0, [0, 0, 0], [16, 128, 1024], [1, 1, 1])                                                                                
    T0_slice2 = fd.ops.slice(T0, [0, 0, 1024], [16, 128, 2048], [1, 1, 1])             
    T0_slice3 = fd.ops.slice(T0, [0, 0, 2048], [16, 128, 3072], [1, 1, 1])                                                                             
    T1_slice1 = fd.ops.reshape(T0_slice1, [16, 128, 1024], [16, 128, 16, 64])                                                                          
    T1_slice2 = fd.ops.reshape(T0_slice2, [16, 128, 1024], [16, 128, 16, 64])                                                                          
    T1_slice3 = fd.ops.reshape(T0_slice3, [16, 128, 1024], [16, 128, 16, 64])                                                                          
    T2_slice1 = fd.ops.permute(T1_slice1, [0, 2, 1, 3])                                                                                                
    T2_slice2 = fd.ops.permute(T1_slice2, [0, 2, 1, 3])
    T2_slice3 = fd.ops.permute(T1_slice3, [0, 2, 1, 3])                
    fd.add_output(T2_slice1)                     
    fd.add_output(T2_slice2)       
    fd.add_output(T2_slice3) 

with FusionDefinition() as fd:
   nvfuser_fusion(fd)

out = fd.execute(inputs)

Nsys cmd:

nsys nvprof --print-gpu-trace python test.py

Nsys output:

Start (ns)  Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd         Device         Ctx  Strm                                                  Name                                                
 ----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  --------------------  ---  ----  ----------------------------------------------------------------------------------------------------
 1307711678          23104     146   912     1     1   256     1     1       40         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
 1499928416           6560     270   256    16     1   128     1     1       20         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)        
 1648504226           6048     311   256    16     1   128     1     1       20         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel2(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)        
 1796967171           7936     356   256    16     1   128     1     1       20         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel3(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)        
 1949040639          11680     397    16   256     1   128     1     1       16         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel4(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)4>)        
 2101463421          11713     442    16   256     1   128     1     1       16         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel5(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)4>)        
 2253434746          11744     483    16   256     1   128     1     1       16         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel6(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)4>)

The text was updated successfully, but these errors were encountered:

Previously, fusions like [this](https://github.com/NVIDIA/Fuser/pull/60/files#diff-a8f5333aa3f2d21440b3cea429bb2a588ed583f4d05486063ef1dc1a30996df9R2411) are segmented due to a limitation of `DomainMap`. It seems there's no impact to the existing tests and benchmarks. No failure with the tests and benchmarks. Dumped all CUDA generated kernels from the benchmarks and compared before and after this PR. Nothing changed. This is part of the fix for #58 --------- Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

Resize ops are not replayed, so they don't need to be exactly mapped Previously, `FusionSliceForNanoGPT3_CUDA` was segmented as the `resize` ops are not exactly mapped since they have the different expansion arguments. Since those `resize` ops are part of rfactor transformations, they were detected as conflicting rfactor transformations. However, unlike the `split` and `merge` used by `reshape`, `resize` ops are not replayed, so they don't need to be uniform. This is also part of the fix for #58. Looks like the Python example is not segmented anymore, although I suspect there's still something need to do for `permute`.

naoyam · 2023-03-24T00:44:56Z

I'm going to close this issue as the repro is no longer segmented after #64. Haven't looked at detailed performance profiles, but here's the result of the running the repro with PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth on A100 80G:

kernel1 run in 0.041984 ms, achieved: 1198.83 GB/s

kevinstephano assigned naoyam Mar 22, 2023

naoyam mentioned this issue Mar 23, 2023

Allow chains of reshape #60

Merged

naoyam mentioned this issue Mar 23, 2023

Ignore Resize ops when validating all ID uses are exactly mapped. #64

Merged

naoyam closed this as completed Mar 24, 2023

wujingyue mentioned this issue May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this issue Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`slice` performance: Horizontal fusion based on `slice` of an input tensor results in segmentation #58

`slice` performance: Horizontal fusion based on `slice` of an input tensor results in segmentation #58

kevinstephano commented Mar 22, 2023

naoyam commented Mar 24, 2023

slice performance: Horizontal fusion based on slice of an input tensor results in segmentation #58

slice performance: Horizontal fusion based on slice of an input tensor results in segmentation #58

Comments

kevinstephano commented Mar 22, 2023

naoyam commented Mar 24, 2023

`slice` performance: Horizontal fusion based on `slice` of an input tensor results in segmentation #58

`slice` performance: Horizontal fusion based on `slice` of an input tensor results in segmentation #58