Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

Open
nived98 opened this issue Feb 10, 2023 · 7 comments
Open

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

nived98 opened this issue Feb 10, 2023 · 7 comments
Labels
bug Something isn't working cuda CUDA back-end performance Performance related issues

Comments

@nived98
Copy link

nived98 commented Feb 10, 2023

Describe the bug
This reproducer is created for enhancing the performance, to achieve higher bandwidth for the SYCL implementation of
LRN primitive on Nvidia.
This reproducer computes the LRN algorithm for forward and then the memory bandwidth is calculated. Similarly, for backward propagation LRN algorithm is computed and memory bandwidth is calculated.

Reproduce

For the reproducer code, refer the attachments setup.sh, lrn.cpp and lrn_kernels.hpp

  1. Go to the directory having the reproducer files and run the below script to setup the environment.
    source setup.sh

  2. To compile, run.
    clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda lrn.cpp

  3. The above generates the output file. To see the output bandwidth, run
    ./a.out

Observed behaviour

For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.000588sec
Total bandwidth = 1.044898 Gb/s

For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 0.688375 sec
Total bandwidth = 8.925368 Gb/s

Expected behavior
The ideal behavior is to attain the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) for any input size.
For the current reproducer higher bandwidth is expected i.e., a minimum of 100 GBPS is expected.

Environment

-OS: Ubuntu 22.04.1 LTS
-Target device and vendor: Nvidia, Tesla T4
-DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
-Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5

Additional context

Currently N, C, D, H, W, size, ndims are hard coded, can be changed as per need..
Attached the source code of the reproducer below
LRN_Reproducer.zip

@nived98 nived98 added the bug Something isn't working label Feb 10, 2023
@AlexeySachkov AlexeySachkov added performance Performance related issues cuda CUDA back-end labels Feb 10, 2023
@zjin-lcf
Copy link
Contributor

I have a question. What is the performance of the CUDA kernel for LRN ?

@nived98
Copy link
Author

nived98 commented Feb 14, 2023

CUDA Performance -

For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.000527 sec
Total bandwidth = 1.165844 Gb/s

For, N = 6,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.156514 sec
Total bandwidth = 117.765816 Gb/s


For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 22.727676 sec
Total bandwidth = 0.405497 Gb/s

For, N = 5,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 0.178457 sec
Total bandwidth = 129.106735 Gb/s


To compile,
nvcc -O3 lrn.cu

We are attaching the source code below:
cuda_lrn.zip

zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this issue Feb 14, 2023
@zjin-lcf
Copy link
Contributor

I kind of rewrite your host programs to change data types for some variables, and the grid size for the CUDA kernel. Then I comment all the codes related to lrn_fwd, and just evaluate the performance of lrn_bwd.

The optimization option is "-O3" for both compilers. For the large problem size you mentioned, the global work size is larger than the largest number represented by an integer. So I think the option "-fno-sycl-id-queries-fit-in-int" is needed.

I observe that the SYCL kernel "lrn_bwd_kernel" takes 9.5 s and the CUDA kernel takes 0.6 s on an NV100 GPU. The performance gap is significant. The CUDA kernel uses 64 registers and the SYCL kernel 80 registers

@nived98
Copy link
Author

nived98 commented Feb 15, 2023

As you specified the performance gap is significant between SYCL and CUDA kernel. Any suggestion on how to improve time on SYCL kernel to match the execution time to CUDA version?

@zjin-lcf
Copy link
Contributor

Is there some license needed for your original example ? After observing a performance gap between the Cuda and Hip kernels, I would like to report the issue to ROCm too.

@nived98
Copy link
Author

nived98 commented Feb 17, 2023

As per my understanding we don't need any license for SYCL reproducer. Currently we are focusing on enhancing the performance of SYCL kernel only to achieve higher bandwidth.

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Apr 9, 2023

I have a question about the "lrn_bwd_kernel". When the value of "across_channel" is one, is the value of "channel" also expected to be one?

The kernel consumes many registers, so it may be split into two kernels, one of which is selected with the boolean value of "channel".

      bool across_channel = 1;
    auto Operation = [=]( int64_t mb, int64_t c, int64_t d, int64_t h, int64_t w) {
      bool channel = 0;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end performance Performance related issues
Projects
None yet
Development

No branches or pull requests

3 participants