[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

nived98 · 2023-02-10T11:26:40Z

Describe the bug
This reproducer is created for enhancing the performance, to achieve higher bandwidth for the SYCL implementation of
LRN primitive on Nvidia.
This reproducer computes the LRN algorithm for forward and then the memory bandwidth is calculated. Similarly, for backward propagation LRN algorithm is computed and memory bandwidth is calculated.

Reproduce

For the reproducer code, refer the attachments setup.sh, lrn.cpp and lrn_kernels.hpp

Go to the directory having the reproducer files and run the below script to setup the environment.
source setup.sh
To compile, run.
clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda lrn.cpp
The above generates the output file. To see the output bandwidth, run
./a.out

Observed behaviour

For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.000588sec
Total bandwidth = 1.044898 Gb/s

For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 0.688375 sec
Total bandwidth = 8.925368 Gb/s

Expected behavior
The ideal behavior is to attain the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) for any input size.
For the current reproducer higher bandwidth is expected i.e., a minimum of 100 GBPS is expected.

Environment

-OS: Ubuntu 22.04.1 LTS
-Target device and vendor: Nvidia, Tesla T4
-DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
-Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5

Additional context

Currently N, C, D, H, W, size, ndims are hard coded, can be changed as per need..
Attached the source code of the reproducer below
LRN_Reproducer.zip

The text was updated successfully, but these errors were encountered:

zjin-lcf · 2023-02-13T20:32:44Z

I have a question. What is the performance of the CUDA kernel for LRN ?

nived98 · 2023-02-14T11:35:00Z

CUDA Performance -

For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.000527 sec
Total bandwidth = 1.165844 Gb/s

For, N = 6,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN

Result,
Total time = 0.156514 sec
Total bandwidth = 117.765816 Gb/s

For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 22.727676 sec
Total bandwidth = 0.405497 Gb/s

For, N = 5,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN

Result,
Total time = 0.178457 sec
Total bandwidth = 129.106735 Gb/s

To compile,
nvcc -O3 lrn.cu

We are attaching the source code below:
cuda_lrn.zip

…rted in intel/llvm#8292

zjin-lcf · 2023-02-14T22:30:15Z

I kind of rewrite your host programs to change data types for some variables, and the grid size for the CUDA kernel. Then I comment all the codes related to lrn_fwd, and just evaluate the performance of lrn_bwd.

The optimization option is "-O3" for both compilers. For the large problem size you mentioned, the global work size is larger than the largest number represented by an integer. So I think the option "-fno-sycl-id-queries-fit-in-int" is needed.

I observe that the SYCL kernel "lrn_bwd_kernel" takes 9.5 s and the CUDA kernel takes 0.6 s on an NV100 GPU. The performance gap is significant. The CUDA kernel uses 64 registers and the SYCL kernel 80 registers

nived98 · 2023-02-15T06:53:28Z

As you specified the performance gap is significant between SYCL and CUDA kernel. Any suggestion on how to improve time on SYCL kernel to match the execution time to CUDA version?

zjin-lcf · 2023-02-17T01:34:08Z

Is there some license needed for your original example ? After observing a performance gap between the Cuda and Hip kernels, I would like to report the issue to ROCm too.

nived98 · 2023-02-17T11:46:52Z

As per my understanding we don't need any license for SYCL reproducer. Currently we are focusing on enhancing the performance of SYCL kernel only to achieve higher bandwidth.

zjin-lcf · 2023-04-09T18:17:03Z

I have a question about the "lrn_bwd_kernel". When the value of "across_channel" is one, is the value of "channel" also expected to be one?

The kernel consumes many registers, so it may be split into two kernels, one of which is selected with the boolean value of "channel".

      bool across_channel = 1;

    auto Operation = [=]( int64_t mb, int64_t c, int64_t d, int64_t h, int64_t w) {
      bool channel = 0;

nived98 added the bug Something isn't working label Feb 10, 2023

AlexeySachkov added performance Performance related issues cuda CUDA back-end labels Feb 10, 2023

zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this issue Feb 14, 2023

[lrn] add the updated lrn examples; this can reproduce the issue repo…

028e2bf

…rted in intel/llvm#8292

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

nived98 commented Feb 10, 2023

zjin-lcf commented Feb 13, 2023

nived98 commented Feb 14, 2023 •

edited

Loading

zjin-lcf commented Feb 14, 2023

nived98 commented Feb 15, 2023

zjin-lcf commented Feb 17, 2023

nived98 commented Feb 17, 2023

zjin-lcf commented Apr 9, 2023

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292

Comments

nived98 commented Feb 10, 2023

zjin-lcf commented Feb 13, 2023

nived98 commented Feb 14, 2023 • edited Loading

zjin-lcf commented Feb 14, 2023

nived98 commented Feb 15, 2023

zjin-lcf commented Feb 17, 2023

nived98 commented Feb 17, 2023

zjin-lcf commented Apr 9, 2023

nived98 commented Feb 14, 2023 •

edited

Loading