sycl::double2 type degrades sycl performance on NV GPU with additional generated memory instructions

**Describe the bug**
sycl::double2 type passed to a kernel function significantly degrades sycl performance on NV GPU with numerous, additional generated memory instructions compared to CUDA implementation baseline 

**Description**
1) CUDA SDK Blackscholes vs. DPCT migrated SYCL Blackscholes. Both run on NV GPU (A100)
2) DPCT migrated SYCL performance on NV GPU is more than 50% worse than baseline CUDA version (CUDA: 53 Goptions/s vs. SYCL: 22 Goptions/s)
3) [Discovered reason] using sycl::double2 generates lots of **additional** memory instructions in LLVM IR, which then get carried into final NV binary (more details below), that degrades performance significantly
4) [Discovered workaround] Insert `__attribute__((always_inline))` before the sycl kernel function. Then SYCL performance matches with CUDA implementation
5) [What needs to be done] For the PTX backend needs to optimize the additional memory instructions (without user having to put the inline keyword). and/or use the optimized LLVM IR in the first place
&ensp;

**To Reproduce**
`git clone https://github.com/sphblue/BlackScholes_From_CUDA_SDK_Samples_PublicVersion.git`

Default DPCT migrated sycl version
`cd BlackScholes_DPCT_Using_Default_Double2`
`clang++ -O2 -gline-tables-only -fsycl -fsycl-unnamed-lambda -fsycl-targets=nvptx64-nvidia-cuda  *.cpp -I/opt/intel/oneapi/dpcpp-ct/latest/include -o BlackScholes.dpct.nvgpu`
`./BlackScholes.dpct.nvgpu`

Fixed sycl version (with inline keyword)
`cd BlackScholes_DPCT_Using_Default_Double2_attribute`
`clang++ -O2 -gline-tables-only -fsycl -fsycl-unnamed-lambda -fsycl-targets=nvptx64-nvidia-cuda  *.cpp -I/opt/intel/oneapi/dpcpp-ct/latest/include -o BlackScholes.dpct.nvgpu.inlineattribute`

Baseline cuda version (for reference, not needed for this issue)
`cd BlackScholes_CUDA_Using_Default_Double2`
`make`
&ensp;


**Code**
Default DPCT migrated SYCL
![image](https://user-images.githubusercontent.com/90853374/184519250-17b7ee20-3da7-4662-8e4c-3643b08cbea8.png)

SYCL with inline attribute manually inserted
![image](https://user-images.githubusercontent.com/90853374/184519268-4c447f04-e55b-43ea-a74a-15baafdd9a52.png)
&ensp;


**Performance Outputs**
Baseline CUDA: 54854 goptions / s
Default DPCT migrated SYCL: 22137 goptions / s
Fixed inline attribute SYCL: 53898 goptions / s
&ensp;


**LLVM IR**
Default DPCT migrated SYCL 
![1](https://user-images.githubusercontent.com/90853374/184519122-a7b0cfc9-0bb8-4546-98b7-94836a3e2351.PNG)

Workaround version, inline attribute SYCL
![2](https://user-images.githubusercontent.com/90853374/184519193-7a508d20-72ac-418e-95c8-dbae79399e1d.PNG)
&ensp;



**Using Nvidia profiler**
Using the Nvidia profiler shows that all those extra memory instructions that were in LLVM IR got carried into final NV binary
![image](https://user-images.githubusercontent.com/90853374/184519312-aabd6939-410c-4502-8115-a80c4e723084.png)
&ensp;


**Environment (please complete the following information):**

- OS: Ubuntu 22.04
- Target device and vendor: NVidia A100
- DPC++ version: clang version 16.0.0
- Dependencies version: Using A100, normal config
&ensp;


**Additional context**
I believe that it's not only SYCL's double2 type (sycl::double2 is coming from CUDA's double2 type during dpct migration)
CUDA workloads often use double2, double4, ... when user migrates those cuda workloads, then I expect there will be performance degradation of sycl on NV GPU due to those extra memory instructions






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl::double2 type degrades sycl performance on NV GPU with additional generated memory instructions #6583

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sycl::double2 type degrades sycl performance on NV GPU with additional generated memory instructions #6583

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions