## **Lab 3**

### **Overview**

This workshop demonstrates converting CUDA samples to SYCL C++ program using SYCLomatic tool and compilation of the SYCL programs using 
Intel® oneAPI DPC++/C++ Compiler.

You will learn about the state of the supported CUDA to SYCL C++ conversion.

### **Overview of CUDA Libraries**

NVIDIA provides a layer on top of the CUDA platform called CUDA-X , which is a collection of libraries, tools, and technologies. 

GPU-accelerated CUDA libraries enable drop-in acceleration across multiple domains such as linear algebra, image and video processing, deep learning, and graph analytics.

NVIDIA CUDA toolkit provides a collection of libaries:
* Math libaries: cuBLAS, cuRAND, cuFFT
* Parallel Algorithm libraries: nvGRAPH, Thrust
* Image and video libraries: nvJPEG, NPP
* Communication libraries: NVSHMEM, NCCL
* Deep Learning libraries: cuDNN, TensorRT, Riva

### **CUDA API Migration Support Status and oneAPI API-Based Programming**

CUDA API migration support can be found in https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/developer-guide-reference/2023-2/cuda-api-migration-support-status.html 

Before starting out a migration, it is important to learn about the software libraries dependency of the CUDA program. If a CUDA program uses higher level CUDA-based library or using CUDA Graphs technologies that coordinate large number of GPU operations, SYCLomatic tool would not be able to do the job as it is designed to take care of foundational CUDA libaries migration.

This is also the reason, oneAPI offers two software programming modes:
* **Direct Programming** - at SYCL C++ level.
* **API-Based Programming**  - a collection of oneAPI libraries that are comparable to the set of CUDA foundational libraries.

![oneAPI Programming Modes](./images/oneAPI-libraries.jpg)

### **Brief Summary**
* Converting CUDA program that uses foundational libraries can be accelerated by SYCLomatic tool
* If CUDA program uses high level CUDA-based libraries, use oneAPI API-Based libraries as they would yield faster migration.
  If your program requires specialized device/kernel code, write them in SYCL C++ directly.

### **Exercise**

#### 1) Git clone CUDA Samples

In [None]:
# Note: we have ahead of time git clone cuda-sampels
! [ ! -d /app/notebooks/cuda-samples ] && git clone https://github.com/NVIDIA/cuda-samples.git /app/notebooks/cuda-samples

#### 2) Make a copy of CUDA sample 'jacobiCudaGraphs' inside lab-3

In [None]:
# Make a fresh copy of CUDA 'jacobiCudaGraphs' sample
! [ -d cuda-samples ] && rm -rf cuda-samples
! mkdir -p cuda-samples/Samples/3_CUDA_Features
! cp -rf /app/notebooks/cuda-samples/Common cuda-samples/
! cp -rf /app/notebooks/cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/  cuda-samples/Samples/3_CUDA_Features/

**Information:**
* cuda-samples/Common - CUDA helper header
* jacobiCudaGraphs - a CUDA sample that uses CUDA Graphs also CUDA foundational libraries

#### 3) Use intercept-build to obtain CUDA sample project compilation dependency

**Note:** 
* Jupyter Notebook shell command execution (! \<bash command\>) is executed as single sub-process and the process state does not persist to the next ! \<bash command \>.
* For the sake of convenience of labwork, we use '&&' to perform the task on a specific location.

In [None]:
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && make clean
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && intercept-build make

In [None]:
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && cat compile_commands.json

**Information:**
* compile_commands.json - contains CUDA compilation information. 
* nvcc - CUDA compiler compiles both jacobi.cu (GPU/device code) and main.cpp (Host/CPU code)
* The compiled binaries are then linked to produce "jacobiCudaGraphs" executable.

#### 4) Use SYCLomatics tool to convert CUDA code to SYCL C++

In [None]:
# If sycl_output exists, we delete it for a fresh 'sycl_output' SYCL conversion
! [ -d cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output ] && rm -rf cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && c2s -p compile_commands.json --in-root ../../.. --gen-helper-function --use-experimental-features=logical-group --cuda-include-path=/usr/local/cuda-12.1/include --out-root=sycl_output 

**Information:**
* --in-root ../../../ : specify path for all common include files for CUDA sample project, i.e. cuda-samples/
* --gen-helper-function : Generate SYCLomatic helper header files to output
* --use-experimental-features=logical-group : Use experimental c2s feature to convert CUDA cooperative group.
* --cuda-include-path=<path to CUDA include> : Specify the CUDA include header path.
* --out-root=<SYCL output directory> : Specify the SYCL code output

**Note:**
* oneAPI Base Toolkit version 2023.02 supports CUDA Toolkit version 12.1.

#### 5) Review the SYCL output 

In [None]:
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && tree sycl_output

**Information:**
* MainSourceFile.yaml : CUDA to SYCL conversion log
* Common/ : CUDA libary from cuda-samples
* include/ : SYCLomatic helpder header
* Samples/3_CUDA_Features/jacobiCudaGraphs - the SYCL C++ code  

In [None]:
# Check the converted CUDA converted code (jacobi.dp.cpp) 
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && cat sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/jacobi.dp.cpp

In [None]:
# Check the converted CUDA converted code (main.cpp.dp.cpp) 
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/ && cat sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/main.cpp.dp.cpp

#### 6) Compile SYCL code using DPC++ compiler

In [None]:
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/ && icpx -fsycl -I ../../../Common -I ../../../include *.cpp -o jacob_prog

#### 7) Fix SYCL compilation issue by commenting out CUDA Graphs code

As expected **CUDA Graphs functions** used in 'jacobiCudaGraphs' sample is **not supported by SYCLomatic tool** and compiling the SYCL C++ code produces above errors.
We are going to fix the compilation errors by commenting out **(#if 0 .... #endif)** the code sections in both **main.cpp.dp.cpp** and **jacobi.dp.cpp** as shown below:

**main.cpp.dp.cpp**

![lab-3-main-fix](./images/lab-3-main-fix.jpg)

**jacobi.dp.cpp**

![lab-3-jacob-fix-1](./images/lab-3-jacob-fix-1.jpg)

![lab-3-jacob-fix-2](./images/lab-3-jacob-fix-2.jpg)

**Note:** Add a DEBUG print just before Floating Point 64 capability check

![lab-3-jacob-fix-3](./images/lab-3-jacob-fix-3.jpg)

In [None]:
# Check the patched main.cpp.dp.cpp
! cat cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/main.cpp.dp.cpp

In [None]:
# Check the patched jacobi.dp.cpp
! cat cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/jacobi.dp.cpp

#### 8) Recompile the Jacobi SYCL program

In [None]:
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/ && icpx -fsycl -I ../../../Common -I ../../../include *.cpp -o jacob_prog

In [None]:
# Check jacob_prog executable file information
! file cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/jacob_prog

#### 9) Run the Jacobi SYCL program

In [None]:
! cd cuda-samples/Samples/3_CUDA_Features/jacobiCudaGraphs/sycl_output/Samples/3_CUDA_Features/jacobiCudaGraphs/ && ./jacob_prog

**Information:** 
* The SYCL Jacobi program exits pre-maturely at the Float Point 64-bit capability check (**dpct::has_capability_or_fail(stream->get_device(), {sycl::aspect::fp64})**).
* dpct::has_capability_or_fail() is provided by SYCLomatic helper header.
* https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/intel-xe-gpu-architecture.html - Intel Xe GPU architecture web page states that FP64 is supported by Intel® Xe-HPC Data Center GPU Max Series currently.

### **Conclusion:**

* SYCLomatic tool is designed to accelerate CUDA to SYCL C++ conversion and it is not meant to perform full conversion.
* CUDA program may use CUDA libraries that are not supported by SYCLomatic tool.
* MainSourceFiles.yaml contains details of CUDA to SYCL C++ conversion.
* SYCLomatic tool generates helper header (namespace dpct) as part of the CUDA to SYCL migration.

**Notices & Disclaimers** 

Intel technologies may require enabled hardware, software or service activation. 

No product or component can be absolutely secure.  

Your costs and results may vary.  

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (0BSD), [Open Source Initiative](https://opensource.org/licenses/0BSD). No rights are granted to create modifications or derivatives of this document. 

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.  