# SYCL Migration - Sorting Networks

##### Sections
- [Introduction](#Introduction)
- [Analyze CUDA source](#Analyze-CUDA-source)
- [Migrate CUDA source to SYCL source](#Migrate-CUDA-source-to-SYCL-source)
- [Analyze, Compile and Run the migrated SYCL source](#Analyze,-Compile-and-Run-the-migrated-SYCL-source)
- [Source Code](#Source-Code)

## Learning Objectives
* Use SYCLomatic Tool to migrate a simple single source CUDA application
* Use various command line options of `SYCLomatic` for CUDA to SYCL migration
* Compile and run migrated SYCL code on Intel CPUs and GPUs
* Optimize the migrated SYCL code with manual coding

## Introduction

This module will walk you through migrating CUDA code to SYCL code using SYCLomatic Tool

#### Requirements
1. NVIDIA CUDA development machine
2. Development machine with Intel CPU/GPU OR a Intel Developer Cloud account

#### Migration Process
We will do the following steps in this hands-on workshop:
- Analyze CUDA source
- Migrate CUDA source to SYCL source
- Analyze, Compile and Run the migrated SYCL source

## Analyze CUDA source

The CUDA source for "Sorting Networks" example is available on [Nvidia Github](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/2_Concepts_and_Techniques/sortingNetworks)

Pull the entire repository on your CUDA Development machine.

```
git clone https://github.com/NVIDIA/cuda-samples.git

cd cuda-samples/Samples/2_Concepts_and_Techniques/sortingNetworks
```

The CUDA source for the SortingNetworks implementation is in the following files.

[__main.cpp__](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/2_Concepts_and_Techniques/sortingNetworks/main.cpp) — host code for:
- identify GPU device
- memory allocation on GPU
- initialization of data on CPU
- copy data to GPU memory for computation
- kickoff computation on GPU
- verify and print results

[__OddEvenMergeSort.cu__, __bitonicSort.cu__](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/2_Concepts_and_Techniques/sortingNetworks/OddEvenMergeSort.cu) — kernel code for sort computations that runs on GPU
- define kernel
- allocate shared local memory
- OddEvenMergeSort computation

The code also uses some helper functions to capture execution times of kernel which is available in `cuda-samples/Common/` folder in the repo:
- `helper_cuda.h` 
- `helper_timer.h`
- `helper_string.h`
- `expections.h`


## Migrate CUDA source to SYCL source

<p style="background-color:#cdc"> Note: A CUDA development machine is required to accomplish the task in this section </p>

Now that we have analyzed the CUDA source, we will next migrate the CUDA source into SYCL source using the __SYCLomatic Tool__.

In this exercise, we will walk you though step-by-step to migrate CUDA code.

#### Requirements

Make sure you have a __NVIDIA CUDA development machine__ that can __compile and run CUDA code__. The next step is to install the tools for migrating CUDA to SYCL:

- Install SYCLomatic Tool on this machine
  - go to https://github.com/oneapi-src/SYCLomatic/releases/
  - copy link to latest `linux_release.tgz` from assets
  - on the CUDA development machine: `mkdir syclomatic; cd syclomatic`
  - `wget <link to linux_release.tgz>`
  - `tar -xvf linux_release.tgz`
  - `export PATH="/home/$USER/syclomatic/bin:$PATH"`
  - Verify installation: `c2s --version`
- pull the CUDA samples repo to this machine
  - `git clone https://github.com/NVIDIA/cuda-samples.git`
- Compile and run the `sortingNetworks` sample
  - `cd cuda-samples/Samples/2_Concepts_and_Techniques/sortingNetworks`
  - `make`


### Migrate CUDA source to SYCL source using SYCLomatic

On the NVIDIA CUDA Development machine, go to the CUDA source folder and generate a compilation database with the tool `intercept-build`. This creates a JSON file with all the compiler invocations, stores the names of the input files and the compiler options.

```
make clean
intercept-build make
```

This will create a file named `compile_commands.json` in the sample folder.

Next, use the SYCLomatic Tool (c2s) to migrate the code; it will store the result in the migration folder `dpct_output`:

```
c2s -p compile_commands.json --in-root ../../.. --gen-helper-function
```

The `--gen-helper-function` option will copy the SYCLomatic helper header files to output directory.

The `--in-root` option will specify the path for all the common include files for the CUDA project.

This command should migrate the CUDA source to the C++ SYCL source in a folder named `dpct_output` by default, and the folder will have the C++ SYCL source along with any dependencies from the `Common` folder:

- `main.cpp.dp.cpp`
- `oddEvenMergeSort.dp.cpp`
- `bitonicSort.dp.cpp`
- `sortingNetworks_validate.cpp`
- `sortingNetworks_common.dp.hpp`
- `sortingNetworks_common.h`

This command may also throw a bunch of warnings about the migration process. The CUDA code that cannot be automatically migrated will have warning comments generated in the migrated source files, which have to be manually migrated.


## Analyze, Compile and Run the migrated SYCL source

<p style="background-color:#cdc"> Note: The tasks in this section should be done on Intel DevCloud or on a system with oneAPI Base toolkit installed.</p>


The migrated SYCL code are in the `Samples` folder under the `dpct_output` folder:
- `main.cpp.dp.cpp`
- `oddEvenMergeSort.dp.cpp`
- `bitonicSort.dp.cpp`
- `sortingNetworks_validate.cpp`
- `sortingNetworks_common.dp.hpp`
- `sortingNetworks_common.h`

The `dpct_output` folder also has headers files needed for compiling the migrated SYCL code. The `Common` folder has header files with CUDA helper functions which are migrated to SYCL and the `include` folder has header files with SYCLomatic helper functions.


#### Requirements

Make sure you have one of the following:
- __Development machine with Intel CPU/GPU__ with Intel oneAPI Base Toolkit installed
- __Intel Developer Cloud__ account to access the Intel CPUs/GPUs on the cloud

### Compiling migrated SYCL code

To compile the migrated SYCL code we can use the following command:
```sh
icpx -fsycl -I ../../../Common -I ../../../include *.cpp
```

There may be compile errors based on whether all of the CUDA code was migrated to SYCL or not. The migrated code may also include comments with warning messages, which could help make it easier to fix the errors. These errors have to be manually fixed to get the code to compile.

### Fixing unmigrated SYCL code

There are other migration warnings that can be looked at to see if any improvements can be made, below are couple examples of warning in `main.cpp.dp.cpp`:

```cpp
      /*
      DPCT1065:1: Consider replacing sycl::nd_item::barrier() with
      sycl::nd_item::barrier(sycl::access::fence_space::local_space) for better
      performance if there is no access to global memory.
      */
      item_ct1.barrier();
```
The above DPCT warning is a suggession to improve performance

```cpp
    /*
    DPCT1049:4: The work-group size passed to the SYCL kernel may exceed the
    limit. To get the device limit, query info::device::max_work_group_size.
    Adjust the work-group size if needed.
    */
    q_ct1.submit([&](sycl::handler &cgh) {
            sycl::nd_range<3>(
                sycl::range<3>(1, 1, (batchSize * arrayLength) / 512) *
                    sycl::range<3>(1, 1, 256),
                sycl::range<3>(1, 1, 256)),
            [=](sycl::nd_item<3> item_ct1) {
              oddEvenMergeGlobal(d_DstKey, d_DstVal, d_DstKey, d_DstVal,
                                 arrayLength, size, stride, dir, item_ct1);
            });
```
The above DPCT warning suggesting you check the workgroup size limit for the device since it varies from device to device.

### Compile and Run the migrated SYCL source

Once you have successfully migrated the CUDA source to the SYCL source, verify that the migrated SYCL code is functioning correctly by compiling and running it on the Intel Developer Cloud, which has a variety of Intel CPUs and GPUs available for development.

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_sycl_migrated.sh gen9

### SYCL Code Migration Analysis

When comparing the CUDA code and migrated SYCL code, we can see that there are some 1:1 equivalent calls, which are listed below in the table:

| Functionality|CUDA|SYCL
|-|-|-
| header file|`#include <cuda_runtime.h>`|`#include <CL/sycl.hpp>`
| Memory allocation on device| `cudaMalloc((void **)&d_InputKey, N * sizeof(uint))` | `d_InputKey = sycl::malloc_device<uint>(N, q_ct1)`
| Copy memory between host and device| `cudaMemcpy(d_InputKey, h_InputKey, N * sizeof(uint), cudaMemcpyHostToDevice)` | `q_ct1.memcpy(d_InputKey, h_InputKey, N * sizeof(uint))`
| Free device memory allocation| `cudaFree(d_A)` | `sycl::free(d_A, q_ct1)`
| Synchronize host and device | `cudaDeviceSynchronize()` | `dev_ct1.queues_wait_and_throw()`

The actual kernel function invocation is different. In CUDA, the kernel function is invoked with the execution configuration syntax `<<<blockCount, threadCount>>>` as follows, specifying blocks and threads:

```cpp
bitonicSortShared<<<blockCount, threadCount>>>(d_DstKey, d_DstVal, d_SrcKey,
                                                   d_SrcVal, arrayLength, dir);
```

In SYCL, the kernel function is invoked using `parallel_for` and specifying `nd_range` with global size and work group size, as follows:
```cpp
    q_ct1.submit([&](sycl::handler &cgh) {
      sycl::accessor<uint, 1, sycl::access_mode::read_write,
                     sycl::access::target::local>
          s_key_acc_ct1(sycl::range<1>(1024 /*1024U*/), cgh);
      sycl::accessor<uint, 1, sycl::access_mode::read_write,
                     sycl::access::target::local>
          s_val_acc_ct1(sycl::range<1>(1024 /*1024U*/), cgh);

      cgh.parallel_for(sycl::nd_range<3>(sycl::range<3>(1, 1, blockCount) *
                                             sycl::range<3>(1, 1, threadCount),
                                         sycl::range<3>(1, 1, threadCount)),
                       [=](sycl::nd_item<3> item_ct1) {
                         bitonicSortShared(d_DstKey, d_DstVal, d_SrcKey,
                                           d_SrcVal, arrayLength, dir, item_ct1,
                                           s_key_acc_ct1.get_pointer(),
                                           s_val_acc_ct1.get_pointer());
                       });
    });
```

Another difference is that the SYCL requires creating a SYCL queue with a device selector and other optional properties. The queue is used to submit the command group to execute on the device. The creation of a SYCL queue is necessary and is done as follows in the SYCL migrated code using some helper functions:
```cpp
dpct::device_ext &dev_ct1 = dpct::get_current_device();
sycl::queue &q_ct1 = dev_ct1.default_queue();
```
In CUDA, the equivalent is a CUDA stream, if no stream is create in CUDA code, a default stream is implicitly created.


## Source Code

This section describes the location of the CUDA source and the contents of different SYCL source code directories in this project.

| folder name | source code description
| --- | ---
| [CUDA github](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/2_Concepts_and_Techniques/sortingNetworks) | Original CUDA Source used for migration
| dpct_output | SYCL migration output from SYCLomatic Tool, compiles without errors
| sycl_migrated | Same as dpct_output, compiles without errors

## Summary

In this module we have learnt how to migrate simple CUDA source to SYCL source to get functionality using `SYCLomatic` and then analized/optimized the SYCL source by manually coding. 