# SYCL Migration - Simple VectorAdd

##### Sections
- [Introduction](#Introduction)
- [Analyze CUDA Source](#Analyze-CUDA-Source)
- [Migrate CUDA source to SYCL source](#Migrate-CUDA-source-to-SYCL-source)
- [Analyze, Compile and Run the migrated SYCL source](#Analyze,-Compile-and-Run-the-migrated-SYCL-source)
- [Manually Optimize the migrated SYCL source](#Manually-Optimize-the-migrated-SYCL-source)
- [Source Code](#Source-Code)

## Learning Objectives
* Use SYCLomatic Tool to migrate a simple single source CUDA application
* Use various command line options of `SYCLomatic` for CUDA to SYCL migration
* Compile and run migrated SYCL code on Intel CPUs and GPUs
* Optimize the migrated SYCL code with manual coding

## Introduction

This module will walk you through migrating CUDA code to SYCL code using the SYCLomatic Tool.

#### Requirements
1. NVidia CUDA development machine
2. Development machine with Intel CPU/GPU OR a Intel Developer Cloud account

#### Migration Process
We will do the following steps in this hands-on workshop:
- Analyze CUDA source
- Migrate CUDA source to SYCL source
- Analyze, Compile and Run the migrated SYCL source
- Manually Optimize the migrated SYCL source

## Analyze CUDA Source

Below is simple VectorAdd CUDA source: `vectoradd.cu`

```cpp

#include <cuda.h>
#include <iostream>
#include <vector>
#define N 16

//# kernel code to perform VectorAdd on GPU
__global__ void VectorAddKernel(float* A, float* B, float* C)
{
        C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];
}

int main()
{
        //# Initialize vectors on host
        float A[N] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
        float B[N] = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
        float C[N] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

        //# Allocate memory on device
        float *d_A, *d_B, *d_C;
        cudaMalloc(&d_A, N*sizeof(float));
        cudaMalloc(&d_B, N*sizeof(float));
        cudaMalloc(&d_C, N*sizeof(float));

        //# copy vector data from host to device
        cudaMemcpy(d_A, A, N*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, B, N*sizeof(float), cudaMemcpyHostToDevice);

        //# sumbit task to compute VectorAdd on device
        VectorAddKernel<<<1, N>>>(d_A, d_B, d_C);

        //# copy result of vector data from device to host
        cudaMemcpy(C, d_C, N*sizeof(float), cudaMemcpyDeviceToHost);

        //# print result on host
        for (int i = 0; i < N; i++) std::cout<< C[i] << " ";
        std::cout << "\n";

        //# free allocation on device
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        return 0;
}
```


The CUDA code above initializes three arrays: `A`, `B` and `C`. The code allocates memory on device for three arrays: `d_A`, `d_B` and `d_C` using the `cudaMalloc` function, and then copies host initialized arrays to device locations using `cudaMemcpy`. The kernel function `VectorAddKernel` is then called to add arrays `d_A` and `d_B` into `d_C`. The results from `d_C` are copied back to host using `cudaMemcpy` to print the output.

## Migrate CUDA source to SYCL source

<p style="background-color:#cdc"> Note: A CUDA development machine is required to accomplish the task in this section </p>

Now that we have analyzed the CUDA source, we will migrate the CUDA source into SYCL source using the __SYCLomatic Tool__.

For relatively simple projects, the SYCLomatic Tool can be invoked on the user source CUDA code directly. In this exercise, we will walk you through step-by-step to migrate the CUDA code.

### Requirements

Make sure you have a __NVIDIA CUDA development machine__ that can __compile and run CUDA code__. The next step is to install the tools for migrating CUDA to SYCL:

- Install SYCLomatic Tool on this machine
  - go to https://github.com/oneapi-src/SYCLomatic/releases/
  - copy link to latest `linux_release.tgz` from assets
  - on the CUDA development machine: `mkdir syclomatic; cd syclomatic`
  - `wget <link to linux_release.tgz>`
  - `tar -xvf linux_release.tgz`
  - `export PATH="/home/$USER/syclomatic/bin:$PATH"`
  - Verify installation: `c2s --version`
- Create a working directory and copy the above `vectoradd.cu` CUDA source to this machine


### Migrate CUDA source to SYCL source using SYCLomatic

On the NVIDIA CUDA development machine, run the following command to migrate CUDA code to SYCL code:

```
c2s vectoradd.cu --gen-helper-function
```

This command should migrate the CUDA source to SYCL source in a folder named `dpct_output/` by default, and the folder will have the SYCL source with name `vectoradd.dp.cpp`

`--gen-helper-function` option will copy the SYCLomatic helper headers filed to output directory

Note that when running the tool again for the same source may throw error that `dpct_output` folder is not empty, so make sure to delete the `dpct_output` folder and then try the `c2s vectoradd.cu --gen-helper-function` command.

Next we will use the `c2s --out-root` option to specify a custom output directory like shown below:

```
c2s vectoradd.cu --gen-helper-function --out-root sycl_code
```

This command should migrate the CUDA source to SYCL source in a folder named `sycl_code`



## Analyze, Compile and Run the migrated SYCL source

<p style="background-color:#cdc"> Note: The tasks in this section should be done on Intel DevCloud or on a system with oneAPI Base toolkit installed.</p>

Once you have successfully migrated the CUDA source to the SYCL source, verify that the migrated SYCL code is functioning correctly by compiling and running it on a system with an Intel CPU/GPU. Alternatively, you can do this on the Intel Developer Cloud, which has a variety of Intel CPUs and GPUs available for development.

Let's look at the migrated SYCL code:

```cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <dpct/dpct.hpp>
#include <iostream>
#include <vector>
#define N 16

//# kernel code to perform VectorAdd on GPU
void VectorAddKernel(float* A, float* B, float* C,
                     const sycl::nd_item<3> &item_ct1)
{
        C[item_ct1.get_local_id(2)] =
            A[item_ct1.get_local_id(2)] + B[item_ct1.get_local_id(2)];
}

int main()
{
        dpct::device_ext &dev_ct1 = dpct::get_current_device();
        sycl::queue &q_ct1 = dev_ct1.default_queue();
        std::cout << "Device: " << q_ct1.get_device().get_info<sycl::info::device::name>() << "\n";

        //# Initialize vectors on host
        float A[N] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
        float B[N] = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
        float C[N] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

        //# Allocate memory on device
        float *d_A, *d_B, *d_C;
        d_A = sycl::malloc_device<float>(N, q_ct1);
        d_B = sycl::malloc_device<float>(N, q_ct1);
        d_C = sycl::malloc_device<float>(N, q_ct1);

        //# copy vector data from host to device
        q_ct1.memcpy(d_A, A, N * sizeof(float));
        q_ct1.memcpy(d_B, B, N * sizeof(float)).wait();

        //# sumbit task to compute VectorAdd on device
        q_ct1.parallel_for(
            sycl::nd_range<3>(sycl::range<3>(1, 1, N), sycl::range<3>(1, 1, N)),
            [=](sycl::nd_item<3> item_ct1) {
                    VectorAddKernel(d_A, d_B, d_C, item_ct1);
            });

        //# copy result of vector data from device to host
        q_ct1.memcpy(C, d_C, N * sizeof(float)).wait();

        //# print result on host
        for (int i = 0; i < N; i++) std::cout<< C[i] << " ";
        std::cout << "\n";

        //# free allocation on device
        sycl::free(d_A, q_ct1);
        sycl::free(d_B, q_ct1);
        sycl::free(d_C, q_ct1);
        return 0;
}
```

The migrated SYCL code can be compiled using the following command in terminal:
```sh
icpx -fsycl -I include vectoradd.dp.cpp
```

OR you can compile and run by executing the cell below:

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_vector_add.sh gen9

### SYCL Code Migration Analysis

When comparing the CUDA code and migrated SYCL code, we can see that there are some 1:1 equivalent calls, which are listed below in the table:

| Functionality|CUDA|SYCL
|-|-|-
| header file|`#include <cuda.h>`|`#include <CL/sycl.hpp>`
| Memory allocation on device| `cudaMalloc(&d_A, N*sizeof(float))`| `d_A = sycl::malloc_device<float>(N, q_ct1)`
| Copy memory between host and device| `cudaMemcpy(d_A, A, N*sizeof(float), cudaMemcpyHostToDevice)`| `q.memcpy(d_A, A, N * sizeof(float))`
 | Free device memory allocation| `cudaFree(d_A)` | `free(d_A, q)`

The actual kernel function invocation is different. In CUDA, the kernel function is invoked with the execution configuration syntax `<<<1, N>>>>` as follows, specifying 1 block and N threads:

```cpp
VectorAddKernel<<<1, N>>>(d_A, d_B, d_C);
```

In SYCL, the kernel function is invoked using `parallel_for` and specifying `nd_range` with one N work item in one work group, as follows:


```cpp
q_ct1.parallel_for(
            sycl::nd_range<3>(sycl::range<3>(1, 1, N), sycl::range<3>(1, 1, N)),
            [=](sycl::nd_item<3> item_ct1) {
                    VectorAddKernel(d_A, d_B, d_C, item_ct1);
            });
```

Another difference is that the SYCL requires creating a SYCL queue with a device selector and other optional properties. The queue is used to submit the command group to execute on the device. The creation of a SYCL queue is necessary and is done as follows in the SYCL migrated code using some helper functions:

```cpp
dpct::device_ext &dev_ct1 = dpct::get_current_device();
sycl::queue &q_ct1 = dev_ct1.default_queue();
```

In CUDA, the equivalent is a CUDA stream; if no stream is created in the CUDA code, a default stream is implicitly created.


## Manually Optimize the migrated SYCL source

The SYCLomatic Tool will migrate the CUDA code to the SYCL code to get functionality, but you may have to manually optimize the resulting SYCL code for optimal performance.

Now that we have successfully migrated the CUDA code to the SYCL code and executed on an Intel CPU/GPU, let’s look at what manual optimizations we can do.

Analyzing the migrated SYCL code, we can see that a SYCL queue is created using the following code:

```cpp
dpct::device_ext &dev_ct1 = dpct::get_current_device();
sycl::queue &q_ct1 = dev_ct1.default_queue();
```

The above code is creating a SYCL queue using dpct helper functions that can be unwrapped using the `dpct/dpct.hpp` header file.

The above code is also creating a SYCL queue with an `in_order` queue property and is doing a default device selection, which is the same as the code below using just SYCL api syntax:

```cpp
sycl::queue q_ct1{sycl::default_selector_v(), sycl::property::queue::in_order()};
```

Using an `in_order` queue property will not allow kernels with no dependency to overlap execution. Therefore, we will remove the `in_order` queue property and add event-based dependency between kernels.

We can replace the SYCL queue creation with the following code:

```cpp
sycl::queue q_ct1;
```

This will create a queue with default device selection and allow kernels to overlap.

The next step is to add kernel dependency. From the code above we can enable the two `memcpy` kernel submissions to overlap and then add dependency for the actual kernel that does the vector add. We will also add a dependency to the final `memcpy` kernel to copy back the results.

The resulting optimized code will look like this:

```cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <iostream>
#include <vector>
#define N 16

//# kernel code to perform VectorAdd on GPU
void VectorAddKernel(float* A, float* B, float* C,
                     const sycl::nd_item<3> &item_ct1)
{
        C[item_ct1.get_local_id(2)] =
            A[item_ct1.get_local_id(2)] + B[item_ct1.get_local_id(2)];
}

int main()
{
        // sycl queue with out of order execution allowed
        sycl::queue q_ct1;
        std::cout << "Device: " << q_ct1.get_device().get_info<sycl::info::device::name>() << "\n";

        //# Initialize vectors on host
        float A[N] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
        float B[N] = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
        float C[N] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

        //# Allocate memory on device
        float *d_A, *d_B, *d_C;
        d_A = sycl::malloc_device<float>(N, q_ct1);
        d_B = sycl::malloc_device<float>(N, q_ct1);
        d_C = sycl::malloc_device<float>(N, q_ct1);

        //# copy vector data from host to device
        auto e1 = q_ct1.memcpy(d_A, A, N * sizeof(float));
        auto e2 = q_ct1.memcpy(d_B, B, N * sizeof(float));

        //# sumbit task to compute VectorAdd on device
        auto e3 = q_ct1.parallel_for(
            sycl::nd_range<3>(sycl::range<3>(1, 1, N), sycl::range<3>(1, 1, N)), {e1, e2},
            [=](sycl::nd_item<3> item_ct1) {
                    VectorAddKernel(d_A, d_B, d_C, item_ct1);
            });

        //# copy result of vector data from device to host
        q_ct1.memcpy(C, d_C, N * sizeof(float), e3).wait();

        //# print result on host
        for (int i = 0; i < N; i++) std::cout<< C[i] << " ";
        std::cout << "\n";

        //# free allocation on device
        sycl::free(d_A, q_ct1);
        sycl::free(d_B, q_ct1);
        sycl::free(d_C, q_ct1);
        return 0;
}
```

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_vector_add_optimized.sh gen9

## Source Code

This section describes the location of the CUDA source and the contents of different SYCL source code directories in this project.

| folder name | source code description
| --- | ---
| cuda | Original CUDA Source used for migration
| dpct_output | SYCL migration output from SYCLomatic Tool, compiles without errors
| sycl_migrated | SYCL code with code that prints the offload device name added, compiles without errors
| sycl_migrated_optimized | SYCL code optimized to allow kernels for execute out-of-order, compiles without errors

## Summary

In this module, we learned how to migrate simple CUDA source to SYCL source to get functionality using the `SYCLomatic` Tool, and then analyzed/optimized the SYCL source by manually coding.
