# SYCL Migration - SimpleCUDAGraphs

##### Sections
- [Introduction](#Introduction)
- [Analyze CUDA source](#Analyze-CUDA-source)
- [Migrate CUDA source to SYCL source](#Migrate-CUDA-source-to-SYCL-source)
- [Analyze, Compile and Run the migrated SYCL source](#Analyze,-Compile-and-Run-the-migrated-SYCL-source)
- [Source Code](#Source-Code)

## Learning Objectives
* Use SYCLomatic Tool to migrate a simple single source CUDA application
* Use various command line options of `SYCLomatic` for CUDA to SYCL migration
* Compile and run migrated SYCL code on Intel CPUs and GPUs
* Optimize the migrated SYCL code with manual coding

## Introduction

This module will walk you through migrating CUDA code to SYCL code using Intel SYCLomatic Tool

#### Requirements
1. NVidia CUDA development machine
2. Development machine with Intel CPU/GPU or a Intel Developer Cloud account

#### Migration Process
We will do the following steps in this hands-on workshop:
- Analyze CUDA source
- Migrate CUDA source to SYCL source
- Analyze, Compile and Run the migrated SYCL source

## Analyze CUDA source

The CUDA source for "SimpleCUDAGraphs" example is available on [Nvidia Github](https://github.com/NVIDIA/cuda-samples/tree/v11.8/Samples/3_CUDA_Features/simpleCudaGraphs)

Pull the entire repository on your CUDA Development machine.

```
git clone https://github.com/NVIDIA/cuda-samples.git

cd cuda-samples/Samples/3_CUDA_Features/simpleCudaGraphs/
```

The CUDA source demonstrates CUDA Graphs creation, instantiation and launch using Graphs APIs and Stream Capture APIs in the following file.

[__simpleCudaGraphs.cu__](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/simpleCudaGraphs) — host code for:
-   The CUDA Graph API is demonstrated in two CUDA functions, cudaGraphsManual() and cudaGraphsUsingStreamCapture()
-	cudaGraphsManual() uses explicit CUDA Graph APIs
-	cudaGraphsUsingStreamCapture() uses stream capture APIs
-	Reduction is performed in two CUDA kernels reduce() and reduceFinal()


## Migrate CUDA source to SYCL source

<p style="background-color:#cdc"> Note: A CUDA development machine is required to accomplish the task in this section </p>

Now that we have analyzed the CUDA source, we will migrate the CUDA source into SYCL source using the __SYCLomatic Tool__.

In this exercise, we will walk you through step-by-step to migrate the CUDA code.

#### Requirements

Make sure you have a __NVIDIA CUDA development machine__ that can __compile and run CUDA code__. The next step is to install the tools for migrating CUDA to SYCL:

- Install SYCLomatic Tool on this machine
  - go to https://github.com/oneapi-src/SYCLomatic/releases/
  - copy link to latest `linux_release.tgz` from assets
  - on the CUDA development machine: `mkdir syclomatic; cd syclomatic`
  - `wget <link to linux_release.tgz>`
  - `tar -xvf linux_release.tgz`
  - `export PATH="/home/$USER/syclomatic/bin:$PATH"`
  - Verify installation: `c2s --version`
- pull the CUDA samples repo to this machine
  - `git clone https://github.com/NVIDIA/cuda-samples.git`
- Compile and run the `simpleCudaGraphs` sample
  - `cd cuda-samples/Samples/3_CUDA_Features/simpleCudaGraphs/`
  - `make`


### Migrate CUDA source to SYCL source using SYCLomatic

On the NVIDIA CUDA Development machine, go to the CUDA source folder and generate a compilation database with the tool `intercept-build`. This creates a JSON file with all the compiler invocations, stores the names of the input files and the compiler options.

```
make clean
intercept-build make
```

This will create a file named `compile_commands.json` in the sample folder.

Next, use the SYCLomatic Tool (c2s) to migrate the code; it will store the result in the migration folder `dpct_output`:

```
c2s -p compile_commands.json --in-root ../../.. --gen-helper-function
```

The `--gen-helper-function` option will copy the SYCLomatic helper header files to output directory.

The `--in-root` option will specify the path for all the common include files for the CUDA project.

This command should migrate the CUDA source to the C++ SYCL source in a folder named `dpct_output` by default, and the folder will have the C++ SYCL source along with any dependencies from the `Common` folder,

- `simpleCudaGraphs.dp.cpp`

This command may also throw a bunch of warnings about the migration process. The CUDA code that cannot be automatically migrated will have warning comments generated in the migrated source files, which have to be manually migrated.


## Analyze, Compile and Run the migrated SYCL source

<p style="background-color:#cdc"> Note: The tasks in this section should be done on Intel DevCloud or on a system with oneAPI Base toolkit installed.</p>

The migrated SYCL code are in the `Samples` folder under the `dpct_output` folder:
- `simpleCudaGraphs.dp.cpp`

The `dpct_output` folder also has headers files needed for compiling the migrated SYCL code. The `Common` folder has header files with CUDA helper functions which are migrated to SYCL and the `include` folder has header files with SYCLomatic helper functions.

#### Requirements

Make sure you have one of the following:
- __Development machine with Intel CPU/GPU__ with Intel oneAPI Base Toolkit installed
- __Intel Developer Cloud__ account to access the Intel CPUs/GPUs on the cloud

### Compiling migrated SYCL code

Copy the files mentioned above in `dpct_output` folder on __Nvidia Development Machine__ to __Intel Developer Cloud__

To compile the migrated SYCL code we can use the following command:
```
icpx -fsycl -fsycl-targets=intel_gpu_pvc -I ../../../Common -I ../../../include *.cpp -pthread
```

There may be compile errors based on whether all of the CUDA code was migrated to SYCL or not. The migrated code may also include comments with warning messages, which could help make it easier to fix the errors. These errors have to be manually fixed to get the code to compile.


### Fixing unmigrated SYCL code

The manual migration of CUDA Graph API calls to SYCL can be done using two separate approaches,
- The Taskflow programming model which manages a task dependency graph demonstrated in sycl_migrated_option1 
- The SYCL Graph extension with command groups SYCL creates an implicit dependency graph of kernel execution at runtime demonstrated in sycl_migrated_option2

With Taskflow approach we do not migrate `cudaGraphsUsingStreamCapture()` because CUDA Stream Capture APIs are not yet supported in SYCL through Taskflow.
With SYCL Graph approach the method `cudaGraphsManual()` is migrated using Explicit graph building API and the method `cudaGraphsUsingStreamCapture()` is migrated using Queue recording API.

The following warnings in the "DPCT1XXX" format are generated by the tool to indicate the code has not been migrated by the tool and needs to be manually modified to complete the migration. Below are the manual workarounds, Option 1 for sycl_migrated_option1 and Option 2 for sycl_migrated_option2 respectively.

##### 1. DPCT1007: Migration of cudaGraphCreate is not supported.
```
cudaGraphCreate(&graph, 0);
```
##### Option 1 (using Taskflow): 
SYCL doesn’t support migration of CUDA Graphs API yet. We can manually migrate these APIs with the help of [Taskflow](https://github.com/taskflow/taskflow) programming model which supports SYCL. Taskflow introduces a lightweight task graph-based programming model, [tf::syclFlow](https://github.com/taskflow/taskflow/tree/master/taskflow/sycl), for tasking SYCL operations and their dependencies. We must include the header file, taskflow/sycl/syclflow.hpp, for using tf::syclFlow.
```
tf::Taskflow tflow;
tf::Executor exe;
```

The above code lines construct a taskflow and an executor. The graph created by the taskflow is executed by an executor.
##### Option 2 (using SYCL Graph):
 SYCL Graph is an addition in `ext::oneapi::experimental` namespace, SYCL command_graph creates an object in the modifiable state for context `syclContext` and device `syclDevice`.

```
namespace sycl_ext = sycl::ext::oneapi::experimental;
sycl_ext::command_graph graph(q.get_context(), q.get_device();
```
##### 2. DPCT1007: Migration of cudaGraphAddMemcpyNode is not supported.
```
cudaGraphAddMemcpyNode(&memcpyNode, graph, NULL, 0, &memcpyParams);
```

##### Option 1 (using Taskflow): 
The tf::syclFlow provides memcpy method to create a memcpy task that copies untyped data in bytes.

```
tf::syclTask inputVec_h2d = sf.memcpy(inputVec_d, inputVec_h, sizeof(float) * inputSize) .name("inputVec_h2d");
```

##### Option 2 (using SYCL Graph):
Command graph class includes `add` method which creates an empty node that contains no command. Its intended use is to make a connection point inside a graph between groups of nodes, and can significantly reduce the number of edges, using this we can add memcpy operation as a node.

```
auto nodecpy = graph.add([&](sycl::handler& h){
 h.memcpy(inputVec_d, inputVec_h, sizeof(float) * inputSize);
});
```

##### 3. DPCT1007: Migration of cudaGraphAddMemsetNode is not supported.
```
cudaGraphAddMemsetNode(&memsetNode, graph, NULL, 0, &memsetParams);
```

##### Option 1 (using Taskflow): 
The tf::syclFlow::memset method creates a memset task that fills untyped data with a byte value.
```
tf::syclTask outputVec_memset = sf.memset(outputVec_d, 0, numOfBlocks * sizeof(double)) .name("outputVecd_memset");
```

For more information on memory operations refer [here](https://github.com/taskflow/taskflow/blob/master/taskflow/sycl/syclflow.hpp).

##### Option 2 (using SYCL Graph):
Similar to memcpy node, memset operation can also be included as a node through the command graph add method

```
auto nodememset1 = graph.add([&](sycl::handler& h){
 h.fill(outputVec_d, 0, numOfBlocks);
});
```

##### 4. DPCT1007: Migration of cudaGraphAddKernelNode is not supported.
```
cudaGraphAddKernelNode(&kernelNode, graph, nodeDependencies.data(),
                         nodeDependencies.size(), &kernelNodeParams);
```

##### Option 1 (using Taskflow): 
The tf::syclFlow::on creates a task to launch the given command group function object and tf::syclFlow::parallel_for creates a kernel task from a parallel_for method through the handler object associated with a command group. The SYCL runtime schedules command group function objects from an out-of-order queue and constructs a task graph based on submitted events.

```
tf::syclTask reduce_kernel = sf.on([=] (sycl::handler& cgh){
  sycl::local_accessor<double, 1> tmp(sycl::range<1>(THREADS_PER_BLOCK), cgh);
  cgh.parallel_for(sycl::nd_range<3>{sycl::range<3>(1, 1, numOfBlocks) *
                            sycl::range<3>(1, 1, THREADS_PER_BLOCK), sycl::range<3>(1, 1, THREADS_PER_BLOCK)}, [=](sycl::nd_item<3> item_ct1)[[intel::reqd_sub_group_size(SUB_GRP_SIZE)]]
                  {
                    reduce(inputVec_d, outputVec_d, inputSize, numOfBlocks, item_ct1, tmp.get_pointer());
                  });
    }).name("reduce_kernel");
```

##### Option 2 (using SYCL Graph):
Kernel operations are also included as a node through the command graph `add` method. These commands are captured into the graph and executed asynchronously when the graph is submitted to a queue. The `property::node::depends_on` property can be passed here with a list of nodes to create dependency edges on.

```
auto nodek1 = graph.add([&](sycl::handler &cgh) {
sycl::local_accessor<double, 1> tmp_acc_ct1(
 sycl::range<1>(THREADS_PER_BLOCK), cgh);

cgh.parallel_for(
 sycl::nd_range<3>(sycl::range<3>(1, 1, numOfBlocks) *
                       sycl::range<3>(1, 1, THREADS_PER_BLOCK),
                   sycl::range<3>(1, 1, THREADS_PER_BLOCK)),
 [=](sycl::nd_item<3> item_ct1) [[intel::reqd_sub_group_size(32)]] {
   reduce(inputVec_d, outputVec_d, inputSize, numOfBlocks, item_ct1,
          tmp_acc_ct1.get_pointer());
 });
},  sycl_ext::property::node::depends_on(nodecpy, nodememset1));
```

##### 5. DPCT1007: Migration of cudaGraphAddHostNode is not supported.
```
cudaGraphAddHostNode(&hostNode, graph, nodeDependencies.data(),   nodeDependencies.size(), &hostParams);
```

##### Option 1 (using Taskflow): 
The tf::syclFlow doesn’t have a host method to run the callable on the host, instead, we can achieve this by creating a subflow graph since Taskflow supports dynamic tasking and runs the callable on the host.

```
tf::Task syclHostTask = tflow.emplace([&](){
  myHostNodeCallback(&hostFnData);
}).name("syclHostTask");
syclHostTask.succeed(syclKernelTask);
```

The task dependencies are established through precede or succeed, here syclHostTask runs after syclKernelTask.

##### 6. DPCT1007: Migration of cudaGraphGetNodes is not supported.
```
cudaGraphGetNodes(graph, nodes, &numNodes);
```

##### Option 1 (using Taskflow): 
CUDA graph nodes are equivalent to SYCL tasks, both tf::Taskflow and tf::syclFlow classes include num_tasks() function to query the total number of tasks.

```
sf_Task = sf.num_tasks();
```

##### 7. DPCT1007: Migration of cudaGraphInstantiate is not supported.
```
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
```

##### Option 1 (using Taskflow): 
SYCL Task graph doesn’t need to be instantiated before executing but needs to establish the task dependencies using precede and succeed.

```
reduce_kernel.succeed(inputVec_h2d, outputVec_memset).precede(reduceFinal_kernel);
reduceFinal_kernel.succeed(resultd_memset).precede(result_d2h);
```

The inputVec_h2d and outputVec_memset tasks run parallelly followed by the reduce_kernel task.

##### Option 2 (using SYCL Graph):
After all the operations are added as a node the graph is finalized using `finalize()` so that no more nodes can be added and creates an executable graph that can be submitted for execution
```
auto exec_graph = graph.finalize();
sycl::queue qexec = sycl::queue{sycl::gpu_selector_v, 
 {sycl::ext::intel::property::queue::no_immediate_command_list()}};
```

##### 8. DPCT1007: Migration of cudaGraphClone is not supported.
```
cudaGraphClone(&clonedGraph, graph);
```

##### Option 1 (using Taskflow): 
In SYCL, no clone function is available as Taskflow graph objects are move-only. We can use the std::move() function as shown below to achieve functionality.

```
tf::Taskflow tflow_clone(std::move(tflow));
```

This will construct a taskflow tflow_clone from moved taskflow tflow, and taskflow tflow becomes empty. For more information refer [here](https://taskflow.github.io/taskflow/classtf_1_1Taskflow.html#afd790de6db6d16ddf4729967c1edebb5).

##### 9. DPCT1007: Migration of cudaGraphLaunch is not supported.
```
for (int i = 0; i < GRAPH_LAUNCH_ITERATIONS; i++) {
  cudaGraphLaunch(graphExec, streamForGraph); }
```
##### Option 1 (using Taskflow): 
A taskflow graph can be run once or multiple times using an executor. run_n() will run the taskflow the number of times specified by the second argument.

```
exe.run_n(tflow, GRAPH_LAUNCH_ITERATIONS).wait();
```

##### Option 2 (using SYCL Graph):
The graph is submitted in its entirety for execution via `handler::ext_oneapi_graph(graph)`.

```
for (int i = 0; i < GRAPH_LAUNCH_ITERATIONS; i++) {
qexec.submit([&](sycl::handler& cgh) {
 cgh.ext_oneapi_graph(exec_graph);
}).wait();
```

##### 10. DPCT1007: Migration of cudaGraphExecDestroy is not supported.
```
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
```

##### Option 1 (using Taskflow): 
tf::Taskflow class has default destructor operators for both tf::executor and tf::taskflow objects created.

```
~Executor() 
~Taskflow()
```

To ensure that all the taskflow submissions are completed before calling the destructor, we must use wait() during the execution.

<p style="background-color:#cdc"> Note: The SYCL Task Graph Programming Model, syclFlow, leverages the out-of-order property of the SYCL queue to design a simple and efficient scheduling algorithm using topological sort. SYCL can be slower than CUDA graphs because of execution overheads. Hence we prefer migrating with SYCL Graph Extension.</p>

Below is the manual migration using SYCL graph extension for cudaGraphsUsingStreamCapture() method.

##### 11. DPCT1027: The call to cudaStreamBeginCapture was replaced with 0 because SYCL currently does not support capture operations on queues.
```
cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);(cudaStreamEndCapture(stream1, &grap));
```

The Queue Recording API (Record & Replay) captures command-groups submitted to a queue and records them in a graph. The command_graph::begin_recording and command_graph::end_recording entry-points return a bool value informing the user whether a related queue state change occurred. All the operation are placed in between these queue-recording APIs.

```
sycl_ext::command_graph graph(q.get_context(), q.get_device());
 graph.begin_recording(q);
 ..
 graph.end_recording();
```

##### 12. The memcpy, memset, and kernel operations are placed as a node via `sycl::event` namespace as follows
```
     sycl::event ememcpy = q.memcpy(inputVec_d, inputVec_h, sizeof(float) * inputSize);

     sycl::event ememset = q.fill(outputVec_d, 0, numOfBlocks);

     sycl::event ek1 = q.submit([&](sycl::handler &cgh) {
     cgh.depends_on({ememcpy, ememset});
     sycl::local_accessor<double, 1> tmp_acc_ct1(
       sycl::range<1>(THREADS_PER_BLOCK), cgh);

     cgh.parallel_for(
      sycl::nd_range<3>(sycl::range<3>(1, 1, numOfBlocks) *
                            sycl::range<3>(1, 1, THREADS_PER_BLOCK),
                        sycl::range<3>(1, 1, THREADS_PER_BLOCK)),
      [=](sycl::nd_item<3> item_ct1) [[intel::reqd_sub_group_size(32)]] {
        reduce(inputVec_d, outputVec_d, inputSize, numOfBlocks, item_ct1,
               tmp_acc_ct1.get_pointer());

```

##### 13. DPCT1007: Migration of cudaGraphInstantiate is not supported.
```
   cudaGraphInstantiate(&clonedGraphExec, clonedGraph, NULL, NULL, 0);
```

Similar to Graph explicit API calls, After all the operations are added as a node the graph is finalized using `finalize()` so that no more nodes can be added and creates an executable graph that can be submitted for execution. 
```
   auto exec_graph = graph.finalize();
   sycl::queue qexec = sycl::queue{sycl::gpu_selector_v, 
      {sycl::ext::intel::property::queue::no_immediate_command_list()};
```

##### 14. DPCT1007:Migration of cudaGraphLaunch is not supported.
```
   cudaGraphLaunch(clonedGraphExec, streamForGraph);
```

The graph is then submitted for execution via `handler::ext_oneapi_graph(graph)`.

```
   for (int i = 0; i < GRAPH_LAUNCH_ITERATIONS; i++) {
      qexec.submit([&](sycl::handler& cgh) {
        cgh.ext_oneapi_graph(exec_graph);
      }).wait();
```

##### 15. CUDA code includes a custom API `findCUDADevice` in helper_cuda file to find the best CUDA Device available.
```
    findCudaDevice (argc, (const char **) argv);
```

Since it is a custom API SYCLomatic tool will not act on it and we can either remove it or replace it with the `sycl get_device()` API 
### Compile and Run the migrated SYCL source

Once you have successfully migrated the CUDA source to the SYCL source, verify that the migrated SYCL code is functioning correctly by compiling and running it on the Intel Developer Cloud, which has a variety of Intel CPUs and GPUs available for development.

#### Build and Run sycl_migrated_option1
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_sycl_migrated_option1.sh

#### Build and Run sycl_migrated_option2
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_sycl_migrated_option2.sh

### SYCL Code Migration Analysis

When comparing the CUDA code and migrated SYCL code, we can see that there are some 1:1 equivalent calls, which are listed below in the tables.

1:1 equivalent mapping for Graph Explicit APIs table:

| Functionality| CUDA| SYCL Taskflow| SYCL Graph
|-|-|-|-
| Header file| `#include <cuda_runtime.h>`| `#include <sycl/sycl.hpp>` <br> `#include <dpct/dpct.hpp>`| `#include <sycl/sycl.hpp>` <br> `#include <dpct/dpct.hpp>`
| Create Graph| `cudaGraphCreate(&graph, 0);`| `tf::Taskflow tflow; tf::Executor exe;`| `namespace sycl_ext = sycl::ext::oneapi::experimental; sycl_ext::command_graph graph(q.get_context(), q.get_device());`
| Add nodes to Graph| `cudaGraphAddKernelNode(&kernelNode, graph, nodeDependencies.data(),nodeDependencies.size(), &kernelNodeParams);`| `tf::syclTask reduce_kernel = sf.on([=] (sycl::handler& cgh){ cgh.parallel_for( … ); }).name("reduce_kernel");`| `auto nodek1 = graph.add([&](sycl::handler &cgh) { cgh.parallel_for( … ); },  sycl_ext::property::node::depends_on(nodecpy, nodememset));`
| Finalize Graph| `cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);`| `reduce_kernel.succeed(inputVec_h2d, outputVec_memset).precede(reduceFinal_kernel); reduceFinal_kernel.succeed(resultd_memset).precede(result_d2h);`| `auto exec_graph = graph.finalize(); sycl::queue qexec = sycl::queue{sycl::gpu_selector_v,{sycl::ext::intel::property::queue::no_immediate_command_list()}};`
| Submit Graph to Queue| `for (int i = 0; i < 3; i++) { cudaGraphLaunch(graphExec, streamForGraph); }`| `exe.run_n(tflow, GRAPH_LAUNCH_ITERATIONS).wait();`| `for (int i = 0; i < 3; i++) {  qexec.submit([&](sycl::handler& cgh) { cgh.ext_oneapi_graph(exec_graph);  }).wait();}`


1:1 equivalent mapping for Graph StreamCapture APIs table:

| Functionality| CUDA| SYCL Graph
|-|-|-
| Create Graph| `cudaGraphCreate(&graph, 0); cudaStreamCreate(&stream1);`| `namespace sycl_ext = sycl::ext::oneapi::experimental; sycl_ext::command_graph graph(q.get_context(), q.get_device());`
| Begin Record| `cudaStreamBeginCapture(stream1, cudaStreamCaptureModeGlobal);`| `graph.begin_recording(q);`
| Add nodes to Graph| `cudaMemcpyAsync(&result_h, result_d, sizeof(double),cudaMemcpyDefault, stream1);`| `q.submit([&](sycl::handler &cgh) {cgh.depends_on(ek2); cgh.memcpy(&result_h, result_d, sizeof(double)); });`
| End Record| `cudaStreamEndCapture(stream1, &graph);`| `graph.end_recording();`
| Finalize Graph| `cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);`| `auto exec_graph = graph.finalize(); sycl::queue qexec = sycl::queue{sycl::gpu_selector_v,{sycl::ext::intel::property::queue::no_immediate_command_list()}};`
| Submit Graph to Queue| `for (int i = 0; i < 3; i++) { cudaGraphLaunch(graphExec, streamForGraph); }`| `for (int i = 0; i < 3; i++) {  qexec.submit([&](sycl::handler& cgh) { cgh.ext_oneapi_graph(exec_graph);  }).wait();}`



## Source Code

This section describes the location of the CUDA source and the contents of different SYCL source code directories in this project.

| folder name | source code description
| --- | ---
| [CUDA github](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/simpleCudaGraphs) | Original CUDA Source used for migration
| dpct_output | Contains output of SYCLomatic Tool used to migrate SYCL-compliant code from CUDA code. This SYCL code has some unmigrated code that must be manually fixed to get full functionality. (The code does not functionally work as generated.)
| sycl_migrated_option1 | Contains manually migrated SYCL code from CUDA code using Taskflow programming model.
| sycl_migrated_option2 | Contains manually migrated SYCL code from CUDA code using SYCL Graph extension.

<p style="background-color:#cdc"> Note: In the first approach(sycl_migrated__option1) we only migrate the cudaGraphsManual() method using Taskflow Programming Model. We do not migrate cudaGraphsUsingStreamCapture() because CUDA Stream Capture APIs are not yet supported in SYCL through Taskflow.</p>


## Summary

In this module we have learnt how to migrate simple CUDA source to SYCL source to get functionality using `SYCLomatic` and then analized/optimized the SYCL source by manually coding. 