# C++ SYCL Introduction

In this section we cover Introduction to compute offloading to accelerators and introduce SYCL programming starting with C++:
- [Why Offload Computation to Accelerators?](#Why-Offload-Computation-to-Accelerators?)
- [Simple Computation on CPU](#Simple-Computation-on-CPU)
- [C++ SYCL for Offloading Computation on Device](#C++-SYCL-for-Offloading-Computation-on-Device)
- [SYCL Libraries for Offloading Computation on Device](#SYCL-Libraries-for-Offloading-Computation-on-Device)



## Why Offload Computation to Accelerators?

Large computational problems in High Perforamance Computing run substantially faster on specialized hardware accelerators like a GPU than on a CPU.

Accelerators like GPU can run many smaller computations at once by making use of parallelism in the hardware.

### Programming Languages for Accelerators

| Language | Description
|:--|:--
| CUDA | CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA for programming to NVIDIA GPUs.
| HIP | HIP (Heterogeneous Interface for Portability) is a free and open-source runtime API and kernel language. With it, you can convert an existing CUDA® application into a single C++ code base that can be compiled to run on AMD or NVIDIA GPUs.
| OpenMP | OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C/C++ and Fortran. OpenMP allows loop-level parallelism and also supports function-level parallelism.
| OpenACC | OpenACC (Open Accelerators) specification supports C, C++, Fortran programming languages and multiple hardware architectures, OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories.
| OpenCL | OpenCL (Open Computing Language) is an open, royalty-free standard  by the Khronos Group for cross-platform, parallel programming of diverse accelerators. OpenCL provides C-like programming model and is low-level framework that need manual memory management and synchronization. 
| SYCL | SYCL is a royalty-free open standard developed by the Khronos Group that allows developers to program heterogeneous architectures in standard C++. Supports NVIDIA, AMD and Intel GPUs. SYCL provides C++ like programming with high-level abstractions which make it easier to program.


### Accelerating Choice with SYCL
SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that:

- Enables code for heterogeneous and offload processors to be written using modern ISO C++ (at least C++ 17).
- Provides APIs and abstractions to find devices (e.g. CPUs, GPUs, FPGAs) on which code can be executed, and to manage data resources and code execution on those devices.

- Open, standards-based
- Multiarchitecture performance 
- Freedom from vendor lock-in
- Comparable performance to native CUDA on Nvidia GPUs
- Extension of widely used C++ language
- Speed code migration via open source SYCLomatic


## Simple Computation on CPU

Let’s look at this simple C++ Code
- We initialize a data array
- We do some computation on each element of the array
- Print the output


In [None]:
%%writefile lab/cpp_compute.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <iostream>

int main(){

    //# initialize some data array
    const int N = 16;
    float data[N];
    for(int i=0;i<N;i++) data[i] = i;

    //# computation on CPU
    for(int i=0;i<N;i++) data[i] = data[i] * 5;

    //# print output
    for(int i=0;i<N;i++) std::cout << data[i] << "\n"; 
}

#### Build and Run

We will use `icpx` to compile the C++ code

```sh
icpx lab/cpp_compute.cpp
```
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_cpp_compute.sh

## C++ SYCL for Offloading Computation on Device

The code below show modifications done to the above C++ code using SYCL so that the computation is offloaded to accelerator device.

What is SYCL doing here:
- Select device for offloading
- Allocate memory so that both host and device can access
- Submit a kernel task to device for computation and wait for completion


In [None]:
%%writefile lab/sycl_compute_offload.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
#include <iostream>

int main(){
    //# select device for offload
    sycl::queue q(sycl::gpu_selector_v);
    std::cout << "Offload Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

    //# initialize some data array
    const int N = 16;
    auto data = sycl::malloc_shared<float>(N, q);
    for(int i=0;i<N;i++) data[i] = i;

    //# computation on GPU
    q.single_task([=](){
        for(int i=0;i<N;i++) data[i] = data[i] * 5;
    }).wait();

    //# print output
    for(int i=0;i<N;i++) std::cout << data[i] << "\n"; 
}

#### Build and Run

We will use `icpx` to compile the C++ SYCL code with `-fsycl` compile flag

```sh
icpx -fsycl lab/sycl_compute_offload.cpp
```

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_sycl_compute_offload.sh

There are three important functionality the SYCL enables:
- __Device Selection__ for offloading computation
- __Memory Allocation__ so that both host and device can access data
- __Submit Compute Task__ to execute on device

### SYCL - Device Selection

`sycl::queue` is used schedule a task to execute on a device
The device can be specified when sycl::queue is constructed

| SYCL Device Selector | Description |
|---|---|
| `sycl::queue q(sycl::gpu_selector_v);`| Select GPU for offloading |
| `sycl::queue q(sycl::cpu_selector_v);`| Select CPU for offloading|
| `sycl::queue q(sycl::accelerator_selector_v);`|Select other Accelerator for offloading|
| `sycl::queue q(sycl::default_selector_v);`|Select default device for offloading, GPU if exists, if not CPU|
| `sycl::queue q;` | Same as default selector|
| `sycl::queue q(sycl::aspect_selector(aspect::fp64));` | Select a device that supports fp64 |

### SYCL - Memory Allocation

`sycl::malloc_shared` is used here to allocate memory that can be accessed by both host and device and data movement happens implicitly.

There is also `sycl::malloc_device`, which allocates memory on device, which allows more controlled explicit data movement, which is recommended for performance.

### SYCL – Submitting Task to Device

`q.single_task` is the most basic method to submit a task to execute on device.

The kernel execution happens asynchronously, so we have to synchronize with host.
`.wait()` method is used to synchronize with host by waiting for task completion.

There is also `q.parallel_for`, which allows submitting a task and enables parallel execution on device.

The code below shows usage of `parallel_for` to enable parallelism of execution on device:

In [None]:
%%writefile lab/sycl_compute_offload_parallelism.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
#include <iostream>

int main(){
    //# select device for offload
    sycl::queue q(sycl::gpu_selector_v);
    std::cout << "Offload Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

    //# initialize some data array
    const int N = 16;
    auto data = sycl::malloc_shared<float>(N, q);
    for(int i=0;i<N;i++) data[i] = i;

    //# parallel computation on GPU
    q.parallel_for(N,[=](auto i){
        data[i] = data[i] * 5;
    }).wait();

    //# print output
    for(int i=0;i<N;i++) std::cout << data[i] << "\n"; 
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_sycl_compute_offload_parallelism.sh

## SYCL Libraries for Offloading Computation on Device

The code below show modifications done to the above C++ code using a SYCL Library (oneDPL or oneAPI Data Parallel C++ Library) so that the computation is offloaded to accelerator device without having to know any knowledge of SYCL programming.

The code uses `oneapi::dpl::for_each(...)` library call which will perform the defined functionality on range of data elements, it also defines an execution policy `oneapi::dpl::execution::dpcpp_default`, which will select the default SYCL device used for offloading compution.   

There are many more SYCL libraries available that will simplify offloading certain types of  computations to devices. These libraries can be used to quickly get device offloading to work when starting from C/C++ code or if the computation is simple. But using actual SYCL calls to offload computation allows you better optimize the code for performance by programming to device hardware features.

In [None]:
%%writefile lab/onedpl_compute_offload.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <iostream>

int main(){
    //# initialize some data array
    const int N = 16;
    std::vector<int> data(N);
    for(int i=0;i<N;i++) data[i] = i;

    //# parallel computation on GPU using SYCL library (oneDPL)
    oneapi::dpl::for_each(oneapi::dpl::execution::dpcpp_default, data.begin(), data.end(), [](int &tmp){ tmp *= 5; });

    //# print output
    for(int i=0;i<N;i++) std::cout << data[i] << "\n"; 
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_onedpl_compute_offload.sh

## Lab Exercise: Vector Add

Complete the coding excercise below using SYCL Buffer and Accessor concepts:
- The code has three arrays initialized on host
- There is a for-loop to compute c = a + b
- Modify the code using SYCL to offload the computation to a device
  - include SYCL header
  - create a SYCL queue for offloading
  - allocate memory so that both device and host can access
  - submit a kernel for computation

1. Edit the code cell below by following the steps and then click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/sycl_vector_add.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <iostream>

//# STEP 1 : Include header for SYCL
//# YOUR CODE GOES HERE





int main(){
    
    //# STEP 2: Create a SYCL queue and device selection for offload
    //# YOUR CODE GOES HERE
    
    
    
    

    //# initialize some data array
    const int N = 16;
    
    //# STEP 3: Allocate memory so that both host and device can access
    //# MODIFY THE CODE BELOW    
    float a[N], b[N], c[N];
    
    
    
    
    for(int i=0;i<N;i++) {
        a[i] = 1;
        b[i] = 2;
        c[i] = 0;
    }

    
    //# STEP 4: Submit computation to Offload device
    //# MODIFY THE CODE BELOW      
    
    //# computation
    for(int i=0;i<N;i++) c[i] = a[i] + b[i];

    
    
    
    //# print output
    for(int i=0;i<N;i++) std::cout << c[i] << "\n"; 
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! ./run_sycl_vector_add.sh

## Summary

C++ with SYCL is open standard for heterogenous computing.