# Introduction to oneAPI and SYCL

##### Sections
- [oneAPI Programming Model Overview](#oneAPI-Software-Model-Overview)
- [Programming Challenges for Multiple architectures](#Programming-Challenges-for-Multiple-architectures)
- [Introducing oneAPI](#Introducing-oneAPI)
- _Code:_ [SYCL Hello World](#Simple-Exercise)
- [What is SYCL](#SYCL)
- [How to Compile & Run a SYCL program](#How-to-Compile-&-Run-SYCL-program)

## Learning Objectives

* Explain how the __oneAPI__ programming model can solve the challenges of programming in a heterogeneous world 
* Use oneAPI projects to enable your workflows
* Understand the __SYCL__ language and programming model
* Familiarization on the use Jupyter notebooks for training throughout the course


## oneAPI Programming Model Overview
The __oneAPI__ programming model provides a comprehensive and unified portfolio of developer tools that can
be used across hardware targets, including a range of performance libraries spanning several workload
domains. The libraries include functions custom-coded for each target architecture so the same
function call delivers optimized performance across supported architectures. __DPC++__ is based on
industry standards and open specifications to encourage ecosystem collaboration and innovation.

### oneAPI Distribution
Intel&reg; oneAPI toolkits are available via multiple distribution channels:
* Local product installation: install the oneAPI toolkits from the __Intel® Developer Zone__.
* Install from containers or repositories: install the oneAPI toolkits from one of several supported
containers or repositories.
* Pre-installed in the __Intel® DevCloud__: a free development sandbox for access to the latest Intel® SVMS hardware and select oneAPI toolkits. 

## Programming Challenges for Multiple architectures
Currently in the data centric space there is growth in specialized workloads. Each kind of data centric hardware typically needs to be programmed using different languages and libraries as there is no common programming language or APIs, this requires maintaining separate code bases. Developers have to learn a whole set of different tools as there is inconsistent tool support across platforms. Developing software for each hardware platform requires a separate investment, with little ability to reuse that work to target a different architecture. You will also have to consider the requirement of the diverse set of data-centric hardware.

<img src="Assets/oneapi1.png">


## Introducing oneAPI
__oneAPI__ is a solution to deliver unified programming model to __simplify development__ across diverse architectures. It includes a unified and simplified language and libraries for expressing __parallelism__ and delivers uncompromised native high-level language performance across a range of hardware including __CPUs, GPUs, FPGAs__. oneAPI initiative is based on __industry standards and open specifications__ and is interoperable with existing HPC programming models.

<img src="Assets/oneapi2.png">


***
# Simple Exercise
This exercise introduces SYCL to the developer by way of a small simple code. In addition, it introduces the developer to the Jupyter notebook environment for editing and saving code; and for running and submitting programs to the Intel® DevCloud.

##  Editing the simple.cpp code
The Jupyter cell below with the gray background can be edited in-place and saved.

The first line of the cell contains the command **%%writefile 'simple.cpp'** This tells the input cell to save the contents of the cell into a file named 'simple.cpp' in your current directory (usually your home directory). As you edit the cell and run it in the Jupyter notebook, it will save your changes into that file.

The code below is some simple SYCL code to get you started in the DevCloud environment. Simply inspect the code - there are no modifications necessary. Run the first cell to create the file, then run the cell below it to compile and execute the code.
1. Inspect the code cell below, then click run ▶ to save the code to a file
2. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [89]:
%%writefile lab/simple.cpp
#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
#include <cmath>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <numeric>

using namespace sycl;

// Helper: must be power of two
bool is_power_of_two(size_t x) { return x && !(x & (x - 1)); }

// Basic bitonic sort
void bitonic_sort_v1(std::vector<int>& data) {
    size_t N = data.size();
    assert(is_power_of_two(N) && "size must be power of two");
    queue q;
    buffer<int> buf(data.data(), range<1>(N));

    for (size_t k = 2; k <= N; k <<= 1) {
        for (size_t j = k >> 1; j > 0; j >>= 1) {
            q.submit([&](handler& h) {
                auto d = buf.get_access<access::mode::read_write>(h);
                h.parallel_for(range<1>(N), [=](id<1> idx) {
                    size_t i = idx[0];
                    size_t ixj = i ^ j;
                    if (ixj > i && ((d[i] > d[ixj]) == ((i & k) == 0))) {
                        int tmp = d[i];
                        d[i] = d[ixj];
                        d[ixj] = tmp;
                    }
                });
            });
        }
    }
    q.wait();
}

void bitonic_sort_v2(std::vector<int>& data) {
    const size_t N = data.size();
    if (!is_power_of_two(N)) {
        std::cerr << "Array size must be a power of two.\n";
        return;
    }

    if (N > 8192) {
        std::cerr << "bitonic_sort_v2 supports only N <= 8192 due to local memory limits.\n";
        return;
    }

    queue q;
    buffer<int> buf(data.data(), range<1>(N));

    q.submit([&](handler& h) {
        accessor<int, 1, access::mode::read_write, access::target::local> local_mem(N, h);
        auto d = buf.get_access<access::mode::read_write>(h);

        h.parallel_for(nd_range<1>(range<1>(N), range<1>(N)), [=](nd_item<1> item) {
            size_t i = item.get_local_id(0);

            // Load to local memory
            local_mem[i] = d[i];
            item.barrier(access::fence_space::local_space);

            for (size_t k = 2; k <= N; k <<= 1) {
                for (size_t j = k >> 1; j > 0; j >>= 1) {
                    size_t ixj = i ^ j;
                    if (ixj > i) {
                        bool ascending = ((i & k) == 0);
                        int a = local_mem[i];
                        int b = local_mem[ixj];
                        if ((a > b) == ascending) {
                            local_mem[i] = b;
                            local_mem[ixj] = a;
                        }
                    }
                    item.barrier(access::fence_space::local_space);
                }
            }

            // Write back
            d[i] = local_mem[i];
        });
    }).wait();
}





void bitonic_sort_v3(std::vector<int>& data) {
    const size_t N = data.size();
    assert(is_power_of_two(N));

    sycl::queue q;
    sycl::buffer<int> buf(data.data(), sycl::range<1>(N));

    const size_t local_size = 128;

    for (size_t k = 2; k <= N; k <<= 1) {
        for (size_t j = k >> 1; j > 0; j >>= 1) {
            q.submit([&](sycl::handler& h) {
                auto d = buf.get_access<sycl::access::mode::read_write>(h);

                h.parallel_for(
                    sycl::nd_range<1>(sycl::range<1>(N), sycl::range<1>(local_size)),
                    [=](sycl::nd_item<1> item) {
                        size_t i = item.get_global_id(0);
                        size_t ixj = i ^ j;

                        if (ixj > i && ixj < N) {
                            int val_i = d[i];
                            int val_ixj = d[ixj];
                            bool ascending = ((i & k) == 0);
                            if ((val_i > val_ixj) == ascending) {
                                d[i] = val_ixj;
                                d[ixj] = val_i;
                            }
                        }
                    });
            }).wait();
        }
    }
}




void bitonic_sort_v4(std::vector<int>& data) {
    const size_t N = data.size();
    if (!is_power_of_two(N)) {
        std::cerr << "Array size must be a power of two.\n";
        return;
    }

    sycl::queue q;
    sycl::buffer<int, 1> buf(data.data(), sycl::range<1>(N));

    const size_t tile_size = 256;

    for (size_t k = 2; k <= N; k *= 2) {
        for (size_t j = k / 2; j > 0; j /= 2) {
            q.submit([&](sycl::handler& h) {
                auto d = buf.get_access<sycl::access::mode::read_write>(h);
                h.parallel_for(
                    sycl::nd_range<1>(sycl::range<1>(N), sycl::range<1>(tile_size)),
                    [=](sycl::nd_item<1> item) {
                        size_t i = item.get_global_id(0);
                        size_t ixj = i ^ j;

                        if (ixj > i && ixj < N && i < N) {
                            bool ascending = ((i & k) == 0);
                            int val_i = d[i];
                            int val_ixj = d[ixj];

                            if ((val_i > val_ixj) == ascending) {
                                d[i] = val_ixj;
                                d[ixj] = val_i;
                            }
                        }
                    });
            });
        }
    }

    q.wait();
}


// Compute median from sorted data
long long median(std::vector<long long>& v) {
    std::sort(v.begin(), v.end());
    size_t n = v.size();
    return (n % 2 == 0) ? (v[n / 2 - 1] + v[n / 2]) / 2 : v[n / 2];
}

void print_stats(const std::string& label, const std::vector<long long>& times) {
    if (times.empty()) return;

    double sum = std::accumulate(times.begin(), times.end(), 0.0);
    double mean = sum / times.size();

    double sq_sum = 0;
    for (auto t : times) sq_sum += (t - mean) * (t - mean);
    double stddev = std::sqrt(sq_sum / times.size());

    long long min_time = *std::min_element(times.begin(), times.end());
    long long max_time = *std::max_element(times.begin(), times.end());

    auto sorted_times = times;
    long long med = median(sorted_times);

    std::cout << "\n--- " << label << " ---\n";
    std::cout << "Avg     : " << mean << " us\n";
    std::cout << "Stddev  : " << stddev << " us\n";
    std::cout << "Min     : " << min_time << " us\n";
    std::cout << "Max     : " << max_time << " us\n";
    std::cout << "Median  : " << med << " us\n";
}

int main() {
    constexpr int ITER = 50;
    constexpr size_t N = 1 << 8; // 8192 elements

    std::vector<int> orig(N);
    for (size_t i = 0; i < N; ++i)
        orig[i] = rand() % 1000000;

    auto expect = orig;
    std::sort(expect.begin(), expect.end());

    std::vector<long long> times_v1, times_v2, times_v3, times_v4;

    auto data = orig;
    bitonic_sort_v1(data);

    
        for (int it = 0; it < ITER; ++it) {
            for (int ver = 1; ver <= 4; ++ver) {
            auto data = orig;
            auto t0 = std::chrono::high_resolution_clock::now();

            if (ver == 1) bitonic_sort_v1(data);
            if (ver == 2) bitonic_sort_v2(data);
            if (ver == 3) bitonic_sort_v3(data);
            if (ver == 4) bitonic_sort_v4(data);

            auto t1 = std::chrono::high_resolution_clock::now();

            if (data != expect) {
                std::cerr << "Mismatch in v" << ver << "\n";
                return 1;
            }

            long long us = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
            if (ver == 1) times_v1.push_back(us);
            if (ver == 2) times_v2.push_back(us);
            if (ver == 3) times_v3.push_back(us);
            if (ver == 4) times_v4.push_back(us);
        }
    }

    print_stats("v1", times_v1);
    print_stats("v2", times_v2);
    print_stats("v3", times_v3);
    print_stats("v4", times_v4);

    // Optional: dump arrays for external plotting
    auto dump_array = [](const std::string& label, const std::vector<long long>& v) {
        std::cout << "\n" << label << "_times = [";
        for (size_t i = 0; i < v.size(); ++i)
            std::cout << v[i] << (i + 1 < v.size() ? ", " : "");
        std::cout << "]\n";
    };

    dump_array("v1", times_v1);
    dump_array("v2", times_v2);
    dump_array("v3", times_v3);
    dump_array("v4", times_v4);

    return 0;
}

Overwriting lab/simple.cpp


### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [90]:
! chmod 755 q; chmod 755 run_simple.sh;if [ -x "$(command -v qsub)" ]; then ./q run_simple.sh; else ./run_simple.sh; fi

Compiling SYCL C++ program...
In file included from lab/simple.cpp:1:
      | [0;1;32m ^
   57 |         accessor<[0;34mint[0m, [0;32m1[0m, access::mode::read_write, access::target::local> local_mem(N, h);[0m
      | [0;1;32m                                                                   ^
[0m[1m/opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/access/access.hpp:25:9: [0m[0;1;36mnote: [0m'local' has been explicitly marked deprecated here[0m
   25 |   local __SYCL2020_DEPRECATED([0;32m"use `local_accessor` instead"[0m) = [0;32m2016[0m,[0m
      | [0;1;32m        ^
[0m[1m/opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/detail/defines_elementary.hpp:62:40: [0m[0;1;36mnote: [0mexpanded from macro '__SYCL2020_DEPRECATED'[0m
   62 | #define __SYCL2020_DEPRECATED(message) __SYCL_DEPRECATED(message)[0m
      | [0;1;32m                                       ^
[0m[1m/opt/intel/oneapi/compiler/2025.1/bin/compiler/../../include/sycl/

## SYCL
__SYCL__ (pronounced ‘sickle’) represents an industry standardization effort that includes
support for data-parallel programming for C++. It is summarized as “C++ Single-source
Heterogeneous Programming for OpenCL.” The SYCL standard, like OpenCL*, is managed
by the __Khronos Group*__.

SYCL is a cross-platform abstraction layer that builds on OpenCL. It enables code
for heterogeneous processors to be written in a “single source” style using C++. This is not
only useful to the programmers, but it also gives a compiler the ability to analyze and
optimize across the entire program regardless of the device on which the code is to be run.

Unlike OpenCL, SYCL includes templates and lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code.
Developers program at a higher level than OpenCL but always have access to lower-level code through seamless integration with OpenCL, as well as C/C++ libraries.

## What is Data Parallel C++
__Data Parallel C++ (DPC++)__ is oneAPI's implementation of SYCL compiler. It takes advantage of modern C++ productivity benefits and familiar constructs, and incorporates the __SYCL*__ standard for data parallelism and heterogeneous programming. SYCL is a __single source__ language where host code and __heterogeneous accelerator kernels__ can be mixed in same source files. A SYCL program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionalities like a __queue__ for work targeting, __buffer__ for data management, and __parallel_for__ for parallelism to direct which parts of the computation and data should be offloaded.

## DPC++ extends SYCL
DPC++ programs __enhance productivity__. Simple things should be simple to express and lower verbosity and programmer burden. They also __enhance performance__ by giving programmers control over program execution and by enabling hardware-specific features. It is a fast-moving open collaboration feeding into the __SYCL* standard__, and is an __open source__ implementation with the goal of upstreaming LLVM and DPC++ extensions to become core __SYCL*__, or __Khronos*__ extensions.

## HPC Single Node Workflow with oneAPI 
Accelerated code can be written in either a kernel (SYCL) or __directive-based style__. Developers can use the __Intel® DPC++ Compatibility tool__ to perform a one-time migration from __CUDA__ to __SYCL__. Existing __Fortran__ applications can use a __directive-based style in OpenMP__. Existing __C++__ applications can choose either the __Kernel style__ or the __directive-based style option__ and existing __OpenCL__ applications can remain in the OpenCL language or migrate to SYCL.

__Intel® Advisor__ is recommended to  __Optimize__ the design for __vectorization and memory__ (CPU and GPU) and __Identify__ loops that are candidates for __offload__ and project the __performance on target accelerators.__

The figure below shows the recommended approach of different starting points for HPC developers:


<img src="Assets/workflow.png">


## oneAPI Programming models

### Platform Model

The platform model for oneAPI is based upon the SYCL* platform model. It specifies a host controlling one or more devices. A host is the computer, typically a CPU-based system executing the primary portion of a program, specifically the application scope and the command group scope. 

The host coordinates and controls the compute work that is performed on the devices. A device is an accelerator, a specialized component containing compute resources that can quickly execute a subset of operations typically more efficiently than the CPUs in the system. Each device contains one or more compute units that can execute several operations in parallel. Each compute unit contains one or more processing elements that serve as the individual engine for computation.

The following figure provides a visual depiction of the relationships in the platform model. One host communicates with one or more devices. Each device can contain one or more compute units. Each compute unit can contain one or more processing elements. In this example, the CPU in a desktop computer is the host and it can also be made available as a device in a platform configuration.

<img src="Assets/plat30.png">



### Execution Model

The execution model is based upon the SYCL* execution model. It defines and specifies how code, termed kernels, execute on the devices and interact with the controlling host.
The host execution model coordinates execution and data management between the host and devices via command groups. The command groups, which are groupings of commands like kernel invocation and accessors, are submitted to queues for execution.

Accessors, which are formally part of the memory model, also communicate ordering requirements of execution. A program employing the execution model declares and instantiates queues. Queues can execute with an in-order or out-of-order policy controllable by the program. In-order execution is an Intel extension.

The device execution model specifies how computation is accomplished on the accelerator. Compute ranging from small one-dimensional data to large multidimensional data sets are allocated across a hierarchy of ND-ranges, work-groups, sub-groups (Intel extension), and work-items, which are all specified when the work is submitted to the command queue.

It is important to note that the actual kernel code represents the work that is executed for one work-item. The code outside of the kernel controls just how much parallelism is executed; the amount and distribution of the work is controlled by specification of the sizes of the ND-range and work-group.


The following figure depicts the relationship between an ND-range, work-group, sub-group, and work-item. The total amount of work is specified by the ND-range size. The grouping of the work is specified by the work-group size. The example shows the ND-range size of X * Y * Z, work-group size of X’ * Y’ * Z’, and subgroup size of X’. Therefore, there are X * Y * Z work-items. There are (X * Y * Z) / (X’ * Y’ * Z’) work-groups and (X * Y * Z) / X’ subgroups.

<img src="Assets/kernel30.png">



### Memory Model

The memory model for oneAPI is based upon the SYCL* memory model. It defines how the host and devices interact with memory. It coordinates the allocation and management of memory between the host and devices. The memory model is an abstraction that aims to generalize across and be adaptable to the different possible host and device configurations.

In this model, memory resides upon and is owned by either the host or the device and is specified by declaring a memory object. There are two different types of memory objects, buffers and images. Interaction of these memory objects between the host and device is accomplished via an accessor, which communicates the desired location of access, such as host or device, and the particular mode of access, such as read or write.

Consider a case where memory is allocated on the host through a traditional malloc call. Once the memory is allocated on the host, a buffer object is created, which enables the host allocated memory to be communicated to the device. The buffer class communicates the type and number of items of that type to be communicated to the device for computation. Once a buffer is created on the host, the type of access allowed on the device is communicated via an accessor object, which specifies the type of access to the buffer.

<img src="Assets/memory.png">

### Kernel Programming Model
The kernel programming model for oneAPI is based upon the SYCL* kernel programming model. It enables explicit parallelism between the host and device. The parallelism is explicit in the sense that the programmer determines what code executes on the host and device; it is not automatic. The kernel code executes on the accelerator. 

Programs employing the oneAPI programming model support single source, meaning the host code and device code can be in the same source file. However, there are differences between the source code accepted in the host code and the device code with respect to language conformance and language features. 

The SYCL Specification defines in detail the required language features for host code and device code. The following is a summary that is specific to the oneAPI product.

## How to Compile & Run SYCL program

The three main steps of compiling and running a SYCL program are:
1. Initialize environment variables
2. Compile the SYCL source code
3. Run the application
 
#### Compiling and Running on Intel&reg; DevCloud:
 
For this training, we have written a script (q) to aid developers in developing projects on DevCloud. This script submits the `run.sh` script to a GPU node on DevCloud for execution, waits for the job to complete and prints out the output/errors. We will be using this command to run on DevCloud: `./q run.sh`



#### Compiling and Running on a Local System:

If you have installed the Intel&reg; oneAPI Base Toolkit on your local system, you can use the commands below to compile and run a SYCL program:

    source /opt/intel/inteloneapi/setvars.sh

    icpx -fsycl simple.cpp -o simple

    ./simple
    
_Note: run.sh script is a combination of the three steps listec above._

# Summary
In this module you will have learned the following:
* How oneAPI solves the challenges of programming in a heterogeneous world 
* Take advantage of oneAPI solutions to enable your workflows
* Use the Intel® DevCloud to test-drive oneAPI tools and libraries
* Basics of the SYCL language and programming model
* Become familiarized with the use of Juypter notebooks by editing of source code in context.


<html><body><span style="color:green"><h1>Survey</h1></span></body></html>

[Tell us how we did in this module with a short survey. We will use your feedback to improve the quality and impact of these learning materials. Thanks!](https://intel.az1.qualtrics.com/jfe/form/SV_6m4G7BXPNSS7FBz)



## Resources

Check out these related resources

#### Intel® oneAPI Toolkit documentation
* [Intel® oneAPI main page](https://software.intel.com/oneapi "oneAPI main page")
* [Intel® oneAPI programming guide](https://software.intel.com/sites/default/files/oneAPIProgrammingGuide_3.pdf "oneAPI programming guide")
* [Intel® DevCloud Signup](https://software.intel.com/en-us/devcloud/oneapi "Intel DevCloud")  Sign up here if you do not have an account.
* [Intel® DevCloud Connect](https://devcloud.intel.com/datacenter/connect)  Login to the DevCloud here.
* [Get Started with oneAPI for Linux*](https://software.intel.com/en-us/get-started-with-intel-oneapi-linux)
* [Get Started with oneAPI for Windows*](https://software.intel.com/en-us/get-started-with-intel-oneapi-windows)
* [Intel® oneAPI Code Samples](https://software.intel.com/en-us/articles/code-samples-for-intel-oneapibeta-toolkits)
* [oneAPI Specification elements](https://www.oneapi.com/spec/)

#### SYCL 
* [SYCL 2020 Specification](https://www.khronos.org/registry/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf)

#### Modern C++
* [CPPReference](https://en.cppreference.com/w/)
* [CPlusPlus](http://www.cplusplus.com/)

***