# Kernel Submission

In this section we cover performance impact and consideration when submitting a kernel in SYCL:
- [Kernel Launch](#Kernel-Launch)
- [Executing Multiple Kernels](#Executing-Multiple-Kernels)
- [Submitting Kernels to Multiple Queues](#Submitting-Kernels-to-Multiple-Queues)
- [Avoid Redundant Queue Construction](#Avoid-Redundant-Queue-Construction)



## Kernel Launch
In SYCL, work is performed by enqueueing kernels into queues targeting specific devices. These kernels are submitted by the host to the device, executed by the device and results are sent back. The kernel submission by the host and the actual start of execution do not happen immediately - they are asynchronous and as such we have to keep track of the following timings associated with a kernel.
#### Kernel submission start time
This is the at which the host starts the process of submitting the kernel.
#### Kernel submission end time
This is the time at which the host finished submitting the kernel. The host performs multiple tasks like queuing the arguments, allocating resources in the runtime for the kernel to start execution on the device.
#### Kernel launch time
This is the time at which the kernel that was submitted by the host starts executing on the device. Note that this is not exactly same as the kernel submission end time. There is a lag between the submission end time and the kernel launch time, which depends on the availability of the device. It is possible for the host to queue up a number of kernels for execution before the kernels are actually launched for execution. More over, there are a few data transfers that need to happen before the actual kernel starts execution which is typically not accounted separately from kernel launch time.
#### Kernel completion time
This is the time at which the kernel finishes execution on the device. The current generation of devices are non-preemptive, which means that once a kernel starts, it has to complete its execution.

Tools like VTune™ Profiler (vtune), clIntercept, and zeIntercept provide a visual timeline for each of the above times for every kernel in the application.

The following simple example shows time being measured for the kernel execution since the time is measured after `wait`. This will involve the kernel submission time on the host, the kernel execution time on the device, and any data transfer times (since there are no buffers or memory, this is usually zero in this case).

The time before the `wait` measures the time it takes for the host to submit the kernel to the runtime. These overheads are highly dependent on the backend runtime being used and the processing power of the host.

In [None]:
%%writefile lab/kernel_launch.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

constexpr int N = 1024000000;

int main() {
  sycl::queue q;
    
  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  q.parallel_for(N, [=](auto id) {
    /* NOP */
  });
  
  auto k_subm = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
    
  q.wait();
    
  auto k_exec = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Kernel Submission Time: " << k_subm / 1e+9 << " seconds\n";
  std::cout << "Kernel Submission + Execution Time: " << k_exec / 1e+9 << " seconds\n";

  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_kernel_launch.sh

#### SYCL Profiling API

One way to measure the actual kernel execution time on the device is to use the SYCL built-in profiling API. 
```cpp
sycl::queue q{sycl::property::queue::enable_profiling()};
```
The following code demonstrates usage of the SYCL profiling API to profile kernel execution times. It also shows the kernel submission time. There is no way to programmatically measure the kernel launch time since it is dependent on the runtime and the device driver. Profiling tools can provide this information.

In [None]:
%%writefile lab/kernel_profiling.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

constexpr int N = 1024000000;

int main() {
  sycl::queue q{sycl::property::queue::enable_profiling()};
    
  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  auto e = q.parallel_for(N, [=](auto id) {
    /* NOP */
  });
  e.wait();
    
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Kernel Duration  : " << duration / 1e+9 << " seconds\n";

  auto startK = e.get_profiling_info<sycl::info::event_profiling::command_start>();
  auto endK = e.get_profiling_info<sycl::info::event_profiling::command_end>();
  std::cout << "Kernel Execution : " << (endK - startK) / 1e+9 << " seconds\n";
    
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_kernel_profiling.sh

## Executing Multiple Kernels
SYCL has two kinds of queues that a programmer can create and use to submit kernels for execution.
- __in-order queues__ : where kernels are executed in the order they were submitted to the queue
- __out-of-order queues__ : where kernels can be executed in an arbitrary order (subject to the dependency constraints among them).

The choice to create an in-order or out-of-order queue is made at queue construction time through the property `sycl::property::queue::in_order()`. By default, when no property is specified, the queue is out-of-order.

In the following example, three kernels are submitted per iteration. Each of these kernels uses only one work-group with 256 work-items. These kernels are created specifically with one group to ensure that they do not use the entire machine. This is done to illustrate the benefit of parallel kernel execution.

In [None]:
%%writefile lab/kernel_multiple.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

// Array type and data size for this example.
constexpr size_t array_size = (1 << 15);
typedef std::array<int, array_size> IntArray;

#define iter 10

int multi_queue(sycl::queue &q, const IntArray &a, const IntArray &b) {
  size_t num_items = a.size();
  IntArray s1, s2, s3;

  sycl::buffer a_buf(a);
  sycl::buffer b_buf(b);
  sycl::buffer sum_buf1(s1);
  sycl::buffer sum_buf2(s2);
  sycl::buffer sum_buf3(s3);

  size_t num_groups = 1;
  size_t wg_size = 256;
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {
    q.submit([&](sycl::handler &h) {
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      sycl::accessor sum_acc(sum_buf1, h, sycl::write_only, sycl::no_init);

      h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
                     [=](sycl::nd_item<1> index) {
                       size_t loc_id = index.get_local_id();
                       sum_acc[loc_id] = 0;
                       for (int j = 0; j < 1000; j++)
                         for (size_t i = loc_id; i < array_size; i += wg_size) {
                           sum_acc[loc_id] += a_acc[i] + b_acc[i];
                         }
                     });
    });
    q.submit([&](sycl::handler &h) {
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      sycl::accessor sum_acc(sum_buf2, h, sycl::write_only, sycl::no_init);

      h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
                     [=](sycl::nd_item<1> index) {
                       size_t loc_id = index.get_local_id();
                       sum_acc[loc_id] = 0;
                       for (int j = 0; j < 1000; j++)
                         for (size_t i = loc_id; i < array_size; i += wg_size) {
                           sum_acc[loc_id] += a_acc[i] + b_acc[i];
                         }
                     });
    });
    q.submit([&](sycl::handler &h) {
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      sycl::accessor sum_acc(sum_buf3, h, sycl::write_only, sycl::no_init);

      h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
                     [=](sycl::nd_item<1> index) {
                       size_t loc_id = index.get_local_id();
                       sum_acc[loc_id] = 0;
                       for (int j = 0; j < 1000; j++)
                         for (size_t i = loc_id; i < array_size; i += wg_size) {
                           sum_acc[loc_id] += a_acc[i] + b_acc[i];
                         }
                     });
    });
  }
  q.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "multi_queue completed on device - took "
            << (end - start).count()/ 1e+9 << " seconds\n";
  // check results
  return ((end - start).count());
} // end multi_queue

void InitializeArray(IntArray &a) {
  for (size_t i = 0; i < a.size(); i++)
    a[i] = 1;
}

IntArray a, b;

int main() {

  sycl::queue q;

  InitializeArray(a);
  InitializeArray(b);

  std::cout << "Running on device: "
            << q.get_device().get_info<sycl::info::device::name>() << "\n";
  std::cout << "Vector size: " << a.size() << "\n";

  // begin in-order submission
  std::cout << "In order queue: Jitting+Execution time\n";
  sycl::queue q1{sycl::property::queue::in_order()};
  multi_queue(q1, a, b);
  std::cout << "In order queue: Execution time\n";
  multi_queue(q1, a, b);
  // end in-order submission

  // begin out-of-order submission
  sycl::queue q2;
  std::cout << "Out of order queue: Jitting+Execution time\n";
  multi_queue(q2, a, b);
  std::cout << "Out of order queue: Execution time\n";
  multi_queue(q2, a, b);
  // end out-of-order submission
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_kernel_multiple.sh

In the case where the underlying queue is in-order, these kernels cannot be executed in parallel and have to be executed sequentially even though there are adequate resources in the machine and there are no dependencies among the kernels. This can be seen from the larger total execution time for all the kernels. The creation of the queue and the kernel submission is shown below.

When the queue is out-of-order, the overall execution time is much lower, indicating that the machine is able to execute different kernels from the queue at the same time. The creation of the queue and the invocation of the kernel is shown below.

In situations where kernels do not scale strongly and therefore cannot effectively utilize full machine compute resources, it is better to allocate only the required compute units through appropriate selection of work-group/work-item values and try to execute multiple kernels at the same time.

It is also possible to statically partition a single device into sub-devices through the use of `create_sub_devices` function of device class. This provides more control to the programmer for submitting kernels to an appropriate sub-device. However, the partition of a device into sub-devices is static, so the runtime will not be able to adapt to the dynamic load of an application because it does not have flexibility to move kernels from one sub-device to another.

> NOTE:
> At the time of writing, only the OpenCL backend is able to execute the kernels out of order. Support in the Level Zero backend to execute kernels out of order is still in development.
>
> To see concurrent kernel execution we will have to set the SYCL backend to `opencl`, by default the SYCL backend is `level_zero` for Intel oneAPI DPC++ Compiler
>```sh
export SYCL_DEVICE_FILTER=opencl
> 
>export SYCL_DEVICE_FILTER=level_zero
>```

## Submitting Kernels to Multiple Queues
Queues provide a channel to submit kernels for execution on an accelerator. Queues also hold a context that describes the state of the device. This state includes the contents of buffers and any memory needed to execute the kernels. The runtime keeps track of the current device context and avoids unnecessary memory transfers between host and device. Therefore, it is better to submit and launch kernels from one context together, as opposed to interleaving the kernel submissions in different contexts.

The following example submits 3000 independent kernels that use the same buffers as input to compute the result into different output buffers. All these kernels are completely independent and can potentially execute concurrently and out of order. The kernels are submitted to three queues, and the execution of each kernel will incur different costs depending on the how the queues are created. The SYCL code submits kernels in 3 different ways:
- Kernel submission to same queue
- Kernel submission to different queues with same context
- Kernel submission to different queues with different contexts

In [None]:
%%writefile lab/kernel_multiple_queues.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr int N = 1024;
#define iter 1000

int VectorAdd(sycl::queue &q1, sycl::queue &q2, sycl::queue &q3,
              std::vector<int> a, std::vector<int> b) {

  sycl::buffer a_buf(a);
  sycl::buffer b_buf(b);
  sycl::buffer<int> *sum_buf[3 * iter];
  for (size_t i = 0; i < (3 * iter); i++)
    sum_buf[i] = new sycl::buffer<int>(256);

  size_t num_groups = 1;
  size_t wg_size = 256;
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {
    q1.submit([&](auto &h) {
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      auto sum_acc = sum_buf[3 * i]->get_access<sycl::access::mode::write>(h);

      h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
                     [=](sycl::nd_item<1> index) {
                       size_t loc_id = index.get_local_id();
                       sum_acc[loc_id] = 0;
                       for (int j = 0; j < 1000; j++)
                         for (size_t i = loc_id; i < N; i += wg_size) {
                           sum_acc[loc_id] += a_acc[i] + b_acc[i];
                         }
                     });
    });
    q2.submit([&](auto &h) {
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      auto sum_acc =
          sum_buf[3 * i + 1]->get_access<sycl::access::mode::write>(h);

      h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
                     [=](sycl::nd_item<1> index) {
                       size_t loc_id = index.get_local_id();
                       sum_acc[loc_id] = 0;
                       for (int j = 0; j < 1000; j++)
                         for (size_t i = loc_id; i < N; i += wg_size) {
                           sum_acc[loc_id] += a_acc[i] + b_acc[i];
                         }
                     });
    });
    q3.submit([&](auto &h) {
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      auto sum_acc =
          sum_buf[3 * i + 2]->get_access<sycl::access::mode::write>(h);

      h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
                     [=](sycl::nd_item<1> index) {
                       size_t loc_id = index.get_local_id();
                       sum_acc[loc_id] = 0;
                       for (int j = 0; j < 1000; j++)
                         for (size_t i = loc_id; i < N; i += wg_size) {
                           sum_acc[loc_id] += a_acc[i] + b_acc[i];
                         }
                     });
    });
  }
  q1.wait();
  q2.wait();
  q3.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector add completed on device - took " << (end - start).count() / 1e+9 << " seconds\n";
  // check results
  for (size_t i = 0; i < (3 * iter); i++)
    delete sum_buf[i];
  return ((end - start).count());
}


int main() {

  sycl::queue q(sycl::default_selector_v);
  
  std::vector<int> a(N, 1);
  std::vector<int> b(N, 2);

  std::cout << "Running on device: "
            << q.get_device().get_info<sycl::info::device::name>() << "\n";
  std::cout << "Vector size: " << a.size() << "\n";

  // jit the code
  VectorAdd(q, q, q, a, b);

  std::cout << "\nSubmission to same queue out_of_order\n";
  VectorAdd(q, q, q, a, b);

  sycl::queue q0(sycl::default_selector_v, sycl::property::queue::in_order());
  std::cout << "\nSubmission to same queue in_order\n";
  VectorAdd(q0, q0, q0, a, b);
    
  std::cout << "\nSubmission to different queues with same context\n";
  sycl::queue q1(sycl::default_selector_v);
  sycl::queue q2(q1.get_context(), sycl::default_selector_v);
  sycl::queue q3(q1.get_context(), sycl::default_selector_v);
  VectorAdd(q1, q2, q3, a, b);

  std::cout << "\nSubmission to different queues with different contexts\n";
  sycl::queue q4(sycl::default_selector_v);
  sycl::queue q5(sycl::default_selector_v);
  sycl::queue q6(sycl::default_selector_v);
  VectorAdd(q4, q5, q6, a, b);

  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_kernel_multiple_queues.sh

Submitting the kernels to the same queue gives the best performance because all the kernels are able to just transfer the needed inputs once at the beginning and do all their computations.
```cpp
  VectorAdd(q, q, q, a, b);
```
If the kernels are submitted to different queues that share the same context, the performance is similar to submitting it to one queue. The issue to note here is that when a kernel is submitted to a new queue with a different context, the JIT process compiles the kernel to the new device associated with the context. If this JIT compilation time is discounted, the actual execution of the kernels is similar.
```cpp
  sycl::queue q1(d_selector);
  sycl::queue q2(q1.get_context(), d_selector);
  sycl::queue q3(q1.get_context(), d_selector);
  VectorAdd(q1, q2, q3, a, b);
```
If the kernels are submitted to three different queues that have three different contexts, performance degrades because at kernel invocation, the runtime needs to transfer all input buffers to the accelerator every time. In addition, the kernels will be JITed for each of the contexts.
```cpp
  sycl::queue q4(d_selector);
  sycl::queue q5(d_selector);
  sycl::queue q6(d_selector);
  VectorAdd(q4, q5, q6, a, b);
```
If for some reason you need to use different queues, the problem can be alleviated by creating the queues with shared context. This will prevent the need to transfer the input buffers, but the memory footprint of the kernels will increase because all the output buffers have to be resident at the same time in the context, whereas earlier the same memory on the device could be used for the output buffers. 

Another thing to remember is the issue of __memory-to-compute ratio__ in the kernels. If the compute requirement of the kernel is low, the overall execution is dominated by the memory transfers. When the compute is high, these transfers do not contribute much to the overall execution time.

## Avoid Redundant Queue Construction
To execute kernels on a device, the user must create a queue, which references an associated context, platform, and device. These may be chosen automatically, or specified by the user.

A context is constructed, either directly by the user or implicitly when creating a queue, to hold all the runtime information required by the SYCL runtime and the SYCL backend to operate on a device. When a queue is created with no context specified, a new context is implicitly constructed using the default constructor. In general, creating a new context is a heavy duty operation due to the need for JIT compiling the program every time a kernel is submitted to a queue with a new context. For good performance one should use as few contexts as possible in their application.

In the following example, a queue is created inside the loop and the kernel is submitted to this new queue. This will essentially invoke the JIT compiler for every iteration of the loop.

Try the 4 modifications and compare the performance numbers:

1. Run the code and check the performance number.

2. In case you need to create multiple queues, try to share the contexts among the queues. This will improve the performance.

   In the code below you can change `sycl::queue q2;` to:
   ```cpp
   sycl::queue q2{ q1.get_context(), sycl::default_selector() };
   ```
3. Another implementation is by moving the `q2` queue declaration outside the for-loop, which improves performance quite dramatically.

4. You can also use the same `sycl::queue` `q1` inside the for-loop, instead of using `q2`


In [None]:
%%writefile lab/kernel_redundant_queue.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

constexpr int N = 1024000;
constexpr int ITER = 1000;

int main() {

  std::vector<int> data(N);
  sycl::buffer<int> data_buf(data);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  //# kernel to initialize data 
  sycl::queue q1;
  q1.submit([&](auto &h) {
    sycl::accessor data_acc(data_buf, h, sycl::write_only, sycl::no_init);
    h.parallel_for(N, [=](auto i) { data_acc[i] = i; });
  }).wait();

  //# for-loop with kernel computation
  for (int i = 0; i < ITER; i++) {

    sycl::queue q2;

    q2.submit([&](auto &h) {
      sycl::accessor data_acc(data_buf, h);
      h.parallel_for(N, [=](auto i) {
        data_acc[i] += 1;
      });
    });
    sycl::host_accessor ha(data_buf);

  }
  std::cout << "data[0] = " << data[0] << "\n";
    
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";

  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_kernel_redundant_queue.sh

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming