# Atomic Operation Optimization

In this section we will look at how Atomic Operations can be optimized including synchronization using barriers

- [Data Types for Atomic Operations](#Data-Types-for-Atomic-Operations)
- [Atomic Operations in Global vs Local Space](#Atomic-Operations-in-Global-vs-Local-Space)

## Data Types for Atomic Operations
Atomics allow multiple work-items for any cross work-item communication via memory. SYCL atomics are similar to C++ atomics and make the access to resources protected by atomics guaranteed to be executed as a single unit. 

#### Atomic Operation: Integer vs Float

The following SYCL code shows the implementation of a reduction operation in SYCL where every work-item is updating a global accumulator atomically. The input data type of this addition and the vector on which this reduction operation is being applied is an integer and float. 

The performance of the kernel with vector integer is reasonable compared to other techniques used for reduction. If the data type of the vector is a float or a double as shown in the second kernel below, the performance on certain accelerators is impaired due to lack of hardware support for float or double atomics. The following two kernels demonstrate how the time to execute an atomic add can vary drastically based on whether native atomics are supported.

In [None]:
%%writefile lab/atomics_data_type.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr size_t N = 1024 * 100;

int reductionInt(sycl::queue &q, std::vector<int> &data) {
  const size_t data_size = data.size();
  int sum = 0;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<int> buf(data.data(), data_size, props);
  sycl::buffer<int> sum_buf(&sum, 1, props);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    q.submit([&](auto &h) {
      sycl::accessor buf_acc(buf, h, sycl::read_only);
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(data_size, [=](auto index) {
        size_t glob_id = index[0];
        auto v = sycl::atomic_ref<
            int, sycl::memory_order::relaxed,
            sycl::memory_scope::device,
            sycl::access::address_space::global_space>(sum_acc[0]);
        v.fetch_add(buf_acc[glob_id]);
      });
    });
    q.wait();
    sycl::host_accessor h_acc(sum_buf);
    sum = h_acc[0];
  std::cout << "ReductionInt Sum   = " << sum << ", Duration " << (std::chrono::high_resolution_clock::now().time_since_epoch().count() - start) * 1e-9 << " seconds\n";

  return sum;
}

int reductionFloat(sycl::queue &q, std::vector<float> &data) {
  const size_t data_size = data.size();
  float sum = 0.0;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<float> buf(data.data(), data_size, props);
  sycl::buffer<float> sum_buf(&sum, 1, props);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    q.submit([&](auto &h) {
      sycl::accessor buf_acc(buf, h, sycl::read_only);
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(data_size, [=](auto index) {
        size_t glob_id = index[0];
        auto v = sycl::atomic_ref<
            float, sycl::memory_order::relaxed,
            sycl::memory_scope::device,
            sycl::access::address_space::global_space>(sum_acc[0]);
        v.fetch_add(buf_acc[glob_id]);
      });
    });
    q.wait();
    sycl::host_accessor h_acc(sum_buf);
    sum = h_acc[0];
  
  std::cout << "ReductionFloat Sum = " << sum << ", Duration " << (std::chrono::high_resolution_clock::now().time_since_epoch().count() - start) * 1e-9 << " seconds\n";
  return sum;
}

int main(int argc, char *argv[]) {

  sycl::queue q;
  std::cout << q.get_device().get_info<sycl::info::device::name>() << "\n";
  {
    std::vector<int> data(N, 1);
    for(int i=0;i<N;i++) data[i] = 1;
    reductionInt(q, data);
  }

  {
    std::vector<float> data(N, 1.0f);
    for(int i=0;i<N;i++) data[i] = 1;
    reductionFloat(q, data);
  }
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_atomics_data_type.sh

When using atomics, care must be taken to ensure that there is support in the hardware and that they can be executed efficiently. In Gen9 and Intel® Iris® Xe integrated graphics, there is no support for atomics on float or double data types and the performance of VectorDouble will be very poor. In future GPUs where the float and double atomics are supported in hardware, the performance of the above kernel will be much better.

### Intel VTune Analysis
By analyzing these kernels using VTune Profiler, we can measure the impact of native atomic support. You can see that the VectorInt kernel is much faster than VectorDouble and VectorFloat.

<img src="assets/atomics_vtune.png">


### Intel Advisor Report
The Intel Advisor tool has a recommendation pane that provides insights on how to improve the performance of GPU kernels.

One of the recommendations that Intel Advisor provides is “Inefficient atomics present”. When atomics are not natively supported in hardware, they are emulated. This can be detected and Intel Advisor gives advice on possible solutions.

<img src="assets/atomics_advisor.png">

The standard C++ memory model assumes that applications execute on a single device with a single address space. Neither of these assumptions holds for DPC++ applications: different parts of the application execute on different devices (i.e., a host device and one or more accelerator devices); each device has multiple address spaces (i.e., private, local, and global); and the global address space of each device may or may not be disjoint (depending on USM support).


## Atomic Operations in Global vs Local Space
When using atomics in the global address space, again, care must be taken because global updates are much slower than local.

#### Atomic Operation: Global Space

In [None]:
%%writefile lab/atomics_global.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  constexpr int N = 1024 * 1000 * 1000;
  constexpr int M = 256;
  int sum = 0;
  int *data = static_cast<int *>(malloc(sizeof(int) * N));
  for (int i = 0; i < N; i++) data[i] = 1;

  sycl::queue q({sycl::property::queue::enable_profiling()});
  sycl::buffer<int> buf_sum(&sum, 1);
  sycl::buffer<int> buf_data(data, N);

  auto e = q.submit([&](sycl::handler &h) {
    sycl::accessor acc_sum(buf_sum, h);
    sycl::accessor acc_data(buf_data, h, sycl::read_only);
    h.parallel_for(sycl::nd_range<1>(N, M), [=](auto it) {
      auto i = it.get_global_id();
      sycl::atomic_ref<int, sycl::memory_order_relaxed,
        sycl::memory_scope_device, sycl::access::address_space::global_space>
        atomic_op(acc_sum[0]);
      atomic_op += acc_data[i];
    });
  });
  sycl::host_accessor h_a(buf_sum);

  std::cout << "Reduction Sum : " << sum << "\n";
  auto total_time = (e.get_profiling_info<sycl::info::event_profiling::command_end>() - e.get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9;
  std::cout << "Kernel Execution Time of Global Atomics : " << total_time << "seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_atomics_global.sh

#### Atomic Operation: Local Space

It is possible to refactor your code to use local memory space as the following example demonstrates.

In [None]:
%%writefile lab/atomics_local.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  constexpr int N = 1024 * 1000 * 1000;
  constexpr int M = 256;
  int sum = 0;
  int *data = static_cast<int *>(malloc(sizeof(int) * N));
  for (int i = 0; i < N; i++) data[i] = 1;

  sycl::queue q({sycl::property::queue::enable_profiling()});
  sycl::buffer<int> buf_sum(&sum, 1);
  sycl::buffer<int> buf_data(data, N);

  auto e = q.submit([&](sycl::handler &h) {
    sycl::accessor acc_sum(buf_sum, h);
    sycl::accessor acc_data(buf_data, h, sycl::read_only);
    sycl::local_accessor<int, 1> local(1, h);
    h.parallel_for(sycl::nd_range<1>(N, M), [=](auto it) {
      auto i = it.get_global_id(0);
      sycl::atomic_ref<int, sycl::memory_order_relaxed,
        sycl::memory_scope_device, sycl::access::address_space::local_space>
        atomic_op(local[0]);
      atomic_op = 0;
      sycl::group_barrier(it.get_group());
      sycl::atomic_ref<int, sycl::memory_order_relaxed,
        sycl::memory_scope_device,sycl::access::address_space::global_space>
        atomic_op_global(acc_sum[0]);
      atomic_op += acc_data[i];
      sycl::group_barrier(it.get_group());
      if (it.get_local_id() == 0)
        atomic_op_global += local[0];
    });
  });
  sycl::host_accessor ha(buf_sum);

  std::cout << "Reduction Sum : " << sum << "\n";
  auto total_time = (e.get_profiling_info<sycl::info::event_profiling::command_end>() - e.get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9;;
  std::cout << "Kernel Execution Time of Local Atomics  : " << total_time << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_atomics_local.sh

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming