# Kernel Reduction
Reduction is a common operation in parallel programming where an operator is applied to all elements of an array and a single result is produced. The reduction operator is associative and in some cases commutative. Some examples of reductions are summation, maximum, and minimum. 

In the next few sections we will look at different implementation of sum reduction in SYCL kernel:
- [Reduction using Atomic Operation](#Reduction-using-Atomic-Operation)
- [Reduction using Shared Local Memory](#Reduction-using-Shared-Local-Memory)
- [Reduction using Sub-Groups](#Reduction-using-Sub-Groups)
- [Reduction using SYCL Reduction Kernel](#Reduction-using-SYCL-Reduction-Kernel)

Different implementations of reduction operation are provided and discussed here, which may have different performance characteristics depending on the architecture of the accelerator. Another important thing to note is that the time it takes to bring the result of reduction to the host over the PCIe interface (for a discrete GPU) is almost same as actually doing the entire reduction on the device. This shows that one should avoid data transfers between host and device as much as possible or overlap the kernel execution with data transfers.

A serial summation reduction is shown below:
```cpp
  for (int it = 0; it < iter; it++) {
    sum = 0;
    for (size_t i = 0; i < data_size; ++i) {
      sum += data[i];
    }
  }
```
The time complexity of reduction is linear with the number of elements. There are several ways this can be parallelized, and care must be taken to ensure that the amount of communication/synchronization is minimized between different processing elements. 

## Reduction using Atomic Operation
A naive way to parallelize this reduction is to use a global variable and let the threads update this variable using an atomic operation, the threads are atomically updating a single memory location and get significant contention:

In [None]:
%%writefile lab/reduction_atomics.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr size_t N = (1000 * 1024 * 1024);

int main(int argc, char *argv[]) {

  sycl::queue q{sycl::property::queue::enable_profiling{}};
  std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  std::vector<int> data(N, 1);
  int sum = 0;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<int> buf(data.data(), data.size(), props);
  sycl::buffer<int> sum_buf(&sum, 1, props);
    
  auto e = q.submit([&](auto &h) {
    sycl::accessor buf_acc(buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(N, [=](auto index) {
      size_t glob_id = index[0];
      auto v = sycl::atomic_ref<int, 
        sycl::memory_order::relaxed, 
        sycl::memory_scope::device, 
        sycl::access::address_space::global_space>(sum_acc[0]);
      v.fetch_add(buf_acc[glob_id]);
    });
  });

  sycl::host_accessor h_acc(sum_buf);
  std::cout << "Sum = " << sum << "\n";

  std::cout << "Kernel time = " << (e.template get_profiling_info<sycl::info::event_profiling::command_end>() - e.template get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_reduction_atomics.sh

## Reduction using Shared Local Memory
A further optimization is to block the accesses to the input vector and use the shared local memory to store the intermediate results. This kernel is shown below. In this kernel every work-item operates on a certain number of vector elements, and then one thread in the work-group reduces all these elements to one result by linearly going through the shared memory containing the intermediate results.

In [None]:
%%writefile lab/reduction_slm.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr size_t N = (1000 * 1024 * 1024);

int main(int argc, char *argv[]) {

  sycl::queue q{sycl::property::queue::enable_profiling{}};
  std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  std::vector<int> data(N, 1);
  int sum = 0;

  int work_group_size = 256;
  int log2elements_per_block = 13;
  int elements_per_block = (1 << log2elements_per_block); // 8192

  int log2workitems_per_block = 8;
  int workitems_per_block = (1 << log2workitems_per_block); // 256
  int elements_per_work_item = elements_per_block / workitems_per_block;

  int mask = ~(~0 << log2workitems_per_block);
  int num_work_items = data.size() / elements_per_work_item;
  int num_work_groups = num_work_items / work_group_size;
  std::cout << "Num work items = " << num_work_items << std::endl;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<int> buf(data.data(), data.size(), props);
  sycl::buffer<int> sum_buf(&sum, 1, props);
  sycl::buffer<int> accum_buf(num_work_groups);
    
  auto e = q.submit([&](auto &h) {
      sycl::accessor buf_acc(buf, h, sycl::read_only);
      sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
      sycl::local_accessor<int, 1> scratch(work_group_size, h);
      h.parallel_for(sycl::nd_range<1>{num_work_items, work_group_size},
                     [=](sycl::nd_item<1> item) {
                       size_t glob_id = item.get_global_id(0);
                       size_t group_id = item.get_group(0);
                       size_t loc_id = item.get_local_id(0);
                       int offset = ((glob_id >> log2workitems_per_block)
                                     << log2elements_per_block) +
                                    (glob_id & mask);
                       int sum = 0;
                       for (size_t i = 0; i < elements_per_work_item; i++)
                         sum +=
                             buf_acc[(i << log2workitems_per_block) + offset];
                       scratch[loc_id] = sum;
                       // Serial Reduction
		       sycl::group_barrier(item.get_group());
                       if (loc_id == 0) {
                         int sum = 0;
                         for (int i = 0; i < work_group_size; i++)
                           sum += scratch[i];
                         accum_acc[group_id] = sum;
                       }
                     });
    });

    q.wait();
    {
      sum = 0;
      sycl::host_accessor h_acc(accum_buf);
      for (int i = 0; i < num_work_groups; i++)
        sum += h_acc[i];
    }
  std::cout << "Sum = " << sum << "\n";

  std::cout << "Kernel time = " << (e.template get_profiling_info<sycl::info::event_profiling::command_end>() - e.template get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_reduction_slm.sh

In the above code, tree reduction can also be used to reduce the intermediate results from all the work-items in a work-group. In most cases this does not seem to make a big difference in performance.

```cpp
       // tree reduction
       item.barrier(sycl::access::fence_space::local_space);
       for (int i = work_group_size / 2; i > 0; i >>= 1) {
         item.barrier(sycl::access::fence_space::local_space);
         if (loc_id < i)
           scratch[loc_id] += scratch[loc_id + i];
       }
       if (loc_id == 0)
         accum_acc[group_id] = scratch[0];
```

## Reduction using Sub-Groups
This kernel below uses a completely different technique for accessing the memory. It uses sub-group loads to generate the intermediate result in a vector form. This intermediate result is then brought back to the host and the final reduction is performed there. In some cases it may be better to create another kernel to reduce this result in a single work-group, which lets you perform tree reduction through efficient barriers.

In [None]:
%%writefile lab/reduction_sg.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr size_t N = (1000 * 1024 * 1024);

int main(int argc, char *argv[]) {

  sycl::queue q{sycl::property::queue::enable_profiling{}};
  std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  std::vector<int> data(N, 1);
  int sum = 0;
    
  int work_group_size = 256;
  int log2elements_per_work_item = 6;
  int elements_per_work_item = (1 << log2elements_per_work_item); // 256
  int num_work_items = data.size() / elements_per_work_item;
  int num_work_groups = num_work_items / work_group_size;

  std::cout << "Num work items = " << num_work_items << std::endl;
  std::cout << "Num work groups = " << num_work_groups << std::endl;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<int> buf(data.data(), data.size(), props);
  sycl::buffer<int> sum_buf(&sum, 1, props);
  sycl::buffer<sycl::vec<int, 8>> accum_buf(num_work_groups);
    
  auto e = q.submit([&](auto &h) {
      const sycl::accessor buf_acc(buf, h);
      sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
      sycl::local_accessor<sycl::vec<int, 8>, 1> scratch(work_group_size, h);
      h.parallel_for(
          sycl::nd_range<1>{num_work_items, work_group_size}, [=
      ](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
            size_t glob_id = item.get_global_id(0);
            size_t group_id = item.get_group(0);
            size_t loc_id = item.get_local_id(0);
            sycl::ext::oneapi::sub_group sg = item.get_sub_group();
            sycl::vec<int, 8> sum{0, 0, 0, 0, 0, 0, 0, 0};
            using global_ptr =
                sycl::multi_ptr<int, sycl::access::address_space::global_space>;
            int base = (group_id * work_group_size +
                        sg.get_group_id()[0] * sg.get_local_range()[0]) *
                       elements_per_work_item;
            for (size_t i = 0; i < elements_per_work_item / 8; i++)
              sum += sg.load<8>(global_ptr(&buf_acc[base + i * 128]));
            scratch[loc_id] = sum;
            for (int i = work_group_size / 2; i > 0; i >>= 1) {
	    sycl::group_barrier(item.get_group());
              if (loc_id < i)
                scratch[loc_id] += scratch[loc_id + i];
            }
            if (loc_id == 0)
              accum_acc[group_id] = scratch[0];
          });
    });

    q.wait();
    {
      sycl::host_accessor h_acc(accum_buf);
      sycl::vec<int, 8> res{0, 0, 0, 0, 0, 0, 0, 0};
      for (int i = 0; i < num_work_groups; i++)
        res += h_acc[i];
      sum = 0;
      for (int i = 0; i < 8; i++)
        sum += res[i];
    }
  sycl::host_accessor h_acc(sum_buf);
  std::cout << "Sum = " << sum << "\n";

  std::cout << "Kernel time = " << (e.template get_profiling_info<sycl::info::event_profiling::command_end>() - e.template get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_reduction_sg.sh

## Reduction using SYCL Reduction Kernel
SYCL also supports built-in reduction operations, and you should use it where it is suitable because its implementation is fine tuned to the underlying architecture. 

The following kernel shows how to use the built-in reduction operator in the compiler.

In [None]:
%%writefile lab/reduction_sycl.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr size_t N = (1000 * 1024 * 1024);

int main(int argc, char *argv[]) {

  sycl::queue q{sycl::property::queue::enable_profiling{}};
  std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  std::vector<int> data(N, 1);
  int sum;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<int> buf(data.data(), data.size(), props);
  sycl::buffer<int> sum_buf(&sum, 1, props);
    
  auto e = q.submit([&](auto &h) {
      sycl::accessor buf_acc(buf, h, sycl::read_only);
      auto sum_reduction = sycl::reduction(sum_buf, h, sycl::plus<>());
      h.parallel_for(sycl::nd_range<1>{N, 256}, sum_reduction,
                     [=](sycl::nd_item<1> item, auto &sum_wg) {
                       int i = item.get_global_id(0);
                       sum_wg += buf_acc[i];
                     });
    });

  sycl::host_accessor h_acc(sum_buf);
  std::cout << "Sum = " << sum << "\n";

  std::cout << "Kernel time = " << (e.template get_profiling_info<sycl::info::event_profiling::command_end>() - e.template get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_reduction_sycl.sh

#### Reduction using SYCL Reduction Kernel and blocking
The kernel below uses the blocking technique and then the compiler reduction operator to do final reduction. This gives good performance on most of the platforms on which it was tested.

In [None]:
%%writefile lab/reduction_sycl_blocks.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr size_t N = (1000 * 1024 * 1024);

int main(int argc, char *argv[]) {

  sycl::queue q{sycl::property::queue::enable_profiling{}};
  std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  std::vector<int> data(N, 1);
  int sum;

  int work_group_size = 256;
  int log2elements_per_block = 13;
  int elements_per_block = (1 << log2elements_per_block); // 8192

  int log2workitems_per_block = 8;
  int workitems_per_block = (1 << log2workitems_per_block); // 256
  int elements_per_work_item = elements_per_block / workitems_per_block;

  int mask = ~(~0 << log2workitems_per_block);
  int num_work_items = data.size() / elements_per_work_item;
  int num_work_groups = num_work_items / work_group_size;

  std::cout << "Num work items = " << num_work_items << std::endl;
  std::cout << "Num work groups = " << num_work_groups << std::endl;
  std::cout << "Elements per item = " << elements_per_work_item << std::endl;

  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer<int> buf(data.data(), data.size(), props);
  sycl::buffer<int> sum_buf(&sum, 1, props);
    
  auto e = q.submit([&](auto &h) {
      sycl::accessor buf_acc(buf, h, sycl::read_only);
      auto sumr = sycl::reduction(sum_buf, h, sycl::plus<>());
      h.parallel_for(sycl::nd_range<1>{num_work_items, work_group_size}, sumr,
                     [=](sycl::nd_item<1> item, auto &sumr_arg) {
                       size_t glob_id = item.get_global_id(0);
                       size_t group_id = item.get_group(0);
                       size_t loc_id = item.get_local_id(0);
                       int offset = ((glob_id >> log2workitems_per_block)
                                     << log2elements_per_block) +
                                    (glob_id & mask);
                       int sum = 0;
                       for (size_t i = 0; i < elements_per_work_item; i++)
                         sum +=
                             buf_acc[(i << log2workitems_per_block) + offset];
                       sumr_arg += sum;
                     });
    });

  sycl::host_accessor h_acc(sum_buf);
  std::cout << "Sum = " << sum << "\n";

  std::cout << "Kernel time = " << (e.template get_profiling_info<sycl::info::event_profiling::command_end>() - e.template get_profiling_info<sycl::info::event_profiling::command_start>()) * 1e-9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_reduction_sycl_blocks.sh

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming