# RAJA-SYCL work in progress backend

RAJA kernel execution is determined by the policies in the template argument.  We will see how these execution policies are mapped to execution of a SYCL kernel in the internels of the RAJA-SYCL work in progress backend.

## RAJA-SYCL forall
The `RAJA::forall` execution takes an execution policy, `<exec_policy>`, as a template parameter.  To enable `RAJA::forall` with the SYCL programming model, we implement the `sycl_exec<block_size, async>` execution policy.  We also have a temporary `sycl_exec_trivial<block_size, async>` execution policy for kernels which are trivially copyable.  

The `sycl_exec_trivial` policy implementation is show below.  It determines a `globalSize` which is a multiple of the specified `blockSize`.  It then launches a SYCL kernel using an `nd_range` and executes the kernel lambda with the RAJA index variable.
```c
template <typename Iterable, typename LoopBody, size_t BlockSize, bool Async>
RAJA_INLINE void forall_impl(sycl_exec_trivial<BlockSize, Async>,
                             Iterable&& iter,
                             LoopBody&& loop_body)
{

  using Iterator  = camp::decay<decltype(std::begin(iter))>;
  using LOOP_BODY = camp::decay<LoopBody>;
  using IndexType = camp::decay<decltype(std::distance(std::begin(iter), std::end(iter)))>;

  //
  // Compute the requested iteration space size
  //
  Iterator begin = std::begin(iter);
  Iterator end = std::end(iter);
  IndexType len = std::distance(begin, end);

  // Only launch kernel if we have something to iterate over
  if (len > 0 && BlockSize > 0) {

    //
    // Compute the number of blocks
    //
    sycl_dim_t blockSize{BlockSize};
    sycl_dim_t globalSize = impl::getGridDim(static_cast<size_t>(len), BlockSize);

    cl::sycl::queue* q = ::RAJA::sycl::detail::getQueue();

    q->submit([&](cl::sycl::handler& h) {

      h.parallel_for( cl::sycl::nd_range<1>{globalSize, blockSize},
                      [=]  (cl::sycl::nd_item<1> it) {

        size_t ii = it.get_global_id(0);

        if (ii < len) {
          loop_body(begin[ii]);
        }
      });
    });

    if (!Async) { q->wait(); }

  }
}
```

## RAJA-SYCL Kernel
The `RAJA::kernel` execution takes a number of `RAJA::statement`s to specify how to execute the kernel.  For the RAJA-SYCL work in progress backend we implement the `SyclKernel` and `SyclKernelTrivial` statements to launch a SYCL kernel.  Then we use the SYCL kernel policies, eg. `RAJA::sycl_global_1<256>`, to be used by  `RAJA::statement::For` to specify how ranges and indexes are mapped to from the RAJA kernel to the SYCL kernel.

Shown below is the SYCL kernel launch specified with the `SyclKernelTrivial` statement. 


```c
template<bool async0, typename StmtList, typename Data, typename Types>
struct SyclLaunchHelperTrivial<sycl_launch<async0>,StmtList,Data,Types>
{
  using Self = SyclLaunchHelperTrivial;

  static constexpr bool async = async0;

  using executor_t = internal::sycl_statement_list_executor_t<StmtList, Data, Types>;
  using data_t = camp::decay<Data>;

  static void launch(Data &&data,
                     internal::LaunchDims launch_dims,
                     size_t shmem,
                     cl::sycl::queue* qu)
  {

    qu->submit([&](cl::sycl::handler& h) {

      h.parallel_for(launch_dims.fit_nd_range(),
                     [=] (cl::sycl::nd_item<3> item) {

        SyclKernelLauncher<Data, executor_t>(data, item);

      });
    });

    if (!async) { stream->wait(); };

  }
};

```

The inner statement, `sycl_global_1<256>` is called through it's exec function, shown below.  The SYCL kernel id is mapped to the RAJA kernel index and the next enclosed `RAJA::statement` is `exec`.

```c
  static
  inline RAJA_DEVICE void exec(Data &data, cl::sycl::nd_item<3> item, bool thread_active)
  {
    auto len = segment_length<ArgumentId>(data);
    auto i = item.get_global_id(Dim);

      // Assign the x thread to the argument
      data.template assign_offset<ArgumentId>(i);

      // execute enclosed statements
      enclosed_stmts_t::exec(data, item, thread_active && (i<len));
  }
```

## RAJA-SYCL reduction object
Part of the current work in progress version of the RAJA reduction objects with the `sycl_reduce` policy is shown below.  Currently the SYCL reduction object only works with the 1D `forall` execution.

This reduces using `atomic_ref` to an array of work group size, by accessing the local id through the `__spriv` extension.

```c
//! specialization of ReduceSum for SYCL
template <typename T>
class ReduceSum<sycl_reduce, T>
    : public TargetReduce<RAJA::reduce::sum<T>, T>
{
public:

  using self = ReduceSum<sycl_reduce, T>;
  using parent = TargetReduce<RAJA::reduce::sum<T>, T>;
  using parent::parent;

  //! enable operator+= for ReduceSum -- alias for reduce()
  self &operator+=(T rhsVal)
  {
    parent::reduce(rhsVal);
    return *this;
  }

  //! enable operator+= for ReduceSum -- alias for reduce()
  const self &operator+=(T rhsVal) const
  {
#ifdef __SYCL_DEVICE_ONLY__
    auto i = __spirv::initLocalInvocationId<1, cl::sycl::id<1>>()[0];
    auto atm = cl::sycl::ONEAPI::atomic_ref<T, cl::sycl::ONEAPI::memory_order::relaxed, cl::sycl::ONEAPI::memory_scope::device, cl::sycl::access::address_space::global_space>(parent::val.device[i]);
    atm.fetch_add(rhsVal);
    return *this;
#else
    parent::reduce(rhsVal);
    return *this;
#endif
  }
};
```