# SYCl with Buffers to SYCL with USM

## Starting Point
We will start with the code from the first notebook


```c
#include "DAXPY.hpp"

#include "RAJA/RAJA.hpp"

#if defined(RAJA_ENABLE_SYCL)

#include "common/SyclDataUtils.hpp"

#include <iostream>

namespace rajaperf
{
namespace basic
{

  //
  // Define thread block size for SYCL execution
  //
  const size_t block_size = 256; // We could query our device for this

#define DAXPY_DATA_SETUP_SYCL \
  sycl::buffer<Real_type> d_x { m_x, iend }; \
  sycl::buffer<Real_type> d_y { m_y, iend) }; \


#define DAXPY_DATA_TEARDOWN_SYCL \
 // Nothing to do here

void DAXPY::runSyclVariant(VariantID vid)
{
  const Index_type run_reps = getRunReps();
  const Index_type ibegin = 0;
  const Index_type iend = getRunSize();

  DAXPY_DATA_SETUP; // This sets up our host data. m_x, m_y.

  if ( vid == Base_SYCL ) {
    { // Create a scope for our buffers

      DAXPY_DATA_SETUP_SYCL;

      startTimer();
      for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

        qu.submit([&] (sycl::handler& h) {
          auto x = d_x.get_access<sycl::access::mode::read>(h);
          auto y = d_y.get_access<sycl::access::mode::read_write>(h);

          h.parallel_for(sycl::range<1>(iend), [=] (sycl::item<1> item ) {

            Index_type i = item.get_id(0);
            DAXPY_BODY

          });
        });
      }
      qu.wait(); // Wait for computation to finish before stopping timer
      stopTimer();
     
    } // End of buffer scope

    DAXPY_DATA_TEARDOWN_SYCL;
 
  } else if ( vid == RAJA_SYCL ) {

  // We will do this later

  } else {
     std::cout << "\n  DAXPY : Unknown Sycl variant id = " << vid << std::endl;
  }

}

} // end namespace basic
} // end namespace rajaperf
```

## Data Setup

As an alternative to the implicite memory management through buffers and acccessors, the SYCL 2020 spec offers an explicit alternative that is similar to existing heterogeneous solutions.

Within SYCL we can use the USM API `sycl::malloc_device` to allocate space on the device.  `sycl::malloc_device` returns the pointer to the newly allocated device memory and doesn't pass the device pointer as an argument.  After the size argument we need to pass `sycl::malloc_device` either the device and context or the queue.  In this code we have a public static class member holding our queue named `qu`.

```
  dptr = cl::sycl::malloc_device<typename std::remove_pointer<T>::type>(len, qu);
```

The memcpy call for SYCL is similar but it is a function member of the queue, `qu.memcpy`.  The memcpy call is asynchronous so we will wait on the returned event to ensure the memory is where we need it.

```
  auto e = qu.memcpy( dptr, hptr,
                      len * sizeof(typename std::remove_pointer<T>::type));
  e.wait();
```


## Kernel Launch
The kernel launch doesn't change much, expect that we now do not pass `sycl::accessors`. We will also make a change to use an `sycl::nd_range` rather than the `sycl::range`.  `sycl::nd_range` takes in two arguments, the global iteration size and the block size.  We calculate our global size to be a multiple of the block size that fits our iteration space.   

```c
  const size_t global_size = block_size * RAJA_DIVIDE_CEILING_INT(iend, block_size);

  qu.submit([&] (cl::sycl::handler& h) {
    h.parallel_for<class DAXPY>(cl::sycl::nd_range<1>(global_size, block_size),
                                [=] (cl::sycl::nd_item<1> item ) {

      Index_type i = item.get_global_id(0);
      if (i < iend) {
        DAXPY_BODY
      }

    });
  });
```

## Kernel 
Within the kernel, because we now have a `sycl::nd_item` instead of a `sycl::item` we  access our global index through `item.get_global_id` instead of `item.get_id`.  This is more explicit because the `sycl::nd_item` also allows us to access our group id or local id.

Within the kernel body we now need include a check to ensure we don't do work on the items between the end of our iteration space and the end of our global space.


```c
   [=] (cl::sycl::nd_item<1> item ) {

      Index_type i = item.get_global_id(0);
      if (i < iend) {
        DAXPY_BODY

    });
      
```

## Data Teardown
Because we are managing our memory explicitly, we cannot rely on the buffer falling out of scope to move the data back to the host.  We instead need to invoke the `memcpy` and `sycl::free` calls.

The `memcpy` looks the same as when we transfered the data to the device, expect that we will switch the desitination and source arguments.  After we wait for the memcpy to finish we will free the device memory with `sycl::free`.

```c
  auto e = qu.memcpy( dptr, hptr,
                      len * sizeof(typename std::remove_pointer<T>::type));
  e.wait();

  cl::sycl::free(dptr, qu);
```

## Lets put it all together

```c
#include "DAXPY.hpp"

#include "RAJA/RAJA.hpp"

#if defined(RAJA_ENABLE_SYCL)

#include "common/SyclDataUtils.hpp"

#include <iostream>

namespace rajaperf
{
namespace basic
{

  //
  // Define thread block size for SYCL execution
  //
  const size_t block_size = 256;

#define DAXPY_DATA_SETUP_SYCL \
  allocAndInitSyclDeviceData(x, m_x, iend, qu); \
  allocAndInitSyclDeviceData(y, m_y, iend, qu);

#define DAXPY_DATA_TEARDOWN_SYCL \
  getSyclDeviceData(m_y, y, iend, qu); \
  deallocSyclDeviceData(x, qu); \
  deallocSyclDeviceData(y, qu);


void DAXPY::runSyclVariant(VariantID vid)
{
  const Index_type run_reps = getRunReps();
  const Index_type ibegin = 0;
  const Index_type iend = getRunSize();

  DAXPY_DATA_SETUP;

  if ( vid == Base_SYCL ) {

    DAXPY_DATA_SETUP_SYCL;

    startTimer();
    for (RepIndex_type irep = 0; irep < run_reps; ++irep) {

      const size_t global_size = block_size * RAJA_DIVIDE_CEILING_INT(iend, block_size);

      qu.submit([&] (cl::sycl::handler& h) {
        h.parallel_for<class DAXPY>(cl::sycl::nd_range<1>{global_size, block_size},
                                    [=] (cl::sycl::nd_item<1> item ) {

          Index_type i = item.get_global_id(0);
          if (i < iend) {
            DAXPY_BODY
          }

        });
      });
    }
    qu.wait(); // Wait for computation to finish before stopping timer
    stopTimer();

    DAXPY_DATA_TEARDOWN_SYCL;

  } else if ( vid == RAJA_SYCL ) {

      // We will do this later
      
  } else {
     std::cout << "\n  DAXPY : Unknown Sycl variant id = " << vid << std::endl;
  }

}

} // end namespace basic
} // end namespace rajaperf

#endif  // RAJA_ENABLE_SYCL
```

In [None]:
# Now Run It !!!