# Memory Optimization - Buffers

In this section we cover topics related to declaration, movement, and access to the memory hierarchy.
- [Buffer Accessor Modes](#Buffer-Accessor-Modes)
- [Optimizing Memory Movement Between Host and Device](#Optimizing-Memory-Movement-Between-Host-and-Device)
- [Avoid Declaring Buffers in a Loop](#Avoid-Declaring-Buffers-in-a-Loop)
- [Avoid Moving Data Back and Forth Between Host and Device](#Avoid-Moving-Data-Back-and-Forth-Between-Host-and-Device)

## Buffer Accessor Modes
In SYCL, a buffer provides an abstract view of memory that can be accessed by the host or a device. A buffer cannot be accessed directly through the buffer object. Instead, we must create an accessor object that allows us to access the buffer’s data.

The access mode describes how we intend to use the memory associated with the accessor in the program. The accessor’s access modes are used by the runtime to create an execution order for the kernels and perform data movement. This will ensure that kernels are executed in an order intended by the programmer. Depending on the capabilities of the underlying hardware, the runtime can execute kernels concurrently if the dependencies do not give rise to dependency violations or race conditions.

For better performance, make sure that the access modes of accessors reflect the operations performed by the kernel. The compiler will flag an error when a write is done on an accessor which is declared as read_only. But the compiler does not change the declaration of an accessor form `read_write` to read if no write is done in the kernel.

The following example shows a simple vector-add computation. A, B, and C buffers have access mode default which is `read_write`.

The `read_write` access mode informs the runtime that the data needs to be available on the device before the kernel can begin executing and the data needs to be copied from the device to the host at the end of the computation.

In [None]:
%%writefile lab/buffer_access_modes.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

constexpr int N = 1024000000;

int main() {

  std::vector<int> a(N, 1);
  std::vector<int> b(N, 2);
  std::vector<int> c(N);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    
  sycl::queue q;
  {
    sycl::buffer<int> a_buf(a);
    sycl::buffer<int> b_buf(b);
    sycl::buffer<int> c_buf(c);

    q.submit([&](auto &h) {
      // Create device accessors.
      sycl::accessor a_acc(a_buf, h);
      sycl::accessor b_acc(b_buf, h);
      sycl::accessor c_acc(c_buf, h);

      h.parallel_for(N, [=](auto i) {
        c_acc[i] = a_acc[i] + b_acc[i];
      });
    });
  }
  std::cout << "C = " << c[N/2] << "\n";
    
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";

  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_access_modes.sh

A better implementation is to make A and B `read_only` since A and B are not modified, it need not be copied back to host. Similarly make C `write_only` since the value is over-written it need not be copied from host to device to begin with.

```cpp
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      sycl::accessor c_acc(c_buf, h, sycl::write_only);
```

The `read_only` access mode informs the runtime that the data needs to be available on the device before the kernel can begin executing, but the data need not be copied from the device to the host at the end of the computation.

The `write_only` access mode informs the runtime that the data need not be available on the device before the kernel can begin executing, but the data needs to be copied from the device to the host at the end of the computation.

Making the change to accessors in the above code should give better performance numbers.

## Optimizing Memory Movement Between Host and Device
Buffers can be created using properties to control how they are allocated. One such property is `use_host_ptr`. This informs the runtime that if possible, the host memory should be directly used by the buffer instead of a copy. This avoids the need to copy the content of the buffer back and forth between the host memory and the buffer memory, potentially saving time during buffer creation and destruction. To take another case, when the GPU and CPU have shared memory, it is possible to avoid copies of memory through sharing of pages. But for page sharing to be possible, the allocated memory needs to have some properties like being aligned on page boundary. In case of discrete devices, the benefit may not be realized because any memory operation by the accelerator will have to go across PCIe or some other slower interface than the memory of the accelerator.

The following code shows how to print the memory addresses on the host, inside the buffer, and on the accelerator device inside the kernel.

In [None]:
%%writefile lab/buffer_host_ptr.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

int main() {
  constexpr int num_items = 16;
  constexpr int iter = 1;

  std::vector<int> a(num_items, 10);
  std::vector<int> b(num_items, 10);
  std::vector<int> sum(num_items, 0);

  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";
  
  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer sum_buf(sum, props);
  {
    sycl::host_accessor a_host_acc(a_buf);
    std::cout << "address of vector a     = " << a.data() << "\n";
    std::cout << "buffer memory address   = " << a_host_acc.get_pointer() << "\n";
  }
  q.submit([&](auto &h) {
    // Input accessors
    sycl::accessor a_acc(a_buf, h, sycl::read_only);
    sycl::accessor b_acc(b_buf, h, sycl::read_only);
    // Output accessor
    sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
    sycl::stream out(1024 * 1024, 1 * 128, h);

    h.parallel_for(num_items, [=](auto i) {
      if (i[0] == 0)
        out << "device accessor address = " << a_acc.get_pointer() << "\n";
      sum_acc[i] = a_acc[i] + b_acc[i];
    });
  }).wait();
  return 0;
}



#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_host_ptr.sh

When this program is run, it can be seen that the addresses for host and the buffer are the same when the property `use_host_ptr` is set.

Also note that none of the input vectors are declared to be const. If these are declared const then during buffer creation they are copied and new memory is allocated instead of reusing the memory in the host vectors. You can this in the above code by changing the vector a to `const`:
```cpp
    const std::vector<int> a(num_items, 10);
    ^^^^^
```
The kernel will incur the cost of copying memory contents between the host and buffer, and also from the buffer to the accelerator.

Care must be taken to ensure that unnecessary copies are avoided during the creation of buffers and passing the memory from the buffers to the kernels. Even when the accelerator shares memory with the host, a few additional conditions must be satisfied to avoid these extra copies.

## Avoid Declaring Buffers in a Loop
When kernels are repeatedly launched inside a for-loop, you can prevent repeated allocation and freeing of a buffer by declaring the buffer outside the loop. Declaring a buffer inside the loop introduces repeated host-to-device and device-to-host memory copies.

In the following example, the kernel is repeatedly launched inside a for-loop. The buffer C is used as a temporary array, where it is used to hold values in an iteration, and the values assigned in one iteration are not used in any other iteration. Since the buffer C is declared inside the for-loop, it is allocated and freed in every loop iteration. In addition to the allocation and freeing of the buffer, the memory associated with the buffer is redundantly transferred from host to device and device to host in each iteration.

In [None]:
%%writefile lab/buffer_loop.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
#include <chrono>

constexpr int N = 16;
constexpr int STEPS = 10000;

int main() {

  std::vector<int> a(N, 1);
  std::vector<int> b(N, 2);
  std::vector<int> c(N);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    
  sycl::queue q;

  sycl::buffer<int> a_buf(a);
  sycl::buffer<int> b_buf(b);

  for (int j = 0; j < STEPS; j++) {
    //# Buffer c in the loop
    sycl::buffer<int> c_buf(c, sycl::no_init);

    q.submit([&](auto &h) {
      // Create device accessors.
      sycl::accessor a_acc(a_buf, h);
      sycl::accessor b_acc(b_buf, h);
      sycl::accessor c_acc(c_buf, h);
      h.parallel_for(N, [=](auto i) {
        c_acc[i] = (a_acc[i] < b_acc[i]) ? -1 : 1;
        a_acc[i] += c_acc[i];
        b_acc[i] -= c_acc[i];
      });
    });
  }

  // Create host accessors.
  const sycl::host_accessor ha(a_buf);
  const sycl::host_accessor hb(b_buf);
  printf("%d %d\n", ha[N / 2], hb[N / 2]);
    
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";

  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_loop.sh

A better approach would be to declare the buffer C before the for-loop, so that it is allocated and freed only once, resulting in improved performance by avoiding the redundant data transfers between host and device. 

In the above code, try moving the buffer C before the for-loop and check the compute duration.
```cpp
sycl::buffer<int> c_buf(c, sycl::no_init);
```



## Avoid Moving Data Back and Forth Between Host and Device
The cost of moving data between host and device is quite high, especially in the case of discrete accelerators. So it is very important to avoid data transfers between host and device as much as possible. In some situations it may be required to bring the data that was computed by a kernel on the accelerator to the host and do some operation on it and send it back to the device for further processing. In such situation we will end up paying for the cost of device to host transfer and then again host to device transfer.

Consider the following example, where one kernel produces data through some operation (in this case vector add) into a new vector. This vector is then transformed into another vector by applying a function on each value and then fed as input into another kernel for some additional computation. This form of computation is quite common and occurs in many domains where algorithms are iterative and output from one computation needs to be fed as input into another computation. One classic example is machine learning models which are structured as layers of computation and output of one layer is input to the next layer.

In [None]:
%%writefile lab/buffer_mem_move_0.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

int main() {
  constexpr int num_items = 1024*1000*1000;
  std::vector<int> a(num_items);
  for(int i=0;i<num_items;i++) a[i] = i;
  std::vector<int> b(num_items, 1);
  std::vector<int> c(num_items, 2);
  std::vector<int> d(num_items, 3);
  std::vector<int> sum(num_items, 0);
  std::vector<int> res(num_items, 0);

  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";
  
  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(c, props);
  sycl::buffer d_buf(d, props);
  sycl::buffer sum_buf(sum, props);
  sycl::buffer res_buf(res, props);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  //# Kernel 1
  q.submit([&](auto &h) {
    sycl::accessor a_acc(a_buf, h, sycl::read_only);
    sycl::accessor b_acc(b_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
  });

  {
    sycl::host_accessor h_acc(sum_buf);
    for (int j = 0; j < num_items; j++)
      if (h_acc[j] > 10)
        h_acc[j] = 1;
      else
        h_acc[j] = 0;
  }

  //# Kernel 2
  q.submit([&](auto &h) {
    sycl::accessor c_acc(c_buf, h, sycl::read_only);
    sycl::accessor d_acc(d_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::read_only);
    sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) { res_acc[i] = sum_acc[i] * c_acc[i] + d_acc[i]; });
  }).wait();

  sycl::host_accessor h_acc(res_buf); 
  for (int i = 0; i < 20; i++) std::cout << h_acc[i] << " ";std::cout << "...\n";
    
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_mem_move_0.sh

Instead of bringing the data to the host and applying the function to the data and sending it back to the device in the second kernel, you can create a kernel to execute this function on the device itself. This has the advantage of avoiding the round trip of data from device to host. This technique is shown in the example below, which is functionally the same as the code before. We now introduce a third kernel kernel3 that operates on the intermediate data in accum_buf in between kernel1 and kernel2.

In [None]:
%%writefile lab/buffer_mem_move_1.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

int main() {
  constexpr int num_items = 1024*1000*1000;
  std::vector<int> a(num_items);
  for(int i=0;i<num_items;i++) a[i] = i;
  std::vector<int> b(num_items, 1);
  std::vector<int> c(num_items, 2);
  std::vector<int> d(num_items, 3);
  std::vector<int> sum(num_items, 0);
  std::vector<int> res(num_items, 0);

  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";
  
  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(c, props);
  sycl::buffer d_buf(d, props);
  sycl::buffer sum_buf(sum, props);
  sycl::buffer res_buf(res, props);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  //# Kernel 1
  q.submit([&](auto &h) {
    sycl::accessor a_acc(a_buf, h, sycl::read_only);
    sycl::accessor b_acc(b_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
  });

  //# Kernel 3
  q.submit([&](auto &h) {
    sycl::accessor sum_acc(sum_buf, h, sycl::read_write);
    h.parallel_for(num_items, [=](auto i) { 
      if (sum_acc[i] > 10)
        sum_acc[i] = 1;
      else
        sum_acc[i] = 0;
    });
  });

  //# Kernel 2
  q.submit([&](auto &h) {
    sycl::accessor c_acc(c_buf, h, sycl::read_only);
    sycl::accessor d_acc(d_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::read_only);
    sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) { res_acc[i] = sum_acc[i] * c_acc[i] + d_acc[i]; });
  }).wait();

  sycl::host_accessor h_acc(res_buf); 
  for (int i = 0; i < 20; i++) std::cout << h_acc[i] << " ";std::cout << "...\n";

  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_mem_move_1.sh

There are other ways to optimize this example. For instance, the clipping operation in kernel3 can be merged into the computation of kernel1 as shown below. This is kernel fusion and has the added advantage of not launching a third kernel. The DPCPP compiler cannot do this kind of optimization. In some specific domains like machine learning, there are graph compilers that operate on the ML models and fuse the operations, which has the same impact.

In [None]:
%%writefile lab/buffer_mem_move_2.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

int main() {
  constexpr int num_items = 1024*1000*1000;
  std::vector<int> a(num_items);
  for(int i=0;i<num_items;i++) a[i] = i;
  std::vector<int> b(num_items, 1);
  std::vector<int> c(num_items, 2);
  std::vector<int> d(num_items, 3);
  std::vector<int> sum(num_items, 0);
  std::vector<int> res(num_items, 0);

  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";
  
  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(c, props);
  sycl::buffer d_buf(d, props);
  sycl::buffer sum_buf(sum, props);
  sycl::buffer res_buf(res, props);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  //# Kernel 1
  q.submit([&](auto &h) {
    sycl::accessor a_acc(a_buf, h, sycl::read_only);
    sycl::accessor b_acc(b_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) {
      int t = a_acc[i] + b_acc[i];
      if (t > 10)
        sum_acc[i] = 1;
      else
        sum_acc[i] = 0;
    });
  });

  //# Kernel 2
  q.submit([&](auto &h) {
    sycl::accessor c_acc(c_buf, h, sycl::read_only);
    sycl::accessor d_acc(d_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::read_only);
    sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) { res_acc[i] = sum_acc[i] * c_acc[i] + d_acc[i]; });
  }).wait();

  sycl::host_accessor h_acc(res_buf); 
  for (int i = 0; i < 20; i++) std::cout << h_acc[i] << " ";std::cout << "...\n";

  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";
  return 0;
}



#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_mem_move_2.sh

We can take this kernel fusion one level further and fuse both kernel1 and kernel2 as shown in the code below. This gives very good performance since it avoids the intermediate accum_buf completely, saving memory in addition to launching an additional kernel. Most of the performance benefit in this case is due to improvement in locality of memory references.

In [None]:
%%writefile lab/buffer_mem_move_3.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

int main() {
  constexpr int num_items = 1024*1000*1000;
  std::vector<int> a(num_items);
  for(int i=0;i<num_items;i++) a[i] = i;
  std::vector<int> b(num_items, 1);
  std::vector<int> c(num_items, 2);
  std::vector<int> d(num_items, 3);
  std::vector<int> res(num_items, 0);

  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";
  
  const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};

  sycl::buffer a_buf(a, props);
  sycl::buffer b_buf(b, props);
  sycl::buffer c_buf(c, props);
  sycl::buffer d_buf(d, props);
  sycl::buffer res_buf(res, props);

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  //# Kernel 1
  q.submit([&](auto &h) {
    sycl::accessor a_acc(a_buf, h, sycl::read_only);
    sycl::accessor b_acc(b_buf, h, sycl::read_only);
    sycl::accessor c_acc(c_buf, h, sycl::read_only);
    sycl::accessor d_acc(d_buf, h, sycl::read_only);
    sycl::accessor res_acc(res_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(num_items, [=](auto i) {
      int t = a_acc[i] + b_acc[i];
      if (t > 10)
        res_acc[i] = c_acc[i] + d_acc[i] ;
      else
        res_acc[i] = d_acc[i];
    });
  }).wait();

  sycl::host_accessor h_acc(res_buf); 
  for (int i = 0; i < 20; i++) std::cout << h_acc[i] << " ";std::cout << "...\n";

  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffer_mem_move_3.sh

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming