# Memory Optimization
Accelerators have access to a rich memory hierarchy. Utilizing the right level in the hierarchy is critical to getting the best performance.

In this section we cover Performance Impact and Consideration when choosing a memory model in SYCL:
- [USM vs Buffers](#USM-vs-Buffers)
- [Performance Impact of Buffers](#Performance-Impact-of-Buffers)
- [Performance Impact of USM Shared Allocations](#Performance-Impact-of-USM-Shared-Allocations)
- [Performance Impact of USM Device Allocations](#Performance-Impact-of-USM-Device-Allocations)

We will also look at Memory Optimizations and Considerations when using Buffer and USM in detail:
- [Memory Optimization: Buffers](031_Memory_Optimization_Buffers.ipynb)
- [Memory Optimization: Unified Shared memory](032_Memory_Optimization_USM.ipynb)

## USM vs Buffers
SYCL offers several choices for managing memory on the device. This section discusses the performance tradeoffs, briefly introducing the concepts. For an in-depth explanation, see Data Parallel C++.

As with other language features, the specification defines the behavior but not the implementation, so performance characteristics can change between software versions and devices. This guide provide best practices.

#### Buffers
A buffer is a container for data that can be accessed from a device and the host. The SYCL runtime manages memory by providing APIs for allocating, reading, and writing memory. The runtime is responsible for moving data between host and device, and synchronizing access to the data.

#### Unified Shared Memory (USM)
USM allows reading and writing of data with conventional pointers, in contrast to buffers where access to data is exclusively by API. USM has two commonly-used variants. Device allocations can only be accessed from the device and therefore require explicit movement of data between host and device. Shared allocations can be referenced from device or host, with the runtime automatically moving memory.

We illustrate the tradeoffs between choices by showing the same example program written with the three models. To highlight the issues, we use a program where a GPU and the host cooperatively compute, and therefore need to ship data back and forth.

## Performance Impact of Buffers vs USM

We illustrate the tradeoffs between choices by showing the same example program written with the three models. To highlight the issues, we use a program where a GPU and the host cooperatively compute, and therefore need to ship data back and forth.

In the next sections we will look at Performance Impact of:
- Buffers
- USM Shared Allocation (Implicit Data Movement)
- USM Device Allocation (Explicit Data Movement)

### Performance Impact of Buffers

Below, we show computation using buffers to manage data. A `buffer` is created and `parallel_for` executes the kernel. The kernel uses the `device_data` accessor to read and write data in `buffer_data`.

Note that the code does not specify the location of data. An `accessor` indicates when and where the data is needed, and the SYCL runtime moves the data to the device (if necessary) and then launches the kernel. The `host_accessor` indicates that the data will be read/written on the host. Since the kernel is also read/writing `buffer_data`, the `host_accessor` constructor waits for the kernel to complete and moves data to the host to perform the read/write. 

In [None]:
%%writefile lab/buffers.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>

int main() {
  constexpr int N = 16;
  std::vector<int> host_data(N, 10);

  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  //# Modify data array on device
  sycl::buffer buffer_data(host_data);
  q.submit([&](sycl::handler& h) {
    sycl::accessor device_data(buffer_data, h);
    h.parallel_for(N, [=](auto i) { device_data[i] += 1; });
  });
  sycl::host_accessor ha(buffer_data, sycl::read_only);

  //# print output
  for (int i = 0; i < N; i++) std::cout << ha[i] << " ";std::cout << "\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_buffers.sh

#### Performance Considerations - Buffers

In the above code, the data access `device_data[i]` appear to be simple array references, but they are implemented by the SYCL runtime with C++ operator overloading. The efficiency of accessor array references depends on the implementation. In practice, device code pays no overhead for overloading compared to direct memory references. The runtime does not know in advance which part of the buffer is accessed, so it must ensure all the data is on the device before the kernel begins. This is true today, but may change over time.

The same is not currently true for the `host_accessor`. The runtime does not move all the data to the host. The array references are implemented with more complex code and are significantly slower than native C++ array references. While it is acceptable to reference a small amount of data, computationally intensive algorithms using `host_accessor` pay a large performance penalty and should be avoided.

Another issue is concurrency. A `host_accessor` can block kernels that reference the same buffer from launching, even if the accessor is not actively being used to read/write data. Limit the scope that contains the `host_accessor` to the minimum possible.

### Performance Impact of USM Shared Allocations

Next we show the same algorithm implemented with shared allocations. Data is allocated using `sycl::malloc_shared`. Accessors are not needed because USM-allocated data can be referenced with conventional allows pointers. Therefore, the array references can be implemented with simple indexing. The `parallel_for` ends with a wait to ensure the kernel finishes before the host accesses data. Similar to buffers, the SYCL runtime ensures that all the data is resident on the device before launching a kernel. And like buffers, shared allocations are not copied to the host unless it is referenced. The first time the host references data, there is an operating system page fault, a page of data is copied from device to host, and execution continues. Subsequent references to data on the same page execute at full speed. When a kernel is launched, all of the host-resident pages are flushed back to the device.

In [None]:
%%writefile lab/usm_shared.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  //# USM allocation using malloc_shared
  constexpr int N = 16;
  int *data = sycl::malloc_shared<int>(N, q);

  //# Initialize data array
  for (int i = 0; i < N; i++) data[i] = 10;

  //# Modify data array on device
  q.parallel_for(N, [=](auto i) { data[i] += 1; }).wait();

  //# print output
  for (int i = 0; i < N; i++) std::cout << data[i] << " ";std::cout << "\n";
  sycl::free(data, q);
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_usm_shared.sh

#### Performance Considerations - USM Shared Allocation
Compared to buffers, data references are simple pointers and perform well. However, servicing page faults to bring data to the host incurs overhead in addition to the cost of transferring data. The impact on the application depends on the reference pattern. Sparse random access has the highest overhead and linear scans through data have lower impact from page faults.

Since all synchronization is explicit and under programmer control, concurrency is not an issue for a well designed program.

### Performance Impact of USM Device Allocations
The same program with device allocation can be found below. With device allocation, data can only be directly accessed on the device and must be explicitly copied to the host. All synchronization between device and host are explicit. The last `memcpy` ends with a wait so the host code will not execute until the asynchronous copy finishes. The `queue` definition uses an `in_order` queue so the memcpy waits for the `parallel_for` to complete.

In [None]:
%%writefile lab/usm_device.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  //# initialize data on host
  constexpr int N = 16;
  int host_data[N];
  for (int i = 0; i < N; i++) host_data[i] = 10;

  //# Explicit USM allocation using malloc_device
  int *device_data = sycl::malloc_device<int>(N, q);

  //# copy mem from host to device
  q.memcpy(device_data, host_data, sizeof(int) * N).wait();

  //# update device memory
  q.parallel_for(N, [=](auto i) { device_data[i] += 1; }).wait();

  //# copy mem from device to host
  q.memcpy(host_data, device_data, sizeof(int) * N).wait();

  //# print output
  for (int i = 0; i < N; i++) std::cout << host_data[i] << " ";std::cout <<"\n";
  sycl::free(device_data, q);
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_usm_device.sh

#### Performance Considerations - USM Device Allocation
Both data movement and synchronization are explicit and under the full control of the programmer. Array references are array references on the host, so it has neither the page faults overhead of shared allocations, nor the overloading overhead associated with buffers. Shared allocations only transfer data that the host actually references, with a memory page granularity. In theory, device allocations allow on-demand movement of any granularity. In practice, fine-grained, asynchronous movement of data can be complex and most programmers simply move the entire data structure once. The requirement for explicit data movement and synchronization makes the code more complicated, but device allocations can provide the best performance.

## Memory Optimizations for Buffers and USM
Next we will look at optimizations specific to Buffers and USM:
- [Buffers Optimization](031_Memory_Optimization_Buffers.ipynb)
- [USM Optimizations](032_Memory_Optimization_USM.ipynb)

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming