# Memory Optimization - USM

In this section we cover topics related to declaration, movement, and access to the memory hierarchy.
- [Overlapping Data Transfer from Host to Device](#Overlapping-Data-Transfer-from-Host-to-Device)
- [Avoid Copying Unnecessary Blocks of Data](#Avoid-Copying-Unnecessary-Blocks-of-Data)
- [Copying Memory from Host to USM Device Allocation](#Copying-Memory-from-Host-to-USM-Device-Allocation)

## Overlapping Data Transfer from Host to Device
Some GPUs provide specialized engines for copying data from host to device. Effective utilization of them will ensure that the host-to-device data transfer can be overlapped with execution on the device. In the following example, a block of memory is divided into chunks and each chunk is transferred to the accelerator, processed , and the result is brought back to the host. These chunks of three tasks are independent, so they can be processed in parallel depending on availability of hardware resources. 

In systems where there are copy engines that can be used to transfer data between host and device, we can see that the operations from different loop iterations can execute in parallel. The parallel execution can manifest in two ways:
- Between two memory copies, where one is executed by the GPU EUs and one by a copy engine, or both are executed by copy engines.
- Between a memory copy and a compute kernel, where the memory copy is executed by the copy engine and the compute kernel by the GPU EUs.

In [None]:
%%writefile lab/usm_overlap_copy.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

#define NITERS 10
#define KERNEL_ITERS 10000
#define NUM_CHUNKS 10
#define CHUNK_SIZE 10000000

int main() {
  const int num_chunks = NUM_CHUNKS;
  const int chunk_size = CHUNK_SIZE;
  const int iter = NITERS;

  sycl::queue q;

  //# Allocate and initialize host data
  float *host_data[num_chunks];
  for (int c = 0; c < num_chunks; c++) {
    host_data[c] = sycl::malloc_host<float>(chunk_size, q);
    float val = c;
    for (int i = 0; i < chunk_size; i++)
      host_data[c][i] = val;
  }
  std::cout << "Allocated host data\n";

  //# Allocate and initialize device memory
  float *device_data[num_chunks];
  for (int c = 0; c < num_chunks; c++) {
    device_data[c] = sycl::malloc_device<float>(chunk_size, q);
    float val = 1000.0;
    q.fill<float>(device_data[c], val, chunk_size);
  }
  q.wait();
  std::cout << "Allocated device data\n";

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  for (int it = 0; it < iter; it++) {
    for (int c = 0; c < num_chunks; c++) {

      //# Copy-in not dependent on previous event
      auto copy_in_event = q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size);

      //# Compute waits for copy_in_event
      auto compute_event = q.parallel_for(chunk_size, copy_in_event, [=](auto id) {
        for (int i = 0; i < KERNEL_ITERS; i++) device_data[c][id] += 1.0;
      });

      //# Copy out waits for compute_event
      auto copy_out_event = q.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size, compute_event);
    }

    q.wait();
  }
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;

  for (int c = 0; c < num_chunks; c++) {
    for (int i = 0; i < chunk_size; i++) {
      if (host_data[c][i] != (float)((c + KERNEL_ITERS * iter))) {
        std::cout << "Mismatch for chunk: " << c << " position: " << i
                  << " expected: " << c + 10000 << " got: " << host_data[c][i]
                  << "\n";
        break;
      }
    }
  }

  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_usm_overlap_copy.sh

In the timeline picture below, which is collected using ze_tracer, we can see that copy-ins from upcoming iterations overlap with the execution of compute kernel. Also, we see multiple copy-ins executing in parallel on multiple copy engines.

ze_tracer plot showing copy-in overlap with execution of compute kernel
<img src="assets/zetracer_overlap.jpeg">

## Avoid Copying Unnecessary Blocks of Data

The example below is allocating memory on device to perform simple addition of all elements in an array, the resulting data is stored in the first element of the array. You can limit the amount of data being copied back by copying back only the first element in the array when doing `memcpy` to copy memory from device to host.

```cpp
q.memcpy(data, device_data, sizeof(int) * N).wait();

vs

q.memcpy(data, device_data, sizeof(int) * 1).wait();
                                          ^
```

In [None]:
%%writefile lab/usm_copy_partial.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

static constexpr size_t N = 102400000; // global size

int main() {
  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    
  //# setup queue with default selector
  sycl::queue q;
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  //# initialize data array using usm
  int *data = static_cast<int *>(malloc(N * sizeof(int)));
  for (int i = 0; i < N; i++) data[i] = 1;

  //# USM device allocation
  auto device_data = sycl::malloc_device<int>(N, q);

  //# copy mem from host to device
  q.memcpy(device_data, data, sizeof(int) * N).wait();

  //# single_task kernel performing simple addition of all elements
  q.single_task([=](){
    int sum = 0;
    for(int i=0;i<N;i++){
        sum += device_data[i];
    }
    device_data[0] = sum;
  }).wait();

  //# copy mem from device to host
  q.memcpy(data, device_data, sizeof(int) * N).wait();

  std::cout << "Sum = " << data[0] << "\n";
    
  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "Compute Duration: " << duration / 1e+9 << " seconds\n";

  sycl::free(device_data, q);
  free(data);
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_usm_copy_partial.sh

## Copying Memory from Host to USM Device Allocation

When copying memory from host to USM device allocation, it is faster if the host memory is allocated using USM `malloc_host` rather than regular `malloc`.

The example below shows 2 host allocations using `malloc` and USM `malloc_host`, we also allocate memory on the device using USM `malloc_device`. We then copy both the host allocations using `q.memcpy()` and then capture the `memcpy` execution duration:

In [None]:
%%writefile lab/usm_memcpy.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  sycl::queue q{sycl::property::queue::enable_profiling{}};
  std::cout << "Device : " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  constexpr int N = 1024000000;
  //# host allocation using malloc
  auto host_data = static_cast<int *>(malloc(N * sizeof(int)));
  //# USM host allocation using malloc_host
  auto host_data_usm = sycl::malloc_host<int>(N, q);

  //# USM device allocation using malloc_device
  auto device_data_usm = sycl::malloc_device<int>(N, q);

  //# copy mem from host (malloc) to device
  auto e1 = q.memcpy(device_data_usm, host_data, sizeof(int) * N);
    
  //# copy mem from host (malloc_host) to device
  auto e2 = q.memcpy(device_data_usm, host_data_usm, sizeof(int) * N);

  q.wait();

  //# free allocations
  sycl::free(device_data_usm, q);
  sycl::free(host_data_usm, q);
  free(host_data);

  std::cout << "memcpy Time (malloc-to-malloc_device)     : " << (e1.template get_profiling_info<sycl::info::event_profiling::command_end>() - e1.template get_profiling_info<sycl::info::event_profiling::command_start>()) / 1e+9 << " seconds\n";

  std::cout << "memcpy Time (malloc_host-to-malloc_device : " << (e2.template get_profiling_info<sycl::info::event_profiling::command_end>() - e2.template get_profiling_info<sycl::info::event_profiling::command_start>()) / 1e+9 << " seconds\n";

  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_usm_memcpy.sh

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming