# SYCL Multi-GPU Programming

This Module covers Multi-GPU SYCL Programming and Optimization. It also covers `ONEAPI_DEVICE_SELECTOR` environment variable to filter devices available for SYCL kernel offloading.

##### Sections
- [Check Available Devices](#Check-Available-Devices)
  - [sycl-ls](#sycl-ls)
  - [ONEAPI_DEVICE_SELECTOR Environment Variable](#ONEAPI_DEVICE_SELECTOR-Environment-Variable)
- [Multi-GPU Device Selection](#Multi-GPU-Device-Selection)
  - _Code:_ [Single GPU](#Single-GPU)
  - _Code:_ [Multiple GPUs](#Multiple-GPUs)
- [Multi-GPU Kernel Submission](#Multi-GPU-Kernel-Submission)
- [GPU to GPU Memory Copy](#GPU-to-GPU-Memory-Copy)
- [Split computation to multiple GPUs](#Split-computation-to-multiple-GPUs)
- [Optimizing Multi-GPU Offload](#Optimizing-Multi-GPU-Offload)
- [Filter offload devices with ONEAPI_DEVICE_SELECTOR](#Filter-offload-devices-with-ONEAPI_DEVICE_SELECTOR)
- [Tips for Multi-GPU Programming](#Tips-for-Multi-GPU-Programming)
  - [Filter by CPU](#Filter-by-CPU)
  - [Filter by GPU OpenCL Backend](#Filter-by-GPU-OpenCL-Backend)
  - [Filter by GPU Level Zero Backend](#Filter-by-GPU-Level-Zero-Backend)
  - [Filter number of GPUs available for offload](#Filter-number-of-GPUs-available-for-offload)


## Learning Objectives
* Use __sycl-ls__ command line tool to find all available offload devices.
* Use __ONEAPI_DEVICE_SELECTOR__ environment variable to enable or disable devices available for offload.
* Write SYCL code to find all GPU devices in a system and submit kernels to all concurrently.
* Understand the __memory copy latency__ differences between Host to GPU and GPU to GPU.
* __Optimize__ Multi-GPU offload SYCL code.

## Check Available Devices

You can check for available devices that a SYCL kernel can be offloaded to, using `sycl-ls` command line tool and you can filter out the devices that is available for SYCL kernel to offload using the `ONEAPI_DEVICE_SELECTOR` environment variable.

### sycl-ls

`sycl-ls` command line tool can be used to list all the enabled SYCL enabled devices.

Run the following command to check all devices available by default for SYCL kernel offload.

In [None]:
! sycl-ls

### ONEAPI_DEVICE_SELECTOR Environment Variable

This device selection environment variable can be used to limit the choice of devices available when the SYCL-using application is run. Useful for limiting devices to a certain type (like GPUs or accelerators) or backends (like Level Zero or OpenCL).

With no environment variables set to say otherwise, all platforms and devices presently on the machine are available. The default choice will be one of these devices, usually preferring a Level Zero GPU device, if available. The `ONEAPI_DEVICE_SELECTOR` can be used to limit that choice of devices, and to expose GPU sub-devices or sub-sub-devices as individual devices.

`ONEAPI_DEVICE_SELECTOR=<backend>:<device>`


Try these examples to limit certain devices, followed by `sycl-ls` command:
```
export ONEAPI_DEVICE_SELECTOR=opencl:cpu
export ONEAPI_DEVICE_SELECTOR=opencl:gpu
export ONEAPI_DEVICE_SELECTOR=opencl:*
export ONEAPI_DEVICE_SELECTOR=level_zero:gpu
export ONEAPI_DEVICE_SELECTOR=*:gpu
export ONEAPI_DEVICE_SELECTOR=*:*
```

To reset and enable all devices:
```
unset ONEAPI_DEVICE_SELECTOR
```

Try the following 3 commands to print currently enables device and then `unset` and try again, you can also try these commands on a terminal directly:

In [None]:
! ONEAPI_DEVICE_SELECTOR=opencl:cpu; sycl-ls

In [None]:
! ONEAPI_DEVICE_SELECTOR=opencl:gpu; sycl-ls

In [None]:
! unset ONEAPI_DEVICE_SELECTOR; sycl-ls

## Multi-GPU Device Selection

To offload a SYCL kernel on specific device, we can either limit available devices using `ONEAPI_DEVICE_SELECTOR` environment variable or you can limit in SYCL code using `sycl::queue` device selector.

### Single GPU

To submit job to a single GPU, we create `sycl::queue` with `sycl::gpu_selector_v` device selector as shown below:

```cpp
sycl::queue q(sycl::gpu_selector_v);
```

The SYCL code below shows GPU device selection: Inspect code, there are no modifications necessary:

Inspect the code cell below and click run ▶ to save the code to file
Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.


In [None]:
%%writefile lab/single_gpu.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  // Create a device queue with device selector
  sycl::queue q(sycl::gpu_selector_v);

  // Print the device name
  std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << "\n";

  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_single_gpu.sh

### Multiple GPUs

To find multiple GPU device in the system, `sycl::platform` class is used to query all devices in a system, `sycl::gpu_selector_v` is used to filter only GPU devices, the `get_devices()` method will create a vector of GPU devices found.

```cpp
auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();

sycl::queue q0(gpus[0]);
sycl::queue q1(gpus[1]);
```

Once we have found all the GPU devices, we create `sycl::queue` for each GPU device and submit job for GPU devices.

The SYCL code below shows Multi-GPU device selection: Inspect code, there are no modifications necessary:

Inspect the code cell below and click run ▶ to save the code to file
Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

In [None]:
%%writefile lab/multi_gpu.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  // get all GPUs devices into a vector
  auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();

  // Print the device names
  for(auto gpu : gpus)
    std::cout << "Device: " << gpu.get_info<sycl::info::device::name>() << "\n";

  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_multi_gpu.sh

## Multi-GPU Kernel Submission

In the previous section we learned how to find multiple GPUs and create multiple `sycl::queue`.

Next we will submit kernel for execution using the `sycl::queue` for different GPUs.
```cpp
    // submit kernels to 2 devices
    q0.parallel_for(N, [=](auto i) {
      a0[i] *= 2;
    }).wait();
    
    q1.parallel_for(N, [=](auto i) {
      a1[i] *= 3;
    }).wait();
```
Note that the above code will submit kernel to GPU and wait for completion, but since `.wait()` is blocking call on host, the 2 kernels will not execute concurrently on 2 GPUs.

To get concurrent execution on GPUs, we have to separate the asynchronous calls and synchronization calls as shown below:

```cpp
    // submit kernels to 2 devices
    q0.parallel_for(N, [=](auto i) {
      a0[i] *= 2;
    });
    q1.parallel_for(N, [=](auto i) {
      a1[i] *= 3;
    });

    // wait for compute complete
    q0.wait();
    q1.wait();
    
```

The SYCL code below shows Multi-GPU kernel submission, submits 2 different kernels to 2 different GPUs: Inspect code, there are no modifications necessary:

Inspect the code cell below and click run ▶ to save the code to file
Next run ▶ the cell in the Build and Run section below the code to compile and execute the code.

In [None]:
%%writefile lab/multi_gpu_submit.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main() {
  // get all GPUs devices into a vector
  auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();

  if(gpus.size() >= 2) {
    // Initialize array with values
    const int N = 256;
    float a[N], b[N];
    for (int i = 0; i < N; i++){
      a[i] = i;
    }
      
    // Create sycl::queue for each gpu
    sycl::queue q0(gpus[0]);
    std::cout << "GPU0: " << gpus[0].get_info<sycl::info::device::name>() << "\n";
    sycl::queue q1(gpus[1]);
    std::cout << "GPU1: " << gpus[1].get_info<sycl::info::device::name>() << "\n";

    // device mem alloc for each device
    auto a0 = sycl::malloc_device<float>(N, q0);
    auto a1 = sycl::malloc_device<float>(N, q1);

    // memcpy to device alloc
    q0.memcpy(a0, a, N * sizeof(float));
    q1.memcpy(a1, a, N * sizeof(float));

    // wait for copy to complete
    q0.wait();
    q1.wait();

    // submit kernels to 2 devices
    q0.parallel_for(N, [=](auto i) {
      a0[i] *= 2;
    });
    q1.parallel_for(N, [=](auto i) {
      a1[i] *= 3;
    });

    // wait for compute complete
    q0.wait();
    q1.wait();

    // copy back result to host
    q0.memcpy(a, a0, N * sizeof(float));
    q1.memcpy(b, a1, N * sizeof(float));

    // wait for copy to complete
    q0.wait();
    q1.wait();

    // print output
    for (int i = 0; i < N; i++) std::cout << a[i] << " ";
    std::cout << "\n";
    for (int i = 0; i < N; i++) std::cout << b[i] << " ";
    std::cout << "\n";

    sycl::free(a0, q0);
    sycl::free(a1, q1);
  } else {
      std::cout << "Multiple GPUs not available\n";
  }
  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_multi_gpu_submit.sh

## GPU to GPU Memory Copy

The code below shows how to copy memory between 2 GPUs, this is especially useful when programming kernels for multi-GPU systems. Memory copy between 2 GPUs is faster that memory copy between Host and GPU.

Understanding memory copy latency differences between Host to GPU, GPU-1 to GPU-2 and same GPU copy is key for designing performant GPU Kernel code.

```cpp
  q0.memcpy(dev0_mem, host_mem, N*sizeof(int));

  q0.memcpy(dev1_mem, dev0_mem, N*sizeof(int));

  q0.memcpy(dev0_mem2, dev0_mem, N*sizeof(int));
    
```

In [None]:
%%writefile lab/multi_gpu_memcpy.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main(){
  int N = 1024000000;

  // get all GPUs devices into a vector
  auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();

  if(gpus.size() > 1){
    sycl::queue q0(gpus[0], sycl::property::queue::enable_profiling());
    sycl::queue q1(gpus[1], sycl::property::queue::enable_profiling());
    std::cout << "GPU0: " << q0.get_device().get_info<sycl::info::device::name>() << "\n";
    std::cout << "GPU1: " << q1.get_device().get_info<sycl::info::device::name>() << "\n";

	auto host_mem = sycl::malloc_host<int>(N, q0);
    for (int i=0;i<N;i++) host_mem[i] = i;
        
	auto dev0_mem = sycl::malloc_device<int>(N, q0);
	auto dev0_mem2 = sycl::malloc_device<int>(N, q0);
    auto dev1_mem = sycl::malloc_device<int>(N, q1);
        
	// host to GPU0 copy
    auto event_h2d0 = q0.memcpy(dev0_mem, host_mem, N*sizeof(int));
    q0.wait();

    // GPU0 to GPU1 copy q0
    auto event_d2d_q0 = q0.memcpy(dev1_mem, dev0_mem, N*sizeof(int));
    q0.wait();

    // GPU0 to GPU0 copy
    auto event_d2d_same0 = q0.memcpy(dev0_mem2, dev0_mem, N*sizeof(int));
    q0.wait();

    std::cout << host_mem[0] << " ... " << host_mem[N-1] << " (1M int, 3.8GB Tx)\n"; 

    // free allocation
	sycl::free(host_mem, q0);
	sycl::free(dev0_mem, q0);
	sycl::free(dev0_mem2, q0);
    sycl::free(dev1_mem, q1);

    // Print kernel profile times for copy
    auto startK = event_h2d0.get_profiling_info<sycl::info::event_profiling::command_start>();
    auto endK = event_h2d0.get_profiling_info<sycl::info::event_profiling::command_end>();
    std::cout << "Host to GPU0 Copy [q0]: " << (endK - startK) / 1e+9 << " seconds\n";
    startK = event_d2d_q0.get_profiling_info<sycl::info::event_profiling::command_start>();
    endK = event_d2d_q0.get_profiling_info<sycl::info::event_profiling::command_end>();
	std::cout << "GPU0 to GPU1 Copy [q0]: " << (endK - startK) / 1e+9 << " seconds\n";
	startK = event_d2d_same0.get_profiling_info<sycl::info::event_profiling::command_start>();
    endK = event_d2d_same0.get_profiling_info<sycl::info::event_profiling::command_end>();
    std::cout << "GPU0 to GPU0 Copy [q0]: " << (endK - startK) / 1e+9 << " seconds\n";
  } else {
    std::cout << "Multiple GPUs not available\n";
  }
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_multi_gpu_memcpy.sh

## Split computation to multiple GPUs

The code below shows how vector add computation is split on available GPUs:
- The total workload size is divided by the number of GPUs found on the system
- Kernel is submitted to all GPUs and each GPUs computes a fraction of the workload
- The result from all GPUs are aggregated to the host

In [None]:
%%writefile lab/multi_gpu_vadd_split.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

void kernel_compute_vadd(sycl::queue &q, float *a, float *b, float *c, size_t n) {
  q.parallel_for(n, [=](auto i) {
    c[i] = a[i] + b[i];
  });
}

int main() {
  const int N = 1200;

  // Define 3 arrays
  float *a = static_cast<float *>(malloc(N * sizeof(float)));
  float *b = static_cast<float *>(malloc(N * sizeof(float)));
  float *c = static_cast<float *>(malloc(N * sizeof(float)));

  // Initialize matrices with values
  for (int i = 0; i < N; i++){
    a[i] = 1;
    b[i] = 2;
    c[i] = 0;
  }
    
  // get all GPUs devices into a vector
  auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();
  int num_devices = gpus.size();

  // Create sycl::queue for each gpu
  std::vector<sycl::queue> q(num_devices);
  for(int i = 0; i < num_devices; i++){
    std::cout << "Device: " << gpus[i].get_info<sycl::info::device::name>() << "\n";
    q.push_back(sycl::queue(gpus[i]));
  }

  // device mem alloc for vectors a,b,c for each device
  float *da[num_devices];
  float *db[num_devices];
  float *dc[num_devices];
  for (int i = 0; i < num_devices; i++) {
    da[i] = sycl::malloc_device<float>(N/num_devices, q[i]);
    db[i] = sycl::malloc_device<float>(N/num_devices, q[i]);
    dc[i] = sycl::malloc_device<float>(N/num_devices, q[i]);
  }

  // memcpy for matrix and b to device alloc
  for (int i = 0; i < num_devices; i++) {
    q[i].memcpy(&da[i][0], &a[i*N/num_devices], N/num_devices * sizeof(float));
    q[i].memcpy(&db[i][0], &b[i*N/num_devices], N/num_devices * sizeof(float));
  }

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // submit vector-add kernels to all devices
  for (int i = 0; i < num_devices; i++)
    kernel_compute_vadd(q[i], da[i], db[i], dc[i], N/num_devices);

  // wait for compute complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // copy back result to host
  for (int i = 0; i < num_devices; i++)
    q[i].memcpy(&c[i*N/num_devices], &dc[i][0], N/num_devices * sizeof(float));

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // print output
  for (int i = 0; i < N; i++) std::cout << c[i] << " ";
  std::cout << "\n";

  free(a);
  free(b);
  free(c);
  for (int i = 0; i < num_devices; i++) {
    sycl::free(da[i], q[i]);
    sycl::free(db[i], q[i]);
    sycl::free(dc[i], q[i]);
  }
  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_multi_gpu_vadd_split.sh

## Optimizing Multi-GPU Offload

Lets look at an example of matrix multiplication kernel submitted to multiple GPUs.

The SYCL code uses a for-loop for the number of GPUs to perform operations like:
- creating memory allocation
- memory copy from host to device
- kernel submission
- memory copy from device to host

All these operations are asynchronously executed, and they run concurrently on GPU.

```
  // device mem alloc for matrix a,b,c for each device
  for (int i = 0; i < num_devices; i++) {
    da[i] = sycl::malloc_device<float>(N * N, q[i]);
    db[i] = sycl::malloc_device<float>(N * N, q[i]);
    dc[i] = sycl::malloc_device<float>(N * N, q[i]);
  }

  // memcpy for matrix and b to device alloc
  for (int i = 0; i < num_devices; i++) {
    q[i].memcpy(&da[i][0], &matrix_a[i][0], N * N * sizeof(float));
    q[i].memcpy(&db[i][0], &matrix_b[i][0], N * N * sizeof(float));
  }

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // submit matrix multiply kernels to all devices
  for (int i = 0; i < num_devices; i++)
    kernel_compute_mm(q[i], da[i], db[i], dc[i], N, B);
```
The full SYCL code for performing matrix multiplication on multiple GPUs is below:

In [None]:
%%writefile lab/multi_gpu_mm.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <execution>

static constexpr size_t N = 5120; // global size
static constexpr size_t B = 32;   // WG size

void kernel_compute_mm(sycl::queue &q, float *a, float *b, float *c, size_t n, size_t wg) {
  q.parallel_for(
      sycl::nd_range<2>(sycl::range<2>{n, n}, sycl::range<2>{wg, wg}),
      [=](sycl::nd_item<2> item) {
        const int i = item.get_global_id(0);
        const int j = item.get_global_id(1);
        float temp = 0.0f;
        for (int k = 0; k < N; k++) {
          temp += a[i * N + k] * b[k * N + j];
        }
        c[i * N + j] = temp;
      });
}

int main() {
  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

  // find all GPUs
  auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();
  int num_devices = gpus.size();

  // Define matrices
  float *matrix_a[num_devices];
  float *matrix_b[num_devices];
  float *matrix_c[num_devices];

  float v1 = 2.f;
  float v2 = 3.f;
  for (int n = 0; n < num_devices; n++) {
    matrix_a[n] = static_cast<float *>(malloc(N * N * sizeof(float)));
    matrix_b[n] = static_cast<float *>(malloc(N * N * sizeof(float)));
    matrix_c[n] = static_cast<float *>(malloc(N * N * sizeof(float)));

    // Initialize matrices with values
    for (int i = 0; i < N; i++)
      for (int j = 0; j < N; j++) {
        matrix_a[n][i * N + j] = v1++;
        matrix_b[n][i * N + j] = v2++;
        matrix_c[n][i * N + j] = 0.f;
      }
  }

  float *da[num_devices];
  float *db[num_devices];
  float *dc[num_devices];

  std::vector<sycl::queue> q(num_devices);
  std::vector<int> id;

  // create queues for each device
  std::cout << "\nSubmitting Compute Kernel to GPUs:\n";
  for (int i = 0; i < num_devices; i++) {
    q[i] = sycl::queue(gpus[i]);
    id.push_back(i);
    std::cout << "GPU" << i << ": " << q[i].get_device().get_info<sycl::info::device::name>() << "\n";
  }

  // device mem alloc for matrix a,b,c for each device
  for (int i = 0; i < num_devices; i++) {
    da[i] = sycl::malloc_device<float>(N * N, q[i]);
    db[i] = sycl::malloc_device<float>(N * N, q[i]);
    dc[i] = sycl::malloc_device<float>(N * N, q[i]);
  }

  // memcpy for matrix and b to device alloc
  for (int i = 0; i < num_devices; i++) {
    q[i].memcpy(&da[i][0], &matrix_a[i][0], N * N * sizeof(float));
    q[i].memcpy(&db[i][0], &matrix_b[i][0], N * N * sizeof(float));
  }

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // submit matrix multiply kernels to all devices
  /*
  for (int i = 0; i < num_devices; i++)
    kernel_compute_mm(q[i], da[i], db[i], dc[i], N, B);
  */
  std::for_each(std::execution::par, id.begin(), id.end(), [&q, &da, &db, &dc](auto i){
    kernel_compute_mm(q[i], da[i], db[i], dc[i], N, B);
  });

  // wait for compute complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // copy back result to host
  for (int i = 0; i < num_devices; i++)
    q[i].memcpy(&matrix_c[i][0], &dc[i][0], N * N * sizeof(float));

  // wait for copy to complete
  for (int i = 0; i < num_devices; i++)
    q[i].wait();

  // print first element of result matrix
  std::cout << "\nMatrix Multiplication Complete\n\n";
  for (int i = 0; i < num_devices; i++)
    std::cout << "GPU" << i << ": matrix_c[0][0]=" << matrix_c[i][0] << "\n";

  for (int i = 0; i < num_devices; i++) {
    free(matrix_a[i]);
    free(matrix_b[i]);
    free(matrix_c[i]);
    sycl::free(da[i], q[i]);
    sycl::free(db[i], q[i]);
    sycl::free(dc[i], q[i]);
  }

  auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
  std::cout << "\nCompute Duration: " << duration / 1e+9 << " seconds\n";
  return 0;
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./run_multi_gpu_mm.sh

#### VTune Analysis

The CPU and GPU can be profiled using VTune using the command `vtune --collect gpu-hotspots`, from the VTune profiling data, it can be seen that the kernel submission for multiple GPUs has an overhead of kernel `zeModuleCreate` on the host. `zeModuleCreate` is backend API that compiles kernel code for execution on GPU. This does not happen concurrently since a for-loop is used for multiple GPUs, which will stack execution on host. This can be observed the VTune capture:
```
for (int i = 0; i < num_devices; i++)
  kernel_compute_mm(q[i], da[i], db[i], dc[i], N, B);
```
<img src="assets/vtune_mm.png">

The kernel submission for multiple GPUs can instead be called using `std::for_each` with `std::execution::par` policy, which will execute kernel `zeModuleCreate` using separate threads on CPU as shown below, which will overlap kernel `zeModuleCreate` on CPU and improves performance.
```
std::for_each(std::execution::par, id.begin(), id.end(), [&q, &da, &db, &dc](auto i){
  kernel_compute_mm(q[i], da[i], db[i], dc[i], N, B);
});
```
<img src="assets/vtune_mm_tbb.png">

Execute the script below to collect VTune data for the code above, you can capture VTune data for both methods of submitting kernel to multiple GPUs by modifying the SYCL code and compare the results.


In [None]:
! ./vtune_multi_gpu_mm.sh

## Tips for Multi-GPU Programming

- Check if offloading to multiple GPUs is benificial or not: Calculate total HW threads that can be executed concurrently on one GPU and compare it to the total number of work-items in the kernel workload.
- Host to GPU memory copy is more expensive that __GPU to GPU memory copy__. Reduce the memory copy latency by aggregating computation on GPUs before copying back to Host when applicable.
- Use __CPU multi-threading__ when launching multiple kernels to multiple GPUs to overlap Kernel Module Creation.
- Understand which SYCL calls are __asynchronous__ versus __blocking calls__ to get concurrency on multiple GPUs.
- Use __VTune Profiler__ to collect `gpu-hospots` and check for concurrency in thread dispatch.

## Filter offload devices with ONEAPI_DEVICE_SELECTOR

`ONEAPI_DEVICE_SELECTOR` environment variable can be used to filter the available devices for SYCL kernel offload. 

If SYCL `queue` uses `default_selector`, then the offload device can be controlled using `ONEAPI_DEVICE_SELECTOR` environment variable rather than updating SYCL code and re-compiling. This is useful to do quick performance analysis on different hardware architectures or different hardware vendors.

The SYCL code should be __performance portable__ to get accurate consistent results on different hardware architectures or different hardware vendors. For example, different hardware architectures have different Max Works-Group size or Max Local Memory size.

In [None]:
! ./build_offload_mm.sh

### Filter by CPU
Check the script below which sets `ONEAPI_DEVICE_SELECTOR` value to `opencl:cpu` for offload and run the script to execute.

In [None]:
%%writefile run_offload_mm_filter.sh
#!/bin/bash
source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1

export ONEAPI_DEVICE_SELECTOR=opencl:cpu

if [ $? -eq 0 ]; then ./offload_mm.out; fi

In [None]:
! ./run_offload_mm_filter.sh

### Filter by GPU OpenCL Backend
Check the script below which sets `ONEAPI_DEVICE_SELECTOR` value to `opencl:gpu` for offload and run the script to execute.

In [None]:
%%writefile run_offload_mm_filter.sh
#!/bin/bash
source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1

export ONEAPI_DEVICE_SELECTOR=opencl:gpu

if [ $? -eq 0 ]; then ./offload_mm.out; fi

In [None]:
! ./run_offload_mm_filter.sh

### Filter by GPU Level-Zero Backend
Check the script below which sets `ONEAPI_DEVICE_SELECTOR` value to `level_zero:gpu` for offload and run the script to execute.

In [None]:
%%writefile run_offload_mm_filter.sh
#!/bin/bash
source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1

export ONEAPI_DEVICE_SELECTOR=level_zero:gpu

if [ $? -eq 0 ]; then ./offload_mm.out; fi

In [None]:
! ./run_offload_mm_filter.sh

### Filter number of GPUs available for offload

The examples below show how to filter number of GPUs available for offload

Run the script below to compile the VectorAdd SYCL code.

In [None]:
! ./build_offload_vadd.sh

#### Matrix Multiplication Kernel on 2 GPUs:
Check the script below which sets `ONEAPI_DEVICE_SELECTOR` value to limit GPUs available for offload and run the script to execute.

In [None]:
%%writefile run_multi_gpu_vadd_filter.sh
#!/bin/bash
source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1

export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"

if [ $? -eq 0 ]; then ./offload_vadd.out; fi

In [None]:
! ./run_multi_gpu_vadd_filter.sh

#### VectorAdd Computation split on 3 GPUs:
Check the script below which sets `ONEAPI_DEVICE_SELECTOR` value to limit GPUs available for offload and run the script to execute.

In [None]:
%%writefile run_multi_gpu_vadd_filter.sh
#!/bin/bash
source /opt/intel/oneapi/setvars.sh > /dev/null 2>&1

export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1;level_zero:2"

if [ $? -eq 0 ]; then ./offload_vadd.out; fi

In [None]:
! ./run_multi_gpu_vadd_filter.sh

# Summary

In this module you learned:
* How to use `sycl-ls` and `ONEAPI_DEVICE_SELECTOR` environment variable to filter offload devices
* How to select multiple GPUs and offload a kernel
* How to copy memory from GPU-1 to GPU-2
* How to optimize Multi-GPU SYCL code for performance
