# Implicit and Explicit Scaling for Multi-Stack Architecture

In this section we cover how Multi-Stack Architecture can be programmed using Implicit Scaling and Explicit Scaling

- [Multi-Stack Architecture](#Multi-Stack-Architecture)
- [Implicit Scaling](#Implicit-Scaling)
  - [Performance Expectations](#Performance-Expectations)
  - [Work Scheduling and Memory Distribution](#Work-Scheduling-and-Memory-Distribution)
  - [Programming Principles](#Programming-Principles)
- [Explicit Scaling](#Explicit-Scaling)
  - [Creating Sub-Devices](#Creating-Sub-Devices)
  - [Context](#Context)
  - [Explicit Scaling Example](#Explicit-Scaling-Example)

## Multi-Stack Architecture

Intel Data Center GPU MAX series use Multi-Stack Architecture with 1 or 2 Stack. Each Stack is capable of functioning as an independent GPU entity. The Stack can execute workloads on its own.

For general applications, the multi-stack GPU is represented as a single GPU device. Applications do not care that internally GPU is constructed out of smaller Stacks, which simplifies the programming model and allows existing applications to run without any code changes. Intel GPU driver, SYCL and OpenMP parallel language runtimes work together to automatically dispatch the workloads across the stacks.

The figures below show a single Stack and 2-Stack GPU schematic:

<img src="assets/1-tile-architecture.png">

<img src="assets/2-tile-architecture.png">

Stack are connected with fast interconnect that allows efficient communication between stacks using High Bandwidth Memory (HBM).

Any Stack is capable of reading and writing to any HBM memory. For example, Stack 0 may read the local HBM memory of Stack 1. In this case, the interconnect between Stack 0 and Stack 1 is used for communication.

Stack 0 is connected to the PCI, but any Stack can read and write system memory. The same inter-Stack interconnects are used to transfer the data. Hence, Stack 0 has the shortest path to system memory among all the Stack.

Reading and writing to system memory do not require CPU involvement, GPU can perform DMA (Direct Memory Access) over PCI to system memory.

Because access to a Stack's local HBM does not involve inter-Stack interconnect, it is more efficient than cross-Stack HBM access, with lower latency and lower inter-Stack bandwidth consumption. Advanced developers can take advantage of memory locality to achieve higher performance.

To properly utilize multi-stack GPU, we introduced two application programming modes:

#### Implicit scaling mode
Driver and language runtimes are responsible for work distribution and multi-stack memory placement. Application sees the GPU as one monolithic device and does not care about multi-stack architecture.

#### Explicit scaling mode
User is responsible for work distribution and mutli-stack memory placement. Driver and language runtimes provide tools that expose each Stack as a separate subdevice that can be programmed independently of all the others.

## Implicit Scaling

In Implicit Scaling Mode, Driver and language runtimes are responsible for work distribution and multi-stack memory placement. Application sees the GPU as one monolithic device and does not care about multi-stack architecture.

Implicit scaling can be enabled by exporting this environment variable:

```
export EnableImplicitScaling=1
```

This environment variable changes the meaning of a device to root-device. No change in application code is required. A kernel submitted to device will utilize all stacks. Similarly, memory allocation on device will span across all stacks.

A root-device is built using multiple sub-devices, also known as stacks. These stacks form a shared memory space which allows to treat a root-device as a monolithic device without the requirement of explicit communication between stacks. This section covers multi-stack programming principles using implicit scaling. When using implicit scaling, the root-device driver is responsible for distributing work to all stacks when application code launches a kernel.

### Performance Expectations

Implicit scaling exposes resources of all stacks to a single kernel launch. For root-device with 2 stacks, a kernel has access to 2x compute peak, 2x memory bandwidth and 2x memory capacity. In the ideal case, workload performance increases by 2x. However, cache size and cache bandwidth are increased by 2x as well which can lead to better-than-linear scaling if workload fits in increased cache capacity.

Each stack is equivalent to a NUMA domain and therefore memory access pattern and memory allocation are a crucial part to achieve optimal implicit scaling performance. Workloads with a concept of locality are expected to work best with this programming model as cross-stack memory accesses are naturally minimized. Note that compute bound kernels are not impacted by NUMA domains, thus are expected to easily scale to multiple stacks with implicit scaling.

MPI applications are more efficient with implicit scaling compared to an explicit scaling approach. A single rank can utilize the entire root-device which eliminates explicit synchronization and communication between stacks. Implicit scaling automatically overlaps local memory accesses and cross-stack memory accesses in a single kernel launch.

Implicit scaling improves kernel execution time only. Serial bottlenecks will not speed up. Applications will observe no speed-up with implicit scaling if large serial bottleneck is present. Common serial bottlenecks are:

- high CPU usage
- kernel launch latency
- PCIe transfers

These will become more pronounced as kernel execution time reduces. Note that only stack-0 has PCIe connection to the host.


### Work Scheduling and Memory Distribution

#### Memory Coloring
Any allocation in SYCL that corresponds to a shared or device allocation is colored across all stacks, meaning that allocation is divided in number-of-stacks chunks and distributed round-robin between stacks. Consider this root-device allocation:
```cpp
int *a = sycl::malloc_device<int>(N, q);
```
For a 2-stack root-device, the first half, (elements a[0] to a[N/2-1]), is physically allocated on stack-0. The remaining half, (elements a[N/2] to a[N-1]), is located on stack-1. In the future, we will introduce memory allocation APIs that allow user-defined memory coloring.

<img src="assets/2-tile-scaling.png">

__Note:__
- Memory coloring described above is applied at page size granularity. 
  - An allocation containing three pages has two pages resident on stack-0.
  - Allocations smaller or equal than page size are resident on stack-0 only.
- Using a memory pool that is based on a single allocation will break memory coloring logic. It is recommended that applications create one allocation per object to allow that object data is distributed to all stacks.


#### Static Partitioning
Scheduling of work-groups to stacks is deterministic and referred to as static partitioning. The partitioning follows a simple rule: the slowest moving dimension is divided in number-of-stacks chunks and distributed round-robin between stacks. 

Let's look at 1-dimensional kernel launch on root-device:
```cpp
q.parallel_for(N, [=](auto i) {
    //
});
```
Since there is only a single dimension it is automatically slowest dimension and partitioned between stacks by driver. For a 2-stack root-device, iterations 0 to N/2-1 are scheduled to stack-0. The remaining iterations N/2 to N-1 are executed on stack-1.

Let's look at 3-dimensional kernel launch on root-device:

```cpp
range<3> global{nz, ny, nx};
range<3> local{1, 1, 16};

cgh.parallel_for(nd_range<3>(global, local), [=](nd_item<3> item) {
    //
});
```

The slowest dimension is z and partitioned between stacks, i.e. for 2-stack root-device, all iterations from z=0 to z=nz/2-1 are executed on stack-0. The remaining iterations with z=nz/2 to z=nz-1 are scheduled to stack-1.

In case slowest moving dimension can't be divided evenly between stacks and creates an remainder imbalance larger than 5%, driver will partition next dimension if it leads to less load imbalance. This impacts kernels with odd dimensions smaller than 19 only. Examples for different kernel launches can be seen in below table (assuming local range {1,1,16}):

Work group partition to stacks:

|nz|ny|nx|Partitioned Dimension
|---|---|---|---
|512|512|512|z
|21|512|512|z
|19|512|512|y
|18|512|512|z
|19|19|512|x

In case of multi-dimensional local range in SYCL, the partitioned dimension can change. For example, for global range {38,512,512} with local range {2,1,8} driver would partition y-dimension while for local range {1,1,16} driver would partition z-dimension. OpenMP can only have a 1-dimensional local range which is created from inner most loop and thus does not impact static partitioning heuristics. OpenMP kernels created with collapse level larger than 3 correspond to 1-dimensional kernel with all for loops linearized. The linearized loop will be portioned following 1D kernel launch heuristics.

__Note:__
- Static partitioning happens at work-group granularity.
  - This implies that all work-items in a work-group are scheduled to same stack.
- A kernel with a single work-group is resident on stack-0 only.



### Programming Principles
To achieve good performance with implicit scaling, cross-stack memory accesses must be minimized but it is not required to eliminate all cross-stack accesses. A certain amount of cross-stack traffic can be handled by stack-to-stack interconnect if performed concurrently with local memory accesses. For memory bandwidth bound workload the amount of acceptable cross-stack accesses is determined by ratio of local memory bandwidth and cross-stack bandwidth (see Cross-stack Traffic).

The following principles should be embraced by workloads that use implicit scaling:

- Kernel must have enough work-items to utilize both stacks.
  - The minimal number of work-items needed to utilize both stacks is: <number of VEs> * <hardware-threads per VE> * <SIMD width>.
  - 2-stack GPU with 1024 VE and SIMD32 requires at least 262,144 work-items.
- Device time must dominate runtime to observe whole application scaling.
- Minimize cross-stack memory accesses by exploiting locality in algorithm.
- Slowest moving dimension should be large to avoid stack load imbalance.
- Cross-stack memory accesses and local memory accesses should be interleaved.
- Avoid stride-1 memory access in slowest moving dimension for 2D and 3D kernel launches.
- If memory access pattern changes dynamically over time, a sorting step every nth iteration should be performed to minimize cross-stack memory accesses.
- Don't use a memory pool based on a single allocation.
 
Many applications naturally have a concept of locality. These applications are expected to be a good fit for using implicit scaling due to low cross-stack traffic. To illustrate this concept, let's use a stencil kernel as an example. A stencil operates on a grid which can be divided into blocks where majority of stencil computations within a block use stack local data. Only stencil operations that are at border of the block require data from another block, i.e. on another stack. The amount of these cross-stack/cross-border accesses are suppressed by halo to local volume ratio. This concept is illustrated below:

<img src="assets/2-tile-stencil.png">


## Explicit Scaling

In Explicit Scaling Mode, User is responsible for work distribution and mutli-stack memory placement. Driver and language runtimes provide tools that expose each stack as a separate subdevice that can be programmed independently of all the others.

### Creating Sub-Devices

In this section we will learn how to create sub-device in SYCL that represent each stack in a multi-stack GPU device.

#### Root-device
Represents a multi-stack GPU device, containing multiple stacks.

#### Sub-Device
Represents a stack in multi-stack GPU device. The root-device in such cases can be partitioned to sub-devices, each subdevice corresponding to a physical stack.

```
vector<device> SubDevices = RootDevice.create_sub_devices<
      sycl::info::partition_property::partition_by_affinity_domain>(
      sycl::info::partition_affinity_domain::numa);
```

#### Query for individual stacks in Multi-stack device
The example below shows how to query for individual stacks (sub-devices) in Multi-stack device


In [None]:
%%writefile lab/sub_device.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

int main(){
  sycl::queue q;
  sycl::device RootDevice =  q.get_device();
  std::cout << "Device: " << RootDevice.get_info<sycl::info::device::name>() << "\n";
  std::cout << "-EUs  : " << RootDevice.get_info<sycl::info::device::max_compute_units>() << "\n\n";

  //# Check if GPU can be partitioned (stacks/Stack)
  auto partitions = RootDevice.get_info<sycl::info::device::partition_max_sub_devices>();
  if(partitions > 0){
    std::cout << "-partition_max_sub_devices: " << partitions << "\n\n";
    std::vector<sycl::device> SubDevices = RootDevice.create_sub_devices<
                  sycl::info::partition_property::partition_by_affinity_domain>(
                                                  sycl::info::partition_affinity_domain::numa);
    for (auto &SubDevice : SubDevices) {
      std::cout << "Sub-Device: " << SubDevice.get_info<sycl::info::device::name>() << "\n";
      std::cout << "-EUs      : " << SubDevice.get_info<sycl::info::device::max_compute_units>() << "\n";
    }  
  }
}


#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_sub_device.sh

### Context
Contexts are used for resources isolation and sharing. A SYCL context may consist of one or multiple devices. Both root-devices and sub-devices can be within single context, but they all should be of the same SYCL platform. A SYCL program created against a context with multiple devices will be built to each of the root-devices in the context. For context that consists of multiple sub-devices of the same root-device only single build (to that root-device) is needed.

#### Unified shared memory
Memory allocated against a root-device is accessible by all of its sub-devices (stacks). So if you are operating on a context with multiple sub-devices of the same root-device, then you can use malloc_device on that root-device instead of using the slower malloc_host. Remember that if using malloc_device you'd need an explicit copy out to the host if it necessary to see data there.

#### Buffer
SYCL buffers are also created against a context and are mapped to the Level-Zero USM allocation discussed above. Current mapping is as follows:

For an integrated device, the allocations are made on the host, and are accessible by the host and the device without any copying.

Memory buffers for context with sub-devices of the same root-device (possibly including the root-device itself) are allocated on that root-device. Thus they are readily accessible by all the devices in such context. The synchronization with the host is performed by SYCL RT with map/unmap doing implicit copies when necessary.

Memory buffers for context with devices from different root-devices in it are allocated on host (thus made accessible to all devices).

#### Queue
SYCL queue is always attached to a single device in a possibly multi-device context. In order of most performant to least performant, here are some typical scenarios:

##### Context associated with single sub-device
Creating a context with a single sub-device in it and the queue is attached to that sub-device (stack), in this scheme, the execution/visibility is limited to the single sub-device only, and expected to offer the best performance per stack. See a code example:
```cpp
  vector<sycl::device> SubDevices = ...;
  for (auto &D : SubDevices) {
    // Each queue is in its own context, no data sharing across them.
    auto Q = sycl::queue(D);
    Q.submit([&](sycl::handler &cgh) { ... });
  }
```

##### Context associated with multiple sub-devices
Creating a context with multiple sub-devices (multiple stacks) of the same root-device, in this scheme, queues are to be attached to the sub-devices effectively implementing "explicit scaling". In this scheme, the root-device should not be passed to such context for better performance. See a code example below:
```cpp
  vector<sycl::device> SubDevices = ...;
  auto C = sycl::context(SubDevices);
  for (auto &D : SubDevices) {
    // All queues share the same context, data can be shared across
    // queues.
    auto Q = sycl::queue(C, D);
    Q.submit([&](sycl::handler &cgh) { ... });
  }
```

##### Context associated with root device
Creating a context with a single root-device in it and the queue is attached to that root-device. In this scheme, the work will be automatically distributed across all sub-devices/stacks via "implicit scaling" by the GPU driver, which is the most simple way to enable multi-stack hardware but doesn't offer the possibility to target specific stacks. See a code example below:
```cpp
  // The queue is attached to the root-device, driver distributes to
  // sub - devices, if any.
  auto D = sycl::device(sycl::gpu_selector_v);
  auto Q = sycl::queue(D);
  Q.submit([&](sycl::handler &cgh) { ... });
```

##### Context associated with multiple root devices
Creating Contexts with multiple root-devices (multi-card). In this scheme, the most unrestrictive context with queues attached to different root-devices, which offers most sharing possibilities at the cost of slow access through host memory or explicit copies needed. See a code example:
```cpp
  auto P = sycl::platform(sycl::gpu_selector_v);
  auto RootDevices = P.get_devices();
  auto C = sycl::context(RootDevices);
  for (auto &D : RootDevices) {
    // Context has multiple root-devices, data can be shared across
    // multi - card(requires explict copying)
    auto Q = queue(C, D);
    Q.submit([&](sycl::handler &cgh) { ... });
  }
```


### Explicit Scaling Example

The example below shows vector addition that is explicitly scaled to execute on multi-stack GPU:


In [None]:
%%writefile lab/vectoradd_explicit_scaling.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <algorithm>
#include <cassert>
#include <cfloat>
#include <iostream>
#include <string>
namespace sycl;

constexpr int num_runs = 10;
constexpr size_t scalar = 3;

cl_ulong triad(size_t array_size) {

  cl_ulong min_time_ns0 = DBL_MAX;
  cl_ulong min_time_ns1 = DBL_MAX;

  device dev = device(gpu_selector());

  std::vector<device> subdev = {};
  subdev = dev.create_sub_devices<sycl::info::partition_property::
    partition_by_affinity_domain>(sycl::info::partition_affinity_domain::numa);

  queue q[2] = {queue(subdev[0], property::queue::enable_profiling{}),
    queue(subdev[1], property::queue::enable_profiling{})};

  std::cout << "Running on device: " <<
    q[0].get_device().get_info<info::device::name>() << "\n";
  std::cout << "Running on device: " <<
    q[1].get_device().get_info<info::device::name>() << "\n";

  double *A0 = malloc_shared<double>(array_size/2 * sizeof(double), q[0]);
  double *B0 = malloc_shared<double>(array_size/2 * sizeof(double), q[0]);
  double *C0 = malloc_shared<double>(array_size/2 * sizeof(double), q[0]);

  double *A1 = malloc_shared<double>(array_size/2 * sizeof(double), q[1]);
  double *B1 = malloc_shared<double>(array_size/2 * sizeof(double), q[1]);
  double *C1 = malloc_shared<double>(array_size/2 * sizeof(double), q[1]);

  for ( int i = 0; i < array_size/2; i++) {
     A0[i]= 1.0; B0[i]= 2.0; C0[i]= 0.0;
     A1[i]= 1.0; B1[i]= 2.0; C1[i]= 0.0;
  }

  for (int i = 0; i< num_runs; i++) {
    auto q0_event = q[0].submit([&](handler& h) {
        h.parallel_for(array_size/2, [=](id<1> idx) {
            C0[idx] = A0[idx] + B0[idx] * scalar;
            });
        });

    auto q1_event = q[1].submit([&](handler& h) {
        h.parallel_for(array_size/2, [=](id<1> idx) {
            C1[idx] = A1[idx] + B1[idx] * scalar;
            });
        });

    q[0].wait();
    q[1].wait();

    cl_ulong exec_time_ns0 =
      q0_event.get_profiling_info<info::event_profiling::command_end>() -
      q0_event.get_profiling_info<info::event_profiling::command_start>();

    std::cout << "stack-0 Execution time (iteration " << i << ") [sec]: "
      << (double)exec_time_ns0 * 1.0E-9 << "\n";
    min_time_ns0 = std::min(min_time_ns0, exec_time_ns0);

    cl_ulong exec_time_ns1 =
      q1_event.get_profiling_info<info::event_profiling::command_end>() -
      q1_event.get_profiling_info<info::event_profiling::command_start>();

    std::cout << "stack-1 Execution time (iteration " << i << ") [sec]: "
      << (double)exec_time_ns1 * 1.0E-9 << "\n";
    min_time_ns1 = std::min(min_time_ns1, exec_time_ns1);
  }

  // Check correctness
  bool error = false;
  for ( int i = 0; i < array_size/2; i++) {
    if ((C0[i] != A0[i] + scalar * B0[i]) || (C1[i] != A1[i] + scalar * B1[i])) {
      std::cout << "\nResult incorrect (element " << i << " is " << C0[i] << ")!\n";
      error = true;
    }
  }

  sycl::free(A0, q[0]);
  sycl::free(B0, q[0]);
  sycl::free(C0, q[0]);

  sycl::free(A1, q[1]);
  sycl::free(B1, q[1]);
  sycl::free(C1, q[1]);

  if (error) return -1;

  std::cout << "Results are correct!\n\n";
  return std::max(min_time_ns0, min_time_ns1);
}

int main(int argc, char *argv[]) {

  size_t array_size;
  if (argc > 1 ) {
    array_size =  std::stoi(argv[1]);
  }
  else {
    std::cout << "Run as ./<progname> <arraysize in elements>\n";
    return 1;
  }
  std::cout << "Running with stream size of " << array_size
    << " elements (" << (array_size * sizeof(double))/(double)1024/1024 << "MB)\n";

  cl_ulong min_time = triad(array_size);

  if (min_time == -1) return 1;
  size_t triad_bytes = 3 * sizeof(double) * array_size;
  std::cout << "Triad Bytes: " << triad_bytes << "\n";
  std::cout << "Time in sec (fastest run): " << min_time * 1.0E-9 << "\n";
  double triad_bandwidth = 1.0E-09 * triad_bytes/(min_time*1.0E-9);
  std::cout << "Bandwidth of fastest run in GB/s: " << triad_bandwidth << "\n";
  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_vectoradd_explicit_scaling.sh

### Explicit Scaling Summary
Performance tuning for a multi-stack dGPU imposes a tedious process given the parallelism granularity is at a finer level. However, the fundamentals are similar to CPU performance tuning. To understand performance scaling dominator, pay attention to:

- VE utilization efficiency - how kernels utilize the execution resources of different stacks
- Data placement - how allocations are spread across the HBM of different stacks
- Thread-data affinity: where data "located" and how they are accessed in the system

In addition, there are several critical programming model concepts for application developers to keep in mind in order to select their favorite scaling scheme for productivity, portability and performance.

- Sub-devices (numa_domains) and Sub-sub-devices (subnuma_domains)
- Implicit and explicit scaling
- Contexts and queues
- Environment variables and program language APIs or constructs