# Thread Mapping and GPU Occupancy
In this section we will cover the how SYCL thread mapping works and how to achieve maximum occupancy with a kernel:
- [SYCL Thread Mapping](#SYCL-Thread-Mapping)
- [Mapping Work-groups to Xe-cores for Maximum Occupancy](#Mapping-Work-groups-to-Xe-cores-for-Maximum-Occupancy)
- [Thread Synchronization](#Thread-Synchronization)
- [GPU Occupancy Calculation Example](#GPU-Occupancy-Calculation-Example)
- [Intel® GPU Occupancy Calculator](#Intel®-GPU-Occupancy-Calculator)

## SYCL Thread Mapping
The SYCL* execution model exposes an abstract view of GPU execution. The SYCL thread hierarchy consists of a 1-, 2-, or 3-dimensional grid of work-items. These work-items are grouped into equal sized thread groups called work-groups. Threads in a work-group are further divided into equal sized vector groups called sub-groups (see the illustration that follows).
#### Work-item
A work-item represents one of a collection of parallel executions of a kernel.
#### Sub-group
A sub-group represents a short range of consecutive work-items that are processed together as a SIMD vector of length 8, 16, 32, or a multiple of the native vector length of a CPU with Intel® UHD Graphics.
#### Work-group
A work-group is a 1-, 2-, or 3-dimensional set of threads within the thread hierarchy. In SYCL, synchronization across work-items is only possible with barriers for the work-items within the same work-group.
### nd_range
An nd_range divides the thread hierarchy into 1-, 2-, or 3-dimensional grids of work-groups. It is represented by the global range, the local range of each work-group.

Thread hierarchy:
<img src="assets/nd_range.png">

The diagram above illustrates the relationship among ND-Range, work-group, sub-group, and work-item.

## Mapping Work-groups to Xe-cores for Maximum Occupancy
The rest of this chapter explains how to pick a proper work-group size to maximize the occupancy of the GPU resources. The new terminologies are Xe-core (XC) for Subslice, and Xe Vector Engine (XVE) for Execution Unit(EU).

From the key [GPU architecture parameters](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html), we summarize the architecture parameters for Gen9, Gen11 and Gen12 below:

Generations | Threads per XVE | XVEs per XC | Threads per XC | XCs | Total XCs | Total Threads | Total Operations | Max Work Group Size
---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
Intel UHD Graphics P630 (Gen9) | 7 | 8  | 56  | 3 | 24 | 168 | 1344 | 256
Intel Iris Xe ICL (Gen11)      | 7 | 8  | 56  | 8 | 64 | 448 | 3584 | 256
Intel Iris Xe-LP TGL (Gen12)   | 7 | 16 | 112 | 6 | 96 | 672 | 5376 | 512

<img src="assets/arch_gen12_xe-lp.png">

The maximum work-group size is a constraint imposed by the hardware and GPU driver. You can query the maximum work-group using `device::get_info<sycl::info::device::max_work_group_size>()` on the supported size.

Let’s start with a simple kernel:
```cpp
auto command_group = [&](auto &cgh) {
      cgh.parallel_for(sycl::range<3>(64, 64, 128), // global range
                       [=](item<3> it) {
                         // (kernel code)
      })
}
```
This kernel contains 524,288 work-items structured as a 3D range of 64 x 64 x 128. It leaves the work-group and sub-group size selection to the compiler. To fully utilize the 5376 parallel operations available in the Xe-LP TGL GPU, the compiler must choose a proper work group size.

The two most important GPU resources are:
- __Thread Contexts:__
The kernel should have a sufficient number of threads to utilize the GPU’s thread contexts.
- __SIMD Units and SIMD Registers:__
The kernel should be organized to vectorize the work-items and utilize the SIMD registers.

In a SYCL kernel, the programmer can affect the work distribution by structuring the kernel with proper work-group size, sub-group size, and organizing the work-items for efficient vector execution. Writing efficient vector kernels is covered in a separate section. This chapter focuses on work-group and sub-group size selection.

Thread contexts are easier to utilize than SIMD vector. Therefore, start with selecting the number of threads in a work-group. Each Xe-core has 112 thread contexts, but usually you cannot use all the threads if the kernel is also vectorized by 8 (112 x 8 = 896 > 512). From this, we can derive that the maximum number of threads in a work-group is 64 (512 / 8).

SYCL does not provide a mechanism to directly set the number of threads in a work-group. However, you can use work-group size and SIMD sub-group size to set the number of threads:
```
Work group size = Threads x SIMD sub-group size
```
You can increase the sub-group size as long as there are a sufficient number of registers for the kernel after widening. Note that each XVE has 128 SIMD8 registers so there is a lot of room for widening on simple kernels. The effect of increasing sub-group size is similar to loop unrolling: while each XVE still executes eight 32-bit operations per cycle, the amount of work per work-group interaction is doubled/quadrupled. In SYCL, a programmer can explicitly specify sub-group size using `intel::reqd_sub_group_size({8|16|32})` to override the compiler’s selection.

The table below summarizes the selection criteria of threads and sub-group sizes to keep all GPU resources occupied for TGL:

Configurations to ensure full occupancy

Maximum Threads | Minimum Sub-group Size | Maximum Sub-group Size | Maximum Work-group Size | Constraint
:-:|:-:|:-:|:-:|:-
64 | 8 | 32 | 512 | Threads x sub-group size <= 512

Back to our example program, if you choose a work-group size less than 64 for sub-group size 8, less than 128 for sub-group size 16, or less than 256 for sub-group size 32, the application will not be able to fully utilize TGL GPU’s thread contexts. Choosing a larger work-group size has the additional advantage of reducing the number of rounds of work-group dispatching.

## Thread Synchronization
SYCL provides two synchronization mechanisms that can be called within a kernel function. Both are only defined for work-items within the same work-group. SYCL does not provide any global synchronization mechanism inside a kernel for all work-items across the entire nd_range.

- __group_barrier__ inserts a memory fence and blocks the execution of all work-items within the work-group until all work-items have reached its location.

- __mem_fence__ inserts a memory fence on global and local memory access across all work-items in a work-group.


### Impact of Work-item Synchronization Within Work-group
Let’s look at a kernel requiring work-item synchronization:
```cpp
auto command_group = [&](auto &cgh) {
      cgh.parallel_for(sycl::nd_range(sycl::range(64, 64, 128), // global range
                                sycl::range(1, R, 128)    // local range
                                ),
                       [=](sycl::nd_item<3> item) {
                         // (kernel code)
                         // Internal synchronization
                         sycl::group_barrier(item.get_group());
                         // (kernel code)
      });
}
```

This kernel is similar to the previous example, except it requires work-group barrier synchronization. Work-item synchronization is only available to work-items within the same work-group. You must pick a work-group local range using nd_range and nd_item. Because synchronization is implemented using a Xe-core’s SLM for shared variables, all the work-items of a work-group must be allocated to the same Xe-core, which affects Xe-core occupancy and kernel performance.

In this kernel, the local range of work-group is given as range(1, R, 128). Assuming the sub-group size is eight, let’s look at how the values of variable R affect XVE occupancy. In the case of R=1, the local group range is (1, 1, 128) and work-group size is 128. The Xe-core allocated for a work-group contains only 16 threads out of 112 available thread contexts (i.e., very low occupancy). However, the system can dispatch 7 work-groups to the same Xe-core to reach full occupancy at the expense of a higher number of dispatches.

In the case of R>4, the work-group size will exceed the system-supported maximum work-group size of 512, and the kernel will fail to launch. In the case of R=4, an Xe-core is only 57% occupied (4/7) and the three unused thread contexts are not sufficient to accommodate another work-group, wasting 43% of the available XVE capacities. Note that the driver may still be able to dispatch a partial work-group to an unused Xe-core. However, because of the barrier in the kernel, the partially dispatched work items would not be able to pass the barriers until the rest of the work group is dispatched. In most cases, the kernel’s performance would not benefit much from the partial dispatch. Hence, it is important to avoid this problem by properly choosing the work-group size.

The table below summarizes the tradeoffs between group size, number of threads, Xe-core utilization, and occupancy.

#### Utilization for various configurations

Work-items | Group Size | Threads | Xe-core Utilization | Xe-core Occupancy
--- | --- | --- | --- | ---
64 x 64 x 128 = 524288 | (R=1) 128 | 16 | 16/112 = 14% | 100% with 7 work-groups
64 x 64 x 128 = 524288 | (R=2) 128 x 2 | 16 x 2| 32/112 = 28.2% | 86% with 3 work-groups
64 x 64 x 128 = 524288 | (R=3) 128 x 3 | 16 x 3 | 48/112 = 42.9% | 86% with 2 work-groups
64 x 64 x 128 = 524288 | (R=4) 128 x 4 | 16 x 4 | 64/112 = 57% | 57% maximum
64 x 64 x 128 = 524288 | (R>4) 640+ |  |  | Fail to launch


### Impact of Local Memory Within Work-group
Let’s look at an example where a kernel allocates local memory for a work-group:
```cpp
auto command_group =
    [&](auto &cgh) {
      // local memory variables shared among work items
      sycl::local_accessor<int, 1> myLocal(sycl::range(R), cgh);
      cgh.parallel_for(sycl::nd_range(sycl::range<3>(64, 64, 128), // global range
                                sycl::range<3>(1, R, 128)    // local range
                                ),
                       [=](sycl::nd_item<3> item) {
                         // (work group code)
                         myLocal[item.get_local_id()[0]] = ...
                       });
    }
```
Because work-group local variables are shared among its work-items, they are allocated in a Xe-core’s SLM. Therefore, this work-group must be allocated to a single Xe-core, same as the intra-group synchronization. In addition, you must also weigh the sizes of local variables under different group size options such that the local variables fit within an Xe-core’s 128KB SLM capacity limit.

## GPU Occupancy Calculation Example
Before concluding this section, let’s look at the hardware occupancies from the variants of a simple vector add example. Using _Intel(R) UHD Graphics P630_ (Gen9 GPU) as the underlying hardware with the resource parameters specified in Gen9 GPU.

GPU | Threads per XVE | XVEs per XC | Threads per XC | XCs | Total Threads | Total Operations | Max Work Group Size
---|:-:|:-:|:-:|:-:|:-:|:-:|:-:
Intel UHD Graphics P630 (Gen9) | 7 | 8  | 56  | 3 | 168 | 1344 | 256

The _vec_add_ example below explicitly specifies the work-group size to `max_work_group_size` (256), SIMD width of 32, and a variable number of work-groups as a function parameter groups.

In the absence of intra-work group synchronization, threads from any work-group can be dispatched to any Xe-core. Dividing the number of threads by the number of available thread contexts in the GPU (168 for Gen9) gives us an estimate of the GPU hardware occupancy. 

```
GPU Occupancy = (Work-Groups * (WG-Size/SIMD)) / GPU-Threads

GPU Occupancy = (Work-Groups * (256/32)) / 168

```

| Work-groups | Work-items | Work-group Size | SIMD | Threads work-group | Threads | XeCore Occupancy<br> = Threads / 56 |GPU Occupancy<br> = Threads/168 |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| 1  | 256  | 256 | 32 | 8 | 8   | 14.2% | 4.7% |
| 2  | 512  | 256 | 32 | 8 | 16  | 28.5% | 9.5% |
| 3  | 768  | 256 | 32 | 8 | 24  | 42.8% | 14.2% |
| 4  | 1024 | 256 | 32 | 8 | 32  | 57.1% | 19% |
| 5  | 1280 | 256 | 32 | 8 | 40  | 71.4% | 23.8% |
| 6  | 1536 | 256 | 32 | 8 | 48  | 85.7% | 28.5% |
| 7  | 1792 | 256 | 32 | 8 | 56  | 100%  | 33.3% |
| 8  | 2048 | 256 | 32 | 8 | 64  | 100%  | 38% |
| 12 | 3072 | 256 | 32 | 8 | 96  | 100%  | 57% |
| 16 | 4096 | 256 | 32 | 8 | 128 | 100%  | 76% |
| 20 | 5120 | 256 | 32 | 8 | 160 | 100%  | 95% |
| 24 | 6144 | 256 | 32 | 8 | 192 | 100%  | 100% |

The `vec_add.cpp` SYCL code example below can be used to run and capture VTune Profiling data to 

In [None]:
%%writefile lab/vec_add.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

#define N 13762560

template <int groups, int wg_size, int sg_size>
int VectorAdd(sycl::queue &q, std::vector<int> &a, std::vector<int> &b,
               std::vector<int> &sum) {
  sycl::range num_items{a.size()};

  sycl::buffer a_buf(a);
  sycl::buffer b_buf(b);
  sycl::buffer sum_buf(sum.data(), num_items);
  size_t num_groups = groups;

  auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
  q.submit([&](auto &h) {
    sycl::accessor a_acc(a_buf, h, sycl::read_only);
    sycl::accessor b_acc(b_buf, h, sycl::read_only);
    sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

    h.parallel_for(
        sycl::nd_range<1>(num_groups * wg_size, wg_size), [=
    ](sycl::nd_item<1> index) [[intel::reqd_sub_group_size(sg_size)]] {
          size_t grp_id = index.get_group()[0];
          size_t loc_id = index.get_local_id();
          size_t start = grp_id * N;
          size_t end = start + N;
          for (size_t i = start + loc_id; i < end; i += wg_size) {
            sum_acc[i] = a_acc[i] + b_acc[i];
          }
        });
  });
  q.wait();
  auto end = std::chrono::high_resolution_clock::now().time_since_epoch().count();
  std::cout << "VectorAdd<" << groups << "> completed on device - "
            << (end - start) * 1e-9 << " seconds\n";
  return 0;
}

int main() {

  sycl::queue q;

  std::vector<int> a(N), b(N), sum(N);
  for (size_t i = 0; i < a.size(); i++){
    a[i] = i;
    b[i] = i;
    sum[i] = 0;
  }

  std::cout << "Running on device: "
            << q.get_device().get_info<sycl::info::device::name>() << "\n";
  std::cout << "Vector size: " << a.size() << "\n";

  VectorAdd<1,256,32>(q, a, b, sum);
  VectorAdd<2,256,32>(q, a, b, sum);
  VectorAdd<3,256,32>(q, a, b, sum);
  VectorAdd<4,256,32>(q, a, b, sum);
  VectorAdd<5,256,32>(q, a, b, sum);
  VectorAdd<6,256,32>(q, a, b, sum);
  VectorAdd<7,256,32>(q, a, b, sum);
  VectorAdd<8,256,32>(q, a, b, sum);
  VectorAdd<12,256,32>(q, a, b, sum);
  VectorAdd<16,256,32>(q, a, b, sum);
  VectorAdd<20,256,32>(q, a, b, sum);
  VectorAdd<24,256,32>(q, a, b, sum);
  VectorAdd<28,256,32>(q, a, b, sum);
  VectorAdd<32,256,32>(q, a, b, sum);
  return 0;
}



#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! ./q.sh run_vec_add.sh gen9

#### VTune Analysis
You can collect VTune `gpu_hotspots` data and check VTune Profiling Data using the `vtune_collect.sh` script, this script will collect VTune data using command-line options, you can then open the results in the VTune Profiler tool.

You can run this command in terminal: `./q.sh vtune_collect.sh gen9` or alternatively
you can login to a Gen9 node interactively on DevCloud and run the script `./vtune_collect.sh`, collecting vtune data will take few minutes and will generate a html report in the same folder, it will also create a folder `vtune_data` folder, this folder can be zipped up and copied to local machine with VTune GUI tool installed to analyze further like shown below:

From the VTune output below you can see that Occupancy we predicted earlier doing the manual calculation and the VTune data are the same:

<img src="assets/vtune_tasks_vec_add_gen9.png">

The GPU Compute Threads Dispatch can be analyzed on VTune Profiler:

<img src="assets/vtune_threads_vec_add_gen9.png">

## Intel® GPU Occupancy Calculator
In summary, a SYCL work-group is typically dispatched to an Xe-core. All the work-items in a work-group shares the same SLM of an Xe-core for intra work-group thread barriers and memory fence synchronization. Multiple work-groups can be dispatched to the same Xe-core if there are sufficient XVE ALUs, SLM, and thread contexts to accommodate them.

You can achieve higher performance by fully utilizing all available Xe-cores. Parameters affecting a kernel’s GPU occupancy are work-group size and SIMD sub-group size, which also determines the number of threads in the work-group.

The [Intel® GPU Occupancy Calculator](https://oneapi-src.github.io/oneAPI-samples/Tools/GPU-Occupancy-Calculator/index.html) can be used to calculate the occupancy on an Intel GPU for a given kernel, and its work-group parameters.

## Resources

- [Intel GPU Optimization Guide](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top.html) - Up to date resources for Intel GPU Optimization
- [SYCL Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf) - Latest Specification document for reference
- [SYCL Essentials Training](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/Jupyter/oneapi-essentials-training) - Learn basics of C++ SYCL Programming