# Subgroups

##### Sections
- [What are Subgroups?](#What-are-Subgroups?)
- [How a Subgroup Maps to Graphics Hardware](#How-a-Subgroup-Maps-to-Graphics-Hardware)
- _Code:_ [Subgroup info](#Subgroup-info)
- _Code:_ [Subgroup shuffle operations](#Subgroup-shuffle-operations)
- _Code:_ [Subgroup Collectives](#Subgroup-Collectives)

## Learning Objectives

- Understand advantages of using Subgroups in Data Parallel C++ (DPC++)
- Take advantage of Subgroup collectives in ND-Range kernel implementation
- Use Subgroup Shuffle operations to avoid explicit memory operations

## What are Subgroups?

On many modern hardware platforms, __a subset of the work-items in a work-group__ are executed simultaneously or with additional scheduling guarantees. These subset of work-items are called subgroups. Leveraging subgroups will help to __map execution to low-level hardware__ and may help in achieving higher performance.

## Subgroups in ND-Range Kernel Execution

Parallel execution with the ND_RANGE Kernel helps to group work items that map to hardware resources. This helps to __tune applications for performance__.

The execution range of an ND-range kernel is divided into __work-groups__, __subgroups__ and __work-items__ as shown in picture below.

![ND-range kernel execution](assets/ndrange.png)

## How a Subgroup Maps to Graphics Hardware

| | |
|:---:|:---|
| __Work-item__ | Represents the individual instances of a kernel function. | 
| __Work-group__ | The entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single compute unit on hardware. | 
| __Subgroup__ | A subset of work-items within a work-group that are executed simultaneously, may be mapped to vector hardware. (DPC++) | 


The picture below shows how work-groups and subgroups map to __Intel® Gen11 Graphics Hardware__.

![ND-Range Hardware Mapping](assets/hwmapping.png)

## Why use Subgroups?

- Work-items in a sub-group can __communicate directly using shuffle operations__, without explicit memory operations.
- Work-items in a sub-group can synchronize using sub-group barriers and __guarantee memory consistency__ using sub-group memory fences.
- Work-items in a sub-group have access to __sub-group collectives__, providing fast implementations of common parallel patterns.

## sub_group class

The subgroup handle can be obtained from the nd_item using the __get_sub_group()__

```cpp
        ONEAPI::sub_group sg = item.get_sub_group();
```

Once you have the subgroup handle, you can query for more information about the subgroup, do shuffle operations or use collective functions.

## Subgroup info

The subgroup handle can be queried to get other information like number of work-items in subgroup, or number of subgroups in a work-group which will be needed for developers to implement kernel code using subgroups:
- __get_local_id()__ returns the index of the work-item within its subgroup
- __get_local_range()__ returns the size of sub_group 
- __get_group_id()__ returns the index of the subgroup
- __get_group_range()__ returns the number of subgroups within the parent work-group


```cpp
    h.parallel_for(nd_range<1>(64,64), [=](nd_item<1> item){
      /* get sub_group handle */
      ONEAPI::sub_group sg = item.get_sub_group();
      /* query sub_group and print sub_group info once per sub_group */
      if(sg.get_local_id()[0] == 0){
        out << "sub_group id: " << sg.get_group_id()[0]
            << " of " << sg.get_group_range()[0]
            << ", size=" << sg.get_local_range()[0] 
            << endl;
      }
    });
```

### Lab Exercise: Subgroup Info

The DPC++ code below demonstrates subgroup query methods to print sub-group info: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/sub_group_info.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static const size_t N = 64; // global size
static const size_t B = 64; // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << std::endl;

  q.submit([&](handler &h) {
    //# setup sycl stream class to print standard output from device code
    auto out = stream(1024, 768, h);

    //# nd-range kernel
    h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
      //# get sub_group handle
      ONEAPI::sub_group sg = item.get_sub_group();

      //# query sub_group and print sub_group info once per sub_group
      if (sg.get_local_id()[0] == 0) {
        out << "sub_group id: " << sg.get_group_id()[0] << " of "
            << sg.get_group_range()[0] << ", size=" << sg.get_local_range()[0]
            << endl;
      }
    });
  }).wait();
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_sub_group_info.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_info.sh; else ./run_sub_group_info.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_.

## Sub-group shuffle operations

One of the most useful features of subgroups is the ability to __communicate directly between individual work-items__ without explicit memory operations.

Shuffle operations enable us to remove work-group local memory usage from our kernels and/or to __avoid unnecessary repeated accesses to global memory__.

The code below uses `shuffle_xor` to swap the values of two work-items:

```cpp
    h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item){
      ONEAPI::sub_group sg = item.get_sub_group();
      size_t i = item.get_global_id(0);
      /* Shuffles */
      //data[i] = sg.shuffle(data[i], 2);
      //data[i] = sg.shuffle_up(0, data[i], 1);
      //data[i] = sg.shuffle_down(data[i], 0, 1);
      data[i] = sg.shuffle_xor(data[i], 1);
    });

```

<img src="assets/shuffle_xor.png" alt="shuffle_xor" width="300"/>

### Lab Exercise: Subgroup Shuffle

The code below uses subgroup shuffle to swap items in a subgroup. You can try other shuffle operations or change the fixed constant in the shuffle function.

The DPC++ code below demonstrates sub-group shuffle operations: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and click run ▶ to save the code to file.

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/sub_group_shuffle.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static const size_t N = 256; // global size
static const size_t B = 64;  // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << std::endl;

  //# initialize data array using usm
  int *data = static_cast<int *>(malloc_shared(N * sizeof(int), q));
  for (int i = 0; i < N; i++) data[i] = i;
  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  std::cout << std::endl << std::endl;

  q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    ONEAPI::sub_group sg = item.get_sub_group();
    size_t i = item.get_global_id(0);

    //# swap adjasent items in array using sub_group shuffle_xor
    data[i] = sg.shuffle_xor(data[i], 1);
  }).wait();

  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  free(data, q);
  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_sub_group_shuffle.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_shuffle.sh; else ./run_sub_group_shuffle.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_.

## Subgroup Collectives

The collective functions provide implementations of closely-related common parallel patterns.  

Providing these implementations as library functions instead __increases developer productivity__ and gives implementations the ability to __generate highly optimized code__ for individual target devices.

```cpp
    h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item){
      ONEAPI::sub_group sg = item.get_sub_group();
      size_t i = item.get_global_id(0);
      /* Collectives */
      data[i] = reduce(sg, data[i], ONEAPI::plus<>());
      //data[i] = reduce(sg, data[i], ONEAPI::maximum<>());
      //data[i] = reduce(sg, data[i], ONEAPI::minimum<>());
    });

```

### Lab Exercise: Subgroup Collectives

The code below uses subgroup collectives to add all items in a subgroup. You can change "_plus_" to "_maximum_" and check output.

The DPC++ code below demonstrates sub-group collectives: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and click run ▶ to save the code to file.

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/sub_group_collective.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
using namespace sycl;

static const size_t N = 256; // global size
static const size_t B = 64;  // work-group size

int main() {
  queue q;
  std::cout << "Device : " << q.get_device().get_info<info::device::name>() << std::endl;

  //# initialize data array using usm
  int *data = static_cast<int *>(malloc_shared(N * sizeof(int), q));
  for (int i = 0; i < N; i++) data[i] = 1 + i;
  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  std::cout << std::endl << std::endl;

  q.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    ONEAPI::sub_group sg = item.get_sub_group();
    size_t i = item.get_global_id(0);

    //# Adds all elements in sub_group using sub_group collectives
    int sum = reduce(sg, data[i], ONEAPI::plus<>());

    //# write sub_group sum in first location for each sub_group
    if (sg.get_local_id()[0] == 0) {
      data[i] = sum;
    } else {
      data[i] = 0;
    }
  }).wait();

  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  free(data, q);
  return 0;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_sub_group_collective.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sub_group_collective.sh; else ./run_sub_group_collective.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_.

## Summary

Subgroups allow kernel programming that maps executions at low-level hardware and may help in achieving higher levels of performance.

<html><body><span style="color:green"><h1>Survey</h1></span></body></html>

[We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks! ](https://intel.az1.qualtrics.com/jfe/form/SV_574qnSw6eggbn1z)