Skip to content

Latest commit

 

History

History
951 lines (695 loc) · 27.6 KB

sycl_ext_oneapi_non_uniform_groups.asciidoc

File metadata and controls

951 lines (695 loc) · 27.6 KB

sycl_ext_oneapi_non_uniform_groups

Notice

Copyright © 2022-2023 Intel Corporation. All rights reserved.

Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.

Contact

To report problems with this extension, please open a new issue at:

Dependencies

This extension is written against the SYCL 2020 revision 7 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.

Status

This is a proposed extension specification, intended to gather community feedback. Interfaces defined in this specification may not be implemented yet or may be in a preliminary state. The specification itself may also change in incompatible ways before it is finalized. Shipping software products should not rely on APIs defined in this specification.

Backend support status

The APIs in this extension may be used only on a device that has aspect::ext_oneapi_non_uniform_groups. The application must check that the device has this aspect before submitting a kernel using any of the APIs in this extension. If the application fails to do this, the implementation throws a synchronous exception with the errc::kernel_not_supported error code when the kernel is submitted to the queue.

Overview

Many modern hardware architectures support flexible sub-divisions of work-groups and sub-groups to support fine-grained work scheduling. A common use-case for such flexibility is communication between and coordination of work-items in divergent control flow.

This proposal introduces new classes to represent sub-divisions of SYCL’s built-in group types, traits for detecting these classes, and free functions for creating new instances of these classes.

These new classes can be used as arguments to group functions, group algorithms, and custom functions to convey exactly which work-items an operation is expecting, simplifying interfaces and greatly reducing the amount of delicate wording needed in function documentation.

Note
The first version of this extension only supports partitioning of sub-groups. It is expected that in the future, these functions will be expanded to also allow partitioning of root-groups, work-groups and user-constructed groups.

Specification

Feature test macro

This extension provides a feature-test macro as described in the core SYCL specification. An implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_NON_UNIFORM_GROUPS to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s features the implementation supports.

Value Description

1

The APIs of this experimental extension are not versioned, so the feature-test macro always has this value.

Extension to enum class aspect

namespace sycl {
enum class aspect {
  ...
  ext_oneapi_non_uniform_groups
}
}

If a SYCL device has the ext_oneapi_non_uniform_groups aspect, then it supports the non-uniform groups described in the next sections.

Control Flow

The SYCL specification defines control flow as below:

When all work-items in a group are executing the same sequence of statements, they are said to be executing under converged control flow. Control flow diverges when different work-items in a group execute a different sequence of statements, typically as a result of evaluating conditions differently (e.g. in selection statements or loops).

This extension introduces some new terminology to describe other kinds of control flow, to simplify the description of the behavior for new group types.

Two or more work-items are considered to have simultaneous execution at a given point if they execute statements at the same time. A common example of simultaneous execution is when work-items are mapped to different lanes of the same SIMD instruction(s). The amount of granularity of simultaneous execution is implementation-defined.

A tangle is a collection of work-items from the same group executing under converged control flow.

Group Taxonomy

Fixed topology groups are groups which represent the hierarchical execution model topology used by SYCL kernels. These groups are implicitly created by an implementation when a SYCL kernel function is enqueued. The following group types are fixed topology groups:

User-constructed groups are explicitly created by a developer (e.g. by partitioning one of the fixed topology groups). This extension introduces the following user-constructed groups:

  • ballot_group

  • fixed_size_group

  • tangle_group

  • opportunistic_group

The is_fixed_topology_group and is_user_constructed_group traits can be used to detect whether a group type represents a fixed topology or user-constructed group, respectively.

namespace sycl::ext::oneapi::experimental {

  template <class T>
  struct is_fixed_topology_group;

  template <class T>
  inline constexpr
  bool is_fixed_topology_group_v = is_fixed_topology_group<T>::value;


  template <class T>
  struct is_user_constructed_group;

  template <class T>
  inline constexpr bool
  is_user_constructed_group_v = is_user_constructed_group<T>::value;

} // namespace sycl::ext::oneapi::experimental

is_fixed_topology_group<T>::value is std::true_type if T is one of: root_group, group or sub_group.

is_user_constructed_group<T>::value is std::true_type if T is one of: ballot_group, fixed_size_group, tangle_group, or opportunisic_group.

Additionally, the is_group<T>::value trait from SYCL 2020 is std::true_type if T is one of: ballot_group, fixed_size_group, tangle_group, or opportunistic_group.

Group Functions and Algorithms

When a user-constructed group is passed to a group function or group algorithm, all work-items in the group must call the function or algorithm in converged control flow. Violating this restriction results in undefined behavior.

If a work-item calls a group function or group algorithm using an object that represents a group to which the work-item does not belong, this results in undefined behavior.

Note
Using group functions and algorithms in the presence of overlapping groups is discouraged, since it is highly likely that such code would have to make assumptions regarding work-item scheduling and forward progress guarantees.

The following group functions support the ballot_group, fixed_size_group, tangle_group, and opportunistic_group group types:

  • group_barrier

  • group_broadcast

The following group algorithms support ballot_group, fixed_size_group, tangle_group, and opportunistic_group group types:

  • joint_any_of and any_of_group

  • joint_all_of and all_of_group

  • joint_none_of and none_of_group

  • shift_group_left

  • shift_group_right

  • permute_group_by_xor

  • select_from_group

  • joint_reduce and reduce_over_group

  • joint_exclusive_scan and exclusive_scan_over_group

  • joint_inclusive_scan and inclusive_scan_over_group

Ballot-Groups

A ballot-group is a non-contiguous subset of a group, representing a collection of all work-items in the group that share the same value of some predicate. Ballot-groups are always created in a range of two: the first ballot-group contains all work-items where the predicate is true, and the second ballot-group contains all work-items where the predicate is false.

Creation

New ballot-groups are created by partitioning an existing group, using the get_ballot_group free-function.

Note
Creating a ballot-group requires a barrier across all work-items in the parent group, since work-items must exchange predicate values in order to determine group membership.
namespace ext::oneapi::experimental {

template <typename Group>
ballot_group get_ballot_group(Group group, bool predicate);

} // namespace ext::oneapi::experimental

Constraints: Available only if sycl::is_group_v<std::decay_t<Group>> && std::is_same_v<Group, sycl::sub_group> is true.

Preconditions: All work-items in group must encounter this function in converged control flow.

Effects: Synchronizes all work-items in group.

Returns: A ballot_group consisting of the work-items in group for which predicate has the same value as the calling work-item.

ballot_group Class

namespace sycl::ext::oneapi::experimental {

template <typename ParentGroup>
class ballot_group {
public:
  using id_type = id<1>;
  using range_type = range<1>;
  using linear_id_type = uint32_t;
  static constexpr int dimensions = 1;
  static constexpr sycl::memory_scope fence_scope = ParentGroup::fence_scope;

  id_type get_group_id() const;

  id_type get_local_id() const;

  range_type get_group_range() const;

  range_type get_local_range() const;

  linear_id_type get_group_linear_id() const;

  linear_id_type get_local_linear_id() const;

  linear_id_type get_group_linear_range() const;

  linear_id_type get_local_linear_range() const;

  bool leader() const;
};

}
Note
ballot_group is templated on a ParentGroup because it is expected that it will eventually be possible to construct a ballot-group from more than only sub-groups.
id_type get_group_id() const;

Returns: An id representing the index of the ballot-group.

Note
This will always be either 0 (representing the group of work-items where the predicate was true) or 1 (representing the group of work-items where the predicate was false).
id_type get_local_id() const;

Returns: An id representing the calling work-item’s position within the ballot-group.

range_type get_group_range() const;

Returns: A range representing the number of ballot-groups.

Note
This will always return a range of 2, as there will always be two groups; one representing the group of work-items where the predicate was true and another representing the group of work-items where the predicate was false.
range_type get_local_range() const;

Returns: A range representing the number of work-items in the ballot-group.

id_type get_group_linear_id() const;

Returns: A linearized version of the id returned by get_group_id().

id_type get_local_linear_id() const;

Returns: A linearized version of the id returned by get_local_linear_id().

range_type get_group_linear_range() const;

Returns: A linearized version of the id returned by get_group_range().

range_type get_local_linear_range() const;

Returns: A linearized version of the id returned by get_local_range().

bool leader() const;

Returns: true for exactly one work-item in the ballot-group, if the calling work-item is the leader of the ballot-group, and false for all other work-items in the ballot-group. The leader of the ballot-group is guaranteed to be the work-item for which get_local_id() returns 0.

Usage Examples

A ballot_group can be used in conjunction with constructs like loops and branches to safely communicate between all work-items still executing:

auto sg = it.get_sub_group();

// get group representing the subset of the sub-group that will take the branch
auto will_branch = sg.get_local_linear_id() % 2 == 0;
auto inner = sycl::ext::oneapi::experimental::get_ballot_group(sg, will_branch);

if (will_branch)
{
  // wait for all work-items that took the branch to hit the barrier
  sycl::group_barrier(inner);

  // reduce across subset of outer work-items that took the branch
  float ix = sycl::reduce_over_group(inner, x, plus<>());
}

Fixed-Size-Groups

A fixed-size-group is a contiguous collection of work-items created by subdividing a group into equally sized parts, such that each work-item is a member of exactly one partition. The size of a fixed-size-group is a static (compile-time) property.

Creation

New fixed-size-groups are created by partitioning an existing group, using the get_fixed_size_group free-function.

Note
Creating a fixed-size-group does not require a barrier across all work-items in the parent group, since work-items can independently identify partition members given a fixed partition size.
namespace ext::oneapi::experimental {

template <size_t PartitionSize, typename Group>
fixed_size_group<PartitionSize, Group> get_fixed_size_group(Group group);

} // namespace ext::oneapi::experimental

Constraints: Available only if sycl::is_group_v<std::decay_t<Group>> && std::is_same_v<Group, sycl::sub_group> is true. PartitionSize must be positive and a power of 2.

Preconditions: PartitionSize must be less than or equal to the result of group.get_max_local_range(). group.get_local_linear_range() must be evenly divisible by PartitionSize.

Returns: A fixed_size_group<PartitionSize> consisting of all work-items in group that are in the same partition as the calling work-item.

fixed_size_group Class

namespace sycl::ext::oneapi::experimental {

template <std::size_t PartitionSize, typename ParentGroup>
class fixed_size_group {
public:
  using id_type = id<1>;
  using range_type = range<1>;
  using linear_id_type = uint32_t;
  static constexpr int dimensions = 1;
  static constexpr sycl::memory_scope fence_scope = ParentGroup::fence_scope;

  id_type get_group_id() const;

  id_type get_local_id() const;

  range_type get_group_range() const;

  range_type get_local_range() const;

  linear_id_type get_group_linear_id() const;

  linear_id_type get_local_linear_id() const;

  linear_id_type get_group_linear_range() const;

  linear_id_type get_local_linear_range() const;

  bool leader() const;
};

}
Note
fixed_size_group is templated on a ParentGroup because it is expected that it will eventually be possible to construct a fixed-size-group from more than only sub-groups.
id_type get_group_id() const;

Returns: An id representing the index of the fixed-size-group.

id_type get_local_id() const;

Returns: An id representing the calling work-item’s position within the fixed-size-group.

range_type get_group_range() const;

Returns: A range representing the number of fixed-size-groups.

range_type get_local_range() const;

Returns: A range representing the number of work-items in the fixed-size-group, which is always equal to PartitionSize.

id_type get_group_linear_id() const;

Returns: A linearized version of the id returned by get_group_id().

id_type get_local_linear_id() const;

Returns: A linearized version of the id returned by get_local_linear_id().

range_type get_group_linear_range() const;

Returns: A linearized version of the id returned by get_group_range().

range_type get_local_linear_range() const;

Returns: A linearized version of the id returned by get_local_range().

bool leader() const;

Returns: true for exactly one work-item in the fixed-size-group, if the calling work-item is the leader of the fixed-size-group, and false for all other work-items in the fixed-size-group. The leader of the fixed-size-group is guaranteed to be the work-item for which get_local_id() returns 0.

Usage Examples

A fixed_size_group can be used to apply group algorithms to subsets of data:

auto sg = it.get_sub_group();

// reduce over contiguous groups of 8 elements
auto partition = sycl::ext::oneapi::experimental::get_fixed_size_group<8>(sg);
auto result = sycl::reduce_over_group(partition, buf[it.get_local_linear_id()], sycl::plus<>());

// write result out once per group
if (partition.leader()){
  buf[partition.get_group_id()] = result;
}

A fixed_size_group can be used to provide an interface accepting a specific number of work-items:

void func_that_needs_4_work_items(sycl::ext::oneapi::experimental::fixed_size_group<4> group);

Tangle-Groups

A tangle-group is a non-contiguous subset of a group representing work-items executing in a tangle. A tangle-group can therefore be used to capture all work-items currently executing the same control flow.

Creation

New tangle-groups are created by partitioning an existing group, using the get_tangle_group free-function.

Note
Creating a tangle-group may implicitly synchronize members of the tangle_group on some devices, since it may be necessary to wait for work-items to reconverge. For consistency, this synchronization is required by all implementations.
namespace ext::oneapi::experimental {

template <typename Group>
tangle_group get_tangle_group(Group group);

} // namespace ext::oneapi::experimental

Constraints: Available only if sycl::is_group_v<std::decay_t<Group>> && std::is_same_v<Group, sycl::sub_group> is true.

Effects: Synchronizes all work-items in the resulting tangle_group.

Returns: A tangle_group consisting of the work-items in group which are part of the same tangle.

tangle_group Class

namespace sycl::ext::oneapi::experimental {

template <typename ParentGroup>
class tangle_group {
public:
  using id_type = id<1>;
  using range_type = range<1>;
  using linear_id_type = uint32_t;
  static constexpr int dimensions = 1;
  static constexpr sycl::memory_scope fence_scope = ParentGroup::fence_scope;

  id_type get_group_id() const;

  id_type get_local_id() const;

  range_type get_group_range() const;

  range_type get_local_range() const;

  linear_id_type get_group_linear_id() const;

  linear_id_type get_local_linear_id() const;

  linear_id_type get_group_linear_range() const;

  linear_id_type get_local_linear_range() const;

  bool leader() const;
};

}
id_type get_group_id() const;

Returns: An id representing the index of the tangle-group.

Note
This will always be an id with all values set to 0, since there can only be one tangle-group.
id_type get_local_id() const;

Returns: An id representing the calling work-item’s position within the tangle-group.

range_type get_group_range() const;

Returns: A range representing the number of tangle-groups.

Note
This will always return a range of 1 as there can only be one tangle-group.
range_type get_local_range() const;

Returns: A range representing the number of work-items in the tangle-group.

id_type get_group_linear_id() const;

Returns: A linearized version of the id returned by get_group_id().

id_type get_local_linear_id() const;

Returns: A linearized version of the id returned by get_local_linear_id().

range_type get_group_linear_range() const;

Returns: A linearized version of the id returned by get_group_range().

range_type get_local_linear_range() const;

Returns: A linearized version of the id returned by get_local_range().

bool leader() const;

Returns: true for exactly one work-item in the tangle-group, if the calling work-item is the leader of the tangle-group, and false for all other work-items in the tangle-group. The leader of the tangle-group is guaranteed to be the work-item for which get_local_id() returns 0.

Usage Examples

A tangle_group can be used in conjunction with constructs like loops and branches to safely communicate between all work-items executing the same control flow.

Note
This differs from a ballot_group because a tangle_group requires the implementation to track group membership. Which group type to use will depend on a combination of implementation/backend/device and programmer preference.
auto sg = it.get_sub_group();

auto will_branch = sg.get_local_linear_id() % 2 == 0;
if (will_branch)
{
  // wait for all work-items that took the branch to hit the barrier
  auto inner = sycl::ext::oneapi::experimental::get_tangle_group(sg);
  sycl::group_barrier(inner);

  // reduce across subset of outer work-items that took the branch
  float ix = sycl::reduce_over_group(inner, x, plus<>());
}

Opportunistic-Groups

An opportunistic-group is a non-contiguous subset of a sub-group, representing the work-items which are executing simultaneously.

In SYCL implementations where work-items have strong forward progress guarantees (and can therefore make progress independently of other work-items in the same sub-group), it is possible that only a subset of the work-items in a sub-group executing the same control flow will execute simultaneously.

In some cases it may be helpful to capture this group and use it for opportunistic optimizations.

Creation

Opportunistic groups are created by calls to the get_opportunistic_group() free-function. Each call to get_opportunistic_group() returns a different group. There are no guarantees that a group returned by get_opportunistic_group() will contain all work-items executing the same control flow, nor the same set of work-items as the group returned by any previous call to get_opportunistic_group().

Note
Creating an opportunistic group does not require a barrier or introduce any synchronization because it is designed to capture whichever set of work-items happen to call get_opportunistic_group() simultaneously.
namespace ext::oneapi::experimental::this_kernel {

opportunistic_group get_opportunistic_group();

} // namespace ext::oneapi::experimental::this_kernel

Returns: An opportunistic_group consisting of all work-items in the same sub-group as the calling work-item which call the function simultaneously.

opportunistic_group Class

namespace sycl::ext::oneapi::experimental {

class opportunistic_group {
public:
  using id_type = id<1>;
  using range_type = range<1>;
  using linear_id_type = uint32_t;
  static constexpr int dimensions = 1;
  static constexpr sycl::memory_scope fence_scope =
    sycl::memory_scope::sub_group;

  id_type get_group_id() const;

  id_type get_local_id() const;

  range_type get_group_range() const;

  range_type get_local_range() const;

  linear_id_type get_group_linear_id() const;

  linear_id_type get_local_linear_id() const;

  linear_id_type get_group_linear_range() const;

  linear_id_type get_local_linear_range() const;

  bool leader() const;
};

}
id_type get_group_id() const;

Returns: An id representing the index of the opportunistic-group.

Note
This will always be an id with all values set to 0, since there can only be one opportunistic-group.
id_type get_local_id() const;

Returns: An id representing the calling work-item’s position within the opportunistic-group.

range_type get_group_range() const;

Returns: A range representing the number of opportunistic-groups.

Note
This will always return a range of 1 as there will only be one opportunistic-group.
range_type get_local_range() const;

Returns: A range representing the number of work-items in the opportunistic-group.

id_type get_group_linear_id() const;

Returns: A linearized version of the id returned by get_group_id().

id_type get_local_linear_id() const;

Returns: A linearized version of the id returned by get_local_linear_id().

range_type get_group_linear_range() const;

Returns: A linearized version of the id returned by get_group_range().

range_type get_local_linear_range() const;

Returns: A linearized version of the id returned by get_local_range().

bool leader() const;

Returns: true for exactly one work-item in the opportunistic-group, if the calling work-item is the leader of the opportunistic-group, and false for all other work-items in the opportunistic-group. The leader of the opportunistic group is guaranteed to be the work-item for which get_local_id() returns 0.

Usage Example

The following example shows an atomic pointer being incremented. It is expected that all the work-items in the sub_group will increment the atomic value, but we opportunistically capture the groups of work-items as they arrive to this point in the control flow.

template <sycl::memory_order Order,
          sycl::memory_scope Scope,
          sycl::access::address_space AddressSpace>
int atomic_aggregate_inc(sycl::sub_group sg, sycl::atomic_ref<int, Order, Scope, AddressSpace> ptr) {

  // get the set of work-items that called this function simultaneously
  auto active_group = sycl::ext::oneapi::experimental::this_kernel::get_opportunistic_group();

  // increment the atomic once on behalf of all active work-items
  int count = active_group.get_local_linear_range();
  int old_value;
  if (active_group.leader()) {
    old_value = ptr.fetch_add(count);
  }

  // return the value the individual work-item might have received if it had worked alone
  auto base = sycl::group_broadcast(active_group, old_value);
  auto idx = active_group.get_local_linear_id();
  return base + idx;

}

Implementation notes

This non-normative section provides information about one possible implementation of this extension. It is not part of the specification of the extension’s API.

For SPIR-V backends, all user-constructed group types are expected to be implemented using SPIR-V’s non-uniform instructions. fixed_size_group functionality is expected to leverage the optional PartitionSize argument of those instructions. Each group type will require slightly different usage of those instructions to ensure that distinct groups encounter unique control flow when appropriate.

For CUDA backends, all user-constructed group types are expected to be lowered to PTX instructions with explicit masks. The only expected difference in implementation for the different group types is how the mask is initially constructed. Supporting tangle_group may require the compiler to construct masks when encountering control flow constructs, and to pass those masks across call boundaries.

Issues

  1. Should tangle_group support work-groups or just sub-groups?

    SPIR-V "tangled instructions" include group and sub-group instructions, but it is unclear how to identify which work-items in different sub-groups are executing the same control flow (without introducing significant overhead). If we decide at a later date that tangle_group should support only sub-groups, we should revisit the name to avoid creating confusion.