Copyright © 2022-2023 Intel Corporation. All rights reserved.
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.
To report problems with this extension, please open a new issue at:
This extension is written against the SYCL 2020 revision 7 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.
This is a proposed extension specification, intended to gather community feedback. Interfaces defined in this specification may not be implemented yet or may be in a preliminary state. The specification itself may also change in incompatible ways before it is finalized. Shipping software products should not rely on APIs defined in this specification.
The APIs in this extension may be used only on a device that has
aspect::ext_oneapi_non_uniform_groups
. The application must check that the
device has this aspect before submitting a kernel using any of the APIs in this
extension. If the application fails to do this, the implementation throws a
synchronous exception with the errc::kernel_not_supported
error code when the
kernel is submitted to the queue.
Many modern hardware architectures support flexible sub-divisions of work-groups and sub-groups to support fine-grained work scheduling. A common use-case for such flexibility is communication between and coordination of work-items in divergent control flow.
This proposal introduces new classes to represent sub-divisions of SYCL’s built-in group types, traits for detecting these classes, and free functions for creating new instances of these classes.
These new classes can be used as arguments to group functions, group algorithms, and custom functions to convey exactly which work-items an operation is expecting, simplifying interfaces and greatly reducing the amount of delicate wording needed in function documentation.
Note
|
The first version of this extension only supports partitioning of sub-groups. It is expected that in the future, these functions will be expanded to also allow partitioning of root-groups, work-groups and user-constructed groups. |
This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro SYCL_EXT_ONEAPI_NON_UNIFORM_GROUPS
to one of the values defined in the
table below. Applications can test for the existence of this macro to
determine if the implementation supports this feature, or applications can test
the macro’s value to determine which of the extension’s features the
implementation supports.
Value | Description |
---|---|
1 |
The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. |
namespace sycl {
enum class aspect {
...
ext_oneapi_non_uniform_groups
}
}
If a SYCL device has the ext_oneapi_non_uniform_groups
aspect,
then it supports the non-uniform groups described in the next sections.
The SYCL specification defines control flow as below:
When all work-items in a group are executing the same sequence of statements, they are said to be executing under converged control flow. Control flow diverges when different work-items in a group execute a different sequence of statements, typically as a result of evaluating conditions differently (e.g. in selection statements or loops).
This extension introduces some new terminology to describe other kinds of control flow, to simplify the description of the behavior for new group types.
Two or more work-items are considered to have simultaneous execution at a given point if they execute statements at the same time. A common example of simultaneous execution is when work-items are mapped to different lanes of the same SIMD instruction(s). The amount of granularity of simultaneous execution is implementation-defined.
A tangle is a collection of work-items from the same group executing under converged control flow.
Fixed topology groups are groups which represent the hierarchical execution model topology used by SYCL kernels. These groups are implicitly created by an implementation when a SYCL kernel function is enqueued. The following group types are fixed topology groups:
-
root_group
(if sycl_ext_oneapi_root_group is supported) -
group
-
sub_group
User-constructed groups are explicitly created by a developer (e.g. by partitioning one of the fixed topology groups). This extension introduces the following user-constructed groups:
-
ballot_group
-
fixed_size_group
-
tangle_group
-
opportunistic_group
The is_fixed_topology_group
and is_user_constructed_group
traits can be
used to detect whether a group type represents a fixed topology or
user-constructed group, respectively.
namespace sycl::ext::oneapi::experimental {
template <class T>
struct is_fixed_topology_group;
template <class T>
inline constexpr
bool is_fixed_topology_group_v = is_fixed_topology_group<T>::value;
template <class T>
struct is_user_constructed_group;
template <class T>
inline constexpr bool
is_user_constructed_group_v = is_user_constructed_group<T>::value;
} // namespace sycl::ext::oneapi::experimental
is_fixed_topology_group<T>::value
is std::true_type
if T
is one of:
root_group
, group
or sub_group
.
is_user_constructed_group<T>::value
is std::true_type
if T
is one of:
ballot_group
, fixed_size_group
, tangle_group
, or opportunisic_group
.
Additionally, the is_group<T>::value
trait from SYCL 2020 is std::true_type
if T
is one of: ballot_group
, fixed_size_group
, tangle_group
, or
opportunistic_group
.
When a user-constructed group is passed to a group function or group algorithm, all work-items in the group must call the function or algorithm in converged control flow. Violating this restriction results in undefined behavior.
If a work-item calls a group function or group algorithm using an object that represents a group to which the work-item does not belong, this results in undefined behavior.
Note
|
Using group functions and algorithms in the presence of overlapping groups is discouraged, since it is highly likely that such code would have to make assumptions regarding work-item scheduling and forward progress guarantees. |
The following group functions support the ballot_group
, fixed_size_group
,
tangle_group
, and opportunistic_group
group types:
-
group_barrier
-
group_broadcast
The following group algorithms support ballot_group
, fixed_size_group
,
tangle_group
, and opportunistic_group
group types:
-
joint_any_of
andany_of_group
-
joint_all_of
andall_of_group
-
joint_none_of
andnone_of_group
-
shift_group_left
-
shift_group_right
-
permute_group_by_xor
-
select_from_group
-
joint_reduce
andreduce_over_group
-
joint_exclusive_scan
andexclusive_scan_over_group
-
joint_inclusive_scan
andinclusive_scan_over_group
A ballot-group is a non-contiguous subset of a group, representing a collection of all work-items in the group that share the same value of some predicate. Ballot-groups are always created in a range of two: the first ballot-group contains all work-items where the predicate is true, and the second ballot-group contains all work-items where the predicate is false.
New ballot-groups are created by partitioning an existing group, using the
get_ballot_group
free-function.
Note
|
Creating a ballot-group requires a barrier across all work-items in the parent group, since work-items must exchange predicate values in order to determine group membership. |
namespace ext::oneapi::experimental {
template <typename Group>
ballot_group get_ballot_group(Group group, bool predicate);
} // namespace ext::oneapi::experimental
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>> &&
std::is_same_v<Group, sycl::sub_group>
is true.
Preconditions: All work-items in group
must encounter this function in
converged control flow.
Effects: Synchronizes all work-items in group
.
Returns: A ballot_group
consisting of the work-items in group
for which
predicate
has the same value as the calling work-item.
namespace sycl::ext::oneapi::experimental {
template <typename ParentGroup>
class ballot_group {
public:
using id_type = id<1>;
using range_type = range<1>;
using linear_id_type = uint32_t;
static constexpr int dimensions = 1;
static constexpr sycl::memory_scope fence_scope = ParentGroup::fence_scope;
id_type get_group_id() const;
id_type get_local_id() const;
range_type get_group_range() const;
range_type get_local_range() const;
linear_id_type get_group_linear_id() const;
linear_id_type get_local_linear_id() const;
linear_id_type get_group_linear_range() const;
linear_id_type get_local_linear_range() const;
bool leader() const;
};
}
Note
|
ballot_group is templated on a ParentGroup because it is expected
that it will eventually be possible to construct a ballot-group from more
than only sub-groups.
|
id_type get_group_id() const;
Returns: An id
representing the index of the ballot-group.
Note
|
This will always be either 0 (representing the group of work-items where the predicate was true) or 1 (representing the group of work-items where the predicate was false). |
id_type get_local_id() const;
Returns: An id
representing the calling work-item’s position within
the ballot-group.
range_type get_group_range() const;
Returns: A range
representing the number of ballot-groups.
Note
|
This will always return a range of 2, as there will always be two groups;
one representing the group of work-items where the predicate was true and
another representing the group of work-items where the predicate was false.
|
range_type get_local_range() const;
Returns: A range
representing the number of work-items in the ballot-group.
id_type get_group_linear_id() const;
Returns: A linearized version of the id
returned by get_group_id()
.
id_type get_local_linear_id() const;
Returns: A linearized version of the id
returned by get_local_linear_id()
.
range_type get_group_linear_range() const;
Returns: A linearized version of the id
returned by get_group_range()
.
range_type get_local_linear_range() const;
Returns: A linearized version of the id
returned by get_local_range()
.
bool leader() const;
Returns: true
for exactly one work-item in the ballot-group, if the calling
work-item is the leader of the ballot-group, and false
for all other
work-items in the ballot-group. The leader of the ballot-group is guaranteed to
be the work-item for which get_local_id()
returns 0.
A ballot_group
can be used in conjunction with constructs like loops and
branches to safely communicate between all work-items still executing:
auto sg = it.get_sub_group();
// get group representing the subset of the sub-group that will take the branch
auto will_branch = sg.get_local_linear_id() % 2 == 0;
auto inner = sycl::ext::oneapi::experimental::get_ballot_group(sg, will_branch);
if (will_branch)
{
// wait for all work-items that took the branch to hit the barrier
sycl::group_barrier(inner);
// reduce across subset of outer work-items that took the branch
float ix = sycl::reduce_over_group(inner, x, plus<>());
}
A fixed-size-group is a contiguous collection of work-items created by subdividing a group into equally sized parts, such that each work-item is a member of exactly one partition. The size of a fixed-size-group is a static (compile-time) property.
New fixed-size-groups are created by partitioning an existing group, using the
get_fixed_size_group
free-function.
Note
|
Creating a fixed-size-group does not require a barrier across all work-items in the parent group, since work-items can independently identify partition members given a fixed partition size. |
namespace ext::oneapi::experimental {
template <size_t PartitionSize, typename Group>
fixed_size_group<PartitionSize, Group> get_fixed_size_group(Group group);
} // namespace ext::oneapi::experimental
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>> &&
std::is_same_v<Group, sycl::sub_group>
is true. PartitionSize
must be positive
and a power of 2.
Preconditions: PartitionSize
must be less than or equal to the result of
group.get_max_local_range()
. group.get_local_linear_range()
must be evenly
divisible by PartitionSize
.
Returns: A fixed_size_group<PartitionSize>
consisting of all work-items in
group
that are in the same partition as the calling work-item.
namespace sycl::ext::oneapi::experimental {
template <std::size_t PartitionSize, typename ParentGroup>
class fixed_size_group {
public:
using id_type = id<1>;
using range_type = range<1>;
using linear_id_type = uint32_t;
static constexpr int dimensions = 1;
static constexpr sycl::memory_scope fence_scope = ParentGroup::fence_scope;
id_type get_group_id() const;
id_type get_local_id() const;
range_type get_group_range() const;
range_type get_local_range() const;
linear_id_type get_group_linear_id() const;
linear_id_type get_local_linear_id() const;
linear_id_type get_group_linear_range() const;
linear_id_type get_local_linear_range() const;
bool leader() const;
};
}
Note
|
fixed_size_group is templated on a ParentGroup because it is expected
that it will eventually be possible to construct a fixed-size-group from more
than only sub-groups.
|
id_type get_group_id() const;
Returns: An id
representing the index of the fixed-size-group.
id_type get_local_id() const;
Returns: An id
representing the calling work-item’s position within
the fixed-size-group.
range_type get_group_range() const;
Returns: A range
representing the number of fixed-size-groups.
range_type get_local_range() const;
Returns: A range
representing the number of work-items in the
fixed-size-group, which is always equal to PartitionSize
.
id_type get_group_linear_id() const;
Returns: A linearized version of the id
returned by get_group_id()
.
id_type get_local_linear_id() const;
Returns: A linearized version of the id
returned by get_local_linear_id()
.
range_type get_group_linear_range() const;
Returns: A linearized version of the id
returned by get_group_range()
.
range_type get_local_linear_range() const;
Returns: A linearized version of the id
returned by get_local_range()
.
bool leader() const;
Returns: true
for exactly one work-item in the fixed-size-group, if the calling
work-item is the leader of the fixed-size-group, and false
for all other
work-items in the fixed-size-group. The leader of the fixed-size-group is guaranteed
to be the work-item for which get_local_id()
returns 0.
A fixed_size_group
can be used to apply group algorithms to subsets of data:
auto sg = it.get_sub_group();
// reduce over contiguous groups of 8 elements
auto partition = sycl::ext::oneapi::experimental::get_fixed_size_group<8>(sg);
auto result = sycl::reduce_over_group(partition, buf[it.get_local_linear_id()], sycl::plus<>());
// write result out once per group
if (partition.leader()){
buf[partition.get_group_id()] = result;
}
A fixed_size_group
can be used to provide an interface accepting a specific
number of work-items:
void func_that_needs_4_work_items(sycl::ext::oneapi::experimental::fixed_size_group<4> group);
A tangle-group is a non-contiguous subset of a group representing work-items executing in a tangle. A tangle-group can therefore be used to capture all work-items currently executing the same control flow.
New tangle-groups are created by partitioning an existing group, using the
get_tangle_group
free-function.
Note
|
Creating a tangle-group may implicitly synchronize members of the
tangle_group on some devices, since it may be necessary to wait for
work-items to reconverge. For consistency, this synchronization is required by
all implementations.
|
namespace ext::oneapi::experimental {
template <typename Group>
tangle_group get_tangle_group(Group group);
} // namespace ext::oneapi::experimental
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>> &&
std::is_same_v<Group, sycl::sub_group>
is true.
Effects: Synchronizes all work-items in the resulting tangle_group
.
Returns: A tangle_group
consisting of the work-items in group
which are
part of the same tangle.
namespace sycl::ext::oneapi::experimental {
template <typename ParentGroup>
class tangle_group {
public:
using id_type = id<1>;
using range_type = range<1>;
using linear_id_type = uint32_t;
static constexpr int dimensions = 1;
static constexpr sycl::memory_scope fence_scope = ParentGroup::fence_scope;
id_type get_group_id() const;
id_type get_local_id() const;
range_type get_group_range() const;
range_type get_local_range() const;
linear_id_type get_group_linear_id() const;
linear_id_type get_local_linear_id() const;
linear_id_type get_group_linear_range() const;
linear_id_type get_local_linear_range() const;
bool leader() const;
};
}
id_type get_group_id() const;
Returns: An id
representing the index of the tangle-group.
Note
|
This will always be an id with all values set to 0, since there can
only be one tangle-group.
|
id_type get_local_id() const;
Returns: An id
representing the calling work-item’s position within
the tangle-group.
range_type get_group_range() const;
Returns: A range
representing the number of tangle-groups.
Note
|
This will always return a range of 1 as there can only be one
tangle-group.
|
range_type get_local_range() const;
Returns: A range
representing the number of work-items in the tangle-group.
id_type get_group_linear_id() const;
Returns: A linearized version of the id
returned by get_group_id()
.
id_type get_local_linear_id() const;
Returns: A linearized version of the id
returned by get_local_linear_id()
.
range_type get_group_linear_range() const;
Returns: A linearized version of the id
returned by get_group_range()
.
range_type get_local_linear_range() const;
Returns: A linearized version of the id
returned by get_local_range()
.
bool leader() const;
Returns: true
for exactly one work-item in the tangle-group, if the calling
work-item is the leader of the tangle-group, and false
for all other
work-items in the tangle-group. The leader of the tangle-group is guaranteed to
be the work-item for which get_local_id()
returns 0.
A tangle_group
can be used in conjunction with constructs like loops and
branches to safely communicate between all work-items executing the same
control flow.
Note
|
This differs from a ballot_group because a tangle_group requires the
implementation to track group membership. Which group type to use will depend
on a combination of implementation/backend/device and programmer preference.
|
auto sg = it.get_sub_group();
auto will_branch = sg.get_local_linear_id() % 2 == 0;
if (will_branch)
{
// wait for all work-items that took the branch to hit the barrier
auto inner = sycl::ext::oneapi::experimental::get_tangle_group(sg);
sycl::group_barrier(inner);
// reduce across subset of outer work-items that took the branch
float ix = sycl::reduce_over_group(inner, x, plus<>());
}
An opportunistic-group is a non-contiguous subset of a sub-group, representing the work-items which are executing simultaneously.
In SYCL implementations where work-items have strong forward progress guarantees (and can therefore make progress independently of other work-items in the same sub-group), it is possible that only a subset of the work-items in a sub-group executing the same control flow will execute simultaneously.
In some cases it may be helpful to capture this group and use it for opportunistic optimizations.
Opportunistic groups are created by calls to the get_opportunistic_group()
free-function. Each call to get_opportunistic_group()
returns a different
group. There are no guarantees that a group returned by
get_opportunistic_group()
will contain all work-items executing the same
control flow, nor the same set of work-items as the group returned by any
previous call to get_opportunistic_group()
.
Note
|
Creating an opportunistic group does not require a barrier or introduce
any synchronization because it is designed to capture whichever set of
work-items happen to call get_opportunistic_group() simultaneously.
|
namespace ext::oneapi::experimental::this_kernel {
opportunistic_group get_opportunistic_group();
} // namespace ext::oneapi::experimental::this_kernel
Returns: An opportunistic_group
consisting of all work-items in the same
sub-group as the calling work-item which call the function simultaneously.
namespace sycl::ext::oneapi::experimental {
class opportunistic_group {
public:
using id_type = id<1>;
using range_type = range<1>;
using linear_id_type = uint32_t;
static constexpr int dimensions = 1;
static constexpr sycl::memory_scope fence_scope =
sycl::memory_scope::sub_group;
id_type get_group_id() const;
id_type get_local_id() const;
range_type get_group_range() const;
range_type get_local_range() const;
linear_id_type get_group_linear_id() const;
linear_id_type get_local_linear_id() const;
linear_id_type get_group_linear_range() const;
linear_id_type get_local_linear_range() const;
bool leader() const;
};
}
id_type get_group_id() const;
Returns: An id
representing the index of the opportunistic-group.
Note
|
This will always be an id with all values set to 0, since there can
only be one opportunistic-group.
|
id_type get_local_id() const;
Returns: An id
representing the calling work-item’s position within
the opportunistic-group.
range_type get_group_range() const;
Returns: A range
representing the number of opportunistic-groups.
Note
|
This will always return a range of 1 as there will only be one
opportunistic-group.
|
range_type get_local_range() const;
Returns: A range
representing the number of work-items in the
opportunistic-group.
id_type get_group_linear_id() const;
Returns: A linearized version of the id
returned by get_group_id()
.
id_type get_local_linear_id() const;
Returns: A linearized version of the id
returned by get_local_linear_id()
.
range_type get_group_linear_range() const;
Returns: A linearized version of the id
returned by get_group_range()
.
range_type get_local_linear_range() const;
Returns: A linearized version of the id
returned by get_local_range()
.
bool leader() const;
Returns: true
for exactly one work-item in the opportunistic-group, if the
calling work-item is the leader of the opportunistic-group, and false
for all
other work-items in the opportunistic-group. The leader of the opportunistic
group is guaranteed to be the work-item for which get_local_id()
returns 0.
The following example shows an atomic pointer being incremented. It is expected that all the work-items in the sub_group will increment the atomic value, but we opportunistically capture the groups of work-items as they arrive to this point in the control flow.
template <sycl::memory_order Order,
sycl::memory_scope Scope,
sycl::access::address_space AddressSpace>
int atomic_aggregate_inc(sycl::sub_group sg, sycl::atomic_ref<int, Order, Scope, AddressSpace> ptr) {
// get the set of work-items that called this function simultaneously
auto active_group = sycl::ext::oneapi::experimental::this_kernel::get_opportunistic_group();
// increment the atomic once on behalf of all active work-items
int count = active_group.get_local_linear_range();
int old_value;
if (active_group.leader()) {
old_value = ptr.fetch_add(count);
}
// return the value the individual work-item might have received if it had worked alone
auto base = sycl::group_broadcast(active_group, old_value);
auto idx = active_group.get_local_linear_id();
return base + idx;
}
This non-normative section provides information about one possible implementation of this extension. It is not part of the specification of the extension’s API.
For SPIR-V backends, all user-constructed group types are expected to be
implemented using SPIR-V’s
non-uniform
instructions. fixed_size_group
functionality is expected to leverage the
optional PartitionSize
argument of those instructions. Each group type will
require slightly different usage of those instructions to ensure that distinct
groups encounter unique control flow when appropriate.
For CUDA backends, all user-constructed group types are expected to be lowered
to PTX instructions with explicit masks. The only expected difference in
implementation for the different group types is how the mask is initially
constructed. Supporting tangle_group
may require the compiler to construct
masks when encountering control flow constructs, and to pass those masks
across call boundaries.
-
Should
tangle_group
support work-groups or just sub-groups?SPIR-V "tangled instructions" include group and sub-group instructions, but it is unclear how to identify which work-items in different sub-groups are executing the same control flow (without introducing significant overhead). If we decide at a later date that
tangle_group
should support only sub-groups, we should revisit the name to avoid creating confusion.