Important
|
This specification is a draft. |
Note
|
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos. |
This document describes an extension enabling the declaration of local memory objects at the kernel functor scope.
Working Draft
This is a preview extension specification, intended to provide early access to a feature for review and community feedback. When the feature matures, this specification may be released as a formal extension.
Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products.
John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
Roland Schulz, Intel (roland 'dot' schulz 'at' intel 'dot' com)
This extension provides a feature-test macro as described in the core SYCL
specification section 6.3.3 "Feature test macros". Therefore, an
implementation supporting this extension must predefine the macro
SYCL_EXT_ONEAPI_LOCAL_MEMORY
to one of the values defined in the table below.
Applications can test for the existence of this macro to determine if the
implementation supports this feature, or applications can test the macro’s
value to determine which of the extension’s APIs the implementation supports.
Value | Description |
---|---|
1 |
Initial extension version. Base features are supported. |
OpenCL provides two ways for local memory to be used in a kernel:
-
The kernel accepts a pointer in the
local
address space as an argument, and the host passes the size of the allocation to the OpenCL runtime when the kernel is launched. -
The kernel declares
local
variables in the kernel function scope.
In SYCL, programmers have two choices:
-
Local accessors created by the host, analogous to the OpenCL kernel argument mechanism.
-
Variables declared at the kernel functor scope, in hierarchical parallelism kernels.
Note that SYCL currently lags behind OpenCL when it comes to local memory allocations; in particular, work-group data parallel SYCL kernels are limited to the accessor method. This is undesirable for some architectures, where allocating local memory with a compile-time known size is required for performance.
This limitation is also undesirable from a usability point of view, since programmers have to declare an accessor outside a kernel and capture it inside the kernel functor.
This extension introduces a concept of group-local memory, with semantics
similar to OpenCL kernel-scope local
variables and C++ thread_local
variables.
The sycl::ext::oneapi::group_local_memory
and
sycl::ext::oneapi::group_local_memory_for_overwrite
functions can be used to
allocate group-local memory at the kernel functor scope of a work-group data
parallel kernel.
Note
|
The restriction that group-local variables must be defined at kernel functor scope may be lifted in a future version of this extension. |
Group-local memory is allocated in an address space accessible by all work-items in the group, and is shared by all work-items of the group.
namespace sycl {
namespace ext {
namespace oneapi {
template <typename T, typename Group, typename... Args>
multi_ptr<T, Group::address_space> group_local_memory(Group g, Args&&... args);
template <typename T, typename Group>
multi_ptr<T, Group::address_space> group_local_memory_for_overwrite(Group g);
} // namespace oneapi
} // namespace ext
} // namespace sycl
Functions | Description |
---|---|
|
Constructs an object of type All arguments in args must be the same for all work-items in the group.
|
|
Constructs an object of type
|
Note
|
The restrictions on supported types for Group and T may be lifted
in a future version of this extension.
|
This non-normative section shows some example usages of the extension.
myQueue.submit([&](handler &h) {
h.parallel_for(
nd_range<1>(range<1>(128), range<1>(32)), [=](nd_item<1> item) {
multi_ptr<int[64], access::address_space::local_space> ptr = group_local_memory<int[64]>(item.get_group());
auto& ref = *ptr;
ref[2 * item.get_local_linear_id()] = 42;
});
});
The example above creates a kernel with four work-groups, each containing 32
work-items. An int[64]
object is defined in group-local memory, and
each work-item in the work-group obtains a multi_ptr
to the same allocation.