Skip to content

Latest commit

 

History

History
1252 lines (1010 loc) · 46.2 KB

sycl_ext_oneapi_device_global.asciidoc

File metadata and controls

1252 lines (1010 loc) · 46.2 KB

sycl_ext_oneapi_device_global

Introduction

In OpenCL 2.0 and later, a user is able to allocate program scope memory which can be accessed like a C++ global variable by any kernel in an OpenCL program (cl_program). When a program is shared between multiple devices, each device receives its own unique instance of the program scope memory allocation.

This extension introduces device scoped memory allocations into SYCL that can be accessed within a kernel using syntax similar to C++ global variables, but that have unique instances per sycl::device and sycl::context. Mechanisms are provided for the host program to enqueue copies to or from the allocations on a specific device. Restrictions are placed on the types of data that can be stored within device_global allocations, particularly around copyability and constructors/destructors.

Notice

Copyright (c) 2021 - 2023 Intel Corporation. All rights reserved.

Note
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.

Status

This is an experimental extension specification, intended to provide early access to features and gather community feedback. Interfaces defined in this specification are implemented in DPC++, but they are not finalized and may change incompatibly in future versions of DPC++ without prior notice. Shipping software products should not rely on APIs defined in this specification.

Contact

Artem Radzikhovskyy, Intel (artem 'dot' radzikhovskyy 'at' intel 'dot' com)

Contributors

Artem Radzikhovskyy, Intel
Michael Kinsner, Intel
Jessica Davies, Intel
Joe Garvey, Intel
Mohammad Fawaz, Intel
Tommy Hoffner, Intel
John Pennycook, Intel
Greg Lueck, Intel
Roland Schulz, Intel

Dependencies

This extension is written against the SYCL 2020 specification, revision 3.

It also depends on the SYCL_EXT_ONEAPI_PROPERTIES extension.

Overview

Note

In this document, we use device_global to indicate the proposed sycl::ext::oneapi::experimental::device_global.

The purpose of this document is to clearly describe and specify device_global and related concepts, types, and mechanisms, and to give examples and context for their usage.

Motivation

Device scope memory allocations can provide an efficient mechanism for communication between multiple invocations of a kernel, or between kernels executing on the same device and context. There are additional benefits and optimization opportunities when a device compiler has visibility into the allocation size (static sizing) and uses of the allocation.

Syntax allowing direct use of an allocation (without passing pointers or parameters through function call boundaries) can also lead to syntax simplification in some important use cases.

Examples

Two example device_global variables can be declared at namespace scope, as follows:

struct MyClass {
  bool flag;
};

using namespace sycl::ext::oneapi::experimental;

device_global<MyClass> dm1;
static device_global<int[4]> dm2;

dm1 creates an allocation on each sycl::device that contains an object of type MyClass. dm2 has internal linkage (due to static), and creates allocations containing an array of four int per device.

Uses of dm1 and dm2 in device functions are syntactically similar to uses of global variables in C++ (access directly through the namespace scope identifier), and device_global has reference wrapper-like semantics on a device. Of note, because C++ doesn’t allow for overloading of the "dot operator", a get() member of device_global allows a reference to be extracted, to which the usual dot operator may be applied as in:

sycl::queue Q;
Q.submit([&](sycl::handler& h) {
  h.single_task([=] {
    int x = 5;
    if (dm1.get().flag)
      x = dm2[0];
  });
});

For both dm1 and dm2, the MyClass and int[4] allocations on each device in the context associated with Q are zero-initialized before any non-initialization accesses occur.

Specification

Feature test macro

This extension provides a feature-test macro as described in the core SYCL specification, Section 6.3.3 "Feature test macros". Therefore, an implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_DEVICE_GLOBAL to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s features that the implementation supports.

Value Description

1

Initial extension version

Representation of device globals

device_global provides a mechanism to allocate device scope memory - memory which has unique underlying storage (of type T) for each sycl::device and sycl::context combination. If multiple valid device and context combinations are present then each receives its own unique underlying allocation. All kernels that reference the same device_global entity (either directly or via a pointer to its underlying object of type T) share the same allocation of that object when those kernels run on the same device and context.

device_global allocations are in the global address space, as are any underlying allocations of type T which are implicitly allocated on each device as a result of a device_global object. It is undefined behavior if the host program directly accesses a device_global or any address obtained from a device_global member function, and similarly it is undefined behavior if a device_global or address obtained on one device from a device_global member function is accessed on a different device or context. There is no mechanism to obtain addresses of or directly access a device’s device_global allocation within the host program.

A device_global on a given device and context maintains its state (address of the allocation and data within the allocation) even after the application changes the value of a specialization constant via handler::set_specialization_constant(). Additionally, a device_global maintains its state even when it is referenced from a kernel in a different kernel_bundle.

namespace sycl::ext::oneapi::experimental {
template <typename T, typename PropertyListT = empty_properties_t>
class device_global {
  ...

device_global is a class template, parameterized by the type of the underlying allocation T, and a list of properties PropertyListT. The type of the allocation T also encodes the size of the allocation for potentially multidimensional array types.

T is restricted to types that have a trivial destructor. PropertyListT enables properties to be associated with a device_global.

When compiling with C++ versions before C++20, T must also have a trivial default constructor. In this case, the allocation of type T for a given device_global is zero-initialized on a given device prior to the first access to that device_global on that device. For the purposes of this definition an access can be a direct access of the device_global in kernel code or a copy to or from that device_global enqueued to the given device.

When compiling with C++20 or later, T must have a constructor that can be constexpr evaluated, and the parameters to the device_global constructor are forwarded to the T constructor. In this case, the allocation of type T for a given device_global is initialized on a given device prior to the first access to that device_global on that device.

Properties may be specified for a device_global to provide semantic modification or optimization hint information to the compiler. See the section below for a list of the properties that are allowed.

Note

On a device, device_global has similar semantics to a reference wrapper. The dot operator (operator.) cannot be overloaded, so a get() member is provided to allow a reference to be extracted directly when needed. Some operators are declared in device_global that must be members (e.g. operator[] and operator->). Note that other operators can be overloaded by specific T as free functions, which will be selected through implicit conversion to T in device functions.

The section below and the table following describe the constructors, member functions and factory methods for device_global.

namespace sycl::ext::oneapi::experimental {

template <typename T, typename PropertyListT = empty_properties_t>
class device_global {
  using subscript_return_t =
    std::remove_reference_t<decltype(std::declval<T>()[std::ptrdiff_t{}])>;

public:
  using element_type = std::remove_extent_t<T>;

  static_assert(std::is_trivially_destructible_v<T>,
      "Type T must be trivially destructible.");

  // device_global initializes underlying T with the args argument
#if __cpp_consteval
  template <typename... Args>
  consteval explicit device_global(Args&&... args);
#else
  static_assert(std::is_trivially_default_constructible_v<T>,
                "Type T must be trivially default constructable (until C++20 "
                "consteval is supported and enabled)");

  // The underlying memory allocations of type T on devices will be
  // zero-initialized before any non-initialization accesses occur.
  device_global() = default;
#endif // __cpp_consteval

  device_global(const device_global &) = delete;
  device_global(const device_global &&) = delete;
  device_global &operator=(const device_global &) = delete;
  device_global &operator=(const device_global &&) = delete;

  template <access::decorated IsDecorated>
  multi_ptr<T, access::address_space::global_space, IsDecorated>
    get_multi_ptr() noexcept;

  template <access::decorated IsDecorated>
  multi_ptr<const T, access::address_space::global_space, IsDecorated>
    get_multi_ptr() const noexcept;

  // Access the underlying data
  operator T&() noexcept;
  operator const T&() const noexcept;

  T& get() noexcept;
  const T& get() const noexcept;

  // Enable assignments from underlying type
  device_global& operator=(const T&) noexcept;

  // Available if the operator[] is valid for objects of type T
  subscript_return_t& operator[]( std::ptrdiff_t idx ) noexcept;
  const subscript_return_t& operator[]( std::ptrdiff_t idx ) const noexcept;

  // Available if the operator-> is valid for objects of type T
  T& operator->() noexcept;
  const T& operator->() const noexcept;

  // Note that there is no need for "device_global" to define member functions for
  // operators like "++", comparison, etc.  Instead, the type "T" need only define
  // these operators as non-member functions.  Because there is an implicit conversion
  // from "device_global" to "T&", the operations can be applied to objects of type
  // "device_global<T>".

  template<typename propertyT>
  static constexpr bool has_property();

  // The return type is an unspecified internal class used to represent
  // instances of propertyT
  template<typename propertyT>
  static constexpr /*unspecified*/ get_property();
};

} // namespace sycl::ext::oneapi::experimental
Functions Description
device_global();

Constructs a device_global object, and implicit storage for T in the global address space on each device that may access it.

The storage on each device for T is zero-initialized.

T must be trivially default constructable and trivially destructible.

template <typename... Args>
consteval explicit device_global(Args&&... args);

Constructs a device_global object, and implicit storage for T in the global address space on each device that may access it.

The object of type T is initialized from the args parameter pack using list initialization as defined in the C++ specification.

T must be trivially destructible.

template <access::decorated IsDecorated>
multi_ptr<T, access::address_space::global_space, IsDecorated>
  get_multi_ptr() noexcept;

template <access::decorated IsDecorated>
multi_ptr<T, access::address_space::global_space, IsDecorated>
  get_multi_ptr() const noexcept;

Available only in device functions.

Returns a multi_ptr to the underlying T on the device. It is undefined behavior to dereference the returned pointer or any address derived from the pointer on a different device or on the host.

operator T&() noexcept;
operator const T&() const noexcept;

Available only in device functions.

Implicit conversion to a reference to the underlying T on the device. It is undefined behavior to access the reference or any address derived from it on a different device or on the host.

T& get() noexcept;
const T& get() const noexcept;

Available only in device functions.

Returns a reference to the underlying T on the device. It is undefined behavior to access the reference or any address derived from it on a different device or on the host.

device_global& operator=(const T&) noexcept;

Available only in device functions.

Enables assignment of type T to the underlying allocation on the device.

element_type& operator[]( std::ptrdiff_t idx ) noexcept;
const element_type& operator[]( std::ptrdiff_t idx ) const noexcept;

Available only in device functions.

Available only when the underlying T defines an operator[].

Indexes into the underlying T. It is undefined behavior if idx is negative.

T& operator->() noexcept;
const T& operator->() const noexcept;

Available only in device functions.

Available only when operator-> is valid for objects of type T.

Provides member access through T that is a pointer or a class which defines operator->.

template<typename propertyT>
static constexpr bool has_property();

Returns true if the PropertyListT contains the property specified by propertyT. Returns false if it does not. Available only if sycl::is_property_key_of_v<propertyT, sycl::ext::oneapi::experimental::device_global> is true.

template<typename propertyT>
static constexpr auto get_property();

Returns an object of the class used to represent the value of property propertyT. Must produce a compiler diagnostic if PropertyListT does not contain a propertyT property. Available only if sycl::is_property_key_of_v<propertyT, sycl::ext::oneapi::experimental::device_global> is true.

Restrictions on creating device global objects

There are restrictions on how the application can create objects of type device_global. Applications that violate these restrictions are ill-formed.

  • The application may declare a variable of type device_global in the following ways:

    • As a variable at namespace scope, or

    • As a static member variable, but only if the member variable is publicly accessible from namespace scope.

    The application must not create an object of type device_global in any other way. (E.g. variables with automatic storage duration or objects created via new are not allowed.)

  • The device_global variable must not itself be an array. The underlying type T may be an array type, but the device_global variable itself must not be an array.

  • The device_global variable must not be shadowed by another identifier X which has the same name and is declared in an inline namespace, such that the device_global variable is no longer accessible after the declaration of X.

  • If the device_global variable is declared in a namespace, none of the enclosing namespace names N may be shadowed by another identifier X which has the same name as N and is declared in an inline namespace, such that N is no longer accessible after the declaration of X.

Note

The expectation is that some implementations may conceptually insert code at the end of a translation unit which references each device_global variable that is declared in that translation unit. The restrictions listed above make this possible by ensuring that these variables are accessible at the end of the translation unit.

The following example illustrates some of these restrictions:

#include <sycl/sycl.hpp>
using namespace sycl::ext::oneapi::experimental;

device_global<int> a;           // OK
static device_global<int> b;    // OK
inline device_global<int> c;    // OK

struct Foo {
  static device_global<int> d;  // OK
};
device_global<int> Foo::d;

struct Bar {
  device_global<int> e;         // ILLEGAL: non-static member variable not
};                              // allowed

struct Baz {
 private:
  static device_global<int> f;  // ILLEGAL: not publicly accessible from
};                              // namespace scope
device_global<int> Baz::f;

device_global<int[4]> g;        // OK
device_global<int> h[4];        // ILLEGAL: array of "device_global" not
                                // allowed

device_global<int> same_name;   // OK
namespace foo {
  device_global<int> same_name; // OK
}
namespace {
  device_global<int> same_name; // OK
}
inline namespace other {
  device_global<int> same_name; // ILLEGAL: shadows "device_global" variable
}                               // with same name in enclosing namespace scope
inline namespace other2 {
  namespace foo {               // ILLEGAL: namespace name shadows "::foo"
  }                             // namespace which contains "device_global"
                                // variable.
}

Constant initialization of device_globals

When compiling with C++20, constant compile-time initialization for device_globals is supported. The following example shows a few examples of what this would look like:

// Constant int and array of int device_globals
device_global<int> no_device_image_dg {3};
device_global<int, decltype(properties(device_image_scope))> dg_int{5};
device_global<int[3], decltype(properties(device_image_scope))>
   dg_int_arr{5, 2, 3};

// Constant char and array of char device_globals
device_global<char, decltype(properties(device_image_scope))> dg_char{'f'};
device_global<char[3], decltype(properties(device_image_scope))>
   dg_char_arr{'d', '4', 'S'};

// Multidimensional array of integers
device_global<int[3][2], decltype(properties(device_image_scope))>
    dg_multi_dim_arr{3, 4, 5, 6, 7, 8};

// Constant float and array of float device_globals
device_global<float, decltype(properties(device_image_scope))> dg_float{4.5};
device_global<float[6], decltype(properties(device_image_scope))>
   dg_float_arr{4.5, 2.1, 3.5, 9.33, 2.33, 2.1};

// Constant double and array of double device_globals
device_global<double, decltype(properties(device_image_scope))>
   dg_double{3.56543};
device_global<double[3], decltype(properties(device_image_scope))>
   dg_double_arr{2.2341234, 233.23423, 236.52321};

// Constant bool and array of bool device_globals
device_global<bool, decltype(properties(device_image_scope))> dg_bool{true};
device_global<bool[3], decltype(properties(device_image_scope))>
   dg_bool_arr{true, false, true};

// Constant struct and array of struct device_globals
struct TestStruct {
  int field1;
  bool field2;
  float field3;
  int field4[4];
};
constexpr TestStruct TS1(5, true, 2.1, {1, 2, 3, 4});
constexpr TestStruct TS2(7, false, 2.4, {1, 2, 3, 4});
constexpr TestStruct TS3(6, false, 4.34534, {5, 6, 7, 8});
device_global<TestStruct, decltype(properties(device_image_scope))>
   dg_struct{TS3};
device_global<TestStruct[2], decltype(properties(device_image_scope))>
   dg_struct_arr{TS1, TS2};

Properties for device global variables

The device_global class supports several compile-time-constant properties. If specified, these properties are included in the PropertyListT template parameter as shown in this example:

using namespace sycl::ext::oneapi::experimental;

device_global<MyClass, decltype(properties(device_image_scope))> dm1;
device_global<int[4], decltype(properties(host_access_read))> dm2;

The following code synopsis shows the set of supported properties, and the following table describes their effect.

namespace sycl::ext::oneapi::experimental {

struct device_image_scope_key {
  using value_t = property_value<device_image_scope_key>;
};

enum class host_access_enum : /* unspecified */ {
  read,
  write,
  read_write,
  none
};

struct host_access_key {
  template <host_access_enum Access>
  using value_t =
      property_value<host_access_key,
                     std::integral_constant<host_access_enum, Access>>;
};

enum class init_mode_enum : /* unspecified */ {
  reprogram,
  reset
};

struct init_mode_key {
  template <init_mode_enum Trigger>
  using value_t =
      property_value<init_mode_key,
                     std::integral_constant<init_mode_enum, Trigger>>;
};

struct implement_in_csr_key {
  template <bool Enable>
  using value_t =
      property_value<implement_in_csr_key, std::bool_constant<Enable>>;
};

inline constexpr device_image_scope_key::value_t device_image_scope;

template <host_access_enum Access>
inline constexpr host_access_key::value_t<Access> host_access;
inline constexpr host_access_key::value_t<host_access_enum::read>
    host_access_read;
inline constexpr host_access_key::value_t<host_access_enum::write>
    host_access_write;
inline constexpr host_access_key::value_t<host_access_enum::read_write>
    host_access_read_write;
inline constexpr host_access_key::value_t<host_access_enum::none>
    host_access_none;

template <init_mode_enum Trigger>
inline constexpr init_mode_key::value_t<Trigger> init_mode;
inline constexpr init_mode_key::value_t<init_mode_enum::reprogram>
    init_mode_reprogram;
inline constexpr init_mode_key::value_t<init_mode_enum::reset> init_mode_reset;

template <bool Enable>
inline constexpr implement_in_csr_key::value_t<Enable> implement_in_csr;
inline constexpr implement_in_csr_key::value_t<true> implement_in_csr_on;
inline constexpr implement_in_csr_key::value_t<false> implement_in_csr_off;

template <> struct is_property_key<device_image_scope_key> : std::true_type {};
template <> struct is_property_key<host_access_key> : std::true_type {};
template <> struct is_property_key<init_mode_key> : std::true_type {};
template <> struct is_property_key<implement_in_csr_key> : std::true_type {};

template <typename T, typename PropertyListT>
struct is_property_key_of<device_image_scope_key, device_global<T, PropertyListT>>
  : std::true_type {};
template <typename T, typename PropertyListT>
struct is_property_key_of<host_access_key, device_global<T, PropertyListT>>
  : std::true_type {};
template <typename T, typename PropertyListT>
struct is_property_key_of<init_mode_key, device_global<T, PropertyListT>>
  : std::true_type {};
template <typename T, typename PropertyListT>
struct is_property_key_of<implement_in_csr_key, device_global<T, PropertyListT>>
  : std::true_type {};

} // namespace sycl::ext::oneapi::experimental
Property Description
device_image_scope

This property is most useful for kernels that are submitted to an FPGA device, but it may be used with any kernel. Normally, a single instance of a device global variable is allocated for each device, and that instance is shared by all kernels that belong to the same context and are submitted to the same device, regardless of which device image contains the kernel. When this property is specified, it is an assertion by the user that on a given device a given device_global decorated with this property is only ever accessed in a single device_image. An implementation may be able to optimize accesses to the device global when this property is specified (especially on an FPGA device), but the user must be aware of which device image contains the kernels that use the variable.

A device global that is decorated with this property may not be accessed from kernels that reside in different device images, either by direct reference to the variable or indirectly by passing the variable’s address to another kernel. The implementation is required to diagnose an error if the kernels that directly access a variable do not all reside in the same device image, however no diagnostic is required for an indirect access from another device image.

A device global variable is guaranteed to be initialized for a device prior to the first time it is accessed (whether from a kernel or a copy operation). Device globals may also be re-initialized at implementation-defined times if multiple device images are used on the same device. To avoid unexpected re-initializations, applications should ensure that all kernels that are enqueued to a device D come from the same device image. In addition, applications should ensure that all device global copy operation enqueued to device D correspond to that same device image.

The application may copy to or from a device global even before any kernel in the device image is submitted to the device. Doing so causes the device global to be initialized immediately before the copy happens. (Typically, the copy operation causes the device image to be loaded onto the device also.) As a result, copying from a device global returns the initial value if the device image that contains the variable is not currently loaded onto the device.

host_access

This property provides an assertion by the user telling the implementation whether the host code copies to or from the device global. As a result, the implementation may be able to perform certain optimizations. Although this property may be used with any device, it is generally only beneficial when used on FPGA devices.

The following values are supported:

  • read: The user asserts that the host code may copy from (read) the variable, but it will never copy to (write) it. For an FPGA device, only a read port is exposed.

  • write: The user asserts that the host code may copy to (write) the variable, but it never copy from (read) it. For an FPGA device, only a write port is exposed.

  • none: The user asserts that the host code will never copy to or copy from the variable. For an FPGA device, no external ports are exposed.

  • read_write: The user provides no assertions, and the host code may either copy to or copy from the variable. This is the default. For an FPGA device, a read/write port is exposed.

init_mode

This property is only meaningful when used with an FPGA device. It is ignored for other devices. The following values are supported:

  • reprogram: Initialization is performed by reprogramming the device. This may require more frequent reprogramming but may reduce area.

  • reset: Initialization is performed by sending a reset signal to the device. This may increase area but may reduce reprogramming frequency.

If the init_mode property is not specified, the default behavior is equivalent to one of the values listed above, but the choice is implementation defined.

implement_in_csr

This property is only meaningful when used with an FPGA device. It is ignored for other devices. The following values are supported:

  • true: Access to this memory is done through a CSR interface shared with kernel arguments.

  • false: Access to this memory is done through a dedicated interface.

If the implement_in_csr property is not specified, the default behavior is equivalent to one of the values listed above, but the choice is implementation defined.

Note

As stated above, the user must understand which device image contains a kernel in order to use the device_image_scope property. Each implementation may have its own rules that determine when two kernels are bundled together into the same device image. For DPC++ two kernels K1 and K2 will be bundled into the same device image when both of the following conditions are satisfied:

  • The translation unit containing K1 and the translation unit containing K2 must both be compiled with -fsycl-targets=X -fsycl-assume-all-kernels-run-on-targets where the target X is the same in both compilations. (A list of targets may also be specified such as -fsycl-targets=X,Y. In this case the list must be the same in both compilations.)

  • The application must be linked with -fsycl-device-code-split such that the kernels K1 and K2 are not split into different device images. For example, if K1 and K2 reside in the same translation unit, -fsycl-device-code-split=per_source will guarantee that they are bundled together in the same device image. If they reside in different translation units, -fsycl-device-code-split=off will guarantee that they reside in the same device image.

In addition, the following factors also affect how kernels are bundled into device images:

  • Kernels that are online-compiled using sycl::kernel_bundle may reside in different device images if they are compiled from different kernel_bundle objects.

  • A kernel that uses specialization constants may have a new instance in a new device image each time the application sets a new value for the specialization constant. However, this happens only if the device supports native specialization constants, which is not the case for FPGA devices.

Relax language restrictions for SYCL device functions

SYCL 2020 restrictions must be relaxed to allow device_global to be used within device functions without being const or constexpr and without being zero-initialized or constant-initialized. This is achieved by adding device_global exceptions to the following point in Section 5.4 "Language restrictions for device functions". The modified restriction is:

  • Variables with static storage duration that are odr-used inside a device function, must be const or constexpr and zero-initialized or constant-initialized, except if the variable is of type device_global in which case it can be odr-used inside a device function without being const/constexpr or zero-/constant-initialized.

    • Amongst other things, this restriction makes it illegal for a device function to access a global variable that isn’t const or constexpr unless the variable is of type device_global.

Referencing a device global defined in another translation unit

This extension broadens the use of the SYCL_EXTERNAL macro to apply also to device global variables. If the implementation defines the SYCL_EXTERNAL macro, device code in one translation unit may reference a device global variable that is defined in a different translation unit so long as the declaration of the variable in both translation units uses SYCL_EXTERNAL. For example:

// In one translation unit
#include <sycl/sycl.hpp>
using namespace sycl::ext::oneapi::experimental;

SYCL_EXTERNAL device_global<int> Foo;  // definition (also a declaration)

// In another translation unit
#include <sycl/sycl.hpp>
using namespace sycl::ext::oneapi::experimental;
using namespace sycl;

SYCL_EXTERNAL extern device_global<int> Foo;  // declaration

void bar(queue q) {
  q.single_task([=] {
    Foo = 42;
  });
}

Add new copy and memcpy members to the queue class

Add the following functions to the sycl::queue interface described in Section 4.6.5.1 of the SYCL 2020 specification.

Note

A pointer to the allocation within a device_global may not be obtained by the host program (can only be extracted in device functions because allocations are per device), so pointer arithmetic can therefore not be used in the host program to define copy/memcpy offsets into data. startIndex and offset arguments are provided in these interfaces to allow offsetting without pointer arithmetic.

namespace sycl {
class queue {
public:
  // Copy to device_global
  template <typename T, typename PropertyListT>
  event copy(const std::remove_all_extents_t<T> *src,
    device_global<T, PropertyListT>& dest,
    size_t count = sizeof(T) / sizeof(std::remove_all_extents_t<T>),
    size_t startIndex = 0);

  template <typename T, typename PropertyListT>
  event copy(const std::remove_all_extents_t<T> *src,
    device_global<T, PropertyListT>& dest,
    size_t count, size_t startIndex, event depEvent);

  template <typename T, typename PropertyListT>
  event copy(const std::remove_all_extents_t<T> *src,
    device_global<T, PropertyListT>& dest,
    size_t count, size_t startIndex,
    const std::vector<event> &depEvents);

  // Copy from device_global
  template <typename T, typename PropertyListT>
  event copy(const device_global<T, PropertyListT>& src,
    std::remove_all_extents_t<T> *dest,
    size_t count = sizeof(T) / sizeof(std::remove_all_extents_t<T>),
    size_t startIndex = 0);

  template <typename T, typename PropertyListT>
  event copy(const device_global<T, PropertyListT>& src,
    std::remove_all_extents_t<T> *dest,
    size_t count, size_t startIndex, event depEvent);

  template <typename T, typename PropertyListT>
  event copy(const device_global<T, PropertyListT>& src,
    std::remove_all_extents_t<T> *dest,
    size_t count,size_t startIndex, const std::vector<event> &depEvents);

  // memcpy to device_global
  template <typename T, typename PropertyListT>
  event memcpy(device_global<T, PropertyListT>& dest,
    const void *src, size_t numBytes = sizeof(T), size_t offset = 0);

  template <typename T, typename PropertyListT>
  event memcpy(device_global<T, PropertyListT>& dest,
    const void *src, size_t numBytes,
    size_t offset, event depEvent);

  template <typename T, typename PropertyListT>
  event memcpy(device_global<T, PropertyListT>& dest,
    const void *src, size_t numBytes,
    size_t offset, const std::vector<event> &depEvents);

  // memcpy from device_global
  template <typename T, typename PropertyListT>
  event memcpy(void *dest,
    const device_global<T, PropertyListT>& src,
    size_t numBytes = sizeof(T), size_t offset = 0);

  template <typename T, typename PropertyListT>
  event memcpy(void *dest,
    const device_global<T, PropertyListT>& src, size_t numBytes,
    size_t offset, event depEvent);

  template <typename T, typename PropertyListT>
  event memcpy(void *dest,
    const device_global<T, PropertyListT>& src, size_t numBytes,
    size_t offset, const std::vector<event> &depEvents);
};
} // namespace sycl

Add the following function descriptions to the sycl::queue interface description table in Section 4.6.5.1 of the SYCL 2020 specification.

Function Definition Function type
template <typename T, typename PropertyListT>
event copy(const std::remove_all_extents_t<T> *src,
  device_global<T, PropertyListT>& dest,
  size_t count = sizeof(T) / sizeof(std::remove_all_extents_t<T>),
  size_t startIndex = 0);

Explicit copy

template <typename T, typename PropertyListT>
event copy(const std::remove_all_extents_t<T> *src,
  device_global<T, PropertyListT>& dest,
  size_t count, size_t startIndex, event depEvent);

Explicit copy

template <typename T, typename PropertyListT>
event copy(const std::remove_all_extents_t<T> *src,
  device_global<T, PropertyListT>& dest,
  size_t count, size_t startIndex, const std::vector<event> &depEvents);

Explicit copy

template <typename T, typename PropertyListT>
event copy(const device_global<T, PropertyListT>& src,
  std::remove_all_extents_t<T> *dest,
  size_t count = sizeof(T) / sizeof(std::remove_all_extents_t<T>),
  size_t startIndex = 0);

Explicit copy

template <typename T, typename PropertyListT>
event copy(const device_global<T, PropertyListT>& src,
  std::remove_all_extents_t<T> *dest,
  size_t count, size_t startIndex, event depEvent);

Explicit copy

template <typename T, typename PropertyListT>
event copy(const device_global<T, PropertyListT>& src,
  std::remove_all_extents_t<T> *dest,
  size_t count, size_t startIndex, const std::vector<event> &depEvents);

Explicit copy

template <typename T, typename PropertyListT>
event memcpy(device_global<T, PropertyListT>& dest,
  const void *src, size_t numBytes = sizeof(T), size_t offset = 0);

Explicit copy

template <typename T, typename PropertyListT>
event memcpy(device_global<T, PropertyListT>& dest,
  const void *src, size_t numBytes,
  size_t offset, event depEvent);

Explicit copy

template <typename T, typename PropertyListT>
event memcpy(device_global<T, PropertyListT>& dest,
  const void *src, size_t numBytes,
  size_t offset, const std::vector<event> &depEvents);

Explicit copy

template <typename T, typename PropertyListT>
event memcpy(void *dest,
  const device_global<T, PropertyListT>& src,
  size_t numBytes = sizeof(T), size_t offset = 0);

Explicit copy

template <typename T, typename PropertyListT>
event memcpy(void *dest,
  const device_global<T, PropertyListT>& src, size_t numBytes,
  size_t offset, event depEvent);

Explicit copy

template <typename T, typename PropertyListT>
event memcpy(void *dest,
  const device_global<T, PropertyListT>& src, size_t numBytes,
  size_t offset, const std::vector<event> &depEvents);

Explicit copy

Add new copy and memcpy members to the handler class

Add the following functions to the sycl::handler interface described in Section 4.9.4.3 of the SYCL 2020 specification.

Add to Table 130, "Member functions of the handler class".

Member Function Description
template <typename T, typename PropertyListT>
void copy(const std::remove_all_extents_t<T> *src,
  device_global<T, PropertyListT>& dest,
  size_t count = sizeof(T) / sizeof(std::remove_all_extents_t<T>),
  size_t startIndex = 0);

T must be device copyable.

Not available if PropertyListT contains the host_access property with read or none assertions.

Copies count elements of type std::remove_all_extents_t<T> from the pointer src to the device_global dest, starting at startIndex elements of dest. src may be either a host or USM pointer.

If count and startIndex would cause data to be written beyond the end of the variable dest, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property and the dest variable exists in more than one device image for this queue’s device, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property, at least one kernel in the device image containing the dest variable must access the dest variable. If this is not the case, the implementation throws an exception with the errc::kernel_not_supported error code.

template <typename T, typename PropertyListT>
void copy(const device_global<T, PropertyListT>& src,
  std::remove_all_extents_t<T> *dest,
  size_t count = sizeof(T) / sizeof(std::remove_all_extents_t<T>),
  size_t startIndex = 0);

T must be device copyable.

Not available if PropertyListT contains the host_access property with write or none assertions.

Copies count elements of type std::remove_all_extents_t<T> from the device_global src to the pointer dest, starting at startIndex elements of src. dest may be either a host or USM pointer.

If count and startIndex would cause data to be read beyond the end of the variable src, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property and the src variable exists in more than one device image for this queue’s device, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property, at least one kernel in the device image containing the dest variable must access the dest variable. If this is not the case, the implementation throws an exception with the errc::kernel_not_supported error code.

template <typename T, typename PropertyListT>
void memcpy(device_global<T, PropertyListT>& dest,
  const void *src, size_t numBytes = sizeof(T), size_t offset = 0);

T must be device copyable.

Not available if PropertyListT contains the host_access property with read or none assertions.

Copies count bytes from the pointer src to the device_global dest, starting at offset bytes. src may be either a host or USM pointer.

If numBytes and offset would cause data to be written beyond the end of the variable dest, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property and the dest variable exists in more than one device image for this queue’s device, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property, at least one kernel in the device image containing the dest variable must access the dest variable. If this is not the case, the implementation throws an exception with the errc::kernel_not_supported error code.

template <typename T, typename PropertyListT>
void memcpy(void *dest,
  const device_global<T, PropertyListT>& src,
  size_t numBytes = sizeof(T), size_t offset = 0);

T must be device copyable.

Not available if PropertyListT contains the host_access property with write or none assertions.

Copies count bytes from the device_global src to the pointer dest, starting at offset bytes. dest may be either a host or USM pointer.

If numBytes and offset would cause data to be read beyond the end of the variable src, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property and the src variable exists in more than one device image for this queue’s device, the implementation throws an exception with the errc::invalid error code.

If PropertyListT contains the device_image_scope property, at least one kernel in the device image containing the dest variable must access the dest variable. If this is not the case, the implementation throws an exception with the errc::kernel_not_supported error code.

Note

As specified above, the copy and memcpy functions throw an exception if the global variable has the device_image_scope property and exists in more than one device image for the queue’s device. This condition could occur if the application submits a kernel referencing the variable to the same device with different values for a specialization constant (when the device supports specialization constants natively). This condition could also occur if the application submits the same kernel from more than one kernel bundle.

Non-normative: Implementation hints

device_global prioritizes usability over simplicity of implementation, and therefore adds requirements such as (1) that contents and addresses of the allocation on each device remain stable across changes to specialization constant values, and (2) that the allocation be accessible across device_image on the same device. These requirements mean that the semantics of device_global do not match the semantics of SPIR-V module scope variables, and therefore may not be implementable exclusively using the SPIR-V feature in existing SPIR-V consuming implementations.

Also note that there are no restrictions on passing (and subsequent dereferencing) of pointers obtained on a device from a device_global, between kernels on the same device, including through storage to memory.

Issues

1) Can sycl::atomic_ref be used with device_global?
Resolved: Yes, but only on the device side. There is no visibility/communication across devices because each device receives a unique allocation of type T underlying the device_global. There is no way for an atomic_ref associated with the allocation to be created in host code because there is no way to extract a pointer or reference in host code (only copy/memcpy).

2) Should we restrict device_global to static storage duration, and if so how?
Resolved: Yes, through similar language as specialization_id. Moreover restricted to namespace scope, because it is expensive to implement function scope statics. This could change if a compelling use case arises that needs function scope static support.

3) Should the returned multi_ptr default to decorated or an undecorated?
Resolved: No default - follow convention on this set by multi_ptr

4) Is a mechanism needed that can mark device accesses as read only, while allowing for host write access?
Resolved: No known compelling use cases at this point.

5) Are there important use cases that require arbitrary destructors to be supported by device_global?
Resolved: No important cases known at this time. May loosen restriction in the future.

Revision History

Rev Date Author Changes

1

2021-06-11

Artem Radzikhovskyy

Initial review version

2

2021-08-01

Mike Kinsner

Restrict to trivial default constructors for first release, change from pointer to reference semantics, swap order of arguments in copy functions, update and clarify wording, remove factory functions.

3

2023-09-14

Justin Rosner

Adding consteval constructor to allow for constant compile-time initialization of device_globals with the device_image_scope property.