Skip to content

Latest commit

 

History

History
2689 lines (2166 loc) · 103 KB

sycl_ext_oneapi_bindless_images.asciidoc

File metadata and controls

2689 lines (2166 loc) · 103 KB

sycl_ext_oneapi_bindless_images

Notice

Copyright © Codeplay. All rights reserved.

Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.

Dependencies

This extension is written against the SYCL 2020 revision 6 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.

Status

This is an experimental extension specification, intended to provide early access to features and gather community feedback. Interfaces defined in this specification are implemented in DPC++, but they are not finalized and may change incompatibly in future versions of DPC++ without prior notice. Shipping software products should not rely on APIs defined in this specification.

Backend support status

This extension is currently implemented in DPC++ only for GPU devices and only when using the CUDA backend. Attempting to use this extension in kernels that run on other devices or backends will not work. Be aware that the compiler may not be able to issue a diagnostic to warn you if this happens.

Overview

Images in SYCL 1.2.1 were designed to work with OpenCL. SYCL 2020 tried to make them more versatile by splitting the image type into sampled and unsampled images. This enabled SYCL 2020 images to work better with other backends. However, SYCL 2020 images still didn’t quite meet user expectations. There was feedback about various use cases where the current model falls short (see examples at the end of this document for some of the use cases).

One of the key issues is requesting access to arbitrary images through handles, and not accessors. Accessing images through handles instead of accessors grants much more flexibility to the user, at the expense of automatic data dependency tracking. Bypassing accessors allows users to implement programs where the number of images is not known at compile-time, such as a texture atlas where one image holds references to other images. This kind of feature is impossible to implement with the accessor model outlined in the core specification.

These shortcomings are why we propose a new extension for SYCL 2020 images. Per our proposal, users would be able to separate memory allocation for the image from the actual image creation. Images will be represented by opaque handle types that can be passed directly into a kernel without requesting access. In many ways, this model more closely resembles the USM model when accessing data on the device, but it’s specialized for dealing with images.

The proposed model does not replace SYCL 2020 images, it is instead meant as building blocks for implementing SYCL 2020 images on top of it.

In addition to bindless images, this document also proposes an interoperability extension providing functionality to allow users to import external memory and semaphore objects from other APIs, such as Vulkan or DirectX.

Importing memory allows it to be shared between APIs without the need to duplicate allocations and perform multiple copies between host and device to ensure that said memory is kept uniform across those APIs at all times.

Importing semaphores will also allow SYCL to schedule command groups and queue operations that depend on completion of GPU commands submitted by external APIs.

Note

The interoperability outlined in this document concerns only the importing of external API objects into the SYCL runtime. We do not expose exportation of SYCL objects to external APIs. Interoperability capabilities vary between APIs. For example, CUDA allows the import of external memory and semaphores, but does not allow export of its own resources.

Specification

Feature test macro

This extension provides a feature-test macro as described in the core SYCL specification. An implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_BINDLESS_IMAGES to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s features the implementation supports.

Rev Description

1

Initial draft of the proposal

2

Second revision of the proposal

3

Third revision of the proposal

4

Fourth revision of the proposal

See the revision history at the bottom of this document for features added in each revision.

Querying bindless image support

We provide the following device queries to retrieve information on whether a SYCL implementation provides support for various bindless image features.

The device aspects for these queries are:

Device descriptor Description

aspect::ext_oneapi_bindless_images

Indicates if the device supports creation of bindless images backed by the image_mem and image_mem_handle APIs.

aspect::ext_oneapi_bindless_images_shared_usm

Indicates if the device supports the creation of bindless images backed by shared USM memory.

aspect::ext_oneapi_bindless_images_1d_usm

Indicates if the device supports creation of 1D bindless images backed by USM.

aspect::ext_oneapi_bindless_images_2d_usm

Indicates if the device supports creation of 2D bindless images backed by USM.

Note

Not all SYCL backends may provide support for bindless images constructed from USM memory with all dimensions. As an example, CUDA does not have native support for 3D image resources constructed from USM. In the future, some backends may support this, and this proposal may be updated to allow creation of 3D USM images.

Image descriptor

namespace sycl::ext::oneapi::experimental {

enum class image_channel_order : /* unspecified */ {
  a,
  r,
  rx,
  rg,
  rgx,
  ra,
  rgb,
  rgbx,
  rgba,
  argb,
  bgra,
  intensity,
  luminance,
  abgr,
};

enum class image_channel_type : /* unspecified */ {
  snorm_int8,
  snorm_int16,
  unorm_int8,
  unorm_int16,
  signed_int8,
  signed_int16,
  signed_int32,
  unsigned_int8,
  unsigned_int16,
  unsigned_int32,
  fp16,
  fp32,
};

enum class image_type : /* unspecified */ {
  standard,
  mipmap,
  array,
  cubemap,
};

struct image_descriptor {
  size_t width;
  size_t height;
  size_t depth;
  image_channel_type channel_type;
  image_channel_order channel_order;
  image_type type;
  unsigned int num_levels;
  unsigned int array_size;

  image_descriptor(sycl::range<1> dims, image_channel_order channel_order,
                   image_channel_type channel_type,
                   image_type type = image_type::standard,
                   unsigned int num_levels = 1, unsigned int array_size = 1);

  image_descriptor(sycl::range<2> dims, image_channel_order channel_order,
                   image_channel_type channel_type,
                   image_type type = image_type::standard,
                   unsigned int num_levels = 1, unsigned int array_size = 1);

  image_descriptor(sycl::range<3> dims, image_channel_order channel_order,
                   image_channel_type channel_type,
                   image_type type = image_type::standard,
                   unsigned int num_levels = 1, unsigned int array_size = 1);

  image_descriptor get_mip_level_desc(unsigned int level) const;

  void verify() const;
};

}

The image descriptor represents the image dimensions, channel type, and channel order. An image_type member is also present to allow for implementation of mipmapped, image array, and cubemapped images.

The image_descriptor shall be default constructible and follow by-value semantics.

Note

Additional future `image_type`s may include combined image types like "mipmapped cubemap".

Note that image_channel_type and image_channel_order existed in SYCL 1.2.1, but were removed in SYCL 2020 in favor of a single, unified enum class. We propose separating them again to enable better flexibility and to avoid combinatorial complexity.

The verify member function is available to check the validity of the image descriptor against the limitations outlined below. If the given descriptor is deemed invalid, then a sycl::exception will be thrown with error code sycl::errc::invalid.

For the standard image type, the value of num_levels and array_size must both be 1.

The type member will inform the implementation of the type of image to create, allocate, or free.

Only mipmap image types support more than one level. For mipmap images, the member function get_mip_level_desc will return an image_descriptor for a given level of a mipmap, with valid dimension values for that level, and the type of the returned image_descriptor will be image_type::standard.

Only array image types support more than one array layer.

Allocating image memory

The process of creating an image is two-fold: allocate an image’s memory, then create an image handle from the allocation. Allocation of image memory can be achieved in two ways.

Allocating non-USM image memory

namespace sycl::ext::oneapi::experimental {

struct image_mem_handle {
  using raw_handle_type = /* implementation defined */;
  raw_handle_type raw_handle;
}

class image_mem {
public:
  image_mem();
  image_mem(const image_mem &rhs);
  image_mem(image_mem &&rhs) noexcept;

  image_mem(const image_descriptor &imageDesc,
            const sycl::device &syclDevice,
            const sycl::context &syclContext);
  image_mem(const image_descriptor &imageDesc,
            const sycl::queue &syclQueue);

  ~image_mem();

  image_mem &operator=(image_mem &rhs);
  image_mem &operator=(image_mem &&rhs) noexcept;

  bool operator==(const image_mem &rhs) const;
  bool operator!=(const image_mem &rhs) const;

  image_mem_handle get_handle() const;
  image_descriptor get_descriptor() const;
  sycl::device get_device() const;
  sycl::context get_context() const;

  sycl::range<3> get_range() const;
  sycl::image_channel_type get_image_channel_type() const;
  sycl::image_channel_type get_image_channel_order() const;
  unsigned int get_image_num_channels() const;
  image_type get_type() const;

  image_mem_handle get_mip_level_mem_handle(unsigned int level) const;
};

image_mem_handle alloc_image_mem(const image_descriptor &imageDesc,
                                 const sycl::device &syclDevice,
                                 const sycl::context &syclContext);
image_mem_handle alloc_image_mem(const image_descriptor &imageDesc,
                                 const sycl::queue &syclQueue);

void free_image_mem(image_mem_handle memHandle,
                    image_type imageType,
                    const sycl::device &syclDevice,
                    const sycl::context &syclContext);
void free_image_mem(image_mem_handle memHandle,
                    image_type imageType,
                    const sycl::queue &syclQueue);
}

The first method of allocating device memory for images is through alloc_image_mem. This takes a sycl::device, sycl::context, and image_descriptor to allocate device memory, with the appropriate image type and size based on the image_descriptor. Alternatively, we can also pass a sycl::queue instead of both sycl::device and sycl::context.

Memory allocated in this way requires the user to free that memory after all operations using the memory are completed and no more operations operating on the memory will be scheduled. This is done using free_image_mem. An image_type should be passed to free_image_mem to inform the implementation of the type of memory to be freed.

The second method involves the image_mem class, which is a RAII class wrapper that performs allocation and deallocation of device memory.

The default constructor does not allocate any memory on the device and the resulting image_mem object is in an uninitialized state.

the constructor is a wrapper for alloc_image_mem functionality. The destructor is a wrapper for free_image_mem functionality.

image_mem also provides some functions to get various properties of the image memory allocation such as the image range, channel type, channel order, number of channels, number of levels, and image type.

In the case where a mipmap has been allocated, get_mip_level_mem_handle can be used to return an image_mem_handle to a specific level of the mipmap. This can then be used to copy data to that specific level or create an image handle based on that level.

Note that the handle type image_mem_handle::raw_handle_type is an opaque type, and the handle cannot be dereferenced on the host. The layout of the memory is backend-specific, and may be an optimized layout, e.g. tile swizzle patterns.

The image_mem class must follow Common Reference Semantics as outlined by the core SYCL 2020 specification.

The image_mem class is not a valid kernel argument.

If the construction of the image_mem class fails, a sycl::exception with error code sycl::errc::memory_allocation will be thrown.

Similarly, if alloc_image_mem or free_image_mem fail, a sycl::exception with error code sycl::errc::memory_allocation will be thrown.

Note

In the DPC++ CUDA backend, image_mem will allocate/deallocate a CUarray type (or CUmipmappedArray in the case of mipmap images).

Getting image information from image_mem_handle

Extension functions are provided to retrieve information about images allocated using the image_mem_alloc function. These are similar to the member functions provided by image_mem. However, since the image_mem_handle is a minimal struct representing just the opaque handle the underlying memory object, there is some information that we cannot retrieve from it, namely the image_type, image_channel_order, the sycl::context or sycl::device the memory was allocated in, and the image_descriptor used to allocate the memory.

namespace sycl::ext::oneapi {

sycl::range<3> get_image_range(const image_mem_handle memHandle,
                               const sycl::device &syclDevice,
                               const sycl::context &syclContext);
sycl::range<3> get_image_range(const image_mem_handle memHandle,
                               const sycl::queue &syclQueue);

sycl::image_channel_type
get_image_channel_type(const image_mem_handle memHandle,
                       const sycl::device &syclDevice,
                       const sycl::context &syclContext);
sycl::image_channel_type
get_image_channel_type(const image_mem_handle memHandle,
                       const sycl::queue &syclQueue);

unsigned int get_image_num_channels(const image_mem_handle memHandle,
                                    const sycl::device &syclDevice,
                                    const sycl::context &syclContext);
unsigned int get_image_num_channels(const image_mem_handle memHandle,
                                    const sycl::queue &syclQueue);

image_mem_handle get_mip_level_mem_handle(const image_mem_handle mipMemHandle,
                                          unsigned int level,
                                          const sycl::device &syclDevice,
                                          const sycl::context &syclContext);
image_mem_handle get_mip_level_mem_handle(const image_mem_handle mipMemHandle,
                                          unsigned int level,
                                          const sycl::queue &syclQueue);
}

For get_image_range where the underlying image memory was allocated with one or two dimensions, the returned sycl::range<3> will contain zero values for the dimensions unused by the underlying image memory object.

Allocating USM image memory

The second way to allocate image memory is to use USM allocations. SYCL already provides a number of USM allocation functions. This proposal would add another, pitched memory allocation, through pitched_alloc_device.

namespace sycl::ext::oneapi::experimental {

void *pitched_alloc_device(size_t *retRowPitch,
                           size_t widthInBytes, size_t height,
                           unsigned int elementSizeBytes,
                           const sycl::queue &syclQueue);
void *pitched_alloc_device(size_t *retRowPitch
                           size_t widthInBytes, size_t height,
                           unsigned int elementSizeBytes,
                           const sycl::device &syclDevice,
                           const sycl::context &syclContext);

void *pitched_alloc_device(size_t *resultPitch,
                           const image_descriptor &desc,
                           const sycl::queue &queue);

void *pitched_alloc_device(size_t *resultPitch,
                           const image_descriptor &desc,
                           const sycl::device &syclDevice,
                           const sycl::context &syclContext);
}

This function will allocate a memory region aimed to be used for two-dimensional images. It allocates memory that is guaranteed to adhere to the device’s alignment requirements for USM images.

If the user does not wish to use pitched_alloc_device to allocate two-dimensional USM images, but prefers to use another USM allocation function instead, then that allocation must adhere to some alignment restrictions. These restrictions are device specific, and queries for them can be found in the "Pitch alignment restrictions and queries" section below.

If the allocation of pitched memory fails, pitched_alloc_device will throw a sycl::exception with error code sycl::errc::memory_allocation.

Pitch alignment restrictions and queries

For the purposes of this document, the row pitch of an image memory allocation is the distance in bytes between the first elements of adjacent rows of the image. Some devices may require two-dimensional USM images to be allocated with specific alignments for their width and pitch values. The pitched_alloc_device API intends to make allocation of USM memory adhering to these restrictions easy, returning the appropriate pitch value to the user. However, if a user wishes to use another USM allocation function, they must be aware of these restrictions, and query the device to ensure the allocations they wish to use adhere to those restrictions.

This proposal provides a number of additional device queries that enable the user to allocate appropriate pitched USM memory for two-dimensional images. One-dimensional images do not require any pitch values.

The device information descriptors for these queries are:

Device descriptor Return type Description

ext::oneapi::experimental::info::device::image_row_pitch_align

uint32_t

Returns the required alignment of the pitch between two rows of an image in bytes for images allocated using USM.

ext::oneapi::experimental::info::device::max_image_linear_width

size_t

Returns the maximum linear width allowed for images allocated using USM.

ext::oneapi::experimental::info::device::max_image_linear_height

size_t

Returns the maximum linear height allowed for images allocated using USM.

ext::oneapi::experimental::info::device::max_image_linear_row_pitch

size_t

Returns the maximum linear row pitch allowed for images allocated using USM.

Obtaining a handle to the image

The next step is to create the image, and obtain the handle.

namespace sycl::ext::oneapi::experimental {

/// Opaque unsampled image handle type.
struct unsampled_image_handle {
  using raw_image_handle_type = /* Implementation defined */;

  unsampled_image_handle();
  unsampled_image_handle(raw_image_handle_type raw_handle);

  raw_image_handle_type raw_handle;
};

/// Opaque sampled image handle type.
struct sampled_image_handle {
  using raw_image_handle_type = /* Implementation defined */;

  sampled_image_handle();
  sampled_image_handle(raw_image_handle_type raw_image_handle);

  raw_image_handle_type raw_handle;
};

// Creating an unsampled image from an `image_mem_handle`
unsampled_image_handle create_image(image_mem_handle memHandle,
                                    const image_descriptor &desc,
                                    const sycl::device &syclDevice,
                                    const sycl::context &syclContext);
unsampled_image_handle create_image(image_mem_handle memHandle,
                                    const image_descriptor &desc,
                                    const sycl::queue &syclQueue);

// Creating a sampled image from an `image_mem_handle`
sampled_image_handle create_image(image_mem_handle memHandle,
                                  const image_descriptor &desc,
                                  const bindless_image_sampler &sampler,
                                  const sycl::device &syclDevice,
                                  const sycl::context &syclContext);
sampled_image_handle create_image(image_mem_handle memHandle,
                                  const image_descriptor &desc,
                                  const bindless_image_sampler &sampler,
                                  const sycl::queue &syclQueue);

// Creating an unsampled image from an `image_mem` object
unsampled_image_handle create_image(const image_mem &memHandle,
                                    const image_descriptor &desc,
                                    const sycl::device &syclDevice,
                                    const sycl::context &syclContext);
unsampled_image_handle create_image(const image_mem &memHandle,
                                    const image_descriptor &desc,
                                    const sycl::queue &syclQueue);

// Creating a sampled image from an `image_mem` object
sampled_image_handle create_image(const image_mem &memHandle,
                                  const image_descriptor &desc,
                                  const bindless_image_sampler &sampler,
                                  const sycl::device &syclDevice,
                                  const sycl::context &syclContext);
sampled_image_handle create_image(const image_mem &memHandle,
                                  const image_descriptor &desc,
                                  const bindless_image_sampler &sampler,
                                  const sycl::queue &syclQueue);

// Creating a sampled image from a USM allocation and pitch
sampled_image_handle create_image(const void *usmPtr, size_t pitch,
                                  const image_descriptor &desc,
                                  const bindless_image_sampler &sampler,
                                  const sycl::device &syclDevice,
                                  const sycl::context &syclContext);
sampled_image_handle create_image(const void *usmPtr, size_t pitch,
                                  const image_descriptor &desc,
                                  const bindless_image_sampler &sampler,
                                  const sycl::queue &syclQueue);

// Destroying an image handle
void destroy_image_handle(sampled_image_handle &imageHandle,
                          const sycl::device &syclDevice,
                          const sycl::context &syclContext);
void destroy_image_handle(sampled_image_handle &imageHandle,
                          const sycl::queue &syclQueue);

void destroy_image_handle(unsampled_image_handle &imageHandle,
                          const sycl::device &syclDevice,
                          const sycl::context &syclContext);
void destroy_image_handle(unsampled_image_handle &imageHandle,
                          const sycl::queue &syclQueue);
}

Once we have allocated memory, we can pass it into the create_image function to obtain a sampled_image_handle or unsampled_image_handle. These objects are opaque types that represent an image object. They can be captured by value into a SYCL kernel, or they can be passed in a buffer as a dynamic array of images (see examples at the bottom of this document).

We can either provide a bindless_image_sampler (defined in section below) or not when creating the image. Doing so will create a sampled_image_handle, where otherwise an unsampled_image_handle would be returned. A sampled_image_handle should contain a raw sampler handle that will be used when sampling an image.

Whether an image_descriptor or void * USM allocation was passed to create_image, it must have been allocated in the same context and on the same device as the one passed to create_image.

If we choose to create a 2D image from a USM allocation by passing a void *, we must also pass the pitch of the memory allocation. If the memory was allocated using pitched_alloc_device, the pitch passed must be the one which was returned by pitched_alloc_device. If the user did not use pitched_alloc_device to allocate this memory, then that memory must still adhere to device specific alignment restrictions. These restrictions and their queries are outlined in the section "Pitch alignment restrictions and queries" below.

The pitch is ignored for 1D USM images.

If the creation of an image fails, create_image will throw a sycl::exception with error code sycl::errc::runtime.

The unsampled_image_handle and sampled_image_handle types shall be default-constructible, copy-constructible, and device-copyable. When default constructed, image handles are not valid until a user manually assigns a valid raw_image_handle_type to the raw_handle field of the handle struct. The default value of the raw_handle is implementation defined.

The unsampled_image_handle and sampled_image_handle types have a constructor to allow creation of the types from a raw_image_handle_type

Note

In the DPC++ CUDA backend a sampled image will correspond to a CUDA texture, whereas an unsampled image will correspond to a CUDA surface.

After we’re done with the image, we need to destroy the handle using destroy_image_handle. Destroying an image handle does not deallocate the underlying image memory. The user is responsible for deallocation, either through free_image_mem, or destroying the image_mem object, if one was used.

Image sampler struct

The bindless_image_sampler struct shown below is used to set the sampling properties of sampled_images upon image creation. It can be used to set sampling properties that exist in the SYCL 2020 image_sampler as well as extra properties used for sampling additional image types including level-of-detail (LOD) and anisotropic filtering for mipmaps, and seamless filtering for cubemaps.

namespace sycl::ext::oneapi::experimental {

enum class cubemap_filtering_mode : /* unspecified */ {
  disjointed,
  seamless,
};

struct bindless_image_sampler {

  // Assign addressing mode to all dimensions
  bindless_image_sampler(sycl::addressing_mode addressing,
                         sycl::coordinate_normalization_mode coordinate,
                         sycl::filtering_mode filtering);

  bindless_image_sampler(sycl::addressing_mode addressing,
                         sycl::coordinate_normalization_mode coordinate,
                         sycl::filtering_mode filtering,
                         sycl::filtering_mode mipFiltering,
                         float minMipmapLevelClamp, float maxMipmapLevelClamp,
                         float maxAnisotropy);

  bindless_image_sampler(sycl::addressing_mode addressing,
                         sycl::coordinate_normalization_mode coordinate,
                         sycl::filtering_mode filtering,
                         cubemap_filtering_mode cubemapFiltering);

  // Specific addressing modes per dimension
  bindless_image_sampler(sycl::addressing_mode addressing[3],
                         sycl::coordinate_normalization_mode coordinate,
                         sycl::filtering_mode filtering);

  bindless_image_sampler(sycl::addressing_mode addressing[3],
                         sycl::coordinate_normalization_mode coordinate,
                         sycl::filtering_mode filtering,
                         sycl::filtering_mode mipmapFiltering,
                         float minMipmapLevelClamp, float maxMipmapLevelClamp,
                         float maxAnisotropy);

  bindless_image_sampler(sycl::addressing_mode addressing[3],
                         sycl::coordinate_normalization_mode coordinate,
                         sycl::filtering_mode filtering,
                         cubemap_filtering_mode cubemapFiltering);

  sycl::addressing_mode addressing[3] = {sycl::addressing_mode::none};
  sycl::coordinate_normalization_mode coordinate =
      sycl::coordinate_normalization_mode::unnormalized;
  sycl::filtering_mode filtering = sycl::filtering_mode::nearest;
  sycl::filtering_mode mipmap_filtering = sycl::filtering_mode::nearest;
  float min_mipmap_level_clamp = 0.f;
  float max_mipmap_level_clamp = 0.f;
  float max_anisotropy = 0.f;
  ext::oneapi::experimental::cubemap_filtering_mode cubemap_filtering =
    cubemap_filtering_mode::disjointed;
};

}

The bindless_image_sampler shall be default constructible and follow by-value semantics. The value for the addressing mode, addressing_mode::none, represents the backend’s default addressing mode. On CUDA this is Wrap, i.e. addressing_mode::repeat.

addressing[3] defines the addressing mode per texture dimension. A bindless_image_sampler can be constructed with a singular sycl::addressing_mode, where this parameter will define all dimensions.

mipmap_filtering dictates the method in which sampling between mipmap levels is performed.

min_mipmap_level_clamp defines the minimum mipmap level from which we can sample, with the minimum value being 0.

max_mipmap_level_clamp defines the maximum mipmap level from which we can sample. This value cannot be higher than the number of allocated levels.

max_anisotropy dictates the anisotropic ratio used when sampling the mipmap with anisotropic filtering.

cubemap_filtering dictates the method of sampling along cubemap face borders. Disjointed indicates no sampling between faces whereas seamless indicates that sampling across face boundaries is enabled.

Note

In CUDA, when seamless cubemap filtering is enabled, sampled image address modes specified are ignored. Instead, if the filtering mode is set to nearest the address mode clamp_to_edge will be applied for all dimensions. If the filtering mode is set to linear then seamless cubemap filtering will be performed when sampling along the cube face borders.

Explicit copies

namespace sycl {

class handler {
public:

  // Simple host to device copy
  void ext_oneapi_copy(
      void *Src,
      ext::oneapi::experimental::image_mem_handle Dest,
      const ext::oneapi::experimental:image_descriptor &DestImgDesc);

  // Host to device copy with offsets and extent
  void ext_oneapi_copy(
      void *Src,
      sycl::range<3> SrcOffset,
      sycl::range<3> SrcExtent,
      ext::oneapi::experimental::image_mem_handle Dest,
      sycl::range<3> DestOffset,
      const ext::oneapi::experimental::image_descriptor &DestImgDesc,
      sycl::range<3> CopyExtent);

  // Simple device to host copy
  void ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      void *Dest,
      const ext::oneapi::experimental::image_descriptor &SrcImgDesc);

  // Device to host copy with offsets and extent
  void ext_oneapi_copy(
    ext::oneapi::experimental::image_mem_handle Src,
    sycl::range<3> SrcOffset,
    const ext::oneapi::experimental::image_descriptor &SrcImgDesc,
    void *Dest,
    sycl::range<3> DestOffset,
    sycl::range<3> DestExtent,
    sycl::range<3> CopyExtent);

  // Simple HtoD or DtoH copy with USM device memory
  void ext_oneapi_copy(void *Src,
                       void *Dest,
                       const ext::oneapi::experimental::image_descriptor &Desc,
                       size_t DeviceRowPitch);

  // HtoD or DtoH copy with USM device memory, using offsets, extent
  void ext_oneapi_copy(
    void *Src,
    sycl::range<3> SrcOffset,
    void *Dest,
    sycl::range<3> DestOffset,
    const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
    size_t DeviceRowPitch,
    sycl::range<3> HostExtent,
    sycl::range<3> CopyExtent);

  // Simple device to device copy
  void ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      ext::oneapi::experimental::image_mem_handle Dest,
      const ext::oneapi::experimental::image_descriptor &ImageDesc);
};

class queue {
public:

  // Simple host to device copy
  event ext_oneapi_copy(void *Src,
                        ext::oneapi::experimental::image_mem_handle Dest,
                        const ext::oneapi::experimental::image_descriptor &DestImgDesc);
  event ext_oneapi_copy(void *Src,
                        ext::oneapi::experimental::image_mem_handle Dest,
                        const ext::oneapi::experimental::image_descriptor &DestImgDesc,
                        event DepEvent);
  event ext_oneapi_copy(void *Src,
                        ext::oneapi::experimental::image_mem_handle Dest,
                        const ext::oneapi::experimental::image_descriptor &DestImgDesc,
                        const std::vector<event> &DepEvents);

  // Host to device copy with offsets and extent
  event ext_oneapi_copy(
    void *Src,
    range<3> SrcOffset,
    range<3> SrcExtent,
    ext::oneapi::experimental::image_mem_handle Dest,
    range<3> DestOffset,
    const ext::oneapi::experimental::image_descriptor &DestImgDesc,
    range<3> CopyExtent);
  event ext_oneapi_copy(
    void *Src,
    range<3> SrcOffset,
    range<3> SrcExtent,
    ext::oneapi::experimental::image_mem_handle Dest,
    range<3> DestOffset,
    const ext::oneapi::experimental::image_descriptor &DestImgDesc,
    range<3> Extent, event DepEvent);
  event ext_oneapi_copy(
    void *Src,
    range<3> SrcOffset,
    range<3> SrcExtent,
    ext::oneapi::experimental::image_mem_handle Dest,
    range<3> DestOffset,
    const ext::oneapi::experimental::image_descriptor &DestImgDesc,
    range<3> CopyExtent, const std::vector<event> &DepEvents);

  // Simple device to host copy
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      void *Dest,
      const ext::oneapi::experimental::image_descriptor &Desc);
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      void *Dest,
      const ext::oneapi::experimental::image_descriptor &Desc,
      event DepEvent);
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      void *Dest,
      const ext::oneapi::experimental::image_descriptor &Desc,
      const std::vector<event> &DepEvents);

  // Device to host copy with offsets and extent
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      range<3> SrcOffset,
      const ext::oneapi::experimental::image_descriptor &SrcImgDesc,
      void *Dest,
      range<3> DestOffset,
      range<3> DestExtent,
      range<3> CopyExtent);
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      range<3> SrcOffset,
      const ext::oneapi::experimental::image_descriptor &SrcImgDesc,
      void *Dest,
      range<3> DestOffset,
      range<3> DestExtent,
      range<3> CopyExtent, event DepEvent);
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      range<3> SrcOffset,
      const ext::oneapi::experimental::image_descriptor &SrcImgDesc,
      void *Dest,
      range<3> DestOffset,
      range<3> DestExtent,
      range<3> CopyExtent, const std::vector<event> &DepEvents);

  // Host to device OR device to host using USM device memory
  event ext_oneapi_copy(
      void *Src, void *Dest,
      const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
      size_t DeviceRowPitch);
  event ext_oneapi_copy(
      void *Src, void *Dest,
      const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
      size_t DeviceRowPitch,
      event DepEvent);
  event ext_oneapi_copy(
      void *Src, void *Dest,
      const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
      size_t DeviceRowPitch,
      const std::vector<event> &DepEvents);

  // Host to device OR device to host using USM device memory,
  // with control over sub-region
  event ext_oneapi_copy(
      void *Src, sycl::range<3> SrcOffset,
      void *Dest, sycl::range<3> DestOffset,
      const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
      size_t DeviceRowPitch,
      sycl::range<3> HostExtent,
      sycl::range<3> CopyExtent);
  event ext_oneapi_copy(
      void *Src, sycl::range<3> SrcOffset,
      void *Dest, sycl::range<3> DestOffset,
      const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
      size_t DeviceRowPitch,
      sycl::range<3> HostExtent,
      sycl::range<3> CopyExtent);
  event ext_oneapi_copy(
      void *Src, sycl::range<3> SrcOffset,
      void *Dest, sycl::range<3> DestOffset,
      const ext::oneapi::experimental::image_descriptor &DeviceImgDesc,
      size_t DeviceRowPitch,
      sycl::range<3> HostExtent,
      sycl::range<3> CopyExtent);

  // Simple device to device copy
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      ext::oneapi::experimental::image_mem_handle Dest,
      const ext::oneapi::experimental::image_descriptor &ImageDesc);
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      ext::oneapi::experimental::image_mem_handle Dest,
      const ext::oneapi::experimental::image_descriptor &ImageDesc,
      event DepEvent);
  event ext_oneapi_copy(
      ext::oneapi::experimental::image_mem_handle Src,
      ext::oneapi::experimental::image_mem_handle Dest,
      const ext::oneapi::experimental::image_descriptor &ImageDesc,
      const std::vector<event> &DepEvents);
};
}

To enable the copying of images an ext_oneapi_copy function is proposed as a method of the queue and handler. It can be used to copy image memory, whether allocated through USM or using an image_mem_handle, from host to device, or device to host. Device to device copies are currently supported only through image_mem_handle allocations. For the ext_oneapi_copy variants that do not take offsets and extents, the image descriptor passed to the ext_oneapi_copy API is used to determine the pixel size, dimensions, and extent in memory of the image to copy. If performing sub-region copy, the size of the memory region is also determined by the offsets and extent passed.

For images allocated using USM, existing SYCL functionality can be used to copy their memory, but we also provide ext_oneapi_copy functions that take USM pointers. If the image memory was allocated using pitched_alloc_device, then the source and destination, row pitch parameter passed must match that which was returned from pitched_alloc_device. If the user opted to use another allocation function then the device pitch parameters must adhere to the alignment restrictions outlined in the "Pitch alignment restrictions and queries" section.

Unless performing a sub-region copy, the user must ensure that the memory regions accessed through Dest and Src have the same capacity.

Whether copying image memory to the device through a USM Dest pointer, or an image_mem_handle, the host memory is always assumed to be tightly packed. Similarly, the host memory is assumed to be packed when copying from device to host.

For the functions that take an image_mem_handle, the handle must have been allocated within the same context and device of the queue.

For the forms that take a USM pointer, the image memory must also have been allocated within the same context and device of the queue. The USM memory must be accessible on the queue’s device.

The ext_oneapi_copy function variants that don’t take offsets and extents may fail in the following scenarios:

  1. The Src and Dest memory was not allocated on the same device and context of the queue.

  2. The Src and Dest memory regions, where Src or Dest can be either on the host or device, do not have the same memory capacity, where the capacity is calculate from the width, height, depth, channel_order, and channel_type members of the image_descriptor parameter.

The ext_oneapi_copy function variants that do take offsets and extents may fail in the following scenarios:

  1. The Src and Dest memory was not allocated on the same device and context of the queue.

  2. The image descriptor passed does not match the image descriptor used to allocate the image on the device.

  3. the CopyExtent describes a memory region larger than that which was allocated on either the host or the device.

  4. The HostExtent describes a memory region larger than that which was allocated on the host.

  5. The SrcExtent describes a memory region larger than that which was allocated, where Src can be either the host or device.

  6. The DestExtent describes a memory region larger than that which was allocated, where Dest can be either the host or device.

  7. If SrcOffset + CopyExtent moves the memory sub-region outside the bounds of the memory described by Src, irrespective of whether Src is on the host or the device.

  8. If DestOffset + CopyExtent moves the memory sub-region outside the bounds of the memory described by Dest, irrespective of whether Dest is on the host or the device.

  9. The DeviceRowPitch does not adhere to the alignment requirements outlined in section "Pitch alignment restrictions and queries"

  10. The value of DeviceRowPitch is smaller than the width of the image on the device.

If copying of an image fails, ext_oneapi_copy will throw a sycl::exception with error code sycl::errc::invalid, and relay an error message back to the user through sycl::exception::what(), describing which of the scenarios listed above caused the failure.

Reading and writing inside the kernel

namespace sycl::ext::oneapi::experimental {

template <typename DataT, typename HintT = DataT, typename CoordT>
DataT fetch_image(const unsampled_image_handle &ImageHandle,
                  const CoordT &Coords);

template <typename DataT, typename HintT = DataT, typename CoordT>
DataT fetch_image(const sampled_image_handle &ImageHandle,
                  const CoordT &Coords);
template <typename DataT, typename HintT = DataT, typename CoordT>
DataT sample_image(const sampled_image_handle &ImageHandle,
                   const CoordT &Coords);

template <typename DataT, typename CoordT>
void write_image(unsampled_image_handle ImageHandle,
                 const CoordT &Coords, const DataT &Color);
}

Inside a kernel, it’s possible to retrieve data from an image via fetch_image or sample_image, passing the appropirate image handle. The fetch_image API is applicable to sampled and unsampled images, and the data will be fetched exactly as is in device memory. The sample_image API is only applicable to sampled images, the image data will be sampled according to the bindless_image_sampler that was passed to the image upon construction.

When fetching from a sampled image handle, data exatly as is in memory, no sampling operations will be performed, and the bindless_image_sampler passed to the image upon creation has no effect on the returned image data. Note that not all devices may support fetching of sampled image data depending on the dimension or backing memory type. We provide device aspect queries for this in Querying sampled image fetch support.

The user is required to pass a DataT template parameter, which specifies the return type of the fetch_image and sample_image functions. If DataT is not a recognized standard type, as defined in Recognized standard types, and instead a user-defined type, the user must provide a HintT template parameter to the fetch_image and sample_image functions, to allow the backend to select the correct device intrinsic to fetch or sample their data.

HintT must be one of the the Recognized standard types, and must be the same size as DataT. If DataT is a recognized standard type, and HintT is also passed, HintT will be ignored.

When fetching or sampling an image backed by a normalized integer channel type, either DataT must be a 32-bit or 16-bit floating point value, a sycl::vec of 32-bit or 16-bit floating point values, or, in the case DataT is not one of the above, then HintT must be one of the above, and be of the same size as DataT.

It’s possible to write to an unsampled image via write_image passing the handle of the image to be written to, along with the coordinates to write to and the data. User-defined types are allowed to be written provided that type is trivially copyable. The user defined type must also be of the same size as any of the Recognized standard types.

Sampled images cannot be written to using write_image.

For fetching and writing of unsampled images, coordinates are specified by int, sycl::vec<int, 2>, and sycl::vec<int, 3> for 1D, 2D, and 3D images, respectively.

Sampled image "sampled reads" take float, sycl::vec<float, 2>, and sycl::vec<float, 3> coordinate types for 1D, 2D, and 3D images, respectively.

Sampled image "fetch reads" take int, sycl::vec<int, 2>, and sycl::vec<int, 3> coordinate types for 1D, 2D, and 3D images, respectively.

Note also that all images must be used in either read-only or write-only fashion within a single kernel invocation; read/write images are not supported.

Note also that read-after-write functionality is not supported. Unsampled images may be read from and written back to within the same kernel, however, reading from that same image again will result in undefined behaviour. A new kernel must be submitted for the written data to be accessible.

Note

Attempting to sample a standard sampled image with sample_mipmap or any other defined sampling function will result in undefined behaviour.

Recognized standard types

For the purposes of this extension, the following are classified as recognized standard types.

  • All POD types (char, short, int, float, etc.) excluding double

  • sycl::half

  • Variants of sycl::vec<T, N> where T is one of the above, and N is 1, 2, or 3

Any other types are classified as user-defined types.

User-defined types

Some examples of a user-defined types may be:

struct my_float4 {
  float r, g, b, a;
};

struct my_short2 {
  short r, g;
};

When providing the above types as DataT parameters to an image read function, the corresponding HintT parameters to use would be sycl::vec<float, 4> and sycl::vec<short, 2>, respectively.

Querying sampled image fetch support

We provide the following device queries to query support for sampled image fetch with various backing memory types and dimensionalities.

The device aspect descriptors for these queries are:

Device descriptor Description

aspect::ext_oneapi_bindless_sampled_image_fetch_1d_usm

Indicates if the device is capable of fetching USM backed 1D sampled image data.

aspect::ext_oneapi_bindless_sampled_image_fetch_1d

Indicates if the device is capable of fetching non-USM backed 1D sampled image data.

aspect::ext_oneapi_bindless_sampled_image_fetch_2d_usm

Indicates if the device is capable of fetching USM backed 2D sampled image data.

aspect::ext_oneapi_bindless_sampled_image_fetch_2d

Indicates if the device is capable of fetching non-USM backed 2D sampled image data.

aspect::ext_oneapi_bindless_sampled_image_fetch_3d_usm

Indicates if the device is capable of fetching USM backed 3D sampled image data.

aspect::ext_oneapi_bindless_sampled_image_fetch_3d

Indicates if the device is capable of fetching non-USM backed 3D sampled image data.

Mipmapped images

So far, we have described how to create and operate on standard bindless images. Another type of image we propose support for is a mipmapped image. Mipmapped images are an image type with multiple levels. Each consecutive dimension of a mipmapped image level is smaller than the previous level. The dimensions of a succeeding mip level is half that of the preceding level. As an example, a two-dimensional mipmapped image where the top-most level (level==0) image has a width==16 and height==16, the succeeding level (level==1) in the mipmap will have sizes width==8 and height==8. This pattern continues until either the final level has sizes of width==1 and height==1, or the user-specified maximum mip level has been reached (described by the num_levels member of image_descriptor).

Querying mipmap support

We provide the following device queries to retrieve information on a SYCL implementation of various mipmap features.

The device aspect descriptors for these queries are:

Device descriptor Description

aspect::ext_oneapi_mipmap

Indicates if the device supports allocating mipmap resources.

aspect::ext_oneapi_mipmap_anisotropy

Indicates if the device supports sampling mipmap images with anisotropic filtering

aspect::ext_oneapi_mipmap_level_reference

Indicates if the device supports using images created from individual mipmap levels

The device information descriptors for these queries are:

Device descriptor Return type Description

ext::oneapi::experimental::info::device::mipmap_max_anisotropy

float

Return the maximum anisotropic ratio supported by the device

Allocation of mipmapped images

Mipmaps are allocated in a similar manner to standard images, however, mipmaps do not support USM backed memory.

Mipmap memory is allocated through alloc_image_mem. The user should populate the image_descriptor with the image type of image_type::mipmap, and provide the number of mipmaps levels they wish to allocate. The value of num_levels must be greater than 1.

Mipmap memory allocated this way requires the user to free that memory after all operations using the memory are completed and no more operations operating on the memory will be scheduled. This is done using free_image_mem, passing image_type::mipmap. Importantly, individual levels of a mipmap must not be freed before calling free_image_mem.

The RAII class image_mem may also be used to perform allocation and deallocation of mipmap device memory. The constructor and destructor act as a wrapper for the functions alloc_image_mem and free_image_mem respectively.

When the underlying memory of image_mem is a mipmap, get_mip_level_mem_handle can be used to return an image_mem_handle to a specific level of the mipmap. This can then be used to copy data to that specific level or create an image based on that level.

Obtaining a handle to a mipmap image

A handle to a mipmap image is acquired in the same way as a sampled_image_handle. Mipmaps can only be sampled image types. We can create a sampled_image_handle to the allocated mipmap through the create_image functions which take a bindless_image_sampler. To sample a mipmap correctly, the mipmap attributes of this sampler must be defined.

Attempting to create an unsampled_image_handle to a mipmap will result in a sycl::exception with error code sycl::errc::runtime being thrown.

Copying mipmap image data

In order to copy to or from mipmaps, the user should retrieve an individual level’s image_mem_handle through image_mem::get_mip_level_mem_handle, which can then be passed to ext_oneapi_copy. The user must ensure that the image descriptor passed to ext_oneapi_copy is dimensioned correctly for the mip level being copied to/from. The provided image_descriptor::get_mip_level_desc allows the user to retrieve a correctly dimensioned image descriptor for any level of a given top-level descriptor.

Reading a mipmap

Inside the kernel, it’s possible to sample a mipmap via sample_mipmap, passing the sampled_image_handle, the coordinates, and either the level or anisotropic gradient values.

The method of sampling a mipmap is different based on which sample_mipmap function is used, and the sampler attributes passed upon creation of the mipmap.

// Nearest/linear filtering between mip levels
template <typename DataT, typename HintT = DataT, typename CoordT>
DataT sample_mipmap(const sampled_image_handle &ImageHandle,
                    const CoordT &Coords,
                    const float Level);

// Anisotropic filtering
template <typename DataT, typename HintT = DataT, typename CoordT>
DataT sample_mipmap(const sampled_image_handle &ImageHandle,
                    const CoordT &Coords,
                    const CoordT &Dx, const CoordT &Dy);

Reading a mipmap follows the same restrictions on what coordinate types may be used as laid out in Reading and writing inside the kernel, and the viewing gradients are bound to the same type as used for the coordinates.

Reading a mipmap by providing a user-defined return DataT type also follows the restrictions as laid out in Reading and writing inside the kernel.

Note

Attempting to sample a mipmap with sample_image or any other defined sample function will result in undefined behaviour.

Image arrays

Another type of image we propose support for is image arrays. Image arrays are images made up of multiple array indices where each index is itself an image and every index has the same dimensionality, size, and data type.

Image arrays may also be referred to as layered images, and the array indices may be referred to layers.

Allocation of image arrays

Image arrays are allocated in a similar manner to standard images.

Image array memory is allocated through alloc_image_mem. The user should populate the image_descriptor with the image type of image_type::array, and provide the size of the array they wish to allocate. The value of array_size must be greater than 1.

Image array memory allocated this way requires the user to free that memory after all operations using the memory are completed and no more operations operating on the memory will be scheduled. This is done using free_image_mem, passing image_type::array.

The RAII class image_mem may also be used to perform allocation and deallocation of arrayed image device memory. The constructor and destructor act as a wrapper for the functions alloc_image_mem and free_image_mem respectively.

Note

Currently there is no support for image arrays backed by USM.

Obtaining a handle to an image array

A handle to an image array is acquired in the same way as unsampled_image_handle. We create the handle through the create_image functions which take image_descriptor that has image_type::array and array_size greater than 1.

Note

Currently there is no support for sampled image arrays.

Copying image array data

When copying to or from image arrays, the user should copy to/from the entire array of images in one call to ext_oneapi_copy by passing the image arrays' image_mem_handle.

Reading an image array

Inside the kernel, it’s possible to fetch data from an unsampled image array via fetch_image_array, passing the unsampled_image_handle, the coordinates, and the array index.

// Fetch an unsampled image array
template <typename DataT, typename HintT = DataT, typename CoordT>
DataT fetch_image_array(const unsampled_image_handle &ImageHandle,
                       const CoordT &Coords, const unsigned int ArrayLayer);

Fetching an image array follows the same restrictions on what coordinate types may be used as laid out in Reading and writing inside the kernel.

Fetching an image array by providing a user-defined return DataT type also follows the restrictions as laid out in Reading and writing inside the kernel.

Note

Attempting to fetch an image array with fetch_image or any other defined fetch function will result in undefined behaviour.

Writing an image array

Inside the kernel, it’s possible to write to an image array via write_image_array, passing the unsampled_image_handle, the coordinates, the array index, and the data to write. User-defined types are allowed to be written provided that type is trivially copyable.

// Write to an unsampled image array
template <typename DataT, typename CoordT>
DataT write_image_array(unsampled_image_handle ImageHandle,
                        const CoordT &Coords, const unsigned int ArrayLayer
                        const DataT &Color);

Writing to an image array follows the same restrictions on what coordinate types may be used as laid out in Reading and writing inside the kernel.

Note

Attempting to write to an image array with write_image or any other defined write function will result in undefined behaviour.

Cubemapped images

Another image type this extension supports is cubemapped images. Cubemap images are a specialisation of 2D image arrays that have exactly six layers representing the faces of a cube where the width and height of each layer (cube face) are equal. Cube mapping is a method of environment mapping, where the environment is projected onto the sides of the cube. Cubemaps have been applied in graphical systems such as skylight illumination, dynamic reflection, and skyboxes.

Querying cubemap support

We provide the following device aspects to retrieve support information on a SYCL implementation of just a couple of cubemap features.

The device aspect descriptors for these queries are:

Device descriptor Description

aspect::ext_oneapi_cubemap

Indicates if the device supports allocating and accessing cubemap resources

aspect::ext_oneapi_cubemap_seamless_filtering

Indicates if the device supports sampling cubemapped images across face bounderies

Allocation of cubemapped images

As with all other image types, cubemap memory is allocated through alloc_image_mem with the appropriately populated image_descriptor, where width and height are equal, and the type is set to image_type::cubemap. Since cubemaps are specialised image arrays, the array_size must be populated with the only valid value, 6. Overriding this with any other value for array_size could result in an exception or undefined behaviour. Cubemaps are not supported with USM backed memory.

Cubemap memory allocated this way requires the user to free that memory after all operations using the memory are completed and no more operations operating on the memory will be scheduled. This is done using free_image_mem, passing image_type::cubemap.

The RAII class image_mem may also be used to perform allocation and deallocation of cubemapped device memory. The constructor and destructor act as a wrapper for the functions alloc_image_mem and free_image_mem respectively.

Obtaining a handle to a cubemap

A handle to a cubemap is acquired in the same way as a standard image for both an unsampled_image_handle and sampled_image_handle. We create the handle for a cubemap through the appropriate create_image functions which take the image_descriptor and bindless_image_sampler for a sampled_image_handle, or just the image_descriptor for an unsampled_image_handle.

As with allocation, the descriptor must be populated appropriately, i.e. image_type::cubemap, width and height are equal, and array_size is equal to 6. To sample a cubemap as expected, the cubemap sampling attribute of the sampler, namely seamless_filtering_mode, must be defined.

Copying cubemap image data

In order to copy to or from cubemaps, the user should utilise the provided ext_oneapi_copy functions following the details laid out in Copying image array data.

Reading, writing, and sampling a cubemap

Cubemaps are supported as both unsampled and sampled images, however, the meaning of their usage is quite different.

An unsampled cubemap is treated as an image array with six layers, i.e. an integer index denoting a face and two integer coordinates addressing a texel within the layer corresponding to this face. Inside the kernel, this is done via fetch_cubemap, passing the unsampled_image_handle, the integer coordinates, int2, and an integer index denoting the face, int. Being an unsampled image, a cubemap can be written with write_cubemap, passing the unsampled_image_handle, the integer coordinates, int2, and an integer index denoting the face, int.

On the other hand, a sampled cubemap is addressed using three floating-point coordinates x, y, and z that are interpreted as a direction vector emanating from the centre of the cube and pointing to one face of the cube and a texel within the layer corresponding to that face. Inside the kernel, this is done via sample_cubemap, passing the sampled_image_handle, the floating-point coordinates x, y, and z, as a float3. The method of sampling depends on the sampler attributes passed upon creation of the cubemap.

// Unsampled cubemap read
template <typename DataT, typename HintT = DataT>
DataT fetch_cubemap(const unsampled_image_handle &ImageHandle,
                    const int2 &Coords,
                    const int Face);

// Sampled cubemap read
template <typename DataT, typename HintT = DataT>
DataT sample_cubemap(const sampled_image_handle &ImageHandle,
                     const float3 &Vec);

// Unsampled cubemap write
template <typename DataT>
void write_cubemap(unsampled_image_handle ImageHandle,
                   const int2 &Coords,
                   const int Face,
                   const DataT &Color);
Note

Attempting to read or write to a cubemap with any other defined read/write function will result in undefined behaviour.

Interoperability

Querying interoperability support

We provide the following device queries to retrieve information on whether a SYCL implementation provides support for various interoperability features.

The device aspect descriptors for these queries are:

Device descriptor Description

aspect::ext_oneapi_interop_memory_import

Indicates if the device supports importing external memory resources.

aspect::ext_oneapi_interop_memory_export

Indicates if the device supports exporting internal memory resources.

aspect::ext_oneapi_interop_semaphore_import`

Indicates if the device supports importing external semaphore resources.

aspect::ext_oneapi_interop_semaphore_export

Indicates if the device supports exporting internal event resources.

Note

Not all SYCL backends may provide support for importing or exporting native memory or semaphore objects. CUDA for example only supports importation of external memory and semaphores, but provides no support for their exportation.

External Resource types

In order to facilitate the importing of a number of different external memory and external semaphore handle types, we propose the following resource structures.

Note

We only show three examples of external resource handle types here, but the external_mem_descriptor and external_semaphore_descriptor structs, as defined in Importing external memory objects and Importing external semaphores, could be templated by any number of handle types, provided that the SYCL implementation provides support for them.

namespace sycl::ext::oneapi::experimental {

// POSIX file descriptor handle type
struct resource_fd {
  int file_descriptor;
};

// Windows NT handle type
struct resource_win32_handle {
  void *handle;
};

// Windows NT name type
struct resource_win32_name {
  const void *name;
};

}

Importing external memory objects

In order to import a memory object, an external API must provide an appropriate handle to that memory. The exact structure and type of this handle can depend on the external API, and the operating system the application is running on.

External memory import is facilitated through the following proposed descriptor struct.

namespace sycl::ext::oneapi::experimental {

// Descriptor templated on specific resource type
template <typename ResourceType>
struct external_mem_descriptor {
  ResourceType external_resource;
  size_t size_in_bytes;
};

}

The user should create an external_mem_descriptor templated on the appropriate handle type, ResourceType, for their purposes, e.g. resource_fd to describe a POSIX file descriptor resource on Linux systems, or a resource_win32_handle for Windows NT resource handles.

Once the user populates the external_mem_descriptor with the appropriate ResourceType values, and the size of the external memory in bytes, they can then import that memory into SYCL through import_external_memory.

namespace sycl::ext::oneapi::experimental {

struct interop_mem_handle {
  using raw_handle_type = /* Implementation defined */;
  raw_handle_type raw_handle;
};

template <typename ResourceType>
interop_mem_handle import_external_memory(
    external_mem_descriptor<ResourceType> externalMemDescriptor,
    const sycl::device &syclDevice,
    const sycl::context &syclContext);

template <typename ResourceType>
interop_mem_handle import_external_memory(
    external_mem_descriptor<ResourceType> externalMemDescriptor,
    const sycl::queue &syclQueue);

image_mem_handle map_external_image_memory(
    interop_mem_handle interopMemHandle,
    const image_descriptor &imageDescriptor,
    const sycl::device &syclDevice,
    const sycl::context &syclContext);
image_mem_handle map_external_image_memory(
    interop_mem_handle interopMemHandle,
    const image_descriptor &imageDescriptor,
    const sycl::queue &syclQueue);
}

The resulting interop_mem_handle can then be mapped, where the resulting type is an image_mem_handle. This can be used to construct images in the same way as memory allocated through alloc_image_mem. The ext_oneapi_copy operations also work with imported memory mapped to image_mem_handle types.

When calling create_image with an image_mem_handle mapped from an external memory object, the user must ensure that the image descriptor they pass to create_image has members that match or map to those of the external API. A mismatch between any of the width, height, depth, image_channel_type, or image_channel_order members will result in undefined behavior.

Additionally, the image_type describing the image must match to the image of the external API. The current supported importable image types are standard and mipmap. Attempting to import other image types will result in undefined behaviour.

Once a user has finished operating on imported memory, they must ensure that they destroy the imported memory handle through release_external_memory.

release_external_memory can only accept interop_mem_handles that were created through import_external_memory.

namespace sycl::ext::oneapi::experimental {

void release_external_memory(interop_mem_handle interopMem,
                             const sycl::device &syclDevice,
                             const sycl::context &syclContext);
void release_external_memory(interop_mem_handle interopMem,
                             const sycl::queue &syclQueue);
}

Destroying or freeing any imported memory through image_mem_free or sycl::free will result in undefined behavior.

Importing external semaphores

In addition to proposing importation of external memory resources, we also propose importation of synchronization primitives. Just like the sharing of memory between APIs described above, any external APIs must provide a valid a handle to a valid semaphore resource they wish to share, and just as external memory resources handles can take different forms of structure and type depending on the API and operating system, so do external semaphore resource handles.

External semaphore import is facilitated through the following proposed descriptor struct.

namespace sycl::ext::oneapi::experimental {

// Descriptor templated on specific resource type
template <typename ResourceType>
struct external_semaphore_descriptor {
  ResourceType external_resource;
};

}

The user should create an external_semaphore_descriptor templated on the appropriate handle type, ResourceType, for their purposes, e.g. resource_fd to describe a POSIX file descriptor resource on Linux systems, or a resource_win32_handle for Windows NT resource handles.

Once the user populates the external_semaphore_descriptor with the appropriate ResourceType values, they can then import that semaphore into SYCL through import_external_semaphore.

namespace sycl::ext::oneapi::experimental {

struct interop_semaphore_handle {
  using raw_handle_type = /* Implementation defined */;
  raw_handle_type raw_handle;
};

template <typename ResourceType>
interop_semaphore_handle import_external_semaphore(
    external_semaphore_descriptor<ResourceType>
        externalSemaphoreDescriptor,
    const sycl::device &syclDevice,
    const sycl::context &syclContext);
}

template <typename ResourceType>
interop_semaphore_handle import_external_semaphore(
    external_semaphore_descriptor<ResourceType>
        externalSemaphoreDescriptor,
    const sycl::queue &syclQueue);
}

The resulting interop_semaphore_handle can then be used in a SYCL command group, to either wait until the semaphore is in the signaled state, or set the semaphore to a signaled state.

We propose to extend the SYCL queue and handler classes with semaphore waiting and signalling operations.

namespace sycl {

class handler {
public:
  void ext_oneapi_wait_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle);

  void ext_oneapi_signal_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle);
};

class queue {
public:
  event ext_oneapi_wait_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle);
  event ext_oneapi_wait_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle,
      event DepEvent);
  event ext_oneapi_wait_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle,
      const std::vector<event> &DepEvents);

  event ext_oneapi_signal_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle);
  event ext_oneapi_signal_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle,
      event DepEvent);
  event ext_oneapi_signal_external_semaphore(
      ext::oneapi::experimental::interop_semaphore_handle
          interop_semaphore_handle,
      const std::vector<event> &DepEvents);
};
}

Any operations submitted to the queue after a ext_oneapi_wait_external_semaphore call will not begin until the imported semaphore is in a signaled state.

When ext_oneapi_signal_external_semaphore is called, the external semaphore will be set to the signaled state after all commands submitted to the queue prior to the ext_oneapi_signal_external_semaphore call complete.

ext_oneapi_wait_external_semaphore and ext_oneapi_signal_external_semaphore are non-blocking, asynchronous operations.

The user must ensure to destroy all external semaphore objects once they are no longer required through destroy_external_semaphore.

namespace sycl::ext::oneapi::experimental {

void destroy_external_semaphore(interop_semaphore_handle semaphoreHandle,
                                const sycl::device &syclDevice,
                                const sycl::context &syclContext);

void destroy_external_semaphore(interop_semaphore_handle semaphoreHandle,
                                const sycl::queue &syclQueue);

}

Examples

1D image read/write

// Set up device, queue, and context
sycl::device device;
sycl::queue queue(device);
sycl::context context = queue.get_context();

// Initialize input data
constexpr size_t width = 512;
std::vector<float> dataIn(width);
std::vector<float> dataOut(width);
for (int i = 0; i < width; i++) {
  dataIn[i] = static_cast<float>(i);
}

// Image descriptor - can use the same for both images
sycl::ext::oneapi::experimental::image_descriptor desc(
    sycl::range{width}, sycl::ext::oneapi::experimental::image_channel_order::r,
    sycl::ext::oneapi::experimental::image_channel_type::fp32);

try {
  // Extension: returns the device pointer to the allocated memory
  sycl::ext::oneapi::experimental::image_mem imgMemoryIn(desc, queue);
  sycl::ext::oneapi::experimental::image_mem imgMemoryOut(desc, queue);

  // Extension: create the image and return the handle
  sycl::ext::oneapi::experimental::unsampled_image_handle imgIn =
      sycl::ext::oneapi::experimental::create_image(imgMemoryIn, desc, queue);
  sycl::ext::oneapi::experimental::unsampled_image_handle imgOut =
      sycl::ext::oneapi::experimental::create_image(imgMemoryOut, desc, queue);

  // Extension: copy over data to device
  q.ext_oneapi_copy(dataIn.data(), imgMemoryIn, desc);

  // Bindless images require manual synchronization
  // Wait for copy operation to finish
  q.wait_and_throw();

  q.submit([&](sycl::handler &cgh) {
    // No need to request access, handles captured by value

    cgh.parallel_for(width, [=](sycl::id<1> id) {
      // Extension: read image data from handle
      float pixel = sycl::ext::oneapi::experimental::fetch_image<float>(
          imgIn, int(id[0]));

      // Extension: write to image data using handle
      sycl::ext::oneapi::experimental::write_image(imgOut, int(id[0]), pixel);
    });
  });

  // Using image handles requires manual synchronization
  q.wait_and_throw();

  // Copy data written to imgOut to host
  q.ext_oneapi_copy(imgMemoryOut, dataOut.data(), desc);

  // Cleanup
  sycl::ext::oneapi::experimental::destroy_image_handle(imgIn, queue);
  sycl::ext::oneapi::experimental::destroy_image_handle(imgOut, queue);
} catch (sycl::exception e) {
  std::cerr << "SYCL exception caught: " << e.what();
  exit(-1);
}

// Validate that `dataIn` correctly transferred to `dataOut`
bool validated = (dataIn == dataOut);

Reading from a dynamically sized array of 2D images

// Set up device, queue, and context
sycl::device device;
sycl::queue queue(device);
sycl::context context = queue.get_context();

// declare image data
size_t numImages = 5;
size_t width = 8;
size_t height = 8;
size_t numPixels = width * height;
std::vector<float> dataIn(numPixels);
std::vector<float> dataOut(numPixels);
std::vector<float> dataExpected(numPixels);
for (int i = 0; i < width; i++) {
  for (int j = 0; j < height; j++) {
    int index = j + (height * i);
    dataIn[index] = index;
    dataExpected[index] = index * numImages;
  }
}

// Image descriptor - can use the same for all images
sycl::ext::oneapi::experimental::image_descriptor desc(
    {width, height}, sycl::ext::oneapi::experimental::image_channel_order::r,
    sycl::ext::oneapi::experimental::image_channel_type::fp32);

try {

  // Allocate each image and save the handles
  std::vector<sycl::ext::oneapi::experimental::image_mem> imgAllocations;
  for (int i = 0; i < numImages; i++) {
    // Extension: move-construct device allocated memory
    imgAllocations.emplace_back(
        sycl::ext::oneapi::experimental::image_mem{desc, queue});
  }

  // Copy over data to device for each image
  for (int i = 0; i < numImages; i++) {
    // Extension: copy over data to device
    q.ext_oneapi_copy(dataIn.data(), imgAllocations[i], desc);
  }

  // Wait for copy operations to finish
  q.wait_and_throw();

  // Create the images and return the handles
  std::vector<sycl::ext::oneapi::experimental::unsampled_image_handle>
      imgHandles;
  for (int i = 0; i < numImages; i++) {
    // Extension: create the image and return the handle
    sycl::ext::oneapi::experimental::unsampled_image_handle imgHandle =
        sycl::ext::oneapi::experimental::create_image(imgAllocations[i],
                                                      desc, queue);
    imgHandles.push_back(imgHandle);
  }

  sycl::buffer outBuf{dataOut.data(), sycl::range{height, width}};
  sycl::buffer imgHandlesBuf{imgHandles.data(), sycl::range{numImages}};
  q.submit([&](sycl::handler &cgh) {
    sycl::accessor outAcc{outBuf, cgh, sycl::write_only};
    sycl::accessor imgHandleAcc{imgHandlesBuf, cgh, sycl::read_only};

    cgh.parallel_for(
        sycl::nd_range<2>{{width, height}, {width, height}},
        [=](sycl::nd_item<2> it) {
          size_t dim0 = it.get_local_id(0);
          size_t dim1 = it.get_local_id(1);

          // Sum each image by reading via its handle
          float sum = 0;
          for (int i = 0; i < numImages; i++) {
            // Extension: read image data from handle
            sum += (sycl::ext::oneapi::experimental::fetch_image<float>(
                imgHandleAcc[i], sycl::vec<int, 2>(dim0, dim1)));
          }
          outAcc[sycl::id{dim1, dim0}] = sum;
        });
  });

  // Using image handles requires manual synchronization
  q.wait_and_throw();

  // Cleanup
  for (int i = 0; i < numImages; i++) {
    sycl::ext::oneapi::experimental::destroy_image_handle(imgHandles[i], queue);
  }
} catch (sycl::exception e) {
  std::cerr << "SYCL exception caught: " << e.what();
  exit(-1);
}

// Validate that `dataOut` is correct
bool validated = (dataOut == dataExpected);

Reading a 1D mipmap with anisotropic filtering and levels

// Set up device, queue, and context
sycl::device device;
sycl::queue queue(device);
sycl::context context = q.get_context();

// declare image data
constexpr size_t width = 16;
unsigned int num_levels = 2;
std::vector<float> dataIn1(width);
std::vector<float> dataIn2(width / 2);
std::vector<float> dataOut(width);
std::vector<float> dataExpected(width);
int j = 0;
for (int i = 0; i < width; i++) {
  dataExpected[i] = static_cast<float>(i + (j + 10));
  if (i % 2)
    j++;
  dataIn1[i] = static_cast<float>(i);
  if (i < (N / 2))
    dataIn2[i] = static_cast<float>(i + 10);
}

try {

  // Image descriptor -- number of levels
  sycl::ext::oneapi::experimental::image_descriptor desc(
      {width}, sycl::ext::oneapi::experimental::image_channel_order::r,
      sycl::ext::oneapi::experimental::image_channel_type::fp32,
      sycl::ext::oneapi::experimental::image_type::mipmap, num_levels);

  // Allocate the mipmap
  sycl::ext::oneapi::experimental::image_mem mip_mem(desc, queue);

  // Retrieve level 0
  sycl::ext::oneapi::experimental::image_mem_handle img_mem1 =
      mip_mem.get_mip_level_mem_handle(0)

  // Copy over data to level 0
  q.ext_oneapi_copy(dataIn1.data(), img_mem1, desc);

  // Copy over data to level 1
  q.ext_oneapi_copy(dataIn2.data(), mip_mem.get_mip_level_mem_handle(1),
                    desc.get_mip_level_desc(1));
  q.wait_and_throw();

  // Extended sampler object to take in mipmap attributes
  sycl::ext::oneapi::experimental::bindless_image_sampler samp(
      addressing_mode::mirrored_repeat,
      coordinate_normalization_mode::normalized, filtering_mode::nearest,
      mipmap_filtering_mode::nearest, 0.0f, (float)num_levels, 8.0f);

  // Create a sampled image handle to represent the mipmap
  sycl::ext::oneapi::experimental::sampled_image_handle mipHandle =
      sycl::ext::oneapi::experimental::create_image(mip_mem, samp, desc, queue);
  q.wait_and_throw();

  sycl::buffer<float, 1> buf((float *)dataOut.data(), width);
  q.submit([&](handler &cgh) {
    auto outAcc = buf.get_access<access_mode::write>(cgh, width);

    cgh.parallel_for<image_addition>(width, [=](id<1> id) {
      float sum = 0;
      float x = (static_cast<float>(id[0]) + 0.5f) / static_cast<float>(width);
      // Read mipmap level 0 with anisotropic filtering
      // and level 1 with level filtering
      float px1 = sycl::ext::oneapi::experimental::sample_mipmap<float>(
          mipHandle, x, 0.0f, 0.0f);
      float px2 = sycl::ext::oneapi::experimental::sample_mipmap<float>(
          mipHandle, x, 1.0f);

      sum = px1 + px2;
      outAcc[id] = sum;
    });
  });

  q.wait_and_throw();

  // Cleanup
  sycl::ext::oneapi::experimental::destroy_image_handle(mipHandle, queue);

} catch (sycl::exception e) {
  std::cerr << "SYCL exception caught! : " << e.what() << "\n";
  exit(-1);
} catch (...) {
  std::cerr << "Unknown exception caught!\n";
  exit(-1);
}

// Validate that `dataOut` is correct
bool validated = (dataOut == dataExpected);

1D image array read/write

using VecType = sycl::vec<float, 4>;

sycl::device dev;
sycl::queue q(dev);
auto ctxt = q.get_context();

// declare image data
constexpr size_t width = 5;
constexpr size_t array_size = 2;
constexpr size_t N = width;
std::vector<VecType> out(N * array_size);
std::vector<float> expected(N * array_size);
std::vector<float> outBuf(N);
std::vector<VecType> dataIn1(N * array_size);
std::vector<VecType> dataIn2(N * array_size);

for (int i = 0; i < N * array_size; i++) {
  // Populate input data (to-be image arrays)
  dataIn1[i] = VecType(i);
  dataIn2[i] = VecType(2*i);
}

// Populate expected output
for (int i = 0; i < width; i++) {
  for (int l = 0; l < array_size; l++) {
    expected[l * N + i] = dataIn1[l * N + i][0] + dataIn2[l * N + i][0];
  }
}

try {
  // Extension: image descriptor -- number of layers
  sycl::ext::oneapi::experimental::image_descriptor desc(
      {width}, sycl::image_channel_order::rgba, sycl::image_channel_type::fp32,
      sycl::ext::oneapi::experimental::image_type::array, 1, array_size);

  // Extension: allocate image array memory on device
  sycl::ext::oneapi::experimental::image_mem arrayMem1(desc, dev, ctxt);
  sycl::ext::oneapi::experimental::image_mem arrayMem2(desc, dev, ctxt);
  sycl::ext::oneapi::experimental::image_mem outMem(desc, dev, ctxt);

  // Extension: copy over data to device
  q.ext_oneapi_copy(dataIn1.data(), arrayMem1.get_handle(), desc);
  q.ext_oneapi_copy(dataIn2.data(), arrayMem2.get_handle(), desc);
  q.wait_and_throw();

  // Extension: create a unsampled image handles to represent the image arrays
  sycl::ext::oneapi::experimental::unsampled_image_handle arrayHandle1 =
      sycl::ext::oneapi::experimental::create_image(arrayMem1, desc, dev,
                                                    ctxt);
  sycl::ext::oneapi::experimental::unsampled_image_handle arrayHandle2 =
      sycl::ext::oneapi::experimental::create_image(arrayMem2, desc, dev,
                                                    ctxt);
  sycl::ext::oneapi::experimental::unsampled_image_handle outHandle =
      sycl::ext::oneapi::experimental::create_image(outMem, desc, dev,
                                                    ctxt);

  q.submit([&](sycl::handler &cgh) {

    cgh.parallel_for<kernel>(N, [=](sycl::id<1> id) {
      float sum1 = 0;
      float sum2 = 0;

      // Extension: read image layers 0 and 1
      VecType px1 = sycl::ext::oneapi::experimental::fetch_image_array<VecType>(
          arrayHandle1, int(id[0]), 0);
      VecType px2 = sycl::ext::oneapi::experimental::fetch_image_array<VecType>(
          arrayHandle1, int(id[0]), 1);

      // Extension: read image layers 0 and 1
      VecType px3 = sycl::ext::oneapi::experimental::fetch_image_array<VecType>(
          arrayHandle2, int(id[0]), 0);
      VecType px4 = sycl::ext::oneapi::experimental::fetch_image_array<VecType>(
          arrayHandle2, int(id[0]), 1);

      sum1 = px1[0] + px3[0];
      sum2 = px2[0] + px4[0];

      // Extension: write to image layers with handle
      sycl::ext::oneapi::experimental::write_image_array<VecType>(
          outHandle, int(id[0]), 0, VecType(sum1));
      sycl::ext::oneapi::experimental::write_image_array<VecType>(
          outHandle, int(id[0]), 1, VecType(sum2));
    });
  });

  q.wait_and_throw();

  // Extension: copy data from device to host
  q.ext_oneapi_copy(outMem.get_handle(), out.data(), desc);
  q.wait_and_throw();

  // Extension: cleanup
  sycl::ext::oneapi::experimental::destroy_image_handle(arrayHandle1, dev, ctxt);
  sycl::ext::oneapi::experimental::destroy_image_handle(arrayHandle2, dev, ctxt);
  sycl::ext::oneapi::experimental::destroy_image_handle(outHandle, dev, ctxt);

} catch (sycl::exception e) {
  std::cerr << "SYCL exception caught! : " << e.what() << "\n";
  std::cout << "Test failed!" << std::endl;
  exit(1);
} catch (...) {
  std::cerr << "Unknown exception caught!\n";
  std::cout << "Test failed!" << std::endl;
  exit(2);
}

// collect and validate output
bool validated = true;
for (int i = 0; i < N * array_size; i++) {
  bool mismatch = false;
  if (out[i][0] != expected[i]) {
    mismatch = true;
    validated = false;
  }
}
if (validated) {
  return 0;
}

return 1;

Sampling a cubemap

#include <iostream>
#include <sycl/sycl.hpp>

int main() {

  namespace syclexp = sycl::ext::oneapi::experimental;

  sycl::device dev;
  sycl::queue q(dev);
  auto ctxt = q.get_context();

  // declare image data
  // width and height must be equal
  size_t width = 8;
  size_t height = 8;
  size_t N = width * height;
  std::vector<float> out(N);
  std::vector<float> expected(N);
  std::vector<sycl::float4> dataIn1(N * 6);
  for (int i = 0; i < width; i++) {
    for (int j = 0; j < height; j++) {
      for (int k = 0; k < 6; k++) {
        dataIn1[i + width * (j + height * k)] = {i + width * (j + height * k),
                                                 0, 0, 0};
      }
    }
  }

  int j = 0;
  for (int i = N - 1; i >= 0; i--) {
    expected[j] = (float)i;
    j++;
  }

  // Extension: image descriptor - Cubemap
  syclexp::image_descriptor desc(
      {width, height}, sycl::image_channel_order::rgba,
      sycl::image_channel_type::fp32, syclexp::image_type::cubemap, 1, 6);

  syclexp::bindless_image_sampler samp(
      sycl::addressing_mode::clamp_to_edge,
      sycl::coordinate_normalization_mode::normalized,
      sycl::filtering_mode::nearest, syclexp::cubemap_filtering_mode::seamless);

  try {
    // Extension: allocate memory on device and create the handle
    syclexp::image_mem imgMem(desc, dev, ctxt);

    // Extension: create the image and return the handle
    syclexp::sampled_image_handle imgHandle =
        syclexp::create_image(imgMem, samp, desc, dev, ctxt);

    // Extension: copy over data to device (handler variant)
    q.submit([&](sycl::handler &cgh) {
      cgh.ext_oneapi_copy(dataIn1.data(), imgMem.get_handle(), desc);
    });
    q.wait_and_throw();

    sycl::buffer<float, 2> buf((float *)out.data(),
                               sycl::range<2>{height, width});
    q.submit([&](sycl::handler &cgh) {
      auto outAcc = buf.get_access<sycl::access_mode::write>(
          cgh, sycl::range<2>{height, width});

      // Emanating vector scans one face
      cgh.parallel_for<kernel>(
          sycl::nd_range<2>{{width, height}, {width, height}},
          [=](sycl::nd_item<2> it) {
            size_t dim0 = it.get_local_id(0);
            size_t dim1 = it.get_local_id(1);

            // Direction Vector
            // x -- largest magnitude
            // y -- shifted between [-0.99, 0.99] + offset
            // z -- shifted between [-0.99, 0.99] + offset
            //
            // [-0.99, 0.99] -- maintains x as largest magnitude
            //
            // 4 elems == [-1, -0.5, 0, 0.5] -- need offset to bring uniformity
            // +0.25 = [-0.75, -0.25, 0.25, 0.75]
            float fdim0 = 1.f;
            float fdim1 = (((float(dim0) / (float)width) * 1.98) - 0.99) +
                          (1.f / (float)width);
            float fdim2 = (((float(dim1) / (float)height) * 1.98) - 0.99) +
                          (1.f / (float)height);

            // Extension: read texture cubemap data from handle
            sycl::float4 px = syclexp::sample_cubemap<sycl::float4>(
                imgHandle, sycl::float3(fdim0, fdim1, fdim2));

            outAcc[sycl::id<2>{dim0, dim1}] = px[0];
          });
    });
    q.wait_and_throw();

    // Extension: cleanup
    syclexp::destroy_image_handle(imgHandle, dev, ctxt);
  } catch (sycl::exception e) {
    std::cerr << "SYCL exception caught! : " << e.what() << "\n";
    return 1;
  } catch (...) {
    std::cerr << "Unknown exception caught!\n";
    return 2;
  }

  // collect and validate output
  bool validated = true;
  for (int i = 0; i < N; i++) {
    bool mismatch = false;
    if (out[i] != expected[i]) {
      mismatch = true;
      validated = false;
    }
    if (mismatch) {
      std::cout << "Result mismatch! Expected: " << expected[i]
                << ", Actual: " << out[i] << std::endl;
    }
  }
  if (validated) {
    std::cout << "Test passed!" << std::endl;
    return 0;
  }

  std::cout << "Test failed!" << std::endl;
  return 3;
}

Using imported memory and semaphore objects

// Set up device, queue, and context
sycl::device device;
sycl::queue queue(device);
sycl::context context = queue.get_context();

size_t width = /* passed from external API */;
size_t height = /* passed from external API */;

sycl::ext::oneapi::experimental::image_channel_order channel_order =
    /* mapped from external API */
    /* we assume sycl::image_channel_order::r */;

sycl::ext::oneapi::experimental::image_channel_type channel_type =
    /* mapped from external API */
    /* we assume sycl::image_channel_type::unsigned_int32 */;

// Image descriptor - mapped to external API image layout
sycl::ext::oneapi::experimental::image_descriptor desc(
    {width, height}, channel_order, channel_type);

size_t img_size_in_bytes = width * height * sizeof(uint32_t);

int external_input_image_file_descriptor = /* passed from external API */
int external_output_image_file_descriptor = /* passed from external API */

// Extension: populate external memory descriptors
sycl::ext::oneapi::experimental::external_mem_descriptor<
    sycl::ext::oneapi::experimental::resource_fd>
    input_ext_mem_desc{external_input_image_file_descriptor,
                       img_size_in_bytes};

sycl::ext::oneapi::experimental::external_mem_descriptor<
    sycl::ext::oneapi::experimental::resource_fd>
    output_ext_mem_desc{external_output_image_file_descriptor,
                        img_size_in_bytes};

// An external API semaphore will signal this semaphore before our SYCL commands
// can begin execution
int wait_semaphore_file_descriptor = /* passed from external API */;

// An external API will wait on this semaphore to be signalled by us before it
// can execute some commands
int done_semaphore_file_descriptor = /* passed from external API */;

// Extension: populate external semaphore descriptor.
//            We assume POSIX file descriptor resource types
sycl::ext::oneapi::experimental::external_semaphore_descriptor<
    sycl::ext::oneapi::experimental::resource_fd>
    wait_external_semaphore_desc{wait_semaphore_file_descriptor};

sycl::ext::oneapi::experimental::external_semaphore_descriptor<
    sycl::ext::oneapi::experimental::resource_fd>
    done_external_semaphore_desc{done_semaphore_file_descriptor};

try {
  // Extension: import external semaphores
  sycl::ext::oneapi::experimental::interop_semaphore_handle
      wait_interop_semaphore_handle =
          sycl::ext::oneapi::experimental::import_external_semaphore(
              wait_external_semaphore_desc, queue);

  sycl::ext::oneapi::experimental::interop_semaphore_handle
      done_interop_semaphore_handle =
          sycl::ext::oneapi::experimental::import_external_semaphore(
              done_external_semaphore_desc, queue);

  // Extension: import external memory from descriptors
  sycl::ext::oneapi::experimental::interop_mem_handle
      input_interop_mem_handle =
          sycl::ext::oneapi::experimental::import_external_memory(
              input_ext_mem_desc, queue);

  sycl::ext::oneapi::experimental::interop_mem_handle
      output_interop_mem_handle =
          sycl::ext::oneapi::experimental::import_external_memory(
              output_ext_mem_desc, queue);

  // Extension: map imported external memory to image memory
  sycl::ext::oneapi::experimental::image_mem_handle input_mapped_mem_handle =
      sycl::ext::oneapi::experimental::map_external_image_memory(
          input_interop_mem_handle, desc, queue);
  sycl::ext::oneapi::experimental::image_mem_handle output_mapped_mem_handle =
      sycl::ext::oneapi::experimental::map_external_image_memory(
          output_interop_mem_handle, desc, queue);

  // Extension: create images from mapped memory and return the handles
  sycl::ext::oneapi::experimental::unsampled_image_handle img_input =
      sycl::ext::oneapi::experimental::create_image(
          input_mapped_mem_handle, desc, queue);
  sycl::ext::oneapi::experimental::unsampled_image_handle img_output =
      sycl::ext::oneapi::experimental::create_image(
          output_mapped_mem_handle, desc, queue);

  // Extension: wait for imported semaphore
  q.ext_oneapi_wait_external_semaphore(wait_interop_semaphore_handle)

  // Submit our kernel that depends on imported "wait_semaphore_file_descriptor"
  q.submit([&](sycl::handler &cgh) {
    cgh.parallel_for<>(
        sycl::nd_range<2>{{width, height}, {32, 32}},
        [=](sycl::nd_item<2> it) {
          size_t dim0 = it.get_global_id(0);
          size_t dim1 = it.get_global_id(1);

          // Extension: read image data from handle to imported image
          uint32_t pixel =
              sycl::ext::oneapi::experimental::fetch_image<uint32_t>(
                  img_input, sycl::vec<int, 2>(dim0, dim1));

          // Modify the data before writing back
          pixel *= 10;

          // Extension: write image data using handle to imported image
          sycl::ext::oneapi::experimental::write_image(
              img_output, sycl::vec<int, 2>(dim0, dim1), pixel);
        });
  });

  // Extension: signal imported semaphore
  q.ext_oneapi_signal_external_semaphore(done_interop_semaphore_handle)

  // The external API can now use the semaphore it exported to
  // "done_semaphore_file_descriptor" to schedule its own command submissions

  q.wait_and_throw();

  // Extension: destroy all external resources
  sycl::ext::oneapi::experimental::release_external_memory(
      input_interop_mem_handle, queue);
  sycl::ext::oneapi::experimental::release_external_memory(
      output_interop_mem_handle, queue);
  sycl::ext::oneapi::experimental::destroy_external_semaphore(
      wait_interop_semaphore_handle, queue);
  sycl::ext::oneapi::experimental::destroy_external_semaphore(
      done_interop_semaphore_handle, queue);
  sycl::ext::oneapi::experimental::destroy_image_handle(img_input, queue);
  sycl::ext::oneapi::experimental::destroy_image_handle(img_output, queue);
} catch (sycl::exception e) {
  std::cerr << "SYCL exception caught! : " << e.what() << "\n";
  exit(-1);
} catch (...) {
  std::cerr << "Unknown exception caught!\n";
  exit(-1);
}

Implementation notes

The current DPC++ prototype only implements the proposal for the CUDA backend, however we are actively exploring Level Zero with SPIR-V. We are looking at other backend as well in order to ensure the extension can work across different backends.

Issues

No dependency tracking

Because this extension allows images to work in a USM-like model, there are similar limitations to using USM for non-images, mainly the lack of dependency tracking and the need for users to manually synchronize operations.

Limitations when using USM as image memory

There are dimension specific limitations:

  • 1D - Linear interpolation not possible in the CUDA backend. A workaround is to allocate 2D pitched memory with a height of 1.

  • 2D - There are some alignment restrictions. See the "Pitch alignment restrictions and queries" section, or use pitched_alloc_device to allocate 2D USM image memory.

  • 3D - No support at the moment. Possible support in non CUDA backends in the future.

Not supported yet

These features still need to be handled:

  • Level Zero and SPIR-V support

Revision History

Rev Date Changes

1

2023-02-03

Initial draft

2

2023-02-23

- Added image_mem_handle for image memory allocated with allocate_image

- Added ability to create images from USM

- Added new way to copy images, removed requirement for copy direction

- Added image memory information getters to reflect cuArray3DGetDescriptor functionality

3

2023-03-30

- Some text clarifications.

- Unsampled images can no longer be created from USM.

- Added SYCL 1.2.1 image_channel_order and image_channel_type structs.

- Added image_type to enable construction of layered, mipmap, and cubemap images in the future.

- Added device information descriptors for querying pitched allocation size and alignment requirement.

- Added ext_oneapi_copy methods for the sycl::handler.

- ext_oneapi_copy functions now take the Src as the first parameter.

- Created image_mem as a RAII style class.

- Renamed allocate_image to alloc_image_mem

- pitched_alloc_device can now take an image_descriptor.

- Added interoperability features

- Added support to query bindless image and interoperability capabilities

- Added mipmap support

4

2023-06-23

- Added sycl::device parameter to multiple functions to clarify that images must be created and used on the same device.

- Changed naming and order of some parameters to be consistent throughout the proposal and with core SYCL.

- Added variants of functions that take a sycl::queue instead of both sycl::device and sycl::context.

- Removed standalone wait and signal semaphore functions. These should always go through the queue or handler methods.

- Removed get_image_handle and get_sampler_handle functions from sampled and unsampled image handle structs. The structs have public handle members that can be retrieved without getters.

- Made all enum types and values unspecified

- Moved support queries to device aspects, improved naming of queries for better consistency, and moved device info queries to the experimental namespace.

- Added get_mip_level_desc member function to image_descriptor

- Fixed get_mip_level_mem_handle prototype in image_mem, and added a standalone function.

- Removed ext_oneapi_copy variants that take image_mem, the user should retrieve the raw handle and pass that themselves.

- Removed ext_oneapi_copy variants that take a mip level, the user should retrieve individual mip level image handles themselves and pass that.

- Added ext_oneapi_copy variants that take offsets and the extent, to enable sub-region copy.

- Created a list of failure scenarios for ext_oneapi_copy, changed the failure error code to errc::invalid, and specified that the implementation should relay the reason for the failure back to the user.

- Added a bindless_image_sampler struct.

- Specified that image_mem must follow Common Reference Semantics.

- Updated code samples.

4.1

2023-07-21

- Made bindless image sampler member names snake-case

4.2

2023-08-18

- write_image now allows passing of user-defined types

4.3

2023-09-08

- Clarify how normalized image formats are read - Remove support for packed normalized image formats (unorm_short_555, unorm_short_565, unorm_int_101010)

4.4

2023-09-12

- Added overload with sycl::queue to standalone functions

4.5

2023-09-14

- Update wording for allocating images + fix typo

4.6

2023-09-19

- Clarify restrictions on reading/writing coordinate types

4.7

2023-10-16

- Introduce read_mipmap for mipmap access and clarify reading restrictions on image types

4.8

2023-10-25

- Change the name of map_external_memory_array to map_external_image_memory to avoid CUDA terminology

4.9

2023-11-13

- Add that the bindless sampler is default constructible and follows by-value semantics

4.10

2023-11-15

- Added constructors for sampled_image_handle and unsampled_image_handle structs. - Removed raw_sampler_handle member from sampled_image_handle struct. Awaiting LevelZero and SPIR-V extensions to mature before before deciding whether a raw_sampler_handle member is necessary. - Renamed image_handle members in sampled_image_handle and unsampled_image_handle structs to raw_handle.

5.0

2023-11-21

- Added section "Recognized standard types", to simplify wording around what types are allowed to be read or written. - Allow read_image and read_mipmap to return a user-defined type.

5.1

2024-01-17

- Added overload for ext_oneapi_copy enabling device to device copies using image_mem_handle.

5.1

2023-12-06

- Added unique addressing modes per dimension to the bindless_image_sampler

5.2

2024-02-14

- Image read and write functions now accept 3-component coordinates for 3D reads, instead of 4-component coordinates.

5.3

2024-02-16

- Replace read_image and read_mipmap APIs in favor of more descriptive naming, with fetch_image, sample_image, and sample_mipmap.

5.4

2024-02-23

- Added support for unsampled image arrays. - Creation of unsampled image arrays. - Fetching/writing of unsampled image arrays. - image_type::array added to enum. - array_size member added to image_descriptor. - image_descriptor::verify() member function added.

5.5

2024-02-27

- Update interop with mipmap interop and slight redesign - interop removed from image_type

5.6

2024-03-04

- Added cubemap support. - Allocation of cubemaps. - Creation of cubemaps. - Fetching/writing of unsampled cubemaps and sampling cubemaps. - image_type::cubemap added to enum. - Cubemap example. - Updated image_array_write with non-const handle parameter. - Removed & reference qualifier from write_xxx handle parameter.

5.7

2024-04-09

- Allow fetching of sampled image data through the fetch_image API.

5.8

2024-05-09

- Add missing cubemap HintT template parameter to fetch_cubemap and sample_cubemap.