Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 3) and Level Zero (rev 3) #8307

Closed
wants to merge 5 commits into from

Conversation

ProGTX
Copy link
Contributor

@ProGTX ProGTX commented Feb 10, 2023

Initial proposal for adding support for Bindless Images in SYCL.

Co-authored-by: Przemek Malon przemek.malon@codeplay.com
Co-authored-by: Isaac Ault isaac.ault@codeplay.com
Co-authored-by: Sean Stirling sean.stirling@codeplay.com
Co-authored-by: Duncan Brawley duncan.brawley@codeplay.com

Initial proposal for adding support for Bindless Images in SYCL.

Co-authored-by: Przemek Malon <przemek.malon@codeplay.com>
Co-authored-by: Isaac Ault <isaac.ault@codeplay.com>
Co-authored-by: Sean Stirling <sean.stirling@codeplay.com>
Co-authored-by: Duncan Brawley <duncan.brawley@codeplay.com>
@ProGTX
Copy link
Contributor Author

ProGTX commented Feb 10, 2023

Tagging @gmlueck @jbrodman @bashbaug

@ProGTX
Copy link
Contributor Author

ProGTX commented Feb 10, 2023

Note that this PR is not intended for merging just yet, only to get feedback.

Both the SYCL (revision 2) and Level Zero (revision 1) bindless images proposals
have been converted to ASCIIDOC and moved into the experimental extensions directory.
@ProGTX
Copy link
Contributor Author

ProGTX commented Feb 22, 2023

The SYCL extension has been updated to revision 2, and we also added an extension for Level Zero (revision 1).

@ProGTX ProGTX changed the title [sycl_ext_oneapi_bindless_images] Extension proposal revision 1 [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 2) and Level Zero (rev 1) Feb 22, 2023
`write_image` is not available for sampled images, or images created from USM
memory.

`DataT` must correspond to the type specified in the `image_descriptor`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no DataT in the image descriptor. There are the image_channel_order and image_channel_type enumerations, but how will they map to a DataT ?
The reason I ask is that in SYCL 1.2.1 images, no matter the choices for image_channel_order and image_channel_type, the accessor data type was always just cl_int4 or cl_float4 . IIRC, even one channel images use a cl_int4, and similarly there is no difference in the accessor data type for image_channel_type::unsigned_int8 versus ::unsigned_int16 or ::unsigned_int32. They all used cl_int4 for the accessor type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware this was the case in SYCL 1.2.1. With this proposal we can pass a DataT that corresponds to any of the combinations of image_channel_order and image_channel_type, e.g. an image_channel_type::fp16 combined with a iamge_channel_order::rg would make DataT == sycl::half2 == sycl::vec<sycl::half, 2>.

We will need to clarify that we are looking for sycl::vec types for multi-channel stores as it is not explained, and maybe include a table of the acceptable combinations and resulting types. We also need to add a restrictions on channels, as CUDA only supports 1, 2, or 4 channel images, when we pass a channel order like rgb, we would need to emulate it with a 4-channel image, or get rid of the 3-channel image orders altogether.

There was feedback about various use cases where the current model falls short
(see examples at the end of this document for some of the use cases).

The main issue is requesting access to each individual image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what this sentence means. Of course, SYCL 2020 images provide the ability to access each of the images that are defined in the application, so I think you must mean some other limitation.

[frame="none",options="header"]
|======================
|Value |Description
|202302 |Initial version of this extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our convention with other extensions is to start numbering at 1 and then increment by one for each change to the API.

=== Image descriptor

```cpp
namespace sycl::ext::oneapi {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an experimental extension, all declarations should be in the namespace sycl::ext::oneapi::experimental.

Note that `image_channel_type` and `image_channel_order` existed in SYCL 1.2.1,
but were removed in SYCL 2020 in favor of a single, unified enum class.
We propose separating them again to enable better flexibility
and to avoid combinatorial complexity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these types are no longer in the SYCL specification, I think they should be added to this extension. I presume they would be in the same namespace as image_descriptor.

Comment on lines 131 to 132
image_mem_handle allocate_image(const context& syclContext,
image_descriptor desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image_mem_handle allocate_image(const context& syclContext,
image_descriptor desc);
image_mem_handle allocate_image_mem(const context& syclContext,
const image_descriptor &desc);

The name allocate_image_mem seems better because it distinguishes this operation from create_image below. I presume you want to pass the image_descriptor by reference since it is a struct with several fields?

// SYCL deals with indexing in row-major fashion
// Reverse output buffer dimensions and access to convert
// the cuda column-major data back to row-major
buffer<float, 2> buf((float *)out.data(), range<2>{height, width});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buffer<float, 2> buf((float *)out.data(), range<2>{height, width});
buffer buf(out.data(), range{height, width});

I think the cast is not needed and CTAD should deduce these template parameters.

Comment on lines 451 to 452
buffer<sycl::ext::oneapi::unsampled_image_handle, 1> imgHandlesBuf(
imgHandles.data(), range<1>{numImages});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buffer<sycl::ext::oneapi::unsampled_image_handle, 1> imgHandlesBuf(
imgHandles.data(), range<1>{numImages});
buffer imgHandlesBuf(imgHandles.data(), range{numImages});

I believe CTAD allows you to simplify it to this.

// Cuda stores data in column-major fashion
// SYCL deals with indexing in row-major fashion
// Reverse output buffer dimensions and access to convert
// the cuda column-major data back to row-major
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems wrong here if the CUDA memory layout is leaking through the API. Shouldn't the SYCL API provide a consistent mapping of dimensions to rows and columns in the image? Otherwise, it's impossible to write portable code that accesses images.

We are looking at other backend as well in order to ensure the extension can
work across different backends.

== Issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a global comment here only so the conversation can be threaded.

Several parts of this API would allow better compile-time error checking if the unsampled_image_handle, sampled_image_handle, and image_mem_handle types were templated on the number of dimensions. Is this worth considering? I think the main downside is that it becomes impossible to create an array of image handles where the handles have diverse numbers of dimensions. I'm not sure if that is an important use case, though. Even if it were a use case, users could instead create a std::tuple of image handles that have different numbers of dimensions.

To give one example of where the compile-time checking could be improved, consider read_image and write_image. Currently, the type of CoordT cannot be checked at compile-time because the compiler doesn't know the number of dimensions. Even worse, this cannot be checked at runtime either because exceptions aren't supported in device code. Therefore, passing a CoordT with the wrong number of dimensions could result in a non-obvious failure that is hard to debug. If the image handles were templated on the number of dimensions, we could check this all at compile-time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several parts of this API would allow better compile-time error checking if the unsampled_image_handle, sampled_image_handle, and image_mem_handle types were templated on the number of dimensions. Is this worth considering?

It's definitely worth considering. The inability to store differently dimensioned image's handles in a single container like vector was the reason we didn't go with more templating, and I think it would also make SYCLomatic's job a bit harder. It does provide more opportunity for compile time error checking though. We can look into this

We are looking at other backend as well in order to ensure the extension can
work across different backends.

== Issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another global comment.

I noted above that the image handle types need to define a copy constructor / operator in order to enable your examples. One way to address this is to say that they have the "common reference semantics". This does not give you a default constructor, though, so you would still need to specify that.

Defining image_mem_handle as an object with common reference semantics would have some nice benefits:

  • The functions listed under "Getting image information from non-USM image memory" could all be member functions instead of free functions.
  • The image_mem_handle object could store its context internally, which means that most of the functions operating on it would no longer need a context parameter.
  • The allocate_image function would simply become a constructor for image_mem_handle.
  • We would no longer need the free_image function because the destructor would automatically deallocate the memory (because it's common reference semantics, this would only happen when the last reference was dropped).
  • The programming model would become RAII, which is generally considered a good C++ design principle.
  • This type would be more consistent with other SYCL types, which also have common reference semantics.

Defining unsampled_image_handle and sampled_image_handle as common reference semantics would have similar nice benefits.

Co-authored-by: Przemyslaw Malon <przemek.malon@codeplay.com>
Co-authored-by: Sean Stirling <sean.stirling@codeplay.com>

- Some text clarifications.

- Unsampled images can no longer be created from USM.

- Added SYCL 1.2.1 `image_channel_order` and
  `image_channel_type` structs.

- Added `image_type` to enable construction of layered,
  mipmap, and cubemap images in the future.

- Added device information descriptors for querying pitched
  allocation size and alignment requirement.

- Added `ext_oneapi_copy` methods for the `sycl::handler`.

- `ext_oneapi_copy` functions now take the `Src` as the first
  parameter.

- Created `image_mem` as a RAII style class.

- Renamed `allocate_image` to `alloc_image_mem`

- `pitched_alloc_device` can now take an `image_descriptor`.

- Added interoperability features

- Added support to query bindless image and interoperability
  capabilities

- Added mipmap support
@ProGTX ProGTX changed the title [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 2) and Level Zero (rev 1) [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 3) and Level Zero (rev 1) Mar 30, 2023
* Structures
+
```
ze_image_mem_exp_desc_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ProGTX . Since this is an extension for L0, it is better to define it in the L0 specification repo https://github.com/oneapi-src/level-zero-spec/, instead of here on SYCL.

I have created a new issue in the L0 spec to add this oneapi-src/level-zero-spec#109


Also, accessing images through USM is not supported by Level Zero.
This would allow the reading and writing of image data with conventional pointers, in contrast to the current Level Zero images which require the exclusive use of API.
This proposal introduces bindless images to Level Zero which allows for USM access.
Copy link

@zzdanowicz zzdanowicz Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you need to access image via USM pointer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A requirement for our extension was to allow for the creation of bindless images via USM. A potential use-case is re-using existing USM memory preventing copies, creating an image on top of that memory, and then accessing the image with hardware sampling applied to the data.

Copy link

@zzdanowicz zzdanowicz Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have plans to access such image from the host side via such USM pointer, like reading pixels?
Or is it only to provide backing storage for such image?
What are the use-cases for this extension?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We don't have any plans on accessing images from the host-side via USM. As of now, this would strictly be device accesses through device pointers.
  2. There is no backing storage since this would only be a device allocation.
  3. A valuable use-case is the re-use of existing USM memory. A bindless image can then be created directly on-top of this memory allowing for the hardware sampling of the underlying device data.
  4. Other use-cases include coherence between the SYCL bindless images extension and SYCLomatic's objective.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We don't have any plans on accessing images from the host-side via USM. As of now, this would strictly be device accesses through device pointers.

Accessing USM pointer from host side or from device side (but by trying to read by dereferencing the pointer) would be essentially the same. Because images by default have different memory layout (tiled) that is needed for image sampler reads.
Reading data from such tiled memory using USM pointer would be meaningless - EU side code would still need to read such image by using dedicated read/write image functions.
Sure, driver can setup such image into linear memory layout, but 1st - you can expect worse performance and 2nd - there would be no need for hardware sampling, it would be just like reading any ordinary buffer.

  1. A valuable use-case is the re-use of existing USM memory. A bindless image can then be created directly on-top of this memory allowing for the hardware sampling of the underlying device data.

Have you considered to use L0 Image View extension: https://spec.oneapi.io/level-zero/latest/core/api.html#imageview

Copy link
Contributor

@Seanst98 Seanst98 Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now, we have only worked with USM bindless images on a device resident basis. We propose that images may be created from linear memory in this extension. That way, dereferencing the pointer would return the expected image data.

I would imagine that the performance would be dependent on the use-case. Allocating new memory for a tiled layout and copying the linear layout image data to this new allocation could be slower than simply allowing for the creation of an image on top of the already existing linear data and sampling that (despite the slower sampling speed).

I don't think we've seen the L0 Image View extension. However, at a quick glance, this seems to only provide a mechanism that redescribes how an already existing image is interpreted. In the use-case I mentioned above, there was no prior existing image. I'll take some more time to look deeper into the Image View.


== Overview

Bindless images are a feature that provides flexibility on how images are accessed and used, such as removing limitations on how many images can be accessed as well as potentially improving performance.
Copy link

@HoppeMateusz HoppeMateusz Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe what are bindless images and how they differ from bindful images?

what is the goal of this extension ?

why current L0 API cannot be used?

If the main reason to have this extension is to be able to access image memory on HOST with a pointer , or allocate an image with provided memory ( imported) - then i do not understand benefits over RAW buffer memory - without image format information, we cannot use tiled images or even linear images and samplers - images have their restrictions on row pitch / slice pitch and dimensions.

we cannot import any memory and create image out of it .

if HOST does not need to access memory - Image view extension can redescribe image and create new image from existing / allocated memory. Wouldnt that be enough ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HoppeMateusz Mateusz this is the main point - bindless textures are not bindless or bindful images (or buffers). It seems the word "bindless" has two different meaning and therefore people can get confused.

  1. Bindless texture - L0 extension draft (here) - texture (image) not attached to the backing storage (thus not bound - bindless).
  2. Bindless and bindful addressing model - two flavors of stateful addressing.

So this extension is very similar to the OpenCL 1.2 image from buffer extension, but with dynamic selection from which buffer image is created. And yes, it looks like only flat buffer memory could be supported, so no tiled memory layout as sampler capable platform could read it.

Copy link

@HoppeMateusz HoppeMateusz Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event with flat linear memory - sampler may not be able to read it if pitch dimensions are not aligned with HW requirements - and from the intents in this spec - i see that any mmoery can be used to create an image - imported memory from different API - it means memory may not be big enough to have necessary padding or memory filled on HOST may not know excatct HW restrictions on row pith and slice pitch and paddings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the responsibility of the user to hold onto the pitch information for their allocation. This information is contained in ze_image_mem_exp_desc_t and is contained in the struct passed to zeImageCreate.

It is also the responsibility of the user to make sure that their allocation meets the requirements of the HW. I think stating this has been missed in the proposal. Potentially, we could make it the responsibility of the user to query for the HW requirements and guarantee their allocations are consistent (or face the undefined consequences), or zeImageCreate would throw on encountering a problem like this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user defined pitch may not align with HW restrictions - if it doesn;t memory will have to be reallocated ( copied) - this is what we do in OpenCL when createImage with USE HOST PTR flag is used.

Restrictions depend on HW image format used - for RGBA restrictions may be different than for planar YUV image, or YCrCb images.

and once requirements are queried by user - the memory cannot be used ( in genreal) for image of tdifferent format - so user will have to allocate new memory - or stop using/ interpreting this memory as described by first image formt..

Does spec cover all of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the user has allocated with our extension to zeMemAllocDevice (though we are probably going to need to propose a new allocation function), the optimised pitch variables are returned to the user. This way the pitch will align with the HW requirements when creating the image.

However, in the case where the user doesn't allocate with our extended allocation function, then it should be the responsibility of the user to query the hardware for the alignment requirements and make sure that their allocation is consistent with those requirements before creating the image.

What are the restrictions based on if not the number of bytes? The user should not be able to create an image on top of memory whose specified formats disagree with one another. We were also recently made aware of the Image View extension, what happens in the case where a user chooses to make an image view with a different image format than what the original image is defined with?

The spec doesn't cover these cases in detail enough. It's clear we're missing a lot and we are hoping to take your comments into consideration. Your feedback is greatly appreciated!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the code - looks that we do not check any compatibility - so any image view could be applied to existing image and L0 would not fail.

For bindless images in SYCL to work in Level Zero, not only must there be a way to produce bindless image handles, but also allocate and populate image memory before `zeImageCreate` is called.

Level Zero stores image memory in an implementation-specific encoding and layout that optimizes for device access such as the z-order curve.
This proposal, instead, focuses on storing image memory in a linear fashion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why forcing linear layout ?
This will have negative performance impact.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichalMrozek we could make this linear layout opt-in for sampler-less platforms. But there is nothing in principle against using USM pointer as a backing storage, define it in API as inaccessible from CPU and read it via sampler and image surface state on GPU.

Copy link
Contributor

@Seanst98 Seanst98 Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because L0 stores image memory in an implementation-specific layout and can create images on top of that memory, we would also wish to allow for the creation of images on top of linear memory.

The performance impact can depend entirely on the access pattern. This is something that the user decides for themselves by choosing the appropriate allocation function as set out in the rev3 SYCL bindless images extension proposal.

@zzdanowicz are you saying that we wouldn't be able to perform sampling on linear memory? We've been able to achieve this on the CUDA backend. Ideally, we'd like to achieve this through L0 as well.

The main point is that an image should be able to sample from both linear memory and implementation-specific layouts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to know all HW restrictions on dimensions - row pitch slice pitch , imnage width / height based on the format used - only then linear image can by accessed by a sampler.

do you know all those HW restrictions when creating and filling memory on HOST ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our revision 3 spec for the SYCL bindless images extension, we propose device query functions so that the user can make sure that their allocation is consistent with the hardware requirements. See https://github.com/intel/llvm/blob/67a37b1786213aba78140c4fb74b906ba0fdd1d3/sycl/doc/extensions/experimental/sycl_ext_oneapi_bindless_images/sycl_ext_oneapi_bindless_images.asciidoc#pitch-alignment-restrictions-and-queries

The allocation function we provide will guarantee these restrictions are met. But if the user wishes to use another allocation, then they must be aware of these restrictions, and query the device to ensure they are correct.

+
[out][optional] pitch of slice

`ze_image_mem_alloc_exp_desc_t` is intended to be passed to `zeMemAllocDevice` when manually allocating memory for an image to return optimized pitch information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ze_image_mem_alloc_exp_desc_t will be passed to zeMemAllocDevice, then how would this ensure that size passed to zeMemAllocDevice is enough to fit the image?
What if size would be lower then image requirements ? Is error expected to be returned?

The below pseudo code assumes that image would fit
width * height * sizeof(float)

This will not be true for all sizes, HW has certain pitch requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there is an oversight in this proposal where we still include the passing of the size parameter to zeMemAllocDevice. This can cause confusion between the struct we pass which contains the dimensions/sizing and return pitch information, while still passing this size parameter. We may need to instead propose a new allocation function.

It was intended that the struct ze_image_mem_alloc_exp_desc_t passed to zeMemAllocDevice contains the fields rowPitch and slicePitch, and the function would allocate the appropriate amount of memory with the widthInBytes, height and depth values. These are return variables that the device populates with the optimal pitch values.

Co-authored-by: Isaac Ault <isaac.ault@codeplay.com>
Co-authored-by: Przemek Malon <przemek.malon@codeplay.com>
Co-authored-by: Sean Stirling <sean.stirling@codeplay.com>

Update the Level Zero bindless images extension proposal to revision 2:
 - Re-designed the proposed API
 - Revamped overview and background section
 - Added specification overview
 - Added definitions (defs, interfaces, enums)
 - Added programming examples
@ProGTX ProGTX changed the title [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 3) and Level Zero (rev 1) [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 3) and Level Zero (rev 2) Apr 11, 2023
/// - ::rowPitch and ::slicePitch are [out] parameters
/// + ::ze_image_desc_t.pNext
/// - ::rowPitch and ::slicePitch are [in] parameters
/// - When this structure is passed to an image descriptor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we pass this to zeImageCreate? so we need to pass it to zeMemAllocPitchedExp and then again to zeImageCreate? wouldn't be better to pass the output from zeMemAllocPitchedExp to zeImageCreate, w/o needing to pass again the struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, zeImageCreate will require the data fields in this struct. That said, it will also need the output from zeMemAllocPitchedExp which is an issue we will address in the next revision.

In the next revision, we hope to remove the image memory descriptors we've proposed and piggy-back off ze_image_desc_t, using pNext for any additional data fields we may require. This is a possible place where we could add the returned memory handle of zeMemAllocPitchedExp, so that it could be propogated forward to zeImageCreate.

typedef struct _ze_device_image_bindless_exp_desc_t {
ze_structure_type_t stype; // [in]
void *pNext; // [in,out][optional]
uint32_t imagePitchAlign; // [out]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagePitchAlign -> minimum Alignment I guess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only row pitch aligment ?
how about 3D images that may have slice pitch alignment ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered above. This will need to be extended to support 3D images fully

/// - ::ZE_RESULT_ERROR_INVALID_NULL_HANDLE
/// + nullptr == hContext
ZE_APIEXPORT ze_result_t ZE_APICALL
zeMemAllocImageExp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would be the CUDA equivalent of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuArray3DCreate
CUresult cuArray3DCreate ( CUarray* pHandle, const CUDA_ARRAY3D_DESCRIPTOR* pAllocateArray )

/// - ::ZE_RESULT_ERROR_INVALID_NULL_HANDLE
/// + nullptr == hContext
ZE_APIEXPORT ze_result_t ZE_APICALL
zeMemAllocPitchedExp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, why would be the equivalent CUDA API of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuMemAllocPitch
CUresult cuMemAllocPitch ( CUdeviceptr* dptr, size_t* pPitch, size_t WidthInBytes, size_t Height, unsigned int ElementSizeBytes )

ze_context_handle_t hContext, /// [in] handle of the context object
const ze_image_mem_alloc_exp_desc_t *image_desc, /// [in] memory allocation descriptor
ze_image_mem_handle_exp_t *mem_handle /// [out] pointer to device allocation handle
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need this function. We could just reuse zeImageCreate, and through pNext, pass ze_image_mem_alloc_exp_desc_t. When L0 detects that descriptor in pNext, then it would behave as this new one. That way we minimize the number of zeImageCreate APIs in L0 - we already have 2, this would be the 3rd - and that way we minimize confusion. @wdamon-intel, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important part of the nomenclature of this function is the "Alloc" aspect. As mentioned in the overview of this proposal, we seek to have

The creation of images to be split into explicit allocation of image memory and the creation of image handles from the previously allocated memory.

It's because of this that zeImageCreate wasn't considered for this, as it's nomenclature implies returning an image, where the proposed zeMemAllocImageExp would allocate memory for the image and return the allocation.

That said, I think we could use zeImageCreate for both, using a flag/enum to determine whether the call to zeImageCreate is asking for an allocation of memory or for the creation of an image. This would result in two calls to zeImageCreate in any instance that separation is needed, and the resulting code would look like

// Allocate image memory
zeImageCreate(...)

// Copy from host to device
zeCommandListAppendImageMemoryCopyFromHostExp(...)

// Create image from memory allocated above
zeImageCreate(...)

Is this what you'd prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @isaacault for the details. Yes, it seems weird to call twice zeImageCreate, so makes more sense to have it separate.

/// - ::ZE_RESULT_ERROR_INVALID_SIZE
/// + (nullptr == phWaitEvents) && (0 < numWaitEvents)
ZE_APIEXPORT ze_result_t ZE_APICALL
zeCommandListAppendImageUSMCopyFromHostExp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need this and neither CopyToHostExp. This is a usm pointer, so we can the zeCommandListAppendMemoryCopy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an issue with our proposal. To my understanding, zeCommandListAppendMemoryCopy doesn't support copying pitched memory. The issue here is that our proposed function here doesn't support pitched copies either.

We'll address this in the next revision, likely by passing the descriptor ze_image_usm_alloc_exp_desc_t to the copy function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @isaacault . Here we are passing pitched memory, which is USM, right? L0 internally should be able to detect whether the pointer is regular USM memory or USM pitched memory, and do what it needs. ze_image_region_t is the same as ze_copy_region_t, and for that we have zeCommandListAppendMemoryCopyRegion. Unless we need to pass some extra info to the copy, then we could use that one already. All that I see below is ze_image_region_t, and zeCommandListAppendMemoryCopyRegion would cover that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isaacault :

We'll address this in the next revision

was this next revision already posted? if not, when do you think we will have it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jandres742 It has just now been posted. Apologies for the delay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @isaacault .

I will move this now to the L0 spec repo.


// Create image from memory allocated above
ze_image_handle_t hImage;
zeImageCreate(hContext, hDevice, &imageDesc, &hImage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont understand this. We created imageMemDevice with zeMemAllocPitchedExp, and then initialize it with zeCommandListAppendImageUSMCopyFromHostExp. How, hImage is associated with imageMemDevice when calling zeImageCreate? Should we have a new struct that contains imageMemDevice , and then pass that struct to imageDesc->pNext, so that way zeImageCreate knows it has to associate the hImage handle with the non-USM memory previously created? something like:

ze_mem_image_pitched_desc_t pitchedDesc {};
pitchedDesc->ptr = imageMemDevice ;

imageDesc->pNext = &pitchedDesc;

// Create image from memory allocated above
ze_image_handle_t hImage;
zeImageCreate(hContext, hDevice, &imageDesc, &hImage);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another issue with the proposal, and we'll address this for the next revision.

In addition to adding the memory handle to the struct passed to zeImageCreate, as you seem to have noticed there is a large overlap between our image memory allocation descriptors, and we plan to collapse them into ze_image_desc_t as you've said by extending through pNext for any additional functionality we need. This is an example of what will be there, along with pitch values in the USM case.

|`info::device::ext_bindless_image_3D_USM_support` |`bool` |
Returns `true` if the device supports creation of 3D bindless images backed by
USM.
|======================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries should also be in the extension namespace. For example:

namespace sycl::ext::oneapi::experimental::info::device {

struct bindless_image_support;
struct bindless_image_1d_usm_support;
struct bindless_image_2d_usm_support;
struct bindless_image_3d_usm_support;

} // sycl::ext::oneapi::experimental::info::device

Since they are already in an "ext" namespace, I think the "ext_" prefix is not necessary. The SYCL convention is to use lowercase, even for abbreviations like "usm" or "1d".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Though in DPC++ there are some info::device properties that are from extensions, but not in the experimental::info::device namespace, e.g. info::device::ext_intel_max_mem_bandwidth, hence we followed that style. However, I think these are marked as deprecated now.

We will update this in the next revision.

We were also considering using device::aspects for some of these properties. E.g. aspect::bindless_image, or aspect::bindless_mipmap. We welcome discussion on this topic.

intensity = 11,
luminance = 12,
abgr = 13,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the specific enumeration values important? If not, let's not overspecify. Most (all?) enumeration values in the core SYCL spec are not specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not. We will update this in the next revision.

Comment on lines +213 to +214
image_descriptor(sycl::range<1> dims, sycl::image_channel_order channel_order,
sycl::image_channel_type channel_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image_descriptor(sycl::range<1> dims, sycl::image_channel_order channel_order,
sycl::image_channel_type channel_type,
image_descriptor(sycl::range<1> dims, image_channel_order channel_order,
image_channel_type channel_type,

These aren't in the sycl namespace. Since they're defined in the same namespace as this struct, they don't need to be qualified.

There are several other occurrences of these types below that are shown in the sycl namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted to fix in the next revision.

backend-specific, and may be an optimized layout, e.g. tile swizzle patterns.

`image_mem` shall not be copy-constructible, copy-assignable, or
device-copyable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many other SYCL classes have "common reference semantics". Why not define image_mem that way too? Defining it in a different way (e.g. no copy constructor) will lead to confusion.

Objects with common reference semantics have shallow copy semantics. Therefore, copying an image_mem would not create a new underlying memory buffer. Rather, you just get a new object wrapping the same buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, I wasn't aware that common reference semantics have shallow copy semantics. We will look more into the common reference semantics and update this in the next revision.

image_descriptor get_descriptor() const;
sycl::context get_context() const;

sycl::range<3> get_range() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should document which dimension is width, height, depth. Since we are using range here, why not use range also in image_descriptor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted to document this in the next revision.

As for using range in image_descriptor, I think the initial reason we did not was because we would try to template image_descriptor by the number of image dimensions, which caused some problems, and descriptors for differently dimensioned images would be distinct types, something that we wanted to avoid for the benefit of SYCLomatic.

However, we could use an explicit range<3>. Will note this down as well to consider for the next revision.

unsigned int get_image_num_channels() const;
image_type get_type() const;

image_mem_handle get_mip_level() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image_mem_handle get_mip_level() const;
unsigned int get_mip_level() const;

I assume?

The image_descriptor APIs call this "num_levels". Use consistent terminology. If this is specific to mipmaps, maybe "mip_levels" or "mipmap_levels" is a good name to use everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_mip_level returns a handle to the image memory of a specific level in the mipmap, and not the number of mipmap levels. This enables users to perform operations on the memory of individual images in the mipmap.

I can see how the name might cause some confusion though. Perhaps we should rename to get_mip_level_mem_handle or similar.


The first method of allocating device memory for images is through
`alloc_image_mem`. This takes an `image_descriptor` and `sycl::context` to
allocate device memory, appropriately sized based on the `image_descriptor`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing my understanding ... image memory allocated this way can be used to create an image for any device that is listed in syclContext. Correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an oversight on our part, as I don't think memory allocated on device A can be used on another device B, even if both devices are in the same context.

Unless sharing memory across devices like this is possible, we will need to fix this in the next revision. We should pass both sycl::context and sycl::device parameters.


This function will allocate a memory region aimed to be used for
two-dimensional images. It allocates memory that is guaranteed to adhere to the
device's alignment requirements for 2D USM images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two overloads take an image_descriptor. Why can't those overloads allocate memory for a 3D image? What happens if the image_descriptor describes a 3D image? This might be a silly question ... is there such a thing as a 1D image?

Presumably, the user must still ensure that the width and height passed to these APIs is less than the max_texture_linear_width / max_texture_linear_height parameters defined below? If so, the spec should say that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far our revisions of the proposal were heavily influenced by our prototyping of the implementation in the PI CUDA backend.

Now that we have more insight into Level Zero and fleshed out that proposal, we need to rethink how this should work.

CUDA does not allow 3D images to be constructed from USM. We have a query for checking USM image support for each dimensionality, e.g. info::device::ext_bindless_image_3d_usm_support.

Level Zero might allow 3D image construction from USM memory. In that case, we might need to add depth parameter to the overloads that don't take image_descriptor, along with a size_t *ret_slice_pitch out paramater for all overloads.

There are 1D images. In the CUDA case, however, 1D USM images do not support linear sampling, so the query info::device::ext_bindless_image_1d_usm_support would return false on CUDA devices.

Presumably, the user must still ensure that the width and height passed to these APIs is less than the max_texture_linear_width / max_texture_linear_height parameters defined below? If so, the spec should say that.

Noted to clarify in the next revision.


The second way to allocate image memory is to use USM allocations. SYCL already
provides a number of USM allocation functions. This proposal would add another,
pitched memory allocation, through `pitched_alloc_device`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you are entertaining suggestions on names, but I think malloc_device_image or malloc_device_pitched_image might be better. That's more consistent with the existing USM allocation function names, which reinforces that these are allocating regular USM device memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're entertaining any and all suggestions. The SYCL 2020 spec has USM allocation functions named similarly, e.g. sycl::aligned_alloc_device, I think that pitched_alloc_device follows that style.

As for including "image" in the name, I'm leaning towards not including it. I don't think pitched allocation is exclusive to images. From my understanding it can be used for regular buffers as well, as its purpose is to align rows of memory to provide better performing accesses. But I might be wrong about this and this benefit does not apply to buffers, only images.

Returns the maximum linear height allowed for images allocated using USM.
|`info::device::ext_oneapi_max_texture_linear_pitch` |`size_t` |
Returns the maximum linear pitch allowed for images allocated using USM.
|======================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here that I made earlier about the namespace of device info descriptors.

@MichalMrozek
Copy link
Contributor

Can further Level Zero discussion happen in Level Zero Github ?
https://github.com/oneapi-src/level-zero-spec


We also propose a new struct `ze_device_image_bindless_exp_desc_t` that
describes device restrictions on image properties. It should be passed to
`ze_device_properties_t::pNext`. It contains the fields `imagePitchAlign`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the new images limited to 2D images only ?

or are 3D images allowed too ?

for 3D images - slice pitch has to be defined ( size of 2D image + padding )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do want to support 3D USM images. If this requires alignment restrictions on both rowPitch and slicePitch, then this will need to be added. This may include adding any or all of the following:

  • imageSlicePitchAlign
  • maxImageLinearDepth
  • maxImageLinearSlicePitch

I'm not sure if these names are the best choice, but the idea stands.

ze_structure_type_t stype; // [in]
void *pNext; // [in,out][optional]
size_t widthInBytes; // [in]
size_t height; // [in][optional]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depth is usually used for 3D images -are 3D images supported ?

does widthInBytes already include padding to row pitch alignment ?

is this the formula to calculate total memory size : widthInBytes * height * depth * elementSizeByte ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. 3D images are supported.

Yes, widthInBytes already includes padding and alignment.

No, the formula for total memory size would be

widthInBytes * height * depth

and the formula for number of elements would be

(widthInBytes / elementSizeByte) * height * depth

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i think slice padding ( if exists, for a case where slice size != rowpitch * elementSize * height ) is not included

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HoppeMateusz ,

You'll have to forgive me, I've answered your original question wrong. To go back on it, widthInBytes does not include padding/alignment, and is merely the requested allocation width (in bytes). I'm very sorry for any confusion from this.

The relevant formulas would be

total_memory_size_2D = rowPitch * height * depth
total_memory_size_3D = rowPitch * slicePitch * depth

To answer your questions about support for 3D images, I hope the latest revision clears up any issues as it has some changes around this part of the spec. To summarize the changes, the row pitch and slice pitch should be handled by the proposed implementation-specific ze_image_mem_handle_exp_t, whereas width, height, depth, element size, etc. should be piggy-backed off the existing ze_image_desc_t.


=== Background

The DPC++ bindless images extension has sought to provide the flexibility of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the term "bindless image" refer to the possibility of passing variable number of images to kernel ? Or to the allocation flow - how backing memory is allocated and usd for image ?

It looks to me the extension defined in this document describes the new way of allocating images - and new memory handles. It does not need to be coupled with bindless images ( allowing passing variable count of image handles to kernels at runtime ) - that are orthogonal to the way of how images are allocated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the term "bindless image" refer to the possibility of passing variable number of images to kernel ?

Yes.

Or to the allocation flow - how backing memory is allocated and usd for image ?

No. However, the allocation flow that we've proposed is necessary for our ability to implement the bindless images extension in the DPC++ Level Zero backend.

It does not need to be coupled with bindless images

The new way of allocating images, and hence the new handles, is necessary for device buffers containing device image handles. We were unaware until yesterday that the existing ze_image_handle_t is a complex type and thus not valid on the device. With this, the extension will also require for the user to have access to device image handles, e.g. through

ze_result_t zeImageGetDeviceHandle(
      ze_image_handle_t image,            /// [in]
      unsigned long long* device_handle   /// [out] – type TBD
);

function, `zeMemFreeImageExp`, to free memory allocated with
`zeMemAllocImageExp`.

We also propose a new struct `ze_device_image_bindless_exp_desc_t` that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ze_device_image_bindless_exp_desc_t

do we need "bindless" in the name of the struct ? it seems the struct defines properties of memory allocated for the image and used as backing storage of the image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't need bindless in this name. We will remove this in the upcoming revision.

- Collapse memory descriptors into ::ze_image_desc_t
- Associate memory handles with image handles
- Use existing zeCommandListAppendMemoryCopyRegion for USM copies
@ProGTX ProGTX changed the title [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 3) and Level Zero (rev 2) [sycl_ext_oneapi_bindless_images] Extension proposals for SYCL (rev 3) and Level Zero (rev 3) May 10, 2023
// Once all operations on the image are complete we need to free the memory and
// destroy the handle

// Free image memory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory cannot be freed before image is destroyed - as image is using this memory as a backing stroage,

the order of freeing should be reverse to allocation order:

zeImageDestroy(hImage);
zeMemFree(hContext, allocDesc.pMemAlloc);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HoppeMateusz that's right - this was overlooked in this proposal. You'll see that in the DPC++ extension we destroy image handles and then free the memory - reverse allocation order as you've mentioned. It is meant for this to work in the same order.

Apologies for the confusion here.

@ProGTX
Copy link
Contributor Author

ProGTX commented Jun 13, 2023

Closing this PR.

The Level Zero proposal will be moved to https://github.com/oneapi-src/level-zero-spec at a later time, as discussed.

The PR for the SYCL proposal revision 3 is now ready for merging and available in #9842 . The corresponding implementation PR is #9665 , or rather #9808 as a split PR. All the feedback from here has been taken into account, but will have to wait until revision 4, the priority is to get the implementation in.

@ProGTX ProGTX closed this Jun 13, 2023
steffenlarsen pushed a commit that referenced this pull request Jul 6, 2023
Replaces #8307

Required for #9808 and
#10112 .

Addressed some of the feedback on revision 3 as revision 4. Any larger
changes will need to move to subsequent revisions.

---------

Co-authored-by: Przemek Malon <przemek.malon@codeplay.com>
Co-authored-by: Isaac Ault <isaac.ault@codeplay.com>
Co-authored-by: Sean Stirling <sean.stirling@codeplay.com>
Co-authored-by: Duncan Brawley <duncan.brawley@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.