Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Doc] Graph fusion extension proposal #8678

Merged
merged 14 commits into from Sep 20, 2023

Conversation

sommerlukas
Copy link
Contributor

Experimental SYCL extension proposal for kernel fusion on top of the SYCL graphs API.

Constructing the sequence of kernels to fuse is completely left to the graphs proposal, which provides two APIs to this end. One recording API similar to the fusion mode for queues in the initial kernel fusion proposal, and an explicit graph construction APIs. Both APIs are supported for kernel fusion.

This proposal mainly introduces a number of properties to trigger fusion of the graph and internalization of dataflow in the fused kernel.

This proposal continues some of the ideas of the experimental SYCL extension for kernel fusion. In contrast to the original kernel fusion proposal, this proposal now also allows internalization of USM pointers.

@sommerlukas sommerlukas added the spec extension All issues/PRs related to extensions specifications label Mar 16, 2023
@sommerlukas sommerlukas requested a review from a team as a code owner March 16, 2023 13:42
@sommerlukas sommerlukas self-assigned this Mar 16, 2023
@Naghasan
Copy link
Contributor

May be good to add a comment on the initial extension it is being superseded by this one.

Copy link
Contributor

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some very nitpicky editorial comments, but otherwise this LGTM from a layering on top of sycl_ext_oneapi_graph perspective.

@sommerlukas
Copy link
Contributor Author

May be good to add a comment on the initial extension it is being superseded by this one.

I added a note to the initial kernel fusion extension.

Copy link
Contributor

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting a couple of changes to the graphs extension which will affect your examples.

@sommerlukas sommerlukas requested a review from EwanC August 17, 2023 14:03
Comment on lines 127 to 149
The property is not prescriptive. Implementations are free to not perform fusion
if it is not possible (see below section <<limitations, Limitations>>), fusion is not
supported by the implementation, or the implementation decides not to perform
fusion for other reasons. It is not an error if an implementation does not
perform fusion even though the property is passed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just my opinion, but I find it a bit weird that this interface requires developers to provide lots of decorations and properties about what they want to happen, but that there's still so much implementation-defined behavior.

I think it would be preferable to separate "make a best effort to perform fusion" from "perform fusion", where the former is allowed to fail spuriously and the latter throws an exception if it fails. The former might already be covered by as-if optimization guarantees.

I say this because if somebody is trying to optimize a code and prepare a graph for fusion, they should know when the property doesn't do what they asked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our current implementation, it is possible to get this information through warning messages enabled by an implementation-specific environment variable. As SYCL does not define a mechanism for synchronous warning messages, this makes it hard to define such a requirement for implementations in this extension proposal.

For the extension, our aim was to strike a balance between required features and portability, but we can certainly try to adjust the current tradeoff:

  • For the internalization properties (promote_local/promote_private), I think they should remain to not be prescriptive. These properties might need to be applied in multiple locations and applying them conditionally would make code less readable. Also, if internalization can be performed might depend on compiler internals, so would be hard to predict in the form of an aspect/information descriptor.
  • For the fusion itself, we could make this prescriptive. If an implementation reports support for fusion on a device through the aspect/information descriptor, we could mandate a synchronous error to be thrown if fusion is not happening.

Would you also want to define a minimum set of fusion scenarios that must be supported for an implementation to be allowed to report fusion support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the internalization properties (promote_local/promote_private), I think they should remain to not be prescriptive.

I agree. These should be hints that give the compiler information about which kernels could legally be fused.

For the fusion itself, we could make this prescriptive. If an implementation reports support for fusion on a device through the aspect/information descriptor, we could mandate a synchronous error to be thrown if fusion is not happening.

This makes sense to me.

It seems to me like an implementation could choose to try and fuse all of the kernels in a graph without any input from the user: as long as the implementation produces the right answer (and respects the properties attached to buffers, etc), the user can't really tell if fusion happened or not. If I'm right about that, then the descriptive version of perform_fusion doesn't really add very much (except that our current implementation doesn't attempt to fuse if it's not present).

If we want to keep a descriptive version around, I'd recommend the naming enable_fusion (similar to enable_profiling). It suggests that the implementation may fuse, but isn't required to. It might be a no-op, either because an implementation always fuses, or because it never fuses.

For a prescriptive version, I'd recommend something like require_fusion. It could coexist with the enable_ variant.

Would you also want to define a minimum set of fusion scenarios that must be supported for an implementation to be allowed to report fusion support?

I don't think we have to go this far, at least right now. I think require_fusion would have to mean something like "maximally fuse the linearization of this graph". In the absence of enqueued barriers, that probably means making everything into one kernel? With enqueued barriers, I guess you'd get one kernel between each pair of barriers.

Any implementation that can't fuse all the kernels together (for any reason) would throw an exception. The user could then enqueue again without fusion, or investigate why fusion isn't happening, depending on their use-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me.

The necessity of a descriptive variant hinges on the guarantees graphs want give with regard to outside visibility of individual kernels. If SYCL graphs want to guarantee that tools such as profilers (Intel VTune, Nvidia NSight, ...) can observe individual kernels, fusion would only be possible if explicitly allowed by the user through a descriptive property. If, on the other hand, SYCL graphs give no such guarantee, we could omit the descriptive variant and implementations are free to perform fusion without explicit instruction to do so.

What type of guarantees should/do graphs give in that regard?

Independent of that, we would have a prescriptive version, throwing an error if fusion fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question. I don't know the answer, so tagging some folks who might know and/or have opinions: @reble, @bashbaug, @gmlueck

I can see arguments both ways... But I agree that the necessity of the descriptive variant hinges on the answer to this question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a descriptive and a prescriptive property to initiate fusion for now. Depending on feedback from @reble, @julianmi, @EwanC and @Bensuo we can drop the descriptive variant if outside observability of individual kernels is not a requirement for graphs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples you give for "outside observability" are all profilers. It seems unlikely to me that a user would want to disable optimizations (like fusion) when running a profiler. Wouldn't a user want to profile the optimized code? Instead, maybe the graph fusion implementation should name the fused kernel in a way that makes it obvious which input kernels it corresponds to.

Are there any issues with "inside observability", though? For example, what if the kernel has a type-name, and the application gets a kernel_bundle from that type-name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples you give for "outside observability" are all profilers. It seems unlikely to me that a user would want to disable optimizations (like fusion) when running a profiler. Wouldn't a user want to profile the optimized code?

Yes, that's true.

If we agree that no outside observability for individual graph nodes is necessary, we can drop the descriptive version of the fusion property and implementations would always be free to perform fusion as a (performance) optimization.

However, I think the right place to talk about outside observability would be the graph extension proposal itself, rather than this fusion proposal. What do you think?

Instead, maybe the graph fusion implementation should name the fused kernel in a way that makes it obvious which input kernels it corresponds to.

Yes, that would be possible. I would consider this an implementation detail, so not require a specific naming in this extension, though.

Are there any issues with "inside observability", though? For example, what if the kernel has a type-name, and the application gets a kernel_bundle from that type-name?

The kernel is only fused at runtime and the original kernels are all still available, so that is not an issue. It would still be possible to access the individual kernels through these APIs.

Comment on lines 146 to 147
If the property list contains this property, no barriers are introduced between
kernels in the fused kernel (see below section on synchronization on kernels).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too coarse-grained.

I can see how it would be useful for simple cases, but couldn't there be cases where the developer wants to fuse a chain like k1 -> k2 -> k3, and they need a barrier between k1 and k2, but not between k2 and k3?

What about making the "no barriers" behavior the default instead, and requiring the developer to insert explicit barriers between pairs of kernels that will still require a barrier after fusion? You could represent such a barrier using something similar to sycl_ext_oneapi_enqueue_barrier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could represent such a barrier using something similar to sycl_ext_oneapi_enqueue_barrier.

A note that current the sycl_ext_oneapi_graph spec currently forbids using sycl_ext_oneapi_enqueue_barrier in a graph node, but this is a restriction we'll relax in the near future given we're in the process of implementing it in our fork reble#301

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl_ext_oneapi_enqueue_barrier would be one possible way to introduce a barrier (although not portable between implementations) once supported in SYCL graphs. Another easy way is to add a group_barrier at the end/beginning of a kernel, because they do not get removed by fusion.

We chose barrier insertion as the default behavior, because it at least partially matches the behavior without fusion. Without fusion, there's an implicit device-wide barrier between two kernels that are executed on the same device. This implicit barrier is gone with fusion, so we chose to insert at least work-group barriers by default.

Therefore, we also wanted to make disablement of the barrier insertion an active decision by the user (through the property).

Would you still prefer the default to be changed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another easy way is to add a group_barrier at the end/beginning of a kernel, because they do not get removed by fusion.

This seems dangerous to me. I think it's quite likely that optimizing compilers may try to remove redundant (i.e. back-to-back) barriers or unused barriers as a performance optimization. For your suggestion to work, we'd have to try and make sure that fusion always happens before any such optimizations.

Therefore, we also wanted to make disablement of the barrier insertion an active decision by the user (through the property).

Would you still prefer the default to be changed?

I don't feel strongly about the default -- my concern is really about the granularity of control.

If the default is to insert work-group barriers between kernels, then I think there needs to be a mechanism for developers to express that a barrier is not required between a specific pair of kernels (i.e. without removing all barriers between all pairs of fused kernels).

If the default is not to insert work-group barriers between kernels, then I think there needs to be a mechanism for developers to express that a barrier is required between a specific pair of kernels. I was leaning towards this just because it seemed we already had the necessary functionality (enqueueing a barrier).

Sorry for not being clear about that originally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the answer to fine-grained control is sub-graphs. If the user wants device-wide synchronization, they could simply create two sub-graphs (one for kernels before the barrier, one for kernels after the barrier), fuse each of them and then submit them as sub-graphs of a larger graph without fusion.

This would allow for the fine-grained control we were looking for. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. The idea of using sub-graphs also ties in nicely to the prescriptive/descriptive versions of the properties we've been discussing -- I can imagine cases where a developer expects (and thus prescribes) fusion within a few small sub-graphs consisting of a handful of kernels, but leaves the decision to fuse other things up to the implementation.

A non-normative note explaining that developers can use sub-graphs like this would be really helpful, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I would propose we do the following:

  • Switch the default behavior to no barriers being inserted by the fusion compiler, so fusion is fast by default.
  • Add a property (e.g., insert_barriers) that triggers the current behavior, i.e., automatic insertion of work-group barriers between the kernels by the JIT compiler.
  • Add a non-normative note explaining that sub-graphs can be combined with fusion to achieve a device-wide barrier between two fused kernels.

If you agree, I can update the proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all sounds good to me.

One thing I'd still like to support somehow is inserting a work-group barrier between two fused kernels without needing a work-group barrier between all pairs of fused kernels. But the direction you've outlined is compatible with adding something like that in the future, and so I'm happy to wait until we have clear use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the default behavior and introduced the insert_barriers property to trigger insertion of work-group barriers between every pair of kernels.

`sycl::aligned_alloc_shared()` to apply the property to an USM pointer.

```c++
sycl::ext::oneapi::experimental::property::promote_local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name promote here (and in promote_private) is confusing.

I understand this is what is happening from the compiler's perspective, but it reads like the developer is prescribing promotion must happen. In my mind, that's inconsistent with the fact that the compiler is allowed to ignore this property.

Since promotion isn't guaranteed, my recommendation here would be to find a way to express this which is more descriptive. For example, a name like work_group_local or access_scope<memory_scope::work_group> would convey that a developer can use these properties to declare certain guarantees that may be used for optimization, without the developer having to prescribe which optimization is expected to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we would be open to renaming these properties (they currently align with the existing kernel fusion extension).

A name like work_group_local (and work_group_private) does however not capture the full semantics of the property. The fact that access is local to the work-item/work-group is only one aspect expressed by the property. The property also expresses that write-back of the results to global memory can be omitted, i.e., no future use of the buffer relies on being able to access these results in the buffer.

We would need to find a name that captures both these aspects, which is why we came up with promote_... (although I agree that it's based on the compiler effect of it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. I'd missed that.

Would it make sense to split that out into two orthogonal properties: one that expresses there is no sharing happening beyond a certain scope, and another that expresses there is no need to write the results back to global memory?

I understand that the current implementation would require both properties to be present to have any effect, but hypothetically speaking, couldn't it be useful to just see one of them? I might be thinking about this the wrong way, but it seems like knowing that there's no sharing would be sufficient to cache/re-use the results in registers or work-group local memory during execution of some of the fused kernels, as long as the results were written out to global memory before they're needed. Similarly, knowing that a buffer is only intended to be used by a specific kernel would allow the memory for those buffers to be re-used as scratchpads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thought!

it seems like knowing that there's no sharing would be sufficient to cache/re-use the results in registers or work-group local memory during execution of some of the fused kernels, as long as the results were written out to global memory before they're needed

Yes, the current implementation does not support that, but the knowledge that no sharing occurs would be sufficient for internalization. The compiler would need to determine the stores which need to be kept to have the final result in the buffer again (or simply keep all stores), but that is only a limitation of the current implementation.

Similarly, knowing that a buffer is only intended to be used by a specific kernel would allow the memory for those buffers to be re-used as scratchpads.

Can you elaborate a bit on what optimization the use as scratchpads would enable? So far, we had thought of not allocating the buffer at all as a potential optimization, but that requires successful internalization, i.e., the no-sharing property, too.

The main reason to have a combined property so far was to reduce the number of properties a user would have to specify to enable internalization. However, if each of the two aspects separately could already enable some optimization, it might indeed make sense to split the properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit on what optimization the use as scratchpads would enable? So far, we had thought of not allocating the buffer at all as a potential optimization, but that requires successful internalization, i.e., the no-sharing property, too.

Disclaimer: I haven't fully thought this through!

I'm imagining a case something like this:

buffer<int> b1{range{dataSize}};
buffer<int> b2{range{dataSize}};
buffer<int> b3{range{dataSize}};

ext::oneapi::experimental::command_graph graph{
    q.get_context(), q.get_device()};

graph.begin_recording(q);

q.submit([&](handler &cgh) {
  auto a1 = b1.get_access(cgh);
  auto a2 = b2.get_access(cgh, sycl::ext::oneapi::experimental::property::scratch);
  cgh.parallel_for<K1>(dataSize, ...);
});

q.submit([&](handler &cgh) {
  auto a1 = b1.get_access(cgh);
  auto a2 = b3.get_access(cgh, sycl::ext::oneapi::experimental::property::scratch);
  cgh.parallel_for<K2>(dataSize, ...);
});

graph.end_recording();

K1 is going to use b2 to communicate some partial results during execution of the kernel, and K2 is going to use b3 for the same. The values of b2 and b3 are never checked after the kernel executes.

Without some sort of decoration, it seems like a runtime would have no choice but to allocate storage for both b2 and b3, just in case the values are used later. With some sort of decoration, it would be legal to use the same storage for b2 and b3 within the graph. I don't think this would always be true after fusion, but it would be true in some cases.

Somebody writing the program as I did above would probably have used a single buffer instead of b2 and b3, but it's not hard to imagine a case like this arising where the kernels are defined and provided by different developers (e.g. in different libraries).

I seem to recall there was some discussion of supporting this sort of optimization in the original graph proposal, but I honestly don't remember where it landed. If there's already a property with the semantics that I discussed above, maybe we could just use that + the scope-narrowing property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting example, I think that's a good motivation to split the properties.

In internal discussion, we also realized that in our current implementation promote_local/promote_private currently imply ` three things:

  • Work-group (local) or work-item (private) exclusive access.
  • No write back of results into the buffer.
  • No initialization of the local/private memory with data from the buffer.

Going forward, we would be using three sets of properties to model this:

  1. The new properties access_scope<memory_scope_work_item> (private) / access_scope<memory_scope_work_group> (local) to express work-item/work-group exclusive access.
  2. A new property to model that no write-back of the results into the device memory will happen. For this one, we need to find a good name. My initial idea no_write_back isn't very suitable, as it would be easily confused with set_write_back(false), which is different as it disables write-back from device memory into host memory. Suggestions for the name are welcome 😉
  3. The existing no_init property to express that the private/local memory is not initialized with data from the buffer.

Implementations can require a combination of these properties to be attached to a buffer/accessor for internalization. It's an implementation detail, but implementations should ideally document it.

In our current implementation, we would require all three properties to be applied for now, but the requirement could be lowered when the JIT compiler get's more capabilities.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction a lot.

My initial idea no_write_back isn't very suitable, as it would be easily confused with set_write_back(false), which is different as it disables write-back from device memory into host memory. Suggestions for the name are welcome 😉

Best I can come up with is no_keep or no_save. I thought about discard, but I fear that would be too confusing given that discard_read and discard_write used to mean "discard the initial results".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the properties' semantics as discussed. Access pattern/behavior is now expressed through the access_scope property template. Not initializing the internalized memory with values from the buffer and not writing back to the buffer are now expressed through the existing no_init property and the new fusion_internal_memory property, respectively.

A few questions are still open and I would appreciate your opinion on those:

  • Should we prohibit specialization of the access_scope property template with other values of the memory_scope enum beyond the two cases currently defined? Or should implementations be free to allow additional specializations?
  • For the no_init property, the SYCL specification defines a list of places where it can appear (table 52). For our case, we need to allow it in additional places, do you see any issue with that?
  • The name of the property for no write-back (fusion_internal_memory) is the best I could come up with so far, but can be changed if we have better ideas.

Copy link
Contributor Author

@sommerlukas sommerlukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback @Pennycook! I've addressed the comments inline.

Comment on lines 127 to 149
The property is not prescriptive. Implementations are free to not perform fusion
if it is not possible (see below section <<limitations, Limitations>>), fusion is not
supported by the implementation, or the implementation decides not to perform
fusion for other reasons. It is not an error if an implementation does not
perform fusion even though the property is passed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our current implementation, it is possible to get this information through warning messages enabled by an implementation-specific environment variable. As SYCL does not define a mechanism for synchronous warning messages, this makes it hard to define such a requirement for implementations in this extension proposal.

For the extension, our aim was to strike a balance between required features and portability, but we can certainly try to adjust the current tradeoff:

  • For the internalization properties (promote_local/promote_private), I think they should remain to not be prescriptive. These properties might need to be applied in multiple locations and applying them conditionally would make code less readable. Also, if internalization can be performed might depend on compiler internals, so would be hard to predict in the form of an aspect/information descriptor.
  • For the fusion itself, we could make this prescriptive. If an implementation reports support for fusion on a device through the aspect/information descriptor, we could mandate a synchronous error to be thrown if fusion is not happening.

Would you also want to define a minimum set of fusion scenarios that must be supported for an implementation to be allowed to report fusion support?

Comment on lines 146 to 147
If the property list contains this property, no barriers are introduced between
kernels in the fused kernel (see below section on synchronization on kernels).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl_ext_oneapi_enqueue_barrier would be one possible way to introduce a barrier (although not portable between implementations) once supported in SYCL graphs. Another easy way is to add a group_barrier at the end/beginning of a kernel, because they do not get removed by fusion.

We chose barrier insertion as the default behavior, because it at least partially matches the behavior without fusion. Without fusion, there's an implicit device-wide barrier between two kernels that are executed on the same device. This implicit barrier is gone with fusion, so we chose to insert at least work-group barriers by default.

Therefore, we also wanted to make disablement of the barrier insertion an active decision by the user (through the property).

Would you still prefer the default to be changed?

`sycl::aligned_alloc_shared()` to apply the property to an USM pointer.

```c++
sycl::ext::oneapi::experimental::property::promote_local
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we would be open to renaming these properties (they currently align with the existing kernel fusion extension).

A name like work_group_local (and work_group_private) does however not capture the full semantics of the property. The fact that access is local to the work-item/work-group is only one aspect expressed by the property. The property also expresses that write-back of the results to global memory can be omitted, i.e., no future use of the buffer relies on being able to access these results in the buffer.

We would need to find a name that captures both these aspects, which is why we came up with promote_... (although I agree that it's based on the compiler effect of it).

Comment on lines +226 to +235
the property is specified on an accessor) or in any kernel in the graph (in case
the property is specified on a buffer or an USM pointer).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second half of this sentence is surprising to me. In a real application, a developer might not have insight into or control over how a USM allocation is used by other kernels.

Wouldn't it be safer to require this property on all usages of USM within the graph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with USM is that throughout most of the application, these are just plain C++ pointers, so it's hard to attach things to them at their usage point (there's no accessors created for USM and no require).

With the standard USM API, the allocation is more or less the only place to attach a property list to a USM pointer.

In an earlier version, we used the USM allocation nodes of the graph to narrow the scope of the USM pointer and therefore also internalization properties to just the graph, but the allocation nodes did not make it into the current version of the graph proposal.

A potential alternative to allow more fine-grained would be to introduce properties to mark USM pointers that take the USM pointer itself:

template<sycl::memory_scope Scope>
struct usm_access_scope {
  void* ptr;
};

These properties could then be passed to command_graph::finalize() to allow limit internalization semantics to just that one graph:

auto* usm_ptr = malloc_shared(...);
...
graph.finalize(require_fusion{}, usm_access_scope<memory_scope::work_item>{usm_ptr});

We could introduce this in addition to the access_scope property and either only allow this for USM pointers, or allow both alternatives. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thank you for the explanation. Somehow, I missed that this property was attached to the malloc.

I like your idea of being able to pass a pointer into the usm_access_scope property, but I'm not sure I follow why it should be part of the finalize call. It sounds like what we really want to be able to do is mark individual usages, so that we can make different fusion decisions for different kernels using the same pointer(s).

What if you made it so that access_scope was a valid property for an sycl_ext_oneapi_annotated_ptr? That way, a developer could attach access scope information to a pointer and carry that information as long as necessary. If the access scope applies to the whole graph, the annotation can be attached to the pointer once, and that same annotated pointer can be captured by all the kernels. If the access scope applies to a single kernel, an annotated pointer can be created locally at command-group scope.

Would that make sense?

One limitation of this approach that I can see is that you'd only be able to use the information if a USM pointer was one of the kernel arguments (i.e. it couldn't be captured indirectly or via a struct). But I suspect this may be an issue with the current proposal already, since there's no association between pointers and the kernels/graphs that use them. Do you need to add a restriction somewhere talking about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you made it so that access_scope was a valid property for an sycl_ext_oneapi_annotated_ptr? That way, a developer could attach access scope information to a pointer and carry that information as long as necessary

I wasn't aware of that extension (or forgot about it again 😅 ), thanks for bringing this up! I think this is a great idea and gives nice control. I don't have a lot of implementation experience with it, but I think it should be possible to integrate the access_scope<...> and fusion_internal_memory into the compile time properties framework.

For symmetry and because implementations might require all three properties to successfully perform internalization, no_init would also need to be allowed for annotated_ptr. I would need to check the implementation, but I assume that the no_init property should also fit into the compile property framework. WDYT?

One limitation of this approach that I can see is that you'd only be able to use the information if a USM pointer was one of the kernel arguments (i.e. it couldn't be captured indirectly or via a struct)

I think in practice this would not be a big limitation. Following pointers stored into structs (or even more deeply nested) around to check whether internalization can be performed would anyways be hard for the compiler, so even if there was not limitation on the use of the property, I doubt that it would be possible for implementations to internalize in that scenario.

But I suspect this may be an issue with the current proposal already, since there's no association between pointers and the kernels/graphs that use them. Do you need to add a restriction somewhere talking about this?

I'm not entirely sure that I get what you mean here. As fusion happens at runtime, it is usually possible to determine the arguments captured/used by a kernel through the SYCL runtime, so implementations of fusion have access to that information, unless it's nested inside structs or similar. As internalization is only prescriptive, the fusion implementation could simply not perform internalization in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For symmetry and because implementations might require all three properties to successfully perform internalization, no_init would also need to be allowed for annotated_ptr. I would need to check the implementation, but I assume that the no_init property should also fit into the compile property framework. WDYT?

I hadn't thought about this. One problem I see here is that no_init for buffers/accessors is scoped to a command-group scope, so the property implicitly means "this memory is not initialized prior to this command-group". It's not clear to me what it would mean if you attached it to a pointer at arbitrary scope. If we attached no_init to an annotated_ptr before any kernels, the interpretation seems like it would have to be "this memory is never initialized" or "this memory is not initialized before its first use".

Do you think the second interpretation is implementable? I can sort of see how you might be able to track this in the runtime.

I'm not entirely sure that I get what you mean here. As fusion happens at runtime, it is usually possible to determine the arguments captured/used by a kernel through the SYCL runtime, so implementations of fusion have access to that information, unless it's nested inside structs or similar

Right, we're on the same page. I'm suggesting that you add text to the effect of:

"The properties of an annotated_ptr should be inspected by an implementation when the annotated_ptr is captured by a kernel lambda or otherwise passed as an argument to a kernel function. Implementations are not required to track fusion-related information for other USM pointers that may be used by a kernel, such as those stored inside of structs or other data structures."

I think it should be a weak "not required" because some implementations (like DPC++) might unpack some structs in some situations. But I think it's important to note that if a developer wants the information to be usable, they should ensure that the pointer is captured as an argument.


One other possible restriction came to mind while I was writing the above. What is supposed to happen in a case like this:

int* some_regular_ptr = ...;
annotated_ptr<int, ...> some_annotated_ptr = ...; // with appropriate internalization properties

q.parallel_for(..., [=](auto& i) {
  foo(some_regular_ptr);
  bar(some_annotated_ptr);
});

If some_annotated_ptr is internalized successfully, some_regular_ptr still won't be. (If you think the compiler is smart enough to detect this case, assume some_regular_ptr is in a struct and some_annotated_ptr isn't). Am I right in thinking that any code which expects these two pointers to point to the same memory will break?

If so, I think you should add another restriction saying that this is UB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could introduce another more fusion/graph-specific property (similar to fusion_internal_memory) to make it clearer. Which solution would you prefer?

Thanks for the explanation. I prefer using no_init, here -- I just want to make sure we'd thought through the corner cases, and document anything that might be important.

It is illegal to apply annotated_ptr to members of kernel arguments.

But I would still add the note to make it more self-contained and in case the annotated_ptr proposal changes.

I think that's a good idea. I'd missed/forgotten this note in the annotated_ptr proposal, so highlighting it here is helpful.

In case some_regular_ptr and some_annotated_ptr point to the same memory, no internalization would happen. As fusion happens at runtime, it's relatively easy to detect if the two pointers are identical. In that case, the same resolution rules as for two accessors to the same buffer (second table in 7.6.1) could apply: One captured pointer has a work-group scope attached, the other not, so the first line of the table would apply and no internalization would happen. I would extend the text before that table to also talk about USM pointers.

But this is only true if both some_regular_ptr and some_annotated_ptr are captured as arguments, isn't it? If they point to the same memory but some_regular_ptr is accessed indirectly (e.g. via a struct member, a pointer to pointer, some kernel-dependent logic) then the runtime won't be able to compare the pointers.

As a somewhat realistic use-case: imagine that there's some big allocation being used as the backing store for a data structure, and a data structure layered on top (e.g. a linked list or something). A developer can declare the big allocation as having some fusion-related properties, but those properties won't carry through to the data structure. The runtime can't apply the accessor resolution rules, because it can't see the pointers in the data structure. The runtime in this case might conclude that the memory can be internalized, but any changes to the memory would break the data structure.

Does that make sense? I feel like I might not be explaining this well.

The only case for which we should probably add a restriction saying it's UB is if some_annotated_ptr is some_regular_ptr plus some offset. As USM pointers do not carry their allocation size, it's impossible to determine if two pointers partially overlap, and the fusion compiler should be able to assume they do not, unless they are identical.

I agree it would be a good idea to add a restriction about this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is only true if both some_regular_ptr and some_annotated_ptr are captured as arguments, isn't it? If they point to the same memory but some_regular_ptr is accessed indirectly (e.g. via a struct member, a pointer to pointer, some kernel-dependent logic) then the runtime won't be able to compare the pointers.

As a somewhat realistic use-case: imagine that there's some big allocation being used as the backing store for a data structure, and a data structure layered on top (e.g. a linked list or something). A developer can declare the big allocation as having some fusion-related properties, but those properties won't carry through to the data structure. The runtime can't apply the accessor resolution rules, because it can't see the pointers in the data structure. The runtime in this case might conclude that the memory can be internalized, but any changes to the memory would break the data structure.

Does that make sense? I feel like I might not be explaining this well.

That's a good point, I hadn't considered that case.

As the absence of such an indirect use of the pointer would definitely be non-trivial for the compiler to prove, the most realistic option seems to be to have the user promise that no such use takes place. This would be possible by extending the semantics of one of the properties (I'm leaning towards fusion_internal_memory) or by introducing a new property, that essentially lets the user state that no such indirect use will take place.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the absence of such an indirect use of the pointer would definitely be non-trivial for the compiler to prove, the most realistic option seems to be to have the user promise that no such use takes place.

I agree, this makes sense to me.

This would be possible by extending the semantics of one of the properties (I'm leaning towards fusion_internal_memory) or by introducing a new property, that essentially lets the user state that no such indirect use will take place.

I might be wrong about this, but I think you need to extend the semantics of all the properties. It seems like a developer could get themselves into similar trouble by accessing memory through two pointers with incompatible access_scopes (and/or raw pointers with no access_scope). Maybe the simplest thing to do is to follow atomic_ref, and say something like:

"If an annotated_ptr is created with any of these properties, the underlying memory must only be accessed via an annotated_ptr with a compatible set of properties. During the lifetime of an annotated_ptr with any of these properties, direct access to the underlying memory via a non-annotated pointer results in undefined behavior."

Or is that too strong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's too strong, but the annotation might not be the decisive point here. I think the key point is whether the pointer is captured/used as a kernel argument or not.

Let me try to explain with two examples:

  1. We have allocated a USM pointer ptr. kernel1 captures an annotated_ptr(ptr, access_scope_work_item, fusion_internal_memory, no_init), kernel2 captures ptr directly. This scenario is unproblematic. The fusion compiler at runtime can trivially detect that both kernels use the same pointer, apply the resolution rules and not perform internalization, no harm done. In fact, this is a valid way to opt out of internalization, e.g., if the two kernels are added to the graph by two different libraries.
  2. We again have allocated a USM pointer ptr, and define auto taggedPtr = annotated_ptr(ptr, access_scope_work_item, fusion_internal_memory, no_init). kernel1 and kernel2 both capture taggedPtr, but in addition, taggedPtr is stored into some struct that is captured by kernel1 and used in the kernel to access memory.
    The fusion compiler would now determine that it's okay to perform the internalization, so replace all loads/stores to the captured pointer with register. However, as the compiler would be unable to determine (without inspecting pointers nested inside structs, which is definitely non-trivial) that the pointer inside the struct is in fact the same pointer as it just internalized. As a consequence, it would not update loads and stores from/to the pointer inside the struct, leading to wrong outcome.

I think the examples show that the crucial point isn't the annotation, but rather the fact the pointer is directly captured as a kernel argument, which allows the compiler at runtime to reason about it. I would therefore propose an alternative wording:

If an annotated_ptr is created with any of the properties relating to
internalization and captured by a kernel lambda or otherwise passed as an
argument to a kernel function participating in fusion, the underlying memory
must only be accessed via pointers that are also captured or passed as kernel
argument. Access to the underlying memory via a different pointer, such as
pointers stored inside of structs or other data structures results in undefined
behavior.

I added this wording and also integrated the other changes discussed in this thread in the latest commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for making these changes!

Copy link
Contributor

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor editorial comments, but LGTM.
I'll create a task to track specifying "outside observability" behavior as part the graphs extension

@sommerlukas sommerlukas requested a review from a team as a code owner September 20, 2023 11:11
Copy link
Contributor Author

@sommerlukas sommerlukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @EwanC @gmlueck @victor-eds!

Experimental SYCL extension proposal for kernel fusion on top of the
SYCL graphs API.

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
@sommerlukas
Copy link
Contributor Author

@intel/llvm-gatekeepers Can someone please merge this? TIA!

@aelovikov-intel aelovikov-intel merged commit 8c4cad4 into intel:sycl Sep 20, 2023
2 checks passed
iclsrc pushed a commit that referenced this pull request Mar 5, 2024
… (#83741)

Emit a warning if pointer/reference to compound literal is returned from
a function.

In C, compound literals in block scope are lvalues that have automatic
storage duration. In C++, compound literals in block scope are
temporaries.

In either case, returning a pointer/reference to a compound literal can
cause a use-after-free bug.

Fixes #8678
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants