Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Doc] Add kernel fusion extension proposal #7098

Merged
merged 6 commits into from
Nov 17, 2022

Conversation

victor-eds
Copy link
Contributor

@victor-eds victor-eds commented Oct 18, 2022

Add specification for the "sycl_ext_codeplay_kernel_fusion" extension proposal, which allows user-driven kernel fusion of two or more kernels in a single kernel launch.

@victor-eds victor-eds requested a review from a team as a code owner October 18, 2022 15:53
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
@victor-eds victor-eds changed the title [SYCL] Add kernel fusion extension proposal [SYCL][Doc] Add kernel fusion extension proposal Oct 18, 2022
@keryell
Copy link
Contributor

keryell commented Oct 19, 2022

After thinking more about these extensions at the F2F, I think that the start()/stop() API is not C++ exception-friendly and you should come with an RAII approach, either with a kernel_fuser constructor taking a queue or a queue::get_kernel_fuser() returning a kernel_fuser. When the kernel_fuser is destructed the kernel fusion stop. To bike-shed, obviously.

@sommerlukas
Copy link
Contributor

After thinking more about these extensions at the F2F, I think that the start()/stop() API is not C++ exception-friendly and you should come with an RAII approach, either with a kernel_fuser constructor taking a queue or a queue::get_kernel_fuser() returning a kernel_fuser. When the kernel_fuser is destructed the kernel fusion stop. To bike-shed, obviously.

Thank you very much for your feedback @keryell!

The main reason we avoided to introduce new objects like kernel_fuser was the ability to use kernel fusion with all the existing SYCL code and libraries out there. If fusion was bound to such an object, this object would need to be passed around and require existing interfaces and library APIs to be adapted for kernel fusion, which would be more intrusive than the current design. Therefore we made the design decision to not introduce such an object.

What is your main concern regarding exceptions? An unclear fusion state if an exception occurs between a pair of start and complete/cancel?

@victor-eds victor-eds self-assigned this Oct 24, 2022
@victor-eds victor-eds added the spec extension All issues/PRs related to extensions specifications label Oct 24, 2022
@pvchupin
Copy link
Contributor

ping @intel/dpcpp-specification-reviewers

@keryell
Copy link
Contributor

keryell commented Oct 29, 2022

The main reason we avoided to introduce new objects like kernel_fuser was the ability to use kernel fusion with all the existing SYCL code and libraries out there. If fusion was bound to such an object, this object would need to be passed around and require existing interfaces and library APIs to be adapted for kernel fusion, which would be more intrusive than the current design. Therefore we made the design decision to not introduce such an object.

I guess such libraries are probably C-like anyway so it does not make sense to clean them with some clean C++.

What is your main concern regarding exceptions? An unclear fusion state if an exception occurs between a pair of start and complete/cancel?

Yes. We cannot really attach some state like that to a queue because it goes against RAII principles https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rr-raii https://en.cppreference.com/w/cpp/language/raii

@sommerlukas
Copy link
Contributor

Yes. We cannot really attach some state like that to a queue because it goes against RAII principles

I think the situation with sycl::queue and RAII/exceptions for fusion is similar to the regular queue::submit when it comes to exception safety. When calling a function with a queue and that function causes an exception, there is also no way of knowing if any kernels have been submitted for execution through that queue before the exception happened, as they are not bound to RAII of the queue.

The implementation we are currently working on implicitly cancels fusion if the queue (i.e., the "last remaining host copy" in SYCL reference semantics) is destructed before fusion is explicitly cancelled/completed.

For a future revision of this proposal, we're also considering adding a fusion object (kernel_fuser in your first comment, name TBD) as a second, alternative API. Constructing this object with a queue would then be equivalent to start_fusion, destruction of the object without previous explicit cancel/complete fusion would implicitly cancel fusion. We tend to offer this as an alternative API, so users with restrictions on the API (e.g., of libraries) can use the start/complete approach, while others can use the more RAII-friendly fusion object approach.

@keryell
Copy link
Contributor

keryell commented Nov 1, 2022

Yes. We cannot really attach some state like that to a queue because it goes against RAII principles

I think the situation with sycl::queue and RAII/exceptions for fusion is similar to the regular queue::submit when it comes to exception safety. When calling a function with a queue and that function causes an exception, there is also no way of knowing if any kernels have been submitted for execution through that queue before the exception happened, as they are not bound to RAII of the queue.

The queue::submit has not this problem (even if we have other internal discussions on similar problems https://gitlab.khronos.org/sycl/Specification/-/issues/253) because there is no expected queue::submit_stop which might be missed.
The problem is not when the queue is destructed by an exception but when the fusion finalization call is skipped by an exception while the queue is still alive. You might end up to eat all your memory in that case by recording all the kernels. But anyway, this extension is not for SYCL SC, right? :-)

@sommerlukas
Copy link
Contributor

But anyway, this extension is not for SYCL SC, right? :-)

Correct, SYCL SC is not the target of this extension.

You might end up to eat all your memory in that case by recording all the kernels

That might indeed be relevant on very memory-limited (e.g., embedded) devices. Just as a note from implementation experience: When recording, in our current implementation, we do not store the IR (e.g., SPIR-V) for the kernel in the fusion list, but rather only argument information. This information is similar (or identical) in size to the information held by the SYCL RT scheduler for regular queue::submit.

The IR for the kernels is held by the applications in "fat" binaries and only retrieved on call to ext_codeplay_complete_fusion, and not stored in the fusion list before that.

@gmlueck
Copy link
Contributor

gmlueck commented Nov 3, 2022

This is a very interesting extension, but I have some concerns about the way the API is structured. My main concern is that the ext_codeplay_start_fusion member function adds a global mode to the queue. There are two reasons I don't like this.

As @keryell pointed out, this doesn't work well when the application uses exceptions. Consider an application like this:

void do_thing(queue q) {
  q.ext_codeplay_start_fusion();
  do_something_else();
  q.submit(/*...*/);
  q.submit(/*...*/);
  q.ext_codeplay_complete_fusion();
}

void do_something_else() {
  /*...*/
  if (/*whatever*/) {
    throw an_error{};
  }
  /*...*/
}

This code snippet has a bug because do_thing leaves the queue in "fusion mode" when do_something_else throws an exception. To fix this, the body of do_thing needs to catch the exception just to disable fusion mode:

void do_thing(queue q) {
  q.ext_codeplay_start_fusion();
  try {
    do_something_else();
  }
  catch (...) {
    q.ext_codeplay_cancel_fusion();
    throw;
  }
  q.submit(/*...*/);
  q.submit(/*...*/);
  q.ext_codeplay_complete_fusion();
}

However, this code pattern is tedious and it's very easy to forget the try / catch block, which means this is an easy bug to fall into.

It's better if the API uses RAII pattern, so that no try / catch is needed. For example:

void do_thing(queue q) {
  kernel_fuser fuser;
  do_something_else();
  fuser.add_kernel(/*...*/);
  fuser.add_kernel(/*...*/);
  q.ext_codeplay_submit_fuser(fuser);
}

Now, if do_something_else throws an exception, fuser is safely destroyed and the queue remains in its original state.

My second concern is that the API encourages users to put the queue into "fusion mode" and then call arbitrary code that adds commands to the queue. In fact, you mention this as a motivation in earlier comments in this PR. However, I think this is not safe because the semantics of the queue change in incompatible ways when it is in fusion mode. As a result, changing the queue mode could cause existing code using that queue to break. I noticed two places where the queue API changes incompatibly when in fusion mode:

  • The event returned from queue::submit cannot be relied upon when fusions occur. Therefore, it is error prone to put a queue into fusion mode unless you examine all code using the queue and determine that it does not rely on these events.

  • The spec seems to indicate that the user must beware of certain data races in kernels that are fused: "it’s the user’s responsibility to make sure no data races occur in the fused kernel". It wasn't clear to me what data race is the concern here, but this could be another incompatible API change.

I think you could avoid this problem also by exposing the fusion API through a new object like the kernel_fuser I show above. This forces users to examine their existing code to make sure it is safe to fuse.

In practice, I suspect users will need to examine and modify their code anyways in order to get reasonable performance benefits because they will need to add the properties promote_local, promote_private, and no_barriers in order to see a good speedup. These are not safe to add unless you understand the kernels you want to fuse.

@sommerlukas
Copy link
Contributor

@gmlueck Thanks for your feedback!

I agree that exceptions could leave the queue in an unknown fusion state and the try/catch-pattern would be tedious to use. The fact that we still deemed this behavior acceptable for fusion during our initial design phase has to do with how queues behave on exceptions, even without fusion.

This behavior becomes visible in a modified version of your example:

void do_stuff(queue q){
  do_thing(q);
  q.wait();
}

void do_thing(queue q) {
  q.submit(/* kernel 1 */);
  do_something_else();
  q.submit(/* kernel 2 */);
}

void do_something_else() {
  /*...*/
  if (/*whatever*/) {
    throw an_error{};
  }
  /*...*/
}

In this example, there is no way for do_stuff to know which kernels it will be waiting for in q.wait(). Depending on whether do_something_else threw an exception, it might be waiting for kernel 1 and kernel 2 or only kernel 1. On top of that, it cannot even be sure that kernel 1 has been submitted, in case the submission caused an synchronous error.

As exceptions can already leave the queue in a state where the user is unable to tell what exactly has been submitted through this queue, the additional fusion mode seemed acceptable to us during design.

ext_codeplay_is_in_fusion_mode() allows the user to query for the fusion state and it is also permissible to call ext_codeplay_cancel_fusion or ext_codeplay_complete_fusion after fusion has already ended.

The event returned from queue::submit cannot be relied upon when fusions occur. Therefore, it is error prone to put a queue into fusion mode unless you examine all code using the queue and determine that it does not rely on these events.

We did not want to commit to validity of the events from queue::submit after ext_codeplay_complete_fusion based on implementation experience with an early prototype. However, in our current implementation (see #7204 and this section for details), the events returned by queue::submit remain valid and allow synchronization independent of whether implicit cancellation, a call to ext_codeplay_cancel_fusion or a call to ext_codeplay_complete_fusion happened. We can change this extension proposal such that the events are guaranteed to always remain valid.

The spec seems to indicate that the user must beware of certain data races in kernels that are fused: "it’s the user’s responsibility to make sure no data races occur in the fused kernel". It wasn't clear to me what data race is the concern here, but this could be another incompatible API change.

This refers to data races between work-items in different work-groups. If two kernels are submitted as separate kernels (i.e., without fusion), there is an implicit global barrier between the execution of the two kernels.

If two kernels are fused, a data race may occur if work-items from the second kernel require synchronization with work-items from a different work-group in the first kernel, as no implicit global barrier is present in the execution of the single, fused kernel and only local barriers are supported in general (and inserted by fusion, see property::no_barriers).

In practice, I suspect users will need to examine and modify their code anyways in order to get reasonable performance benefits because they will need to add the properties promote_local, promote_private, and no_barriers in order to see a good speedup.

This work was partially motivated by SYCL-based ML libraries such as SYCL-DNN. We were able to successfully apply fusion to applications submitting multiple kernels from the SYCL-DNN library by putting the queue in fusion mode before calling the SYCL-DNN library functions and completing it afterwards. In this case, it was also not necessary to modify the SYCL-DNN library functions for NN operators themselves, as the properties you mentioned can also be applied to the buffer passed to the library.

Still, we agree with your and @keryell's comment that a more exception-safe, RAII-based API based on an explicit fusion object (kernel_fuser in your example) could be a valuable addition to this extension.

While we would like to keep the existing API for now, to enable users to work with libraries, we could introduce the kernel fusion object as an alternative API.

The extension proposal could then strongly encourage the use of the fusion object RAII API, while still offering a less safe API for users that need to work with libraries that are not yet fusion-aware.

If you agree, our team could hash out the details of the kernel_fuser API and update this extension proposal to include it.

@gmlueck
Copy link
Contributor

gmlueck commented Nov 4, 2022

In this case, it was also not necessary to modify the SYCL-DNN library functions for NN operators themselves, as the properties you mentioned can also be applied to the buffer passed to the library.

This sounds really unsafe. By setting these properties on the buffer passed into the library, isn't the caller making an assumption about how the buffer is used inside that library? If that assumption is wrong or if the library implementation changes, then the code will be broken? Is that the case, or am I not understanding the situation correctly?

@sommerlukas
Copy link
Contributor

isn't the caller making an assumption about how the buffer is used inside that library

In case of SYCL-DNN as an open-source library, it is possible to analyze the use and access pattern of the buffer to make sure internalization is possible.

@sommerlukas
Copy link
Contributor

We had some offline discussion with @gmlueck to improve the proposal.

Key takeaways are:

  • We are going to use the current API for now, to collect some experience and early feedback for kernel fusion. However, the proposal will be clearly marked as "experimental" (also indicated by the namespace), to make users aware of the concerns regarding safety (exceptions, third-party libraries) and indicate that future changes to the API are very likely.
  • For separation of concerns and to make the use of fusion more explicit with the current API, most of the new member functions have been separated from queue into an explicit fusion_wrapper class, which wraps a queue to give access to the fusion API.
  • A future proposal will replace the current API with an explicit fusion object to address exception safety and make the use of fusion in third-party libraries explicit. We are working on a design for that and will also align it with the SYCL graph extension.

steffenlarsen pushed a commit to intel/llvm-test-suite that referenced this pull request Nov 28, 2022
Two simple tests to check that code using the kernel fusion extension API compiles correctly. 

The tests currently do not yet execute the compiled application, as the necessary functionality will only be added to the implementation in a later PR. 

Spec: intel/llvm#7098
Implementation: intel/llvm#7416

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
steffenlarsen pushed a commit that referenced this pull request Dec 8, 2022
This is the fourth patch in a series of patches to add an implementation
of the [kernel fusion
extension](#7098). We have split the
implementation into multiple patches to make them more easy to review.

This patch adds the LLVM passes that perform the kernel fusion and
related optimizations:
* A pass creating the function definition for the fused kernel from the
input kernel definitions.
* A pass performing internalization of dataflow internal to the fused
kernel into either private or local memory.
The type of memory to use is currently specified by the user in the
runtime.
* A pass propagating values for scalars and by-val aggregates from the
SYCL runtime to the fused kernel as constants.

The information is propagated from the SYCL runtime to the passes via
LLVM metadata inserted by the JIT compiler frontend.

After and between the fusion passes, some standard LLVM optimization and
transformation passes are executed to enable passes and optimize the
fused kernel.

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Co-authored-by: Victor Perez <victor.perez@codeplay.com>
againull pushed a commit that referenced this pull request Dec 16, 2022
This is the third patch in a series of patches to add an implementation
of the [kernel fusion
extension](#7098). We have split the
implementation into multiple patches to make them more easy to review.
This patch integrates the kernel fusion extension into the SYCL runtime
scheduler.

Next to collecting the kernels submitted while in fusion mode in the
fusion list associated with the queue, the integration into the
scheduler is also responsible for detecting the synchronization
scenarios. Various scenarios, such as buffer destruction or event wait,
require fusion to be aborted early. The full list of scenarios is
available in the [extension
proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application).

A high-level description of the integration into the scheduler can be
found in the [design document](#7204).

This PR can be reviewed and merged independently of
#7465.

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
EwanC added a commit to reble/llvm that referenced this pull request Dec 21, 2022
Better aligns the queue record graph creation mechansism
with the [kernel fusion
extension](intel#7098)

```cpp
ext::codeplay::experimental::fusion_wrapper w{q};
w.start_fusion();
// 'q' submissions
w.complete_fusion()
```

By changing the relationship between a queue and a graph so
that recording starts and finishes on a graph we better match
kernel fusion. This design is also more exception safe as
`end_recording()` can be called in a RAII approach when a graph
is destroyed.

As a result a graph is now created from queue recording like:
```cpp
ext::oneapi::experimental::command_graph graph;
graph.begin_recording({q});
// 'q' submissions
graph.end_recording();
```

Addresses Issue #53
EwanC added a commit to reble/llvm that referenced this pull request Jan 10, 2023
Better aligns the queue record graph creation mechanism with the [kernel fusion extension](intel#7098)

```cpp
ext::codeplay::experimental::fusion_wrapper w{q};
w.start_fusion();
// 'q' submissions
w.complete_fusion()
```

By changing the relationship between a queue and a graph so
that recording starts and finishes on a graph we better match
kernel fusion. This design is also more exception safe as
`end_recording()` can be called in a RAII approach when a graph
is destroyed.

As a result a graph is now created from queue recording like:
```cpp
ext::oneapi::experimental::command_graph graph;
graph.begin_recording({q});
// 'q' submissions
graph.end_recording();
```

Addresses Issue #53
steffenlarsen pushed a commit that referenced this pull request Jan 13, 2023
This is the fifth patch in a series of patches to add an implementation
of the [kernel fusion
extension](#7098). We have split the
implementation into multiple patches to make them more easy to review.

This patch connects the JIT compiler for kernel fusion (`sycl-fusion`)
with the SYCL runtime.

- Enable the feature by default and add an option to `configure.py` to
disable it.
- Link the runtime against the JIT compiler library as a shared library.
- Add logic to retrieve binaries (SPIR-V) and other information (e.g.,
accessors) from the SYCL RT and invoke the JIT compiler.
- Representation to store binaries (SPIR-V) returned by JIT compiler in
memory for use as PI device binaries.

The integration of the JIT compiler into the SYCL RT is described in
[this design document](#7204).

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
dm-vodopyanov pushed a commit to intel/llvm-test-suite that referenced this pull request Jan 23, 2023
Test integration of kernel fusion into the SYCL runtime scheduler.
    
Check that cancellation of the fusion happens if required by synchronization rules, as described in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application).

Spec: intel/llvm#7098
Implementation: intel/llvm#7531

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
steffenlarsen pushed a commit to intel/llvm-test-suite that referenced this pull request Jan 27, 2023
Test different scenarios for kernel fusion, including creation of the fused kernel by the JIT compiler and performance optimizations such as dataflow internalization.

Automatically detect availability of the kernel fusion extension in the DPC++ build in `lit.cfg.py` and make it available for `REQUIRES` clauses.

Spec: intel/llvm#7098
Implementation: intel/llvm#7831

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023
Two simple tests to check that code using the kernel fusion extension API compiles correctly. 

The tests currently do not yet execute the compiled application, as the necessary functionality will only be added to the implementation in a later PR. 

Spec: intel#7098
Implementation: intel#7416

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023
Test integration of kernel fusion into the SYCL runtime scheduler.
    
Check that cancellation of the fusion happens if required by synchronization rules, as described in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application).

Spec: intel#7098
Implementation: intel#7531

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023
Test different scenarios for kernel fusion, including creation of the fused kernel by the JIT compiler and performance optimizations such as dataflow internalization.

Automatically detect availability of the kernel fusion extension in the DPC++ build in `lit.cfg.py` and make it available for `REQUIRES` clauses.

Spec: intel#7098
Implementation: intel#7831

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
bader pushed a commit that referenced this pull request Mar 14, 2023
Design document covering the approach to integrate the kernel fusion
extension into the runtime and
the kernel fusion JIT process.

Covers design to implement extension proposed in
#7098

Signed-off-by: Victor Lomuller <victor@codeplay.com>
Co-authored-by: Lukas Sommer <lukas.sommer@codeplay.com>
Co-authored-by: Victor Perez <victor.perez@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…ite#1404)

Two simple tests to check that code using the kernel fusion extension API compiles correctly. 

The tests currently do not yet execute the compiled application, as the necessary functionality will only be added to the implementation in a later PR. 

Spec: intel#7098
Implementation: intel#7416

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…te#1416)

Test integration of kernel fusion into the SYCL runtime scheduler.
    
Check that cancellation of the fusion happens if required by synchronization rules, as described in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application).

Spec: intel#7098
Implementation: intel#7531

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…uite#1535)

Test different scenarios for kernel fusion, including creation of the fused kernel by the JIT compiler and performance optimizations such as dataflow internalization.

Automatically detect availability of the kernel fusion extension in the DPC++ build in `lit.cfg.py` and make it available for `REQUIRES` clauses.

Spec: intel#7098
Implementation: intel#7831

Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants