-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Doc] Add kernel fusion extension proposal #7098
Conversation
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
6a2bfad
to
e95fd55
Compare
After thinking more about these extensions at the F2F, I think that the |
Thank you very much for your feedback @keryell! The main reason we avoided to introduce new objects like What is your main concern regarding exceptions? An unclear fusion state if an exception occurs between a pair of |
ping @intel/dpcpp-specification-reviewers |
I guess such libraries are probably C-like anyway so it does not make sense to clean them with some clean C++.
Yes. We cannot really attach some state like that to a queue because it goes against RAII principles https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rr-raii https://en.cppreference.com/w/cpp/language/raii |
I think the situation with The implementation we are currently working on implicitly cancels fusion if the queue (i.e., the "last remaining host copy" in SYCL reference semantics) is destructed before fusion is explicitly cancelled/completed. For a future revision of this proposal, we're also considering adding a fusion object ( |
sycl/doc/extensions/proposed/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/proposed/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
The |
Correct, SYCL SC is not the target of this extension.
That might indeed be relevant on very memory-limited (e.g., embedded) devices. Just as a note from implementation experience: When recording, in our current implementation, we do not store the IR (e.g., SPIR-V) for the kernel in the fusion list, but rather only argument information. This information is similar (or identical) in size to the information held by the SYCL RT scheduler for regular The IR for the kernels is held by the applications in "fat" binaries and only retrieved on call to |
This is a very interesting extension, but I have some concerns about the way the API is structured. My main concern is that the As @keryell pointed out, this doesn't work well when the application uses exceptions. Consider an application like this:
This code snippet has a bug because
However, this code pattern is tedious and it's very easy to forget the It's better if the API uses RAII pattern, so that no
Now, if My second concern is that the API encourages users to put the queue into "fusion mode" and then call arbitrary code that adds commands to the queue. In fact, you mention this as a motivation in earlier comments in this PR. However, I think this is not safe because the semantics of the queue change in incompatible ways when it is in fusion mode. As a result, changing the queue mode could cause existing code using that queue to break. I noticed two places where the queue API changes incompatibly when in fusion mode:
I think you could avoid this problem also by exposing the fusion API through a new object like the In practice, I suspect users will need to examine and modify their code anyways in order to get reasonable performance benefits because they will need to add the properties |
@gmlueck Thanks for your feedback! I agree that exceptions could leave the queue in an unknown fusion state and the This behavior becomes visible in a modified version of your example: void do_stuff(queue q){
do_thing(q);
q.wait();
}
void do_thing(queue q) {
q.submit(/* kernel 1 */);
do_something_else();
q.submit(/* kernel 2 */);
}
void do_something_else() {
/*...*/
if (/*whatever*/) {
throw an_error{};
}
/*...*/
} In this example, there is no way for As exceptions can already leave the queue in a state where the user is unable to tell what exactly has been submitted through this queue, the additional fusion mode seemed acceptable to us during design.
We did not want to commit to validity of the events from
This refers to data races between work-items in different work-groups. If two kernels are submitted as separate kernels (i.e., without fusion), there is an implicit global barrier between the execution of the two kernels. If two kernels are fused, a data race may occur if work-items from the second kernel require synchronization with work-items from a different work-group in the first kernel, as no implicit global barrier is present in the execution of the single, fused kernel and only local barriers are supported in general (and inserted by fusion, see
This work was partially motivated by SYCL-based ML libraries such as SYCL-DNN. We were able to successfully apply fusion to applications submitting multiple kernels from the SYCL-DNN library by putting the queue in fusion mode before calling the SYCL-DNN library functions and completing it afterwards. In this case, it was also not necessary to modify the SYCL-DNN library functions for NN operators themselves, as the properties you mentioned can also be applied to the buffer passed to the library. Still, we agree with your and @keryell's comment that a more exception-safe, RAII-based API based on an explicit fusion object ( While we would like to keep the existing API for now, to enable users to work with libraries, we could introduce the kernel fusion object as an alternative API. The extension proposal could then strongly encourage the use of the fusion object RAII API, while still offering a less safe API for users that need to work with libraries that are not yet fusion-aware. If you agree, our team could hash out the details of the |
This sounds really unsafe. By setting these properties on the buffer passed into the library, isn't the caller making an assumption about how the buffer is used inside that library? If that assumption is wrong or if the library implementation changes, then the code will be broken? Is that the case, or am I not understanding the situation correctly? |
In case of SYCL-DNN as an open-source library, it is possible to analyze the use and access pattern of the buffer to make sure internalization is possible. |
We had some offline discussion with @gmlueck to improve the proposal. Key takeaways are:
|
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc
Outdated
Show resolved
Hide resolved
Two simple tests to check that code using the kernel fusion extension API compiles correctly. The tests currently do not yet execute the compiled application, as the necessary functionality will only be added to the implementation in a later PR. Spec: intel/llvm#7098 Implementation: intel/llvm#7416 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
This is the fourth patch in a series of patches to add an implementation of the [kernel fusion extension](#7098). We have split the implementation into multiple patches to make them more easy to review. This patch adds the LLVM passes that perform the kernel fusion and related optimizations: * A pass creating the function definition for the fused kernel from the input kernel definitions. * A pass performing internalization of dataflow internal to the fused kernel into either private or local memory. The type of memory to use is currently specified by the user in the runtime. * A pass propagating values for scalars and by-val aggregates from the SYCL runtime to the fused kernel as constants. The information is propagated from the SYCL runtime to the passes via LLVM metadata inserted by the JIT compiler frontend. After and between the fusion passes, some standard LLVM optimization and transformation passes are executed to enable passes and optimize the fused kernel. Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com> Co-authored-by: Victor Perez <victor.perez@codeplay.com>
This is the third patch in a series of patches to add an implementation of the [kernel fusion extension](#7098). We have split the implementation into multiple patches to make them more easy to review. This patch integrates the kernel fusion extension into the SYCL runtime scheduler. Next to collecting the kernels submitted while in fusion mode in the fusion list associated with the queue, the integration into the scheduler is also responsible for detecting the synchronization scenarios. Various scenarios, such as buffer destruction or event wait, require fusion to be aborted early. The full list of scenarios is available in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application). A high-level description of the integration into the scheduler can be found in the [design document](#7204). This PR can be reviewed and merged independently of #7465. Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com> Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Better aligns the queue record graph creation mechansism with the [kernel fusion extension](intel#7098) ```cpp ext::codeplay::experimental::fusion_wrapper w{q}; w.start_fusion(); // 'q' submissions w.complete_fusion() ``` By changing the relationship between a queue and a graph so that recording starts and finishes on a graph we better match kernel fusion. This design is also more exception safe as `end_recording()` can be called in a RAII approach when a graph is destroyed. As a result a graph is now created from queue recording like: ```cpp ext::oneapi::experimental::command_graph graph; graph.begin_recording({q}); // 'q' submissions graph.end_recording(); ``` Addresses Issue #53
Better aligns the queue record graph creation mechanism with the [kernel fusion extension](intel#7098) ```cpp ext::codeplay::experimental::fusion_wrapper w{q}; w.start_fusion(); // 'q' submissions w.complete_fusion() ``` By changing the relationship between a queue and a graph so that recording starts and finishes on a graph we better match kernel fusion. This design is also more exception safe as `end_recording()` can be called in a RAII approach when a graph is destroyed. As a result a graph is now created from queue recording like: ```cpp ext::oneapi::experimental::command_graph graph; graph.begin_recording({q}); // 'q' submissions graph.end_recording(); ``` Addresses Issue #53
This is the fifth patch in a series of patches to add an implementation of the [kernel fusion extension](#7098). We have split the implementation into multiple patches to make them more easy to review. This patch connects the JIT compiler for kernel fusion (`sycl-fusion`) with the SYCL runtime. - Enable the feature by default and add an option to `configure.py` to disable it. - Link the runtime against the JIT compiler library as a shared library. - Add logic to retrieve binaries (SPIR-V) and other information (e.g., accessors) from the SYCL RT and invoke the JIT compiler. - Representation to store binaries (SPIR-V) returned by JIT compiler in memory for use as PI device binaries. The integration of the JIT compiler into the SYCL RT is described in [this design document](#7204). Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Test integration of kernel fusion into the SYCL runtime scheduler. Check that cancellation of the fusion happens if required by synchronization rules, as described in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application). Spec: intel/llvm#7098 Implementation: intel/llvm#7531 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Test different scenarios for kernel fusion, including creation of the fused kernel by the JIT compiler and performance optimizations such as dataflow internalization. Automatically detect availability of the kernel fusion extension in the DPC++ build in `lit.cfg.py` and make it available for `REQUIRES` clauses. Spec: intel/llvm#7098 Implementation: intel/llvm#7831 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Two simple tests to check that code using the kernel fusion extension API compiles correctly. The tests currently do not yet execute the compiled application, as the necessary functionality will only be added to the implementation in a later PR. Spec: intel#7098 Implementation: intel#7416 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Test integration of kernel fusion into the SYCL runtime scheduler. Check that cancellation of the fusion happens if required by synchronization rules, as described in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application). Spec: intel#7098 Implementation: intel#7531 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Test different scenarios for kernel fusion, including creation of the fused kernel by the JIT compiler and performance optimizations such as dataflow internalization. Automatically detect availability of the kernel fusion extension in the DPC++ build in `lit.cfg.py` and make it available for `REQUIRES` clauses. Spec: intel#7098 Implementation: intel#7831 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Design document covering the approach to integrate the kernel fusion extension into the runtime and the kernel fusion JIT process. Covers design to implement extension proposed in #7098 Signed-off-by: Victor Lomuller <victor@codeplay.com> Co-authored-by: Lukas Sommer <lukas.sommer@codeplay.com> Co-authored-by: Victor Perez <victor.perez@codeplay.com>
…ite#1404) Two simple tests to check that code using the kernel fusion extension API compiles correctly. The tests currently do not yet execute the compiled application, as the necessary functionality will only be added to the implementation in a later PR. Spec: intel#7098 Implementation: intel#7416 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
…te#1416) Test integration of kernel fusion into the SYCL runtime scheduler. Check that cancellation of the fusion happens if required by synchronization rules, as described in the [extension proposal](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc#synchronization-in-the-sycl-application). Spec: intel#7098 Implementation: intel#7531 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
…uite#1535) Test different scenarios for kernel fusion, including creation of the fused kernel by the JIT compiler and performance optimizations such as dataflow internalization. Automatically detect availability of the kernel fusion extension in the DPC++ build in `lit.cfg.py` and make it available for `REQUIRES` clauses. Spec: intel#7098 Implementation: intel#7831 Signed-off-by: Lukas Sommer <lukas.sommer@codeplay.com>
Add specification for the "sycl_ext_codeplay_kernel_fusion" extension proposal, which allows user-driven kernel fusion of two or more kernels in a single kernel launch.