From be3dfbd70ec7da46ccf42d180d111c39ff7726ca Mon Sep 17 00:00:00 2001 From: vic Date: Thu, 17 Nov 2022 19:57:42 +0100 Subject: [PATCH] [SYCL][Doc] Add kernel fusion extension proposal (#7098) Add specification for the "sycl_ext_codeplay_kernel_fusion" extension proposal, which allows user-driven kernel fusion of two or more kernels in a single kernel launch. Signed-off-by: Victor Perez Co-authored-by: Lukas Sommer --- .../sycl_ext_codeplay_kernel_fusion.asciidoc | 710 ++++++++++++++++++ 1 file changed, 710 insertions(+) create mode 100644 sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc diff --git a/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc new file mode 100644 index 000000000000..468d8b08ab57 --- /dev/null +++ b/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc @@ -0,0 +1,710 @@ += sycl_ext_codeplay_kernel_fusion + +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:dpcpp: pass:[DPC++] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +[%hardbreaks] +Copyright (C) 2022-2022 Codeplay Software Limited. All rights reserved. + +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by +permission by Khronos. + + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +== Status + +This is a proposed extension specification, intended to gather community +feedback. Interfaces defined in this specification may not be implemented yet +or may be in a preliminary state. The specification itself may also change in +incompatible ways before it is finalized. *Shipping software products should +not rely on APIs defined in this specification.* + +[NOTE] +==== +This is an experimental extension for the SYCL specification. +Exceptions while in fusion mode can leave a `queue` in an unknown fusion state, +as no RAII-based management of fusion is done. Passing a `queue` in fusion mode +to third-party libraries can make assumptions about the kernels enqueued by the +library that might change over time. + +This experimental proposal is intended to collect experience and early feedback +on an API for kernel fusion in SYCL to inform a future extension proposal +addressing the mentioned problems. +==== + +[NOTE] +==== +This extension is currently being implemented in {dpcpp} only for kernels in +SPIRV format. +==== + +== Overview + +Every kernel launch in SYCL carries an overhead due to memory traffic and device +launch and synchronization. To avoid this repeated overhead, it can be desirable +to **fuse** two or more kernels executing on the same device into a single +kernel launch. + +However, constructing a reliable, completely automatic kernel fusion in the +compiler is hard for the general case. Therefore, we instead propose an +interface for **user-driven kernel fusion**, so that the user can leverage +application/domain knowledge to explicitly instruct the SYCL runtime to fuse two +or more kernels. + +This work is motivated by scenarios in which the information to decide whether +to fuse is only available at runtime, e.g., taking into account input data size; +and/or the kernels being submitted for execution are not known at compile time, +e.g., using different kernels for different input data sizes and/or +platform. Thus, the fusion of kernels should be possible at **runtime** of the +application (in contrast to compile time). + +The aim of this document is to propose a mechanism for users to request the +fusion of two or more kernels into a single kernel **at runtime**. This requires +the extension of the runtime with some sort of JIT compiler to allow for the +fusion of kernel functions at runtime. + +=== Internalizing Dataflow [[internalization]] + +While avoiding repeated kernel launch overheads will most likely already improve +application performance, kernel fusion can deliver even higher performance gains +when internalizing dataflows. + +In a situation where data produced by one kernel is consumed by another kernel +and the two kernels are fused, the dataflow from the first kernel to the second +kernel can be made internal to the fused kernel. Instead of using time-consuming +reads and writes to/from global memory, the fused kernel can use much faster +mechanisms, e.g., registers or private memory to "communicate" the result, as we +will see in the following example. + +To achieve this result during fusion, a fusion compiler must be aware of some +additional information and context: + +* The compiler must know that two arguments refer to the same + accessor/underlying memory. +* As internalized buffers are not initialized, elements of the internalized + buffer being read by a kernel must have been written before (either in the + same kernel or in a previous one). +* Values stored to an internalized buffer must not be used by any other kernel + not part of the fusion process, as the data would become unavailable to + consumers. This is knowledge that the compiler cannot deduce. Instead, the + fact that the values stored to an internalized buffer are not used outside the + fused kernel must be provided by the user. +* If these conditions hold, depending on the memory access pattern of the fused + kernel, we can say that a buffer is: +** _Privately internalizable_: If not a single element of the buffer is to be + accessed by more than one work-item; +** _Locally internalizable_: If not a single element of the buffer is to be + accessed by work items of different work groups. + +As the compiler can reason about the access behavior of the different kernels +only in a very limited fashion, **it's the user's responsibility to make sure no +data races occur in the fused kernel**. Data races could in particular be +introduced because the implicit inter-work-group synchronization between the +execution of two separate kernels is eliminated by fusion. The user must ensure +that the kernels combined during fusion do not rely on this synchronization. + +=== Example + +```c++ +class KernelOne { +public: + KernelOne(accessor a, accessor b, accessor c) + : A{a}, B{b}, C{c} {} + + void operator()(item<1> i){ + C[i] = A[i] * B[i]; + } + +private: + accessor A; + accessor B; + accessor C; +}; + +class KernelTwo { +public: + KernelTwo(accessor x, accessor y, accessor z) + : X{x}, Y{y}, Z{z} {} + + void operator()(item<1> i){ + Z[i] = X[i] + Y[i]; + } + +private: + accessor X; + accessor Y; + accessor Z; + +}; + +int main(){ + constexpr size_t dataSize = 512; + int in1[dataSize], in2[dataSize], in3[dataSize], tmp[dataSize], out[dataSize]; + + queue q{gpu_selector{}, + {ext::codeplay::experimental::property::queue::enable_fusion()}}; + + { + ext::codeplay::experimental::fusion_wrapper w{q}; + + buffer bIn1{in1, range{dataSize}}; + buffer bIn2{in2, range{dataSize}}; + buffer bIn3{in3, range{dataSize}}; + buffer bOut{out, range{dataSize}}; + // Buffer bTmp will be internalized, as the promote_private property is used + // in its construction. + buffer bTmp{tmp, range{dataSize}, + {ext::codeplay::experimental::property::promote_private()}}; + + // Set the queue into "fusion mode" + w.start_fusion(); + + // "Submit" the first kernel. The kernel will be added to the the list of + // kernels to be fused and will not be executed before fusion is completed + // or cancelled. + q.submit([&](handler& cgh){ + auto accIn1 = bIn1.get_access(cgh); + auto accIn2 = bIn2.get_access(cgh); + auto accTmp = bTmp.get_access(cgh); + cgh.parallel_for(dataSize, KernelOne{accIn1, accIn2, accTmp}); + }); + + // "Submit" the second kernel. The kernel will be added to the the list of + // kernels to be fused and will not be executed before fusion is completed + // or canceled. + q.submit([&](handler& cgh){ + auto accTmp = bTmp.get_access(cgh); + auto accIn3 = bIn3.get_access(cgh); + auto accOut = bOut.get_access(cgh); + cgh.parallel_for(dataSize, KernelTwo{accTmp, accIn3, accOut}); + }); + + // Complete the fusion: JIT-compile a fused kernel containing KernelOne and + // KernelTwo and submit the fused kernel for execution. This call may return + // before JIT-compilation or execution of the fused kernel is completed. + w.complete_fusion({ext::codeplay::experimental::property::no_barriers()}); + + // End of the scope - buffers go out-of-scope and are destructed. Buffer + // destruction causes a synchronization with all outstanding commands + // operating on the buffer, in this case the fused kernel. + } +} +``` + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must predefine the +macro `SYCL_EXT_CODEPLAY_KERNEL_FUSION` to one of the values defined in the +table below. Applications can test for the existence of this macro to determine +if the implementation supports this feature, or applications can test the +macro's value to determine which of the extension's features the implementation +supports. + +[%header,cols="1,5"] +|=== +|Value +|Description + +|1 +|Initial version of this extension. +|=== + +=== API Extension + +The design tightly integrates with the `queue` class and leverages the +asynchronous nature of SYCL kernel submissions. It introduces a new class +`fusion_wrapper` that wraps a SYCL queue to give access to the relevant API for +fusion. The wrapper class is introduced to achieve a separation of concerns by +keeping the fusion control API separate from the existing queue API. The wrapper +directly manipulates and controls the fusion state of the wrapped queue. + +Next to the `fusion_wrapper`, this extension also introduces additional +properties and a new member function for class `queue`. + +==== Fusion Wrapper class + +The `fusion_wrapper` is a thin wrapper around a SYCL queue object and provides +access to the necessary API functions to control the fusion state of the wrapped +queue object. The `fusion_wrapper` member functions directly modify the fusion +state of the underlying `queue`, effectively making the queue stateful. + +As the fusion state is attached to the wrapped `queue` object, it is permissible +to create two or more `fusion_wrapper` objects for the same `queue` object. The +`fusion_wrapper` objects will manage the fusion state for the same queue. It is +the applications responsibility to synchronize if one or multiple +`fusion_wrapper` objects are used in a multithreaded context. + +The `fusion_wrapper` class is **not** an allowable type for kernel parameters +(https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:kernel.parameter.passing[§4.12.4] +of the SYCL 2020 specification). + +A synopsis of the SYCL `fusion_wrapper` class is provided below. The constructors, +destructors and member functions of the SYCL `fusion_wrapper` class are listed in +Table 1 and 2. + +```c++ +namespace sycl { +namespace ext { +namespace codeplay { +namespace experimental { + +class fusion_wrapper { + + explicit fusion_wrapper(queue &q); + + /* -- common interface members -- */ + + queue get_queue() const; + + bool is_in_fusion_mode() const; + + void start_fusion(); + + void cancel_fusion(); + + event complete_fusion(const property_list &propList = {}); +}; +} // namespace experimental +} // namespace codeplay +} // namespace ext +} // namespace sycl +``` + + +Table 1. Constructors and destructors of the `fusion_wrapper` class +|=== +|Constructor|Description + +|`explicit fusion_wrapper(queue& syclQueue)` + +|Wraps the queue `syclQueue` with a `fusion_wrapper` to get access to the +fusion API and manage kernel fusion on `syclQueue`. + +The underlying `queue` must have property +`sycl::ext::codeplay::experimental::property::queue::enable_fusion` + +|=== + +Table 2. Member functions of the `fusion_wrapper` class + +|=== +|Member Function|Description + +|`void start_fusion()` + +|Set the wrapped `queue` into "fusion mode". Subsequent command group +submissions to the `queue` will not be submitted for execution right away, but rather added to +a list of kernels that should be fused (i.e., to the _fusion list_), until +`complete_fusion` or `cancel_fusion` are called. + +If the wrapped `queue` is already in fusion mode, the function throws an +`exception` with `errc::invalid` error code. + +|`event complete_fusion(const property_list &)` + +|Complete the fusion: If the runtime decides to perform fusion, it will +JIT-compile a fused kernel from all kernels submitted to the wrapped `queue` +since the last call to `start_fusion` and submit the fused kernel for execution. +Inside the fused kernel, the per-work-item effects are executed in the same +order as the kernels were initially submitted, adding group barriers between +each of them by default. If the runtime decides not to fuse the kernels, they +are passed to the scheduler in the same order that they were originally +submitted to the queue. Constraints on when fusion is possible and criteria for +the implementation to perform fusion are implementation-defined. Calling +`fusion_wrapper::complete_fusion` does therefore not guarantee that the kernels +will be fused. + +The call is asynchronous, i.e., it may return after fusion (JIT-compilation) is +done, but before execution of the fused kernel is completed. The returned event +allows to synchronize with the execution of the fused kernel. + +At call completion the wrapped `queue` is no longer in fusion mode, until the +next `start_fusion`. + +|`void cancel_fusion()` + +|Cancel the fusion and submit all kernels submitted to the wrapped `queue` since +the last `start_fusion()` for immediate execution **without** fusion. The +kernels are submitted in the same order as they were initially submitted to the +queue. + +This operation is asynchronous, i.e., it may return after the kernels have been +added to the scheduler, but before any of the previously submitted kernel starts +or completes execution. + +At call completion the wrapped `queue` is no longer in fusion mode, until the next +`start_fusion`. + +|`bool is_in_fusion_mode() const` +|Returns true if the wrapped SYCL `queue` is currently in fusion mode. + +|=== + +==== Properties + +Next to the new API functions and classes described above, this extension also +adds new properties that are described in Table 3. + +Table 3. New properties for kernel fusion. + +|=== +|Property|Description + +|`sycl::ext::codeplay::experimental::property::queue::enable_fusion` +|This property enables kernel fusion for the queue. If a `fusion_wrapper` object +is constructed on a queue without this property, an `exception` with `errc::invalid` +error code is thrown. + +If a `queue` is constructed with this property, but the underlying `device` of +the queue returns `false` for the device information descriptor +`sycl::ext::codeplay::experimental::info::device::supports_fusion`, an +`exception` with `errc::invalid` error code is thrown. + +|`sycl::ext::codeplay::experimental::property::no_barriers` + +|If the property list passed to `fusion_wrapper::complete_fusion()` contains this +property, no barriers are introduced between kernels in the fused kernel. + +|`sycl::ext::codeplay::experimental::property::promote_local` +|This property can be passed to the `accessor` constructor, giving a more +granular control, or to the `buffer` constructor, in which case all the +`accessors` will inherit this property (unless overridden). + +This property is an assertion by the application that each element in the buffer +is accessed by no more than one work-group in the kernel submitted by this +command-group (in case the property is specified on an accessor) or in any +kernel in the fusion set (in case the property is specified on a buffer). +Implementations may treat this as a hint to promote the buffer elements to +local memory (see local and private internalization in <>). + +The application also asserts that the updates made to the buffer by the kernel +submitted by this command-group (in case the property is specified on an +accessor) or in any kernel in the fusion set (in case the property is specified +on a buffer) may not be available for use after the fused kernel completes +execution. Implementations may treat this as a hint to not write back the final +result to global memory. + +|`sycl::ext::codeplay::experimental::property::promote_private` +|This property can be passed to the `accessor` constructor, giving a more +granular control, or to the `buffer` constructor, in which case all the +`accessors` will inherit this property (unless overridden). + +This property is an assertion by the application that each element in the buffer +is accessed by no more than one work-item in the kernel submitted by this +command-group (in case the property is specified on an accessor) or in any +kernel in the fusion set (in case the property is specified on a buffer). +Implementations may treat this as a hint to promote the buffer elements to +private memory (see local and private internalization in <>). + +The application also asserts that the updates made to the buffer by the kernel +submitted by this command-group (in case the property is specified on an +accessor) or in any kernel in the fusion set (in case the property is specified +on a buffer) may not be available for use after the fused kernel completes +execution. Implementations may treat this as a hint to not write back the final +result to global memory. + +|`sycl::ext::codeplay::experimental::property::force_fusion` + +|This property forces the SYCL runtime implementation to perform fusion if it is +possible to do so. Implementations must not defer kernel fusion, even if they +deemed the fusion to be non-profitable, e.g., based on some profitability +analysis. + +This property can be passed to `fusion_wrapper::complete_fusion()`. + +|=== + +==== New Queue Member Functions + +To support querying if a `queue` can be used for fusion, i.e., can be wrapped by +a `fusion_wrapper` object, this extension adds a new member function to the +`queue` class. + +Table 4. Added member functions of the `queue` class + +|=== +|Member Function|Description + +|`bool queue::ext_codeplay_supports_fusion() const` + +|Returns true if the SYCL `queue` was created with the `enable_fusion` property. +Equivalent to +`has_property()`. + +|=== + +==== Additional Device Information Descriptors + +To support querying whether a SYCL device and the underlying platform support +kernel fusion before constructing a queue with property +`ext::codeplay::experimental::property::queue::enable_fusion`, the following +device information descriptor is added as part of this extension proposal. + +Table 5. Added device information descriptors + +|=== +|Device descriptor |Return type |Description + +|`sycl::ext::codeplay::experimental::info::device::supports_fusion` | `bool` + +|Returns true if the SYCL `device` and the underlying `platform` support kernel fusion. + +|=== + +=== Synchronization while in Fusion Mode + +[NOTE] +==== +This section follows the same structure as +https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:synchronization[its +homonym in the SYCL standard]. +==== + +By design, the execution of a SYCL application using our proposed extension +should produce the same visible results as if the kernels were executed +regularly. Throughout this section, synchronization rules while in fusion mode +are described. A `queue` is said to be in fusion mode between being set into +fusion mode through a call to `fusion_wrapper::start_fusion` on a +`fusion_wrapper` object wrapping this queue and a call to either +`fusion_wrapper::cancel_fusion` or `fusion_wrapper::complete_fusion` on a +`fusion_wrapper` object wrapping this queue (note that the the two +`fusion_wrapper` objects need not be the same object). + +Also note that some scenarios will lead to the sequential submission of the +kernels in the fusion list, as adherence to the SYCL standard takes a higher +priority than the optimization benefits brought by the kernel fusion. + +==== Synchronization in the SYCL Application + +* _Buffer destruction_: In order to adhere to the SYCL standard, destruction of + a buffer which is to be accessed by kernels in the fusion list implies an + implicit fusion cancellation. This way, the kernels would be executed in + submission order, ensuring correct semantics, pending work would be completed + and the data would be copied back on completion. +* _Host accessors_: Similarly, to obtain correct semantics, when a host accessor + accessing a buffer to be accessed by a kernel submitted to the fusion list is + created, kernel fusion is implicitly canceled to be able to obtain the + expected contents of the buffer. +* _Command group enqueue_: Submission of command groups to (at least) + two different queues, of which at least one is in fusion mode, can + lead to _circular dependencies_ between the fused kernel and the + execution of other command-groups, if the command-groups synchronize + via requirements or explicit synchronization. In this context, a + circular dependencies arise if any kernel in a fusion list depends + on a kernel submitted for execution in a different queue and, at the + same time, this depends on another kernel in the fusion list. This + causes a circular dependency as the fused kernel would depend on the + kernel not in the fusion list and, at the same time, this would + depend on the fused kernel. ++ +Circular dependencies can be caused by device kernels, host tasks or +explicit memory operations. Implementations must cancel fusion in time +to avoid such circular dependencies and deadlock of the +application. The concrete event/submission causing cancellation is +implementation defined. Implementations could opt to cancel only when +the submission would create a circular dependency, but are free to do +so earlier, e.g., on submission of a command-group to another queue +which synchronizes with a kernel in the fusion list of another queue. +* _Queue operations_: Calls to queue operations blocking execution of the +calling thread, such as `sycl::queue::wait()`, must also imply an implicit +kernel fusion cancellation. +* _SYCL event objects_: Host synchronization on events returned by a call to +`queue::submit` while the queue is still in fusion mode would also result on an +implicit kernel fusion cancellation. Explicit dependencies (specified by the +user with `handler::depends_on`) between kernels to be fused must be dropped, as +the requirement will trivially hold (per work-item) thanks to fusion semantics. +* _Queue destruction_: As in this extension the queue becomes stateful, the +destruction of a queue in fusion mode would lead to an implicit kernel fusion +cancellation. + +==== Synchronization in SYCL kernels + +Group barriers semantics do not change in the fused kernel and barriers already +in the unfused kernels are preserved in the fused kernel. Despite this, it is +worth noting that, in order to introduce synchronization between work items in a +same work-group executing a fused kernel, a barrier is added between each of the +kernels being fused. This way, fusing a submission sequence as the one above +would result in the following one unless the `property::no_barriers` property is +used: + +```c++ +queue.submit([&](handler& cgh){ + auto accIn1 = bIn1.get_access(cgh); + auto accIn2 = bIn2.get_access(cgh); + auto accIn3 = bIn3.get_access(cgh); + auto accTmp = bTmp.get_access(cgh); + auto accOut = bOut.get_access(cgh); + cgh.parallel_for(dataSize, + [=](item<1> i) { + KernelOne{accIn1, accIn2, accTmp}(i); + group_barrier(i.get_group()); + KernelTwo{accTmp, accIn3, accOut}(i); + }); +} +``` + +=== Kernel Fusion Limitations + +In addition to the cases discussed above, kernel fusion might be canceled by the +runtime if some undesired scenarios arise. Note that some implementations might +be more capable/permissive and might not abort fusion in all of these +cases. Also, whether to abort when a kernel is submitted or when +`fusion_wrapper::complete_fusion` is called will be implementation and +scenario-dependent. + +==== Hierarchical Parallelism + +The extension does not support kernels using hierarchical parallelism. Although +some implementations might want to add support for this kind of kernels. + +==== Incompatible ND-ranges of the kernels to fuse + +Incompatibility of ND-ranges will be determined by the kernel fusion +implementation. All implementations should support fusing kernels with the exact +same ND-ranges, but implementations might cancel fusion as soon as a kernel with +a different ND-range is submitted. + +==== Kernels with different dimensions + +Similar to the previous one, it is implementation-defined whether or not to +support fusing kernels with different dimensionality. + +==== Explicit memory operations + +Calls to member function of the `handler` class (or their homologous `queue` +class shortcuts) should abort fusion in any of the following scenarios: + +* The command-group calling the explicit memory function explicitly synchronizes + (through an event) with one or multiple kernels in the fusion list; +* One or multiple requirements created by the command-group calling the explicit + memory function requires the execution of one or multiple kernels in the + fusion list to be satisfied. + +==== No intermediate representation + +In case any of the kernels to be fused does not count with an accessible +suitable intermediate representation, kernel fusion is canceled. + +=== Combining Internalization Properties + +In some cases, the user will specify different internalization targets for a +buffer and accessors to such buffer. When incompatible combinations are used, an +`exception` with `errc::invalid` error code is thrown. Otherwise, these +properties must be combined as follows: + +[options="header"] +|=== +|Accessor Internalization Target|Buffer Internalization Target|Resulting Internalization Target + +.3+.^|None +|None +|None + +|Local +|Local + +|Private +|Private + +.3+.^|Local +|None +|Local + +|Local +|Local + +|Private +|*Error* + +.3+.^|Private +|None +|Private + +|Local +|*Error* + +|Private +|Private +|=== + +In case different internalization targets are used for accessors to the same +buffer, the following (commutative and associative) rules are followed: + +[options="header"] +|=== +|Accessor~1~ Internalization Target|Accessor~2~ Internalization Target|Resulting Internalization Target + +|None +|_Any_ +|None + +.2+.^|Local +|Local +|Local + +|Private +|None + +|Private +|Private +|Private +|=== + +If no work-group size is specified or two accessors specify different +work-group sizes when using local internalization for any of the +kernels involved in the fusion, no internalization will be +performed. If there is a mismatch between the two accessors (access +range, access offset, number of dimensions, data type), no +internalization is performed. + +== Design Constraints + +The biggest constraint for the design stems from the the fact that the +combination of kernels to be fused is unknown at compile time. This means that, +for the design of the extension, templates cannot be leveraged to full +extent. Templates can only be used in cases where the information is available +at compile time (e.g., for a single kernel), but never for any interface working +with combinations of kernels that should be fused. + +== Revision History + +[cols="5,15,15,70"] +[grid="rows"] +[options="header"] +|======================================== +|Rev|Date|Authors|Changes +|1|2022-10-14|Victor Lomüller, Lukas Sommer and Victor Perez|*Initial draft* +|2|2022-11-09|Victor Lomüller, Lukas Sommer and Victor Perez|*Separate fusion API into new `fusion_wrapper`* +|========================================