[nvidia|amd] Add missing synchronization #1732

densamoilov · 2023-10-03T22:53:40Z

Currently, if we call asynchronous API within a host task the event that is tied to the host task completes when the host task completes, rather than when the operation submitted by the asynchronous API call completes. This is a root-cause of the issues like #1703.

In order to fix the issue we need to do proper synchronization within the host task. For example:

    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(handle, ...);
            // Only the stream associated with the handle has to be synchronized
            cudaStream_t currentStreamId;
            cudnnGetStream(handle, &currentStreamId);
            cuStreamSynchronize(currentStreamId);
        });
    });
    e.wait(); // The event completes when `cudnnAddTensor` operation completes.

This fix should be implemented for the cuDNN/cuBLAS and MIOpen/rocBLAS based implementations (use HIP counterparts, e.g. hipStreamSynchronize).

Also, we would need to document a set of limitations coming from the fact that when we do the synchronization it might happen that between an asynchronous API call and cuStreamSynchronize/hipStreamSynchronize call another thread could submit something to the stream in which case the submitted host task will be completed only when that something is completed. It might even lead to a deadlock.

The text was updated successfully, but these errors were encountered:

AerialMantis · 2023-10-05T16:30:55Z

That's right, the approach described here is the only correct way to guarantee the operations enqueued within the host task are synchronized with, following the current SYCL 2020 specification, however, this will likely impact performance. We are working on a SYCL extension to resolve this limitation.

In the meantime, if you wish to achieve better performance, there is another option that will work for DPC++ but be aware it only works under certain conditions and is not guaranteed to work in other SYCL implementations. If the queue is in-order and the same queue is used for all host task submissions then the same underlying CUDA stream will always be used and so the ordering will be forced in-order and no synchronization is required. However, if the queue may also be out-of-order there's no more optimal way to handle this, so you may need to check for this to know whether explicit synchronization is required.

densamoilov · 2023-10-05T19:29:29Z

@AerialMantis, thanks for the suggestion. Even with in-order queues there is still a synchronization problem that occurs when there are multiple of them and we submit operations that depend on each other to those queues. Though I don't know whether the use case is common.

As for the performance implications, do you know how significant the performance impact would be? Where would it come from? Is it because we would lose amortization of the cost of the asynchronous API calls because we would have to wait for the previous call to complete to make the next one.

AerialMantis · 2024-04-15T00:58:19Z

@densamoilov apologies for the late reply, I hadn't seen your response. That's right this solution would only work in the case of a single in-order queue, though as it relies on implementation details I would avoid it in general.

The performance impact I described would come from all asynchronous native commands enqueued within the host task function being synchronised with before returning, therefore effectively making the those commands blocking.

There is an extension now which addresses this limitation by extending the host task interface to allow native events to be passed into a host task function and for native events to be propagated out and encapsulated in the SYCL event returned by submit, therefore allowing a the host task function to enqueue asynchronous commands.

densamoilov added the bug A confirmed library bug label Oct 3, 2023

densamoilov assigned mehdi-goli Oct 3, 2023

vpirogov added platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia platform:gpu-amd Codeowner: @oneapi-src/onednn-gpu-amd labels Mar 29, 2024

vpirogov added the help wanted label Jul 16, 2024

vpirogov unassigned mehdi-goli Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvidia|amd] Add missing synchronization #1732

[nvidia|amd] Add missing synchronization #1732

densamoilov commented Oct 3, 2023

AerialMantis commented Oct 5, 2023

densamoilov commented Oct 5, 2023

AerialMantis commented Apr 15, 2024

[nvidia|amd] Add missing synchronization #1732

[nvidia|amd] Add missing synchronization #1732

Comments

densamoilov commented Oct 3, 2023

AerialMantis commented Oct 5, 2023

densamoilov commented Oct 5, 2023

AerialMantis commented Apr 15, 2024