Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvidia|amd] Add missing synchronization #1732

Open
densamoilov opened this issue Oct 3, 2023 · 3 comments
Open

[nvidia|amd] Add missing synchronization #1732

densamoilov opened this issue Oct 3, 2023 · 3 comments
Labels
bug A confirmed library bug help wanted platform:gpu-amd Codeowner: @oneapi-src/onednn-gpu-amd platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia

Comments

@densamoilov
Copy link
Contributor

Currently, if we call asynchronous API within a host task the event that is tied to the host task completes when the host task completes, rather than when the operation submitted by the asynchronous API call completes. This is a root-cause of the issues like #1703.

In order to fix the issue we need to do proper synchronization within the host task. For example:

    auto e = q.submit([&](sycl::handler &cgh) {
        cgh.host_task([=](const sycl::interop_handle &ih) {
            cudnnAddTensor(handle, ...);
            // Only the stream associated with the handle has to be synchronized
            cudaStream_t currentStreamId;
            cudnnGetStream(handle, &currentStreamId);
            cuStreamSynchronize(currentStreamId);
        });
    });
    e.wait(); // The event completes when `cudnnAddTensor` operation completes.

This fix should be implemented for the cuDNN/cuBLAS and MIOpen/rocBLAS based implementations (use HIP counterparts, e.g. hipStreamSynchronize).

Also, we would need to document a set of limitations coming from the fact that when we do the synchronization it might happen that between an asynchronous API call and cuStreamSynchronize/hipStreamSynchronize call another thread could submit something to the stream in which case the submitted host task will be completed only when that something is completed. It might even lead to a deadlock.

@densamoilov densamoilov added the bug A confirmed library bug label Oct 3, 2023
@AerialMantis
Copy link

That's right, the approach described here is the only correct way to guarantee the operations enqueued within the host task are synchronized with, following the current SYCL 2020 specification, however, this will likely impact performance. We are working on a SYCL extension to resolve this limitation.

In the meantime, if you wish to achieve better performance, there is another option that will work for DPC++ but be aware it only works under certain conditions and is not guaranteed to work in other SYCL implementations. If the queue is in-order and the same queue is used for all host task submissions then the same underlying CUDA stream will always be used and so the ordering will be forced in-order and no synchronization is required. However, if the queue may also be out-of-order there's no more optimal way to handle this, so you may need to check for this to know whether explicit synchronization is required.

@densamoilov
Copy link
Contributor Author

@AerialMantis, thanks for the suggestion. Even with in-order queues there is still a synchronization problem that occurs when there are multiple of them and we submit operations that depend on each other to those queues. Though I don't know whether the use case is common.

As for the performance implications, do you know how significant the performance impact would be? Where would it come from? Is it because we would lose amortization of the cost of the asynchronous API calls because we would have to wait for the previous call to complete to make the next one.

@vpirogov vpirogov added platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia platform:gpu-amd Codeowner: @oneapi-src/onednn-gpu-amd labels Mar 29, 2024
@AerialMantis
Copy link

@densamoilov apologies for the late reply, I hadn't seen your response. That's right this solution would only work in the case of a single in-order queue, though as it relies on implementation details I would avoid it in general.

The performance impact I described would come from all asynchronous native commands enqueued within the host task function being synchronised with before returning, therefore effectively making the those commands blocking.

There is an extension now which addresses this limitation by extending the host task interface to allow native events to be passed into a host task function and for native events to be propagated out and encapsulated in the SYCL event returned by submit, therefore allowing a the host task function to enqueue asynchronous commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug help wanted platform:gpu-amd Codeowner: @oneapi-src/onednn-gpu-amd platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia
Projects
None yet
Development

No branches or pull requests

4 participants