Vulkan: Use timeline semaphores for synchronization if possible #17761

msat-huawei · 2024-04-17T12:19:46Z

What does this PR do?

With this PR the Vulkan RHI backend uses timeline semaphores instead of binary semaphores to synchronize execution across queue. Binary semaphores are still used for swapchains.
The synchronization still supports binary semaphores for synchronization if timeline semaphores are not available for a device.

What does the PR actually do?

Use timeline semaphores where possible (everywhere but swapchains)
Add a new synchronization construct. Binary semaphores have the limitations that the wait operation must be submitted to the queue before the signal operation is. Timeline semaphores don't have this limitation, but when mixing timeline semaphores with binary semaphores we still need to synchronize the queue operations. When enqueuing a binary semaphore wait operation, all dependent semaphore (including timeline semaphores) wait operations must have been submitted already. See here and here. For binary semaphores this was accomplished by a Signal/Wait on the CPU that ensured the right submit order. We no longer call this functions. Instead we introduce a SemaphoreTracker class that counts how many semaphores are on the framegraph before a swapchain, and how many semaphores have already been submitted to the queue. We don't track which specific semaphores have been signaled, just how many semaphores a swapchain depends on have been signaled.
A way to tell the framegraph that an external (Fence) semaphore has been signalled. This is accomplished by the FenceTracker class. The user of a Fence is responsible to call the FenceTracker::SignalUserSemaphore when calling the Fence::SignalOnCpu function.

Why do we need this?

In a previous PR (#17569) we introduced timeline semaphores for user defined Fences. We found that signalling these from the CPU lead to a deadlock somewhere in the graphics driver/kernel, which is explained by the Vulkan specification linked above.
Signalling Fences from the CPU will be used in MultiGPU contexts. See this RFC

In addition to fixing the problem above, this may also lead to a slight performance gain, when semaphores are used for syncing between queues. The queue submits no longer need to wait for each other, except for the swapchain.

How was this PR tested?

Tested with multiple levels and atom sample viewer samples (async compute, multi pipeline, multi scene, mesh) on Windows with Vulkan.
I cannot test this on a machine without timeline semaphore support because I don't have access to hardware. I did deactivate the m_signalFenceFromCPU manually however to test this codepath.
This was also tested with the MultiGPU RPI sample in atom sampler viewer in this branch: https://github.com/jhmueller-huawei/o3de/tree/multi-device-copy-pass-test

jhmueller-huawei

Looks good, great job! Just minor comments from my side, but I'd like to get this reviewed by @moudgils as well.

Gems/Atom/RHI/Vulkan/Code/Source/RHI/CommandQueue.cpp

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuter.cpp

moudgils · 2024-04-17T16:42:11Z

Recommend waiting for @akioCL approval before merging. I will also try to find time to go through it this week.

akioCL

I'm going through the review but I feel like this change increased the complexity of semaphores a lot. This is starting to look like spaghetti code. Maybe this is a signal that a deeper refactor is need in order to accommodate binary and timeline semaphores. For example, almost all functions of the semaphore class have a different behavior for binary and timeline semaphore. Maybe we should separate into two classes with a common interface? I'm having a hard time following the semaphore logic in the Vulkan RHI. What do you guys think?

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuteGroup.h

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuteGroupHandler.h

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuter.cpp

msat-huawei · 2024-04-19T12:11:23Z

I'm going through the review but I feel like this change increased the complexity of semaphores a lot. This is starting to look like spaghetti code. Maybe this is a signal that a deeper refactor is need in order to accommodate binary and timeline semaphores. For example, almost all functions of the semaphore class have a different behavior for binary and timeline semaphore. Maybe we should separate into two classes with a common interface? I'm having a hard time following the semaphore logic in the Vulkan RHI. What do you guys think?

I refactored the code a bit so I could remove the FenceTracker class from the public RHI API. The SemaphoreTracker classes are not signalled and waited-for through the Fence/Semaphore classes. This removes a lot of the new code I added.

Do you mean the implementation in the Semaphore/Fence with Spaghetti code? Should I split them into a timeline semaphore and binary semaphore/VkFence specific class?

msat-huawei · 2024-04-19T12:16:49Z

I'm gettings some validation errors like:

vkDebugMessage: [ERROR][Validation] Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ] Object 0: handle = 0x20582dd94b0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4dae5635 | vkQueueSubmit(): pSubmits[0].pCommandBuffers[0] command buffer VkCommandBuffer 0x20582dd94b0[] expects VkImage 0x9484f20000012180[LuminanceMapAttachmentId] (subresource: aspectMask 0x1 array layer 0, mip level 0) to be in layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL--instead, current layout is VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
vkDebugMessage: [ERROR][Validation] Validation Error: [ UNASSIGNED-VkImageMemoryBarrier-image-00004 ] Object 0: handle = 0x20582dd94b0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4acfa767 | vkQueueSubmit(): in submitted command buffer VkImageMemoryBarrier acquiring ownership of VkImage (VkImage 0x35807000000d02e[]), from srcQueueFamilyIndex 0 to dstQueueFamilyIndex 2 has no matching release barrier queued for execution.
[ERROR][Validation] Validation Error: [ UNASSIGNED-VkImageMemoryBarrier-image-00004 ] Object 0: handle = 0x2060a142060, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4acfa767 | vkQueueSubmit(): in submitted command buffer VkImageMemoryBarrier acquiring ownership of VkImage (VkImage 0x35807000000d02e[]), from srcQueueFamilyIndex 2 to dstQueueFamilyIndex 0 has no matching release barrier queued for execution.

This are likely false positives as discussed here and here.

After the refactoring the PR does not longer work for the MultiGPU sample in this branch. I will mark the PR as draft until I fixed that

akioCL · 2024-04-23T03:50:59Z

I'm going through the review but I feel like this change increased the complexity of semaphores a lot. This is starting to look like spaghetti code. Maybe this is a signal that a deeper refactor is need in order to accommodate binary and timeline semaphores. For example, almost all functions of the semaphore class have a different behavior for binary and timeline semaphore. Maybe we should separate into two classes with a common interface? I'm having a hard time following the semaphore logic in the Vulkan RHI. What do you guys think?

I refactored the code a bit so I could remove the FenceTracker class from the public RHI API. The SemaphoreTracker classes are not signalled and waited-for through the Fence/Semaphore classes. This removes a lot of the new code I added.

Do you mean the implementation in the Semaphore/Fence with Spaghetti code? Should I split them into a timeline semaphore and binary semaphore/VkFence specific class?

I do think we should split the fence and semaphore classes into a normal/timeline_semaphore fence, and a binary/timeline semaphore class. It would make the code easier to read and understand.
Also, after re-reading the standard, I think we have a problem with binary semaphores in general, independent if we are using timeline semaphores or not. Currently, binary semaphores wait on CPU before the signal operation for the SAME semaphore has been submitted. But, according to the standard, it has to wait for all dependencies (not only for it's own signal operation), so we are missing dependencies. We need to do this for ALL binary semaphores, it doesn't matter if we are using timeline semaphores or not.

As for a solution, I'm not a fan of the SemaphoreTracker, SemaphoreTrackerCollection and SemaphoreTrackerHandle. I think we can reuse what we already have. Currently, the binary semaphore has a condition_variable that's signal once, and wait once. We should refactor the Vulkan::SignalEvent class to accept multiple signal calls. We would share a SignalEvent between the dependent semaphores (similar to the shared SemaphoreTrackerHandle). So for all binary semaphores, they would need to wait on CPU until ALL previous dependent semaphore have signal the SignalEvent. We cannot use the "count" approach, because now the Binary semaphore is not the last one, so we need to know who is signaling (maybe a bitset in the SignalEvent class). We can get the dependent semaphores information from the Framegraph. You are currently doing it in the FrameExecuter, but I think it should go at the end of the FrameGraphCompiler::CompileInternal, after all barriers and semaphores have been added to the scopes.

Since Timeline semaphores don't care about out of order signal/wait operations, they would only signal the SignalEvent, but they wouldn't waste time waiting on CPU. Only Binary semaphores would waste time waiting until all dependent semaphores have submit their signal into a queue (it also includes itself).

What do you think?

msat-huawei · 2024-04-23T14:20:08Z

I don't think we have a problem when only binary semaphores are present.
When we submit a wait operation for a semaphore we wait for the signal operation to finish.
Before the signal operation is submitted, we wait for the signal operations of the directly dependent semaphores to be enqueued.
The dependent semaphores wait on their dependencies and so on.
Or maybe I don't understand something there.

I will split the Semaphore/Fence classes into normal/timeline semaphore classes.

Your approach also makes sense. Moving it to the FrameGraphCompiler also makes sense. I will implement it.

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Gems/Atom/RHI/Vulkan/Code/Source/RHI/Semaphore.cpp

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphCompiler.cpp

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuteGroupSecondary.h

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuteGroupSecondary.cpp

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuteGroupPrimary.cpp

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuter.cpp

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Co-authored-by: Martin Winter <102576959+martinwinter-huawei@users.noreply.github.com> Signed-off-by: Martin Sattlecker <120572403+msat-huawei@users.noreply.github.com>

msat-huawei · 2024-04-25T14:00:26Z

@akioCL I refactored the synchronization code according to your suggestion, and split the Fence/Semaphore into multiple classes. Can you have a look at it again?

akioCL

I think it looks much better. I'm still going through the review but I added a comment about the Factory::CreateFence function. Will publish the rest of my comments on Monday.

akioCL · 2024-04-29T00:08:45Z

Gems/Atom/RHI/Code/Include/Atom/RHI/Factory.h

@@ -168,7 +168,7 @@ namespace AZ::RHI

        virtual Ptr<Device> CreateDevice() = 0;

-        virtual Ptr<Fence> CreateFence() = 0;
+        virtual Ptr<Fence> CreateFence(const RHI::Device& device) = 0;


We don't need to change the public interface of the Factory to have 2 different type of fence implementations. What I would do is have a Vulkan::Fence class, and inside that class a member that is the Fence Implementation (m_fenceImpl) using the FenceBase interface. Then, during Initialization, we create the proper Fence Implementation depending on the device properties, and we funnel the calls of the Vulkan::Fence to the FenceBase class.

akioCL

Some minor comments, but I think after addressing them it should be good to go.

Gems/Atom/RHI/Vulkan/Code/Source/RHI/BinarySemaphore.h

Gems/Atom/RHI/Vulkan/Code/Source/RHI/SignalEvent.cpp

Gems/Atom/RHI/Vulkan/Code/Source/RHI/TimelineSemaphore.h

Gems/Atom/RHI/Vulkan/Code/Source/RHI/SignalEvent.h

Gems/Atom/RHI/Vulkan/Code/Source/RHI/BinarySemaphore.h

martinwinter-huawei · 2024-05-03T11:29:37Z

@akioCL we did another refactor, what do you think about this?

akioCL

Looks good! Thanks for making the changes

Gems/Atom/RHI/Vulkan/Code/Source/RHI/TimelineSemaphoreFence.cpp

Signed-off-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Martin Winter <martin.winter@huawei.com>

…#17761) Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com> Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com> Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Martin Winter <martin.winter@huawei.com>

…#17761) Signed-off-by: Martin Winter <martin.winter@huawei.com> Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com> Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Martin Winter <martin.winter@huawei.com>

…le (o3de#17761)" This reverts commit d9b7dbe.

Signed-off-by: Martin Winter <martin.winter@huawei.com> Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com> Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Martin Winter <martin.winter@huawei.com>

msat-huawei requested review from a team as code owners April 17, 2024 12:19

jhmueller-huawei approved these changes Apr 17, 2024

View reviewed changes

Gems/Atom/RHI/Vulkan/Code/Source/RHI/CommandQueue.cpp Outdated Show resolved Hide resolved

Gems/Atom/RHI/Vulkan/Code/Source/RHI/FrameGraphExecuter.cpp Outdated Show resolved Hide resolved

jhmueller-huawei requested review from moudgils and martinwinter-huawei April 17, 2024 13:50

moudgils requested a review from akioCL April 17, 2024 16:24

akioCL reviewed Apr 19, 2024

View reviewed changes

msat-huawei marked this pull request as draft April 19, 2024 12:17

msat-huawei marked this pull request as ready for review April 22, 2024 07:11

byrcolin added the sig/graphics-audio Categorizes an issue or PR as relevant to SIG graphics-audio. label Apr 23, 2024

msat-huawei added 13 commits April 25, 2024 10:52

Use timeline semaphores for synchronization if possible

687d146

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Fix

6ce77e0

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Fixes

31cd904

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Refactoring

22f3ffb

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Fix

46d9c5b

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Deleted unused variable

b156ba4

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Cleanup

6307050

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Cleanup

099eeba

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Fixes for fences

2756204

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Moved semaphore/fence synchronization to SignalEvent

e5195d8

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Split Vulkan Fence/Semaphore classes

8eee643

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Added missing include

357469e

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Fixed unused variable warning

343a16d

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

msat-huawei force-pushed the use-timeline-semaphores branch from ab32b26 to 343a16d Compare April 25, 2024 10:58

Renamed FenceTimelineSemaphore to TimelineSemaphoreFence

0045b81

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

martinwinter-huawei approved these changes Apr 25, 2024

View reviewed changes

msat-huawei and others added 3 commits April 25, 2024 15:18

Revert formatting

21de74b

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Revert more formatting changes

1ee7b8e

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>

Fix a comment

c51d3da

Co-authored-by: Martin Winter <102576959+martinwinter-huawei@users.noreply.github.com> Signed-off-by: Martin Sattlecker <120572403+msat-huawei@users.noreply.github.com>

jhmueller-huawei requested a review from akioCL April 26, 2024 06:18

akioCL reviewed Apr 29, 2024

View reviewed changes

akioCL approved these changes May 3, 2024

View reviewed changes

Gems/Atom/RHI/Vulkan/Code/Source/RHI/TimelineSemaphoreFence.cpp Outdated Show resolved Hide resolved

PR comments

cbac2ab

Signed-off-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com> Co-authored-by: Martin Winter <martin.winter@huawei.com>

martinwinter-huawei force-pushed the use-timeline-semaphores branch from 9fdd9d8 to cbac2ab Compare May 6, 2024 06:26

jhmueller-huawei merged commit d9b7dbe into o3de:development May 6, 2024
3 checks passed

martinwinter-huawei mentioned this pull request May 8, 2024

Fix waitSemaphore dynamic cast #17891

Merged

galibzon added a commit to galibzon/o3de that referenced this pull request May 8, 2024

Revert "Vulkan: Use timeline semaphores for synchronization if possib…

d2460f3

…le (o3de#17761)" This reverts commit d9b7dbe.

msat-huawei mentioned this pull request May 22, 2024

Fix vulkan fence synchronization #17969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Use timeline semaphores for synchronization if possible #17761

Vulkan: Use timeline semaphores for synchronization if possible #17761

msat-huawei commented Apr 17, 2024

jhmueller-huawei left a comment

moudgils commented Apr 17, 2024

akioCL left a comment •

edited

Loading

msat-huawei commented Apr 19, 2024

msat-huawei commented Apr 19, 2024

akioCL commented Apr 23, 2024

msat-huawei commented Apr 23, 2024

msat-huawei commented Apr 25, 2024

akioCL left a comment

akioCL Apr 29, 2024

akioCL left a comment

martinwinter-huawei commented May 3, 2024

akioCL left a comment

Vulkan: Use timeline semaphores for synchronization if possible #17761

Vulkan: Use timeline semaphores for synchronization if possible #17761

Conversation

msat-huawei commented Apr 17, 2024

What does this PR do?

What does the PR actually do?

Why do we need this?

How was this PR tested?

jhmueller-huawei left a comment

Choose a reason for hiding this comment

moudgils commented Apr 17, 2024

akioCL left a comment • edited Loading

Choose a reason for hiding this comment

msat-huawei commented Apr 19, 2024

msat-huawei commented Apr 19, 2024

akioCL commented Apr 23, 2024

msat-huawei commented Apr 23, 2024

msat-huawei commented Apr 25, 2024

akioCL left a comment

Choose a reason for hiding this comment

akioCL Apr 29, 2024

Choose a reason for hiding this comment

akioCL left a comment

Choose a reason for hiding this comment

martinwinter-huawei commented May 3, 2024

akioCL left a comment

Choose a reason for hiding this comment

akioCL left a comment •

edited

Loading