Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Use timeline semaphores for synchronization if possible #17761

Merged

Conversation

msat-huawei
Copy link
Contributor

What does this PR do?

With this PR the Vulkan RHI backend uses timeline semaphores instead of binary semaphores to synchronize execution across queue. Binary semaphores are still used for swapchains.
The synchronization still supports binary semaphores for synchronization if timeline semaphores are not available for a device.

What does the PR actually do?

  1. Use timeline semaphores where possible (everywhere but swapchains)
  2. Add a new synchronization construct. Binary semaphores have the limitations that the wait operation must be submitted to the queue before the signal operation is. Timeline semaphores don't have this limitation, but when mixing timeline semaphores with binary semaphores we still need to synchronize the queue operations. When enqueuing a binary semaphore wait operation, all dependent semaphore (including timeline semaphores) wait operations must have been submitted already. See here and here. For binary semaphores this was accomplished by a Signal/Wait on the CPU that ensured the right submit order. We no longer call this functions. Instead we introduce a SemaphoreTracker class that counts how many semaphores are on the framegraph before a swapchain, and how many semaphores have already been submitted to the queue. We don't track which specific semaphores have been signaled, just how many semaphores a swapchain depends on have been signaled.
  3. A way to tell the framegraph that an external (Fence) semaphore has been signalled. This is accomplished by the FenceTracker class. The user of a Fence is responsible to call the FenceTracker::SignalUserSemaphore when calling the Fence::SignalOnCpu function.

Why do we need this?

In a previous PR (#17569) we introduced timeline semaphores for user defined Fences. We found that signalling these from the CPU lead to a deadlock somewhere in the graphics driver/kernel, which is explained by the Vulkan specification linked above.
Signalling Fences from the CPU will be used in MultiGPU contexts. See this RFC

In addition to fixing the problem above, this may also lead to a slight performance gain, when semaphores are used for syncing between queues. The queue submits no longer need to wait for each other, except for the swapchain.

How was this PR tested?

Tested with multiple levels and atom sample viewer samples (async compute, multi pipeline, multi scene, mesh) on Windows with Vulkan.
I cannot test this on a machine without timeline semaphore support because I don't have access to hardware. I did deactivate the m_signalFenceFromCPU manually however to test this codepath.
This was also tested with the MultiGPU RPI sample in atom sampler viewer in this branch: https://github.com/jhmueller-huawei/o3de/tree/multi-device-copy-pass-test

@msat-huawei msat-huawei requested review from a team as code owners April 17, 2024 12:19
Copy link
Contributor

@jhmueller-huawei jhmueller-huawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, great job! Just minor comments from my side, but I'd like to get this reviewed by @moudgils as well.

Gems/Atom/RHI/Vulkan/Code/Source/RHI/CommandQueue.cpp Outdated Show resolved Hide resolved
@moudgils
Copy link
Contributor

Recommend waiting for @akioCL approval before merging. I will also try to find time to go through it this week.

Copy link
Contributor

@akioCL akioCL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going through the review but I feel like this change increased the complexity of semaphores a lot. This is starting to look like spaghetti code. Maybe this is a signal that a deeper refactor is need in order to accommodate binary and timeline semaphores. For example, almost all functions of the semaphore class have a different behavior for binary and timeline semaphore. Maybe we should separate into two classes with a common interface? I'm having a hard time following the semaphore logic in the Vulkan RHI. What do you guys think?

@msat-huawei
Copy link
Contributor Author

I'm going through the review but I feel like this change increased the complexity of semaphores a lot. This is starting to look like spaghetti code. Maybe this is a signal that a deeper refactor is need in order to accommodate binary and timeline semaphores. For example, almost all functions of the semaphore class have a different behavior for binary and timeline semaphore. Maybe we should separate into two classes with a common interface? I'm having a hard time following the semaphore logic in the Vulkan RHI. What do you guys think?

I refactored the code a bit so I could remove the FenceTracker class from the public RHI API. The SemaphoreTracker classes are not signalled and waited-for through the Fence/Semaphore classes. This removes a lot of the new code I added.

Do you mean the implementation in the Semaphore/Fence with Spaghetti code? Should I split them into a timeline semaphore and binary semaphore/VkFence specific class?

@msat-huawei
Copy link
Contributor Author

I'm gettings some validation errors like:

vkDebugMessage: [ERROR][Validation] Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ] Object 0: handle = 0x20582dd94b0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4dae5635 | vkQueueSubmit(): pSubmits[0].pCommandBuffers[0] command buffer VkCommandBuffer 0x20582dd94b0[] expects VkImage 0x9484f20000012180[LuminanceMapAttachmentId] (subresource: aspectMask 0x1 array layer 0, mip level 0) to be in layout VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL--instead, current layout is VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
vkDebugMessage: [ERROR][Validation] Validation Error: [ UNASSIGNED-VkImageMemoryBarrier-image-00004 ] Object 0: handle = 0x20582dd94b0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4acfa767 | vkQueueSubmit(): in submitted command buffer VkImageMemoryBarrier acquiring ownership of VkImage (VkImage 0x35807000000d02e[]), from srcQueueFamilyIndex 0 to dstQueueFamilyIndex 2 has no matching release barrier queued for execution.
[ERROR][Validation] Validation Error: [ UNASSIGNED-VkImageMemoryBarrier-image-00004 ] Object 0: handle = 0x2060a142060, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4acfa767 | vkQueueSubmit(): in submitted command buffer VkImageMemoryBarrier acquiring ownership of VkImage (VkImage 0x35807000000d02e[]), from srcQueueFamilyIndex 2 to dstQueueFamilyIndex 0 has no matching release barrier queued for execution.

This are likely false positives as discussed here and here.

After the refactoring the PR does not longer work for the MultiGPU sample in this branch. I will mark the PR as draft until I fixed that

@msat-huawei msat-huawei marked this pull request as draft April 19, 2024 12:17
@msat-huawei msat-huawei marked this pull request as ready for review April 22, 2024 07:11
@akioCL
Copy link
Contributor

akioCL commented Apr 23, 2024

I'm going through the review but I feel like this change increased the complexity of semaphores a lot. This is starting to look like spaghetti code. Maybe this is a signal that a deeper refactor is need in order to accommodate binary and timeline semaphores. For example, almost all functions of the semaphore class have a different behavior for binary and timeline semaphore. Maybe we should separate into two classes with a common interface? I'm having a hard time following the semaphore logic in the Vulkan RHI. What do you guys think?

I refactored the code a bit so I could remove the FenceTracker class from the public RHI API. The SemaphoreTracker classes are not signalled and waited-for through the Fence/Semaphore classes. This removes a lot of the new code I added.

Do you mean the implementation in the Semaphore/Fence with Spaghetti code? Should I split them into a timeline semaphore and binary semaphore/VkFence specific class?

I do think we should split the fence and semaphore classes into a normal/timeline_semaphore fence, and a binary/timeline semaphore class. It would make the code easier to read and understand.
Also, after re-reading the standard, I think we have a problem with binary semaphores in general, independent if we are using timeline semaphores or not. Currently, binary semaphores wait on CPU before the signal operation for the SAME semaphore has been submitted. But, according to the standard, it has to wait for all dependencies (not only for it's own signal operation), so we are missing dependencies. We need to do this for ALL binary semaphores, it doesn't matter if we are using timeline semaphores or not.

As for a solution, I'm not a fan of the SemaphoreTracker, SemaphoreTrackerCollection and SemaphoreTrackerHandle. I think we can reuse what we already have. Currently, the binary semaphore has a condition_variable that's signal once, and wait once. We should refactor the Vulkan::SignalEvent class to accept multiple signal calls. We would share a SignalEvent between the dependent semaphores (similar to the shared SemaphoreTrackerHandle). So for all binary semaphores, they would need to wait on CPU until ALL previous dependent semaphore have signal the SignalEvent. We cannot use the "count" approach, because now the Binary semaphore is not the last one, so we need to know who is signaling (maybe a bitset in the SignalEvent class). We can get the dependent semaphores information from the Framegraph. You are currently doing it in the FrameExecuter, but I think it should go at the end of the FrameGraphCompiler::CompileInternal, after all barriers and semaphores have been added to the scopes.

Since Timeline semaphores don't care about out of order signal/wait operations, they would only signal the SignalEvent, but they wouldn't waste time waiting on CPU. Only Binary semaphores would waste time waiting until all dependent semaphores have submit their signal into a queue (it also includes itself).

What do you think?

@msat-huawei
Copy link
Contributor Author

I don't think we have a problem when only binary semaphores are present.
When we submit a wait operation for a semaphore we wait for the signal operation to finish.
Before the signal operation is submitted, we wait for the signal operations of the directly dependent semaphores to be enqueued.
The dependent semaphores wait on their dependencies and so on.
Or maybe I don't understand something there.

I will split the Semaphore/Fence classes into normal/timeline semaphore classes.

Your approach also makes sense. Moving it to the FrameGraphCompiler also makes sense. I will implement it.

@byrcolin byrcolin added the sig/graphics-audio Categorizes an issue or PR as relevant to SIG graphics-audio. label Apr 23, 2024
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
msat-huawei and others added 3 commits April 25, 2024 15:18
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Co-authored-by: Martin Winter <102576959+martinwinter-huawei@users.noreply.github.com>
Signed-off-by: Martin Sattlecker <120572403+msat-huawei@users.noreply.github.com>
@msat-huawei
Copy link
Contributor Author

@akioCL I refactored the synchronization code according to your suggestion, and split the Fence/Semaphore into multiple classes. Can you have a look at it again?

Copy link
Contributor

@akioCL akioCL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks much better. I'm still going through the review but I added a comment about the Factory::CreateFence function. Will publish the rest of my comments on Monday.

@@ -168,7 +168,7 @@ namespace AZ::RHI

virtual Ptr<Device> CreateDevice() = 0;

virtual Ptr<Fence> CreateFence() = 0;
virtual Ptr<Fence> CreateFence(const RHI::Device& device) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to change the public interface of the Factory to have 2 different type of fence implementations. What I would do is have a Vulkan::Fence class, and inside that class a member that is the Fence Implementation (m_fenceImpl) using the FenceBase interface. Then, during Initialization, we create the proper Fence Implementation depending on the device properties, and we funnel the calls of the Vulkan::Fence to the FenceBase class.

Copy link
Contributor

@akioCL akioCL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but I think after addressing them it should be good to go.

@martinwinter-huawei
Copy link
Contributor

@akioCL we did another refactor, what do you think about this?

Copy link
Contributor

@akioCL akioCL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for making the changes

Signed-off-by: Joerg H. Mueller <joerg.mueller@huawei.com>
Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com>
Co-authored-by: Martin Winter <martin.winter@huawei.com>
@jhmueller-huawei jhmueller-huawei merged commit d9b7dbe into o3de:development May 6, 2024
3 checks passed
martinwinter-huawei added a commit to martinwinter-huawei/o3de that referenced this pull request May 7, 2024
…#17761)

Signed-off-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com>
Co-authored-by: Martin Winter <martin.winter@huawei.com>
martinwinter-huawei added a commit to martinwinter-huawei/o3de that referenced this pull request May 8, 2024
…#17761)

Signed-off-by: Martin Winter <martin.winter@huawei.com>
Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com>
Co-authored-by: Martin Winter <martin.winter@huawei.com>
martinwinter-huawei added a commit to martinwinter-huawei/o3de that referenced this pull request May 8, 2024
…#17761)

Signed-off-by: Martin Winter <martin.winter@huawei.com>
Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com>
Co-authored-by: Martin Winter <martin.winter@huawei.com>
galibzon added a commit to galibzon/o3de that referenced this pull request May 8, 2024
jhmueller-huawei added a commit that referenced this pull request May 10, 2024
Signed-off-by: Martin Winter <martin.winter@huawei.com>
Co-authored-by: Martin Sattlecker <martin.sattlecker@huawei.com>
Co-authored-by: Joerg H. Mueller <joerg.mueller@huawei.com>
Co-authored-by: Martin Winter <martin.winter@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/graphics-audio Categorizes an issue or PR as relevant to SIG graphics-audio.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants