Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anvil Examples have Synchronization Issues #40

Closed
marti4d opened this issue Sep 20, 2017 · 4 comments
Closed

Anvil Examples have Synchronization Issues #40

marti4d opened this issue Sep 20, 2017 · 4 comments
Assignees
Labels

Comments

@marti4d
Copy link

marti4d commented Sep 20, 2017

Some of Anvil's demos have synchronization issues. I personally used the Push Constants one, but from what I can tell they all have similar problems.

  1. The "Push Constants" example uses a Uniform Buffer - But it only has 1 of them that's shared by all the command buffers. Since the GPU may be in the middle of processing the previous frame when the CPU updates that UB, this creates a race condition between the GPU and the CPU. There likely needs to be a UB for every command buffer

  2. No CPU fence - The part of the example that is responsible for submitting the command buffers to the Graphics and Present queues doesn't wait on any fences. This means that the CPU never synchronizes with the other engines and could just keep cranking out frame-after-frame-after-frame without stopping until it runs out of memory if the GPU and/or presentation engine can't keep up.

    The reason that it works right now is that I'm fairly certain that the vkAcquireNextImageKHR currently does block to wait until the next image is available, although it explicitly says in the Vulkan Spec not to count on this behavior.

@DominikWitczakAMD
Copy link
Contributor

Good observations!

  1. Agreed. We could use dynamic UB bindings here to make sure individual swapchain frames are assigned non-overlapping, exclusive regions of memory they would be operating on.

Will fix this for the next update.

Random trivia: sounds like this is something the validation layers should capture, eh?

  1. I have to disagree here. A few starting points:
  • the app uses the FIFO presentation mode;
  • our swapchain is built of 3 swapchain images.
  • acquire_image_by_setting_semaphore() implicitly calls vkAcquireNextImageKHR()

You said that vkAcquireNextImageKHR() cannot be counted on the behavior of blocking till the next image becomes available. What I understand by next image is any swapchain image which is re-used (ie. it is not being acquired for the first time). Am I following you correctly?

If so, let's continue. Anvil's acquire function uses a timeout of UINT64_MAX and checks the VK result for any errors that may be reported by the func. Furthermore, we do specify a wait & a signal semaphore in order to ensure GPU does not write to the same swapchain image without first having it presented. That should defend us from any GPU-side sync issues, correct?

Now, from the CPU side, granted that no assertion failure is being reported right after we are being returned execution flow by vkAcquireNextImageKHR(), I'm assuming we're also protected on the CPU end. Fences are heavy-hammered. I may be wrong here but my understanding is you'd only use them for some sort of inter-process or inter-device synchronization, for which semaphores would not be a good fit.

Am I missing something trivial in the picture here?

@DominikWitczakAMD
Copy link
Contributor

I've picked Dan's brains on 2) and after a bit of a brainstorm we concluded that you're right about the CPU aspect. We're going ot need that CPU-levle sync in order to make things aligned with the spec.

Thanks for spotting this! It's definitely not a trivial bug. It seems like it's a good candidate for validation layer improvement.

@marti4d
Copy link
Author

marti4d commented Sep 29, 2017

What I understand by next image is any swapchain image which is re-used (ie. it is not being acquired for the first time). Am I following you correctly?

Ah - I think this is where there's confusion, and I must admit I spent way too much time reading the Vulkan Spec to try and understand this and the consequences of it. It's allowed to return as long as:

  1. It knows which image it's going to give you next
  2. It can guarantee the semaphore will be signalled
  3. It can guarantee the fence will be signalled

1 is extremely important, because it means that if it just uses a simple round-robin algorithm then it's allowed to just quickly push the index, semaphore, and fence into a list and return immediately with the prediction.

This means the presentation engine is allowed to give you an index that it already has queued in its FIFO, and that the same index is allowed to appear multiple times in the FIFO with different semaphores and fences.

In fact, if the presentation engine weren't running for some reason, its FIFO could get filled up to system memory capacity with "1,2,3,1,2,3,1,2,3,1,2,3,1,2,3..." and it would still be a valid implementation.

@DominikWitczakAMD
Copy link
Contributor

Both issues have been fixed internally. The relevant fixes will land in the update planned for this week.

DominikWitczakAMD added a commit that referenced this issue Oct 4, 2017
Bug-fixes and improvements; Addresses #36, #37, #39, #40, partially #…
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants