Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples crash WindowServer on macOS #268

Closed
simonask opened this issue Jan 29, 2023 · 3 comments · Fixed by #269
Closed

Examples crash WindowServer on macOS #268

simonask opened this issue Jan 29, 2023 · 3 comments · Fixed by #269

Comments

@simonask
Copy link

Running this:

cargo run -p with_winit

An empty window is shown, and then after a second or two the WindowServer process seems to freeze, requiring a hard reboot. Unfortunately, no crash report is available in the system console, but if I leave it frozen for a while and reboot, a watchdog event seems to have been reported, noting that the WindowServer process is unresponsive and that the with_winit example is still running. No stack trace, no nothing.

Tried the following:

  • Release mode did not make a difference.
  • The with_bevy example is the same.
  • Disconnecting my external monitor did not make a difference.

I tried inserting println!()s and commenting things out to try to understand what's going on.

Commenting out surface.present() seems to make it not freeze the system outright, but does still seem to cause some mysterious jankiness even after the process is terminated (like scrolling in VSCode becomes jittery).

Then I looked at block_on_wgpu(), which has a note about deadlocking if it is "awaiting anything other than GPU progress". As I understand wgpu::Device::poll(), it defines "GPU progress" as work having been submitted to a queue, and it returns true if the queue is empty. I'm not sure, but I think that means that when wgpu::Device::poll() returns true, that is exactly the situation where there would be a deadlock. So I tried panicking when poll() returns true, and indeed it happens.

(Oddly enough, after crashing the process in a separate run with that panic, the jankiness in VSCode stopped...)

The render_to_texture_async() function does .await a buffer mapping after some stuff has been submitted to the queue (through run_recording()), and I'm honestly not sure what to make of wgpu's documentation here, because device.poll() is also meant to be called when awaiting buffer mapping events, but what exactly happens when the queue is fully emptied by the driver, but there are still outstanding buffer mappings?

To clarify: I'm not certain that poll() returning true while the future passed to block_on_wgpu() is still Pending is actually a reliable indication of a deadlock, because I'm not sure it captures pending buffer mappings. But if it is, it seems plausible that the device is being spammed by poll()s (with the Maintain::Wait argument no less), which somehow ends up crashing either WindowServer or the GPU driver (which is concerning in itself).

I'm sorry if I'm missing something and this is a wild goose chase - I'm really struggling (and I know I'm not alone here) with understanding how exactly wgpu::Device::poll() should be used, and I know that the wgpu teams has iterated a bit on its behavior.

@simonask
Copy link
Author

simonask commented Jan 29, 2023

Oh, and this was on an Macbook Pro (M1 Pro) running latest macOS Ventura 13.2.

@raphlinus
Copy link
Contributor

This repros, thanks for the report. It was inadequately tested on mac, ironically because I moved my development to Windows because it's so painful to debug when the computer is constantly crashing and hard-locking.

I suspect this is not a simple problem. I naively assumed that the behavior of Device::poll would be similar on Windows (Vulkan) and mac, but now it's clear I need to deeply understand those implementations.

@simonask
Copy link
Author

I can confirm that #269 fixes the problem on my machine. That was fast! :-)

Out of curiosity, could the actual problem be described as a deadlock in a compute shader? Whether as a logic error or a miscompilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants