Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wayland: Buffers with old size can get commited already after ack_configure on window resize #4563

Closed
dos1 opened this issue Aug 1, 2021 · 40 comments · Fixed by #6361
Closed
Milestone

Comments

@dos1
Copy link
Contributor

dos1 commented Aug 1, 2021

To reproduce:

  1. Run a nested compositor, like weston --width=640 --height=480
  2. Create a window larger than the screen, like 1280x720
  3. Toggle fullscreen with SDL_SetWindowFullscreen (tested with SDL_WINDOW_FULLSCREEN_DESKTOP)

The client gets terminated with a protocol error:

[468479.498] wl_display@1.error(xdg_wm_base@13, 4, "xdg_surface buffer (1280 x 720) is larger than the configured fullscreen state (640 x 480)")

Happens with xdg-shell, does not happen with libdecor. Tested on d784dd2.

/cc @flibitijibibo

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

It doesn't crash on any other compositor I checked, but seems it's only because they don't check for this protocol violation.

This log was obtained on Mutter:

[1889923.478]  -> zwp_linux_buffer_params_v1@30.create_immed(new id wl_buffer@26, 1280, 720, 875713112, 0)
(...)
[1890108.585]  -> xdg_toplevel@28.set_fullscreen(wl_output@7)
[1890133.181] wl_display@1.delete_id(25)
[1890133.215] wl_display@1.delete_id(29)
[1890133.226] xdg_toplevel@28.configure(800, 600, array)
[1890133.245] xdg_surface@27.configure(54)
[1890133.260]  -> wl_surface@21.set_buffer_scale(1)
[1890133.279]  -> xdg_surface@27.ack_configure(54)
[1890133.293]  -> wl_compositor@4.create_region(new id wl_region@32)
[1890133.307]  -> wl_region@32.add(0, 0, 800, 600)
[1890133.339]  -> wl_surface@21.set_opaque_region(wl_region@32)
[1890133.352]  -> wl_region@32.destroy()
[1890133.362] wl_callback@25.done(837525162)
[1890133.375]  -> wl_surface@21.frame(new id wl_callback@30)
[1890134.684] wl_buffer@34.release()
[1890134.703]  -> wl_buffer@34.destroy()
[1890134.711] wl_callback@29.done(837525162)
[1890136.007]  -> wl_surface@21.frame(new id wl_callback@29)
[1890136.033]  -> wl_surface@21.attach(wl_buffer@26, 0, 0)
[1890136.051]  -> wl_surface@21.damage(0, 0, 2147483647, 2147483647)
[1890136.473]  -> wl_surface@21.commit()
[1890137.489] wl_display@1.delete_id(32)
[1890137.515] wl_display@1.delete_id(34)
[1890137.747] wl_buffer@24.release()
[1890137.764]  -> wl_buffer@24.destroy()
[1890140.517] wl_display@1.delete_id(24)
[1890140.539] wl_display@1.delete_id(30)
[1890140.545] wl_display@1.delete_id(29)
[1890140.551] wl_callback@29.done(837525178)
[1890140.559]  -> wl_surface@21.frame(new id wl_callback@29)
[1890140.571]  -> zwp_linux_dmabuf_v1@23.create_params(new id zwp_linux_buffer_params_v1@24)
[1890140.600]  -> zwp_linux_buffer_params_v1@24.add(fd 104, 0, 0, 3200, 16777215, 4294967295)
[1890140.630]  -> zwp_linux_buffer_params_v1@24.create_immed(new id wl_buffer@34, 800, 600, 875713112, 0)
[1890140.656]  -> zwp_linux_buffer_params_v1@24.destroy()
[1890140.663]  -> wl_surface@21.attach(wl_buffer@34, 0, 0)
[1890140.678]  -> wl_surface@21.damage(0, 0, 2147483647, 2147483647)
[1890142.088]  -> wl_surface@21.commit()

After setting xdg_toplevel::set_fullscreen, compositor sends xdg_toplevel::configure(800, 600, array) which gets acked and the new surface state gets commited right away (wl_surface::commit) - but the new state still contains the old 1280x720 buffer, as the new one gets created only after that commit.

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

Interestingly, when unsetting fullscreen the new buffer size is commited before the configure event is even received, which is why it also triggers protocol violation:

[2468867.151]  -> xdg_toplevel@25.unset_fullscreen()
[2468867.170]  -> wl_surface@18.set_buffer_scale(1)
[2468867.192]  -> wl_compositor@4.create_region(new id wl_region@29)
[2468867.219]  -> wl_region@29.add(0, 0, 1280, 720)
[2468867.262]  -> wl_surface@18.set_opaque_region(wl_region@29)
[2468867.282]  -> wl_region@29.destroy()
[2468878.603]  -> wl_buffer@28.destroy()
[2468884.628] wl_buffer@21.release()
[2468884.690]  -> wl_buffer@21.destroy()
[2468884.768] wl_display@1.delete_id(29)
[2468884.805] wl_display@1.delete_id(28)
[2468885.693] wl_display@1.delete_id(21)
[2468885.740] wl_display@1.delete_id(23)
[2468885.759] wl_display@1.delete_id(22)
[2468885.776] wl_callback@22.done(838098027)
[2468885.800]  -> wl_surface@18.frame(new id wl_callback@22)
[2468885.839]  -> zwp_linux_dmabuf_v1@20.create_params(new id zwp_linux_buffer_params_v1@21)
[2468885.961]  -> zwp_linux_buffer_params_v1@21.add(fd 28, 0, 0, 5120, 16777216, 4)
[2468886.053]  -> zwp_linux_buffer_params_v1@21.add(fd 29, 1, 3768320, 256, 16777216, 4)
[2468886.117]  -> zwp_linux_buffer_params_v1@21.create_immed(new id wl_buffer@28, 1280, 720, 875713112, 0)
[2468886.178]  -> zwp_linux_buffer_params_v1@21.destroy()
[2468886.194]  -> wl_surface@18.attach(wl_buffer@28, 0, 0)
[2468886.232]  -> wl_surface@18.damage(0, 0, 2147483647, 2147483647)
[2468886.806]  -> wl_surface@18.commit()
[2468886.874] xdg_toplevel@25.configure(0, 0, array)
[2468886.914] xdg_surface@24.configure(332)
[2468886.940]  -> wl_surface@18.set_buffer_scale(1)
[2468886.963]  -> xdg_surface@24.ack_configure(332)
[2468886.989]  -> wl_compositor@4.create_region(new id wl_region@29)
[2468887.012]  -> wl_region@29.add(0, 0, 1280, 720)
[2468887.061]  -> wl_surface@18.set_opaque_region(wl_region@29)
[2468887.083]  -> wl_region@29.destroy()
[2468887.098] wl_callback@23.done(838098027)
[2468887.121]  -> wl_surface@18.frame(new id wl_callback@27)
[2468887.188] wl_display@1.delete_id(21)
[2468887.211] wl_display@1.error(xdg_wm_base@13, 4, "xdg_surface buffer (1280 x 720) is larger than the configured fullscreen state (640 x 480)")

(this log comes from Weston)

@flibitijibibo
Copy link
Collaborator

Odd, do we need to resize the buffer as part of setting fullscreen? I figured setting fullscreen would do that for us, via configure or a similar mechanism...

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

No, the problem is that:

  • in case of toggling fullscreen on, the resize gets acknowledged too early (when it's not ready yet)
  • in case of toggling fullscreen off, the resize happens too early (before the compositor asks for it)

@flibitijibibo
Copy link
Collaborator

The second bit is probably the resize in SetWindowFullscreen, which unfortunately was necessary to get compositors to not blow apart the window size when leaving fullscreen (test with Super Hexagon). For the first one... that one's lost me, how long do we have to wait for it to be ready for an ack when they told us to reconfigure in the first place?

This might be a bit out of my range, but SupHex is a perfect test case for fullscreen testing as it avoids all APIs except SetWindowFullscreen and uses resize events for everything else, despite not being resizable.

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

how long do we have to wait for it to be ready for an ack when they told us to reconfigure in the first place?

Wait? No, SDL should send an ack when SDL is ready - that is, the surface has been reconfigured with a new resolution. Currently SDL acks first, commits and only then changes the buffer size, which is obviously wrong.

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

Just finished bisecting.

The first commit that introduces crashes when toggling fullscreen off is e5f9fae. That crash isn't deterministic and doesn't seem to happen at all when toggling fullscreen on.

The second commit that makes it reliably crash in both cases is 871c111.

@flibitijibibo
Copy link
Collaborator

I'd be up for reintroducing the first one - the second one basically only works because we put off resizing so late that it skips over a lot of activity in exchange for lag. Odds are we just need to move the ack to be later in HandlePendingResize.

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

The flush shouldn't be necessary with correct resizing logic. It was simply hiding the bug under the carpet.

From 7f261d3:

void
Wayland_SetWindowFullscreen(_THIS, SDL_Window * window,
                            SDL_VideoDisplay * _display, SDL_bool fullscreen)
{
    struct wl_output *output = ((SDL_WaylandOutputData*) _display->driverdata)->output;
    SetFullscreen(window, fullscreen ? output : NULL);

    /* The window may have been resized to the output size, so reset this when
     * returning to a window
     */
    if (!fullscreen) {
        SDL_WindowData *wind = (SDL_WindowData*) window->driverdata;
        wind->resize.width = window->windowed.w;
        wind->resize.height = window->windowed.h;
        wind->resize.pending = SDL_TRUE;
        Wayland_HandlePendingResize(window);
    }
}

This is wrong - Wayland_HandlePendingResize immediatelly calls WAYLAND_wl_egl_window_resize, which can't be done before we receive a configure event from the compositor (as in that case it still applies to fullscreened surface). This is what was hidden by that flush, since flushing makes wl_egl_window_resize to be executed after configure is already sent by the compositor (it would still fail in case compositor rejects unfullscreening though, which is something it's free to do). Commenting Wayland_HandlePendingResize out makes unsetting fullscreen work fine. Setting fullscreen still fails though.

@flibitijibibo
Copy link
Collaborator

flibitijibibo commented Aug 1, 2021

That makes perfect sense to me - in messing around with this I think the block should be replaced with a flush, as that was what I was desperately trying to replicate here.

It's very possible that, like the original version, a flush should be unconditional. To lay out why this is such a pain in the ass, this is what happens:

  • SDL_SetWindowFullscreen calls the driver function
  • Driver unsets fullscreen, triggers a RESIZED event that sets it back to the old windowed size
  • SDL_SetWindowFullscreen immediately queues a SIZE_CHANGED event, erasing the resize event
  • Application polls event loop, resize event is gone, game doesn't know to resize backbuffer, madness ensues

So the issue is two-fold: We need the first resized event to ensure we have the correct windowed size before SetWindowFullscreen has a chance to delete it, and we need the second resize from the configure event to bypass SDL_video.c's SIZE_CHANGED so that the game knows that a non-programmatic system resize happened. This is why Super Hexagon is a really important test case: It not only calls SetWindowFullscreen with no SetWindowSize, but it exclusively uses RESIZED events to know whether to match the fixed windowed size or the desktop size. Without the exact sequence of events listed above, the SDL events don't fire or the window size will be incorrect.

@dos1
Copy link
Contributor Author

dos1 commented Aug 1, 2021

I still don't understand what's going on regarding the case when fullscreen is being set. However, I'm pretty sure now that libdecor backend is broken in exact same way - it just gets away with it because it calls xdg_surface::set_window_geometry when acking configure, and Weston, in contrary to the error message it sends, actually checks the geometry size and not the buffer size. That may be a good workaround for xdg-shell in 2.0.16 in case we don't manage to fix it soon.

This means that instead of crashing/violating the protocol, we're getting a broken frame displayed on screen until it gets replaced with the next one that's actually correctly sized.

@dos1
Copy link
Contributor Author

dos1 commented Aug 2, 2021

So there's #4575 that workarounds this issue, but doesn't get rid of visual glitches. I've been trying to fix it properly as well and got the code written for at least the GL case that, to my best knowledge, should be correct, but... it doesn't work - and judging from discussion on #wayland it smells like it may be another Mesa bug.

Also, even if it worked, there would still be Vulkan case to handle. I don't know much about Vulkan - does SDL even know when the buffer swap happens there?

@flibitijibibo
Copy link
Collaborator

Vulkan handles presentation on its own via vkQueuePresentKHR and vkAcquireNextImageKHR, so it doesn't need to touch SDL like OpenGL does.

I wonder if the glitch has to do with the frame callback? We try to follow it for GL as best as we can, but technically we don't swap inside the call itself. I don't know if that would actually cause an issue, but it's about the only thing I can think of off the top of my head.

@flibitijibibo
Copy link
Collaborator

I did Even More Super Hexagon Tests(TM) and it seems like a root cause may have been us ignoring configure events...?

2506729...1fb1aa1

With this I get the right window size without having to fire all those arbitrary resizes from before, so it's very possible that this affected what should have been smooth resizes as well.

@dos1 dos1 changed the title Wayland/xdg-shell: Crash on toggling fullscreen when window size is larger than screen size wayland: Buffers with old size can get commited already after ack_configure on window resize Aug 3, 2021
@dos1
Copy link
Contributor Author

dos1 commented Aug 3, 2021

Renamed the issue as we don't crash anymore, but the incorrect resize logic is still going to haunt us.

@dos1
Copy link
Contributor Author

dos1 commented Aug 3, 2021

@flibitijibibo I don't think frame callback is related in any way. At least it shouldn't be.

To my best knowledge, this should work in OpenGL case: 785dfdf

However, it does not. Even when eglQuerySurface reports the size to be already changed, the next buffer that gets commited can still have the old size.

From #wayland:

[05:00] <dos1> is there some documentation available on wl_egl_window_resize behavior? i.e. whether there are situations where the next commit will still have a buffer with old size attached etc. 
[10:26] <pq> dos1, sorry, I don't recall there being any, that is a very good and tricky question.
[10:26] <pq> dos1, I believe the size is "locked" for the next swapbuffers by any API call that needs to allocate a buffer.
[10:27] <pq> dos1, so not just GL drawing calls, but also things like querying buffer_age extension.
[11:51] <dos1> should the dimensions obtained from wl_egl_window_get_attached_size right before calling eglSwapBuffers always match the commited buffer size?
[11:52] <daniels> dos1: no
[11:52] <daniels> dos1: wl_egl_window_get_attached_size() will return the dimensions of the buffer committed the _last_ time you called eglSwapBuffers
[11:52] <dos1> I see
[11:52] <dos1> is there a way to find out what size is going to be commited?
[11:53] <daniels> if you want to see what's going to get committed the _next_ time you call eglSwapBuffers, call eglQuerySurface(surf, EGL_{WIDTH,HEIGHT})
[11:53] <dos1> thanks!
[11:54] <daniels> as pq says, resize requests take effect the first time in a frame that you need a new buffer - after you call eglSwapBuffers there is no current back buffer, until you do something like submit a GL draw call or query EGL buffer age, which would cause a back buffer to be locked - the resize request takes effect during that lock
[11:57] <dos1> yeah, I get that. the thing is that SDL doesn't control whether the client has issued some GL calls already or not at the time wl_egl_window_resize is called, so it needs to find out whether the resize request took effect already or not in order to know when to ack the configure request
[11:58] <dos1> but sounds like eglQuerySurface will do
[11:58] <pq> dos1, is there no other way to go about it? You can't e.g. ensure to call wl_egl_window_resize() right *before* eglSwapBuffers of the *previous* frame? Maybe you don't have the new size yet at that time?
[12:00] <pq> I'm just thinking that there should be a more explicit way to match window state with the content drawn, because resizes are not the only thing that might change.
[12:06] <daniels> pq: well, it depends on your definition of 'anywhere' :P https://lists.freedesktop.org/archives/mesa-dev/2018-June/thread.html#196474
[12:08] <pq> daniels, add that link in the EGL_KHR_platform_wayland spec?
[12:09] <dos1> I'm afraid SDL client API isn't really in the "every frame is perfect" age yet. AFAIK the client doesn't even have a way to acknowledge the resize back to SDL
[12:10] <dos1> or any other state change for that matter
[12:11] <pq> dos1, does that mean that apps don't check what the window state should be before they draw? If they did, couldn't SDL assume that if the state update is visible to the app, the next frame will be drawn accordingly? I guess this breaks on threaded apps?
[12:15] <dos1> apps receive events, like SDL_WINDOWEVENT_RESIZED with new window dimensions - but clients can still draw and flip buffers in between the points of time when the event has been queued by SDL and handled by the app
[12:17] <dos1> the resize used to be delayed until after wl_egl_window_resize (that's how I wrote it back in 2018), but apparently it was causing troubles with clients that only draw opportunistically: https://github.com/libsdl-org/SDL/issues/4326
[12:23] <pq> dos1, and I think there is also slack betweend "handled by the app" and "actually drawn by the app", so yeah.
[12:29] <dos1> blah, until after eglSwapBuffers*
[13:21] <dos1> I must be still missing something - I check with eglQuerySurface whether it has the new size already; if so, then call ack_configure; and then call eglSwapBuffers
[13:21] <dos1> I still end up with a buffer of the old size commited after ack_configure
[13:21] <dos1> which weston then rejects with wl_display@1.error(xdg_wm_base@13, 4, "xdg_surface buffer (1280 x 720) is larger than the configured fullscreen state (640 x 480)")
[14:07] <dos1> so, based on my current understanding of how it all should work this doesn't make much sense: https://paste.debian.net/1206328/
[14:12] <dos1> running this on a compositor that doesn't throw an error in that case shows that the next frame gets a new and correctly sized buffer
[14:12] <pq> dos1, I agree... there are two things missing: EGL creating a new wl_buffer in the new size which seems like the resize didn't go through; the other is lack of adjusting xdg_surface geometry but if the client never set that maybe it's supposed to be automatic from surface size.
[14:13] <dos1> geometry was never set
[14:13] <pq> ok
[14:14] <pq> so yeah, something is going wrong with the resizing, and I don't remember the intended semantics between all those EGL-related calls to say anything
[14:15] <pq> a Mesa bug is not impossible either I suppose, if other compositors don't bother with errors
[14:16] <dos1> yeah, from all the compositors I tried only weston checked for this error
[14:16] <pq> nor is a Weston bug, but since no new wl_buffer is seen created, I'd think it's something client-side
[14:18] <pq> dos1, if you don't get help here, I'd suggest filing a Mesa bug.
[1391779.427]  -> xdg_toplevel@24.set_fullscreen(wl_output@12)
[1391791.369] xdg_toplevel@24.configure(640, 480, array)
[1391791.441] xdg_surface@23.configure(16761)
[1391791.470]  -> wl_surface@18.set_buffer_scale(1)
wl_egl_window_resize 640 480
[1391807.599] wl_buffer@29.release()
[1391807.632]  -> wl_buffer@29.destroy()
[1391809.022] wl_display@1.delete_id(22)
[1391809.040] wl_display@1.delete_id(27)
[1391809.048] wl_display@1.delete_id(25)
[1391809.054] wl_callback@27.done(918874813)
[1391809.063]  -> wl_surface@18.frame(new id wl_callback@22)
eglQuerySurface (CURRENT) 640 480
[1391809.219]  -> xdg_surface@23.ack_configure(16761)
wl_egl_window_get_attached_size (PREVIOUS) 1280 720
eglSwapBuffers:
[1391809.238] wl_callback@25.done(918874813)
[1391809.247]  -> wl_surface@18.frame(new id wl_callback@25)
[1391809.257]  -> wl_surface@18.attach(wl_buffer@26, 0, 0)
[1391809.275]  -> wl_surface@18.damage(0, 0, 2147483647, 2147483647)
[1391809.706]  -> wl_surface@18.commit()
wl_egl_window_get_attached_size (COMMITED) 640 480
[1391810.496] wl_display@1.delete_id(29)
[1391810.508] wl_display@1.error(xdg_wm_base@13, 4, "xdg_surface geometry (1280 x 720) is larger than the configured fullscreen state (640 x 480)")

So, I'm going to try to isolate the test case and file a Mesa bug afterwards.

@dos1
Copy link
Contributor Author

dos1 commented Aug 3, 2021

BTW. Moving the eglQuerySurface check and xdg_surface_ack_configure after eglSwapBuffers fixes the glitches... most of the time. It still fails sometimes, but most of the time it resizes flawlessly. It also does not make any sense whatsoever, I just tried it to see what will happen :P So yeah, I think we're in the "Mesa does something else under the hood than everyone thinks it does" territory.

@flibitijibibo
Copy link
Collaborator

flibitijibibo commented Aug 3, 2021

Interesting, if this related to eglSwapBuffers, it might be worth checking with the swap interval set to both 0 and 1. The weird custom vsync waiting has some weird side effects, and one of them was VVV not showing the frame during resize when the interval is set to 1.

@flibitijibibo flibitijibibo added this to the 2.0.18 milestone Aug 4, 2021
@christianrauch
Copy link
Contributor

One source of error could be that we use wl_surface_commit after setting min/max limits. According to the protocol documentation for xdg_toplevel.set_min_size

Values set in this way are double-buffered. They will get applied
on the next commit.

the surface has to be committed for the limits to apply.

This also happens inside the configure callback with the stack trace handle_configure_xdg_toplevel -> SetFullscreen -> CommitMinMaxDimensions -> wl_surface_commit. To this end, we commit the old surface (with old buffers) before resizing the wl_egl_window and acknowledging the configure event.
This is very bad, because the client is not supposed to apply this state immediately as per xdg_toplevel.configure documentation:

This configure event asks the client to resize its toplevel surface or
to change its state. The configured state should not be applied
immediately. See xdg_surface.configure for details.

but has to wait for xdg_surface.ack_configure:

The configure event marks the end of a configure sequence. A configure
sequence is a set of one or more events configuring the state of the
xdg_surface, including the final xdg_surface.configure event.

Where applicable, xdg_surface surface roles will during a configure
sequence extend this event as a latched state sent as events before the
xdg_surface.configure event. Such events should be considered to make up
a set of atomically applied configuration states, where the
xdg_surface.configure commits the accumulated state.

Since xdg_toplevel is derived from xdg_surface you have to wait for the configure event of the "base" surface and all configure events make up the entire configuration of this surface. So, committing the surface while it is in between a configuration is likely messing up your state. Don't do this.

Ideally, you will gather the state provided by the configure events from the compositor, acknowledge this, add additional states (such as min/max limits, window geometry, buffers) and commit once at the end. The commits that we have currently right after setting the limits must be postponed up to the point where SDL set all necessary properties for the surface.

@flibitijibibo
Copy link
Collaborator

flibitijibibo commented Aug 4, 2021

Oof, yeah that is a messy stack... I think that might have been a carryover from before we had the shared CommitMinMax function, I wonder if moving the commit out of the function and instead put it after CommitMinMaxDimensions in SetWindowResizable/SetWindowMaximumSize/SetWindowMinimumSize would be enough to avoid committing while having everything continue to work. Functionally it'd be the same for every function except SetFullscreen so if it works it'd be pretty safe to sneak into 2.0.16 at the last moment.

@christianrauch
Copy link
Contributor

christianrauch commented Aug 5, 2021

The same goes for xdg_toplevel.set_fullscreen etc. Those functions that change the state of the surface should be called after the end of the configure sequence. In the specific case of the SDL window flags, I am pretty sure the configure event should not cause going into fullscreen directly. The only thing SDL has to do is to "react" to any state changes from the configure event, such as resize buffers and redraw when the surface size changes or hiding/updating the decorations when changing to fullscreen/maximised state.

I think weston's window.c is a good inspiration for how to handle the configure events. In xdg_toplevel_handle_configure it simply gathers the state (set values and flags) and in xdg_surface_handle_configure it acknowledges the configuration, calls a client provided callback and enables redrawing of the surface.

I take it that the intention with the SDL_WINDOW_FULLSCREEN flag checking is to change to fullscreen when the user has requested to have the window in fullscreen mode. The wl_surface_commit is a side-effect of this, and postponing the commit is just an intermediate solution. A proper solution would be to postpone the entre SetFullscreen.

I think the entire loop should look something like:

  1. poll the wayland file descriptor for new events via wl_display_dispatch (this should trigger all registered event handlers like the configure handler)
  2. gather all state changes that have to be applied (pending states)
  3. setup the surfaces with the new pending states
  4. commit the surface (the state now becomes applied)
  5. process users events like keyboard and mouse
  6. compare the just applied state with the user's desired state (e.g. you just applied a floating state because of the compositor's request but want SDL_WINDOW_FULLSCREEN)
  7. send any requests that might change that current surface state to the desired state (e.g. set_fullscreen) to the compositor via wl_display_flush

The interaction with the compositor via the Wayland file descriptor can be quite complicated, especially when you have to use multiple queues for multiple threads. The documentation about wl_display is a good resource to start.

@dos1
Copy link
Contributor Author

dos1 commented Aug 5, 2021

One source of error could be that we use wl_surface_commit after setting min/max limits.

While I completely agree that it shouldn't be done this way, commenting it out was one of the first things I tried while looking at this issue and, sadly, it did not help at all. Judging from the protocol logs I really don't think it's SDL who's at fault there (although it wouldn't surprise me if reworking it to work more like a conventional Wayland client could make it not trigger the bug anymore).

@dos1
Copy link
Contributor Author

dos1 commented Aug 8, 2021

Interesting, if this related to eglSwapBuffers, it might be worth checking with the swap interval set to both 0 and 1.

For the record, I tried it without busy-looping SwapWindow hack just in case it was somewhat related, but that didn't change anything.

Also, I don't seem to be able to reproduce it with some simpler apps like testgles2; this appears to be some kind of race condition based on when the client is invoking GL calls that cause the buffer to get allocated, and my Allegro-based games when used with SDL backend somehow manage to trigger it consistently. Trying to write a simple reproducer got me nowhere so far, so I'll probably have to work the other way around to simplify the existing reproducer.

@flibitijibibo flibitijibibo added waiting Waiting on user response and removed bug labels Jan 25, 2022
@dos1
Copy link
Contributor Author

dos1 commented Jan 30, 2022

Just tried the latest main and it's still there (xdg-shell path with Mesa on Intel).

This list of steps should make it easy to reproduce:

  1. Modify SDL to remove all xdg_surface_set_window_geometry calls across the codebase (they exist only to workaround this issue). Alternatively apply 785dfdf.
  2. git clone https://gitlab.com/HolyPangolin/animatch --recursive && cd animatch && cmake -Bbuild -GNinja -DLIBSUPERDERPY_EMBEDDED_ALLEGRO=ON -DALLEGRO_SDL=ON && cmake --build build
  3. Run nested weston with weston --width=1280 --height=720 --scale=1
  4. Run Animatch with: SDL_VIDEO_WAYLAND_ALLOW_LIBDECOR=0 build/src/animatch -w (use SDL_DYNAMIC_API as appropriate)
  5. Once Animatch loads, press F to enter fullscreen mode.

Effect:

xdg_wm_base@12: error 4: xdg_surface buffer (720 x 1440) is larger than the configured fullscreen state (1280 x 720)

@flibitijibibo flibitijibibo added bug and removed waiting Waiting on user response labels Jan 30, 2022
@flibitijibibo
Copy link
Collaborator

I messed around with this some more and didn't get anywhere... this works great under Vulkan, GL not so much:

From 48f86edfa1000ae4547b2198019adafd2f78d9b2 Mon Sep 17 00:00:00 2001
From: Ethan Lee <SNIP>
Date: Wed, 2 Feb 2022 15:18:52 -0500
Subject: [PATCH] wayland: Rewrite SetWindowSize to avoid resizes mid-frame

---
 src/video/wayland/SDL_waylandwindow.c | 76 ++++++++++-----------------
 1 file changed, 27 insertions(+), 49 deletions(-)

diff --git a/src/video/wayland/SDL_waylandwindow.c b/src/video/wayland/SDL_waylandwindow.c
index 635546af4..b21b630d2 100644
--- a/src/video/wayland/SDL_waylandwindow.c
+++ b/src/video/wayland/SDL_waylandwindow.c
@@ -1361,7 +1361,6 @@ static void
 Wayland_HandleResize(SDL_Window *window, int width, int height, float scale)
 {
     SDL_WindowData *data = (SDL_WindowData *) window->driverdata;
-    SDL_VideoData *viddata = data->waylandData;
     struct wl_region *region;
 
     if (data->needs_resize_event || window->w != width || window->h != height || data->scale_factor != scale) {
@@ -1388,15 +1387,6 @@ Wayland_HandleResize(SDL_Window *window, int width, int height, float scale)
     wl_region_add(region, 0, 0, window->w, window->h);
     wl_surface_set_opaque_region(data->surface, region);
     wl_region_destroy(region);
-
-    /* XXX: This workarounds issues with commiting buffers with old size after
-     * already acknowledging the new size, which can cause protocol violations.
-     * It doesn't fix the first frames after resize being glitched visually,
-     * but at least lets us not be terminated by the compositor.
-     * Can be removed once SDL's resize logic becomes compliant. */
-    if (viddata->shell.xdg && data->shell_surface.xdg.surface) {
-       xdg_surface_set_window_geometry(data->shell_surface.xdg.surface, 0, 0, window->w, window->h);
-    }
 }
 
 void
@@ -1413,54 +1403,42 @@ Wayland_SetWindowMaximumSize(_THIS, SDL_Window * window)
 
 void Wayland_SetWindowSize(_THIS, SDL_Window * window)
 {
-    SDL_VideoData *data = _this->driverdata;
     SDL_WindowData *wind = window->driverdata;
-    struct wl_region *region;
-#ifdef HAVE_LIBDECOR_H
-    struct libdecor_state *state;
-#endif
 
 #ifdef HAVE_LIBDECOR_H
-    /* we must not resize the window while we have a static (non-floating) size */
-    if (data->shell.libdecor &&
-        wind->shell_surface.libdecor.frame &&
-        !libdecor_frame_is_floating(wind->shell_surface.libdecor.frame)) {
-            /* Commit the resize when we re-enter floating state */
-            wind->floating_resize_pending = SDL_TRUE;
-            return;
-    }
+    SDL_VideoData *data = _this->driverdata;
+    if (data->shell.libdecor) {
+        if (wind->shell_surface.libdecor.frame) {
+            /* We must not resize the window while we have a static
+             * (non-floating) size! Commit when we re-enter floating state
+             */
+            if (!libdecor_frame_is_floating(wind->shell_surface.libdecor.frame)) {
+                wind->floating_resize_pending = SDL_TRUE;
+            } else {
+                struct libdecor_state *state = libdecor_state_new(window->w, window->h);
+                libdecor_frame_commit(wind->shell_surface.libdecor.frame, state, NULL);
+                libdecor_state_free(state);
+            }
+        }
+    } else
 #endif
-
-    wl_surface_set_buffer_scale(wind->surface, wind->scale_factor);
-
-    if (wind->egl_window) {
-        WAYLAND_wl_egl_window_resize(wind->egl_window,
-                                     window->w * wind->scale_factor,
-                                     window->h * wind->scale_factor,
-                                     0, 0);
-    }
-
-#ifdef HAVE_LIBDECOR_H
-    if (data->shell.libdecor && wind->shell_surface.libdecor.frame) {
-        state = libdecor_state_new(window->w, window->h);
-        libdecor_frame_commit(wind->shell_surface.libdecor.frame, state, NULL);
-        libdecor_state_free(state);
+    if (data->shell.xdg) {
+        if (!(window->flags & SDL_WINDOW_RESIZABLE) || (window->flags & SDL_WINDOW_FULLSCREEN)) {
+            /* We can use min/max to update the size for fixed-size windows */
+            SetMinMaxDimensions(window, SDL_TRUE);
+        } else {
+            /* Do NOT commit, a fullscreen toggle after this would crash! */
+            xdg_surface_set_window_geometry(wind->shell_surface.xdg.surface,
+                                            0, 0, window->w, window->h);
+        }
+    } else {
+        /* No good way to resize, do it the hard way */
+        Wayland_HandleResize(window, window->w, window->h, wind->scale_factor);
     }
-#endif
 
     /* windowed is unconditionally set, so we can trust it here */
     wind->floating_width = window->windowed.w;
     wind->floating_height = window->windowed.h;
-
-    region = wl_compositor_create_region(data->compositor);
-    wl_region_add(region, 0, 0, window->w, window->h);
-    wl_surface_set_opaque_region(wind->surface, region);
-    wl_region_destroy(region);
-
-    /* Update the geometry which may have been set by a hack in Wayland_HandleResize */
-    if (data->shell.xdg && wind->shell_surface.xdg.surface) {
-       xdg_surface_set_window_geometry(wind->shell_surface.xdg.surface, 0, 0, window->w, window->h);
-    }
 }
 
 void Wayland_SetWindowTitle(_THIS, SDL_Window * window)
-- 
2.34.1


@flibitijibibo
Copy link
Collaborator

Try as I might I cannot get this to happen locally on anything other than the exact test case on the exact compositor. This is at risk of being bumped to .24 as a result.

(We really should make window changes a Commit() operation in 3.0...)

@flibitijibibo
Copy link
Collaborator

This bug may interact with #5450, probably not but it does touch a lot of the same areas with an added protocol extension.

@flibitijibibo
Copy link
Collaborator

We're at the deadline, so bumping to 2.0.24. That said I would be very interested to know if the recent wp_viewporter additions fix this, since the viewporter would presumably scale the surface and thus prevent the protocol error.

@flibitijibibo flibitijibibo modified the milestones: 2.0.22, 2.0.24 Mar 31, 2022
@flibitijibibo
Copy link
Collaborator

Going to make a potentially controversial call and make this a 3.0 task, rather than a 2.0.x task... see #3519 (comment) for more.

@Kontrabant
Copy link
Contributor

Any chance of testing this against the latest version and seeing if it's fixed? There's been a lot of cleanup of excess commits and configuration in the Wayland backend, particularly regarding entering/exiting fullscreen.

With the xdg_surface_set_window_geometry() call in Wayland_HandleResize() removed I'm not seeing any issues in Gnome or Weston, so maybe this is fixed now?

@Kontrabant
Copy link
Contributor

I played around with this a bit more and I can't replicate this at all with the current version, so I think it's fair to say this has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants