Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.18 regression: osk-sdl unit test fails #5136

Closed
smcv opened this issue Dec 24, 2021 · 16 comments
Closed

2.0.18 regression: osk-sdl unit test fails #5136

smcv opened this issue Dec 24, 2021 · 16 comments
Assignees
Milestone

Comments

@smcv
Copy link
Contributor

smcv commented Dec 24, 2021

After updating the SDL package in Debian from 2.0.16 to 2.0.18 (+ selected patches recommended by @slouken), an osk-sdl unit test that previously passed has started failing, blocking the "migration" process in Debian (the effect is similar to blocking the SDL update's inclusion in Debian betas).

I am a packager/maintainer of SDL in Debian, but I have no special knowledge of osk-sdl, which I installed for the first time in a test VM today to try to get an easier reproducer / more information.

Bisecting SDL indicates that commit 8bf32e1 "Improved SDL_PollEvent usage (#4794)" is the first one to fail. Reverting that commit (which requires a bit of conflict resolution to account for subsequent changes) makes the test pass again.

To reproduce:

  • have build-dependencies for osk-sdl and SDL, plus Xvfb, xauth and Debian's xvfb-run script (not part of Xvfb upstream, but many distros package it)
  • build SDL itself
  • git clone -b reproduce-sdl-regression https://gitlab.com/smcv/osk-sdl.git
    • this branch is based on upstream release 0.66, and adds a few commits to add more debugging to the test script, and adjust a font path in osd.conf to be correct for Debian/Ubuntu (please adjust as necessary for your distro)
  • cd osk-sdl
  • meson _build && ninja -C _build
  • export OSK_SDL_EXE_PATH=$(pwd)/_build/osk-sdl
  • export OSK_SDL_CONF_PATH=$(pwd)/osk.conf
  • export TEST_DEBUG=1
  • export TEST_RECORD=$(pwd)/test.webm
  • export LD_LIBRARY_PATH=$HOME/SDL/build/.libs (or wherever you built SDL)
  • xvfb-run -a -s '-ac -screen 0 480x800x24' ./test/test_functional.sh test_keyscript_mouse_toggle_osk

Expected result:

  • The test passes, logs Success! and exits with status 0
  • In particular, when the test "clicks" on the button to open the on-screen keyboard, osk-sdl logs INFO: keyboard toggle hit x: 460, y: 775
  • good.log, good.webm in https://people.debian.org/~smcv/bug1001809/

Actual result:

  • The test fails
  • In particular, when the test "clicks" on the button to open the on-screen keyboard, osk-sdl does not log anything and the OSK does not open
  • bad.log, bad.webm in https://people.debian.org/~smcv/bug1001809/

Please note that when the test fails, it leaks an osk-sdl process which will occupy 100% of a CPU. This is probably a bug in the test script and/or osk-sdl, but it's not my bug, and probably not SDL's bug either; it's just something to watch out for while debugging, because these processes will build up across multiple failures.

Setting the SDL_POLL_SENTINEL environment variable to either 0 or 1 does not seem to have any effect on this.

This is https://bugs.debian.org/1001809 downstream. At the moment I don't know whether it is a SDL bug, or whether osk-sdl is doing something wrong.

@smcv
Copy link
Contributor Author

smcv commented Dec 24, 2021

Here's the patch I tried to revert this change, which makes this test pass again: smcv@6677d17

@smcv
Copy link
Contributor Author

smcv commented Dec 24, 2021

The osk-sdl event loop as of 0.66 appears to be a busy-loop, which doesn't seem great to me (but it's not my code and maybe there is a good reason why it has to be):

	while (luksDev.isLocked() && !done) {
		show_osk = !keyboardToggle.isVisible();
		if (SDL_WaitEvent(&event)) {
			// an event was found
			switch (event.type) {
			// handle the keyboard
			case SDL_KEYDOWN:
...

@smcv
Copy link
Contributor Author

smcv commented Dec 24, 2021

This seems to avoid the regression, although I don't know whether it's correct. (Indentation deliberately not adjusted to highlight what the significant change is)

diff --git a/src/events/SDL_events.c b/src/events/SDL_events.c
index f61075eeb..84d7b51af 100644
--- a/src/events/SDL_events.c
+++ b/src/events/SDL_events.c
@@ -954,6 +954,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
     Uint32 expiration = 0;
 
     /* First check for existing events */
+    if (timeout == 0) {
     switch (SDL_PeepEvents(event, 1, SDL_GETEVENT, SDL_FIRSTEVENT, SDL_LASTEVENT)) {
     case -1:
         return 0;
@@ -967,6 +968,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
         /* Has existing events */
         return 1;
     }
+    }
 
     if (timeout > 0) {
         start = SDL_GetTicks();

@rohlem
Copy link
Contributor

rohlem commented Dec 27, 2021

I think I have an idea of what is happening:

  • The application's main event loop is filled with calls to SDL_PushEvent pushing a custom "render event" type into the queue.
    • In particular, the code handling the render event may continuously push new render events into the queue.
    • The programmer's assumption must have been that "it will be added to the end of the event queue, behind new input events".
    • This is not true, because SDL_PumpEvents (which adds the input events to the queue) was only called after the render event was added (from SDL_WaitEvent).
    • In 2.0.16, this resulted in one stale frame, but an otherwise working program.
  • The optimization in commit 8bf32e1 is incompatible with this usage scenario.
    • Its reasoning was that "as long as there's something in the event queue, WaitEvents doesn't need to call SDL_PumpEvents at all.
    • However, the evnt queue keeps getting filled without any calls to SDL_PumpEvents. Therefore render events pushed by the user are the only events ever pushed, and keep SDL 2.0.18 from pumping in SDL_WaitEvent.

I don't think this use case aligns with the intention of SDL's current API, though the discrepancies should probably be documented:

  • SDL_PushEvent queues the event behind queued events, which are only guaranteed as recent as the last call to SDL_PumpEvents.
    • Changing this would require SDL_PushEvent (which is currently thread-safe) to trigger a call to SDL_PumpEvents (which should currently only be called by the "main" thread that initialized the video subsystem).
  • SDL_WaitEvents and SDL_WaitEventsTimeout only call SDL_PumpEvents when the event queue is empty.
    • The posted patch only sidesteps the optimization (effectively reverting 8bf32e1 ) for calls to SDL_WaitEventTimeout with timeout != 0 (which leaves SDL_Poll incompatible for this usage scenario).

EDIT: created downstream issue 138.


EDIT 2: On second thought, maybe it would just work if we generalized the polling sentinel solution from timeout == 0 to timeout <= 0, when peeping and inserting.
Additionally for timeout <= 0, we wouldn't want to return early but fall through to the later sections that implement the waiting for events.
But maybe I'm missing some issue that would introduce.

EDIT 3: edit 2 would address a separate issue and doesn't yet solve this. We would need to repurpose the polling sentinel even more:

  • on the start of the first WaitEventTimeout, call PumpEvents, set a "did_pump_events" flag, and insert the sentinel event
  • on subsequent WaitEventTimeout, check the "did_pump_events" flag and don't insert the sentinel (so we don't flood the queue with them)
  • on encountering the sentinel, reset the "did_pump_events" flag for the next polling cycle

@slouken slouken added this to the 2.0.20 milestone Dec 28, 2021
@smcv
Copy link
Contributor Author

smcv commented Jan 3, 2022

There are three modes:

  • polling (timeout == 0, SDL_PollEvent)
  • waiting until event or timeout (timeout > 0)
  • waiting until event, no timeout (timeout < 0, SDL_WaitEvent)

#4794 seems to have been intended to help the polling case, but osk-sdl is using the waiting mode (although because of the way it re-queues events, there is usually an event available, so it rarely actually waits).

I think perhaps ideally we only want to enter the First check for existing events block if there is a sentinel already queued, and not if the queue contains only non-sentinels?

@cgutman
Copy link
Collaborator

cgutman commented Jan 3, 2022

My opinion is that we should just document this, especially if this is the only known application affected.

The optimization of avoiding SDL_PumpEvents() is a nice performance uplift across the board, regardless of whether an application is doing SDL_PollEvent() or SDL_WaitEvent(). Doing a bunch of syscalls each time the application wants to get an event sucks, especially for high frequency input devices (a main target of #4794). It would be a shame to lose it in some/all cases due to a single broken application.

Normally I'd want to strongly avoid breaking applications, but in this particular case the application is using a pretty unusual event processing technique and the optimization has benefits for all SDL users.

@Undef-a
Copy link

Undef-a commented Jan 4, 2022

I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents() into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN and SDL_MOUSEBUTTONUP events still do not trigger. Some others, such as SDL_WINDOWEVENT and SDL_KEYDOWN seem to work fine.

This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).

@icculus
Copy link
Collaborator

icculus commented Jan 6, 2022

Its reasoning was that "as long as there's something in the event queue, WaitEvents doesn't need to call SDL_PumpEvents at all.

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event" instead of "WaitEvents doesn't need to call SDL_PumpEvents again until the event queue is empty."

This still solves the original problem (calling PumpEvents every time you poll/block for the next event in the queue), but should also fix this issue (be in a position where PumpEvents doesn't get called at all).

Am I way oversimplifying (or simply misunderstanding) the problem?

@smcv
Copy link
Contributor Author

smcv commented Jan 6, 2022

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"

I think so, yes.

If I'm understanding the intention of #4794 correctly, the problem was that high-frequency input devices would enqueue events every time SDL_PumpEvents is called, starving other event sources; and so you don't want to call SDL_PumpEvents until all the events generated by the previous SDL_PumpEvents have been dispatched to the library user (as indicated by reaching the sentinel). Is that right?

What's happening in osk-sdl is almost the other way round: every time osk-sdl processes one of its own "render events", it enqueues another render event, starving the event source that should be talking to X11. Like this:

  • start with one render event in the queue
  • the block that is affected by 2.0.18 regression: osk-sdl unit test fails #5136 (comment) sees that the event queue is non-empty, pops the render event from the queue and does an early-return
  • osk-sdl processes the render event and enqueues another one
  • we effectively go back to the beginning
  • the code that talks to X11 in SDL_PumpEvents is never reached
  • osk-sdl never escapes from the "currently animating" state

If we conditionally skipped the same block as in #5136 (comment), but instead of the condition for running it being if (timeout == 0), the condition was if (there is a sentinel event in the queue), then I think that would solve osk-sdl's problem:

To achieve that, I think it might be necessary to adjust some of the locking, so that the checks in the block affected by #5136 (comment) are all done within a single lock/unlock.

@smcv
Copy link
Contributor Author

smcv commented Jan 6, 2022

Also, if there is a sentinel but it is at the head of the queue, I think probably we want to fall through to starting a new poll cycle immediately, rather than considering the next event? Otherwise, osk-sdl could do this:

  • start with one render event in the queue
  • the early-return block is skipped
  • SDL_PumpEvents polls X11 and enqueues 0 or more input events, etc.
  • also enqueue 1 sentinel
  • dispatch the first event, which is the render event
    • osk-sdl enqueues another one [A], but it is enqueued after the sentinel
  • keep dispatching events until we reach the sentinel
  • when the sentinel is the head of the queue, the early-return block checks the new head of the queue
  • the new head of the queue is the render event added at [A]
  • we early-return and dispatch the render event
  • osk-sdl enqueues another render event
  • oops, we loop forever without polling X11 again

smcv added a commit to smcv/SDL that referenced this issue Jan 6, 2022
…sdl-org#5136)

If the library user responds to an event by adding a custom event to
the queue, as osk-sdl does during its startup, then the queue will never
fully drain, so we'll never leave the early-return block and poll for
new windowing system events.

To avoid this situation, add an internal variant of SDL_PeepEvents
that can be asked whether the event queue contains at least one sentinel.
This is done under the same lock as the rest of the interactions with
the event queue, so that it is atomic with respect to other concurrent
calls to SDL_PeepEvents_internal.

Signed-off-by: Simon McVittie <smcv@debian.org>
@smcv
Copy link
Contributor Author

smcv commented Jan 6, 2022

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"

#5184 implements this.

smcv added a commit to smcv/SDL that referenced this issue Jan 6, 2022
…sdl-org#5136)

If the library user responds to an event by adding a custom event to
the queue, as osk-sdl does during its startup, then the queue will never
fully drain, so we'll never leave the early-return block and poll for
new windowing system events.

To avoid this situation, add an internal variant of SDL_PeepEvents
that can be asked whether the event queue contains at least one sentinel.
This is done under the same lock as the rest of the interactions with
the event queue, so that it is atomic with respect to other concurrent
calls to SDL_PeepEvents_internal.

Signed-off-by: Simon McVittie <smcv@debian.org>
@0x1F9F1
Copy link
Contributor

0x1F9F1 commented Jan 6, 2022

I need to look at this a bit more carefully, but before the poll sentinel was added, wouldn't the event queue just continue to grow as the application falls further and further behind? Also the poll sentinel commit did add a hint to disable them (which just uses the regular API for disabling events).

@smcv
Copy link
Contributor Author

smcv commented Jan 6, 2022

Also the poll sentinel commit did add a hint to disable them

That hint doesn't disable the fast-path when the queue is non-empty, only the sentinel event, so it doesn't help osk-sdl.

@rohlem
Copy link
Contributor

rohlem commented Jan 6, 2022

I don't quite follow how the scenario in @Undef-a 's report above manifested:

I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents() into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN and SDL_MOUSEBUTTONUP events still do not trigger. Some others, such as SDL_WINDOWEVENT and SDL_KEYDOWN seem to work fine.

This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).

If manual pumping doesn't fix it, I don't see how adapting SDL_WaitEventTimeout would fix it either (unless I'm missing something) - is there another separate issue at play?
@smcv Have you attempted patching osk-sdl and encountered this as well?

@slouken
Copy link
Collaborator

slouken commented Jan 6, 2022

Can you try this approach?
#5187

@slouken slouken assigned slouken and unassigned icculus Jan 7, 2022
@slouken
Copy link
Collaborator

slouken commented Jan 7, 2022

This is fixed in the latest commits.

@slouken slouken closed this as completed Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants