Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N3DS: Semaphore fixes. #6776

Merged
merged 5 commits into from
Dec 11, 2022
Merged

N3DS: Semaphore fixes. #6776

merged 5 commits into from
Dec 11, 2022

Conversation

FtZPetruska
Copy link
Contributor

Description

There were a handful of issues with Nintendo 3DS' semaphore implementation. Namely:

  • lacking a function to wait with a timeout, it had a naive implementation that tried once every millisecond.
  • contended Post/TryWait operations would hang the OS.

This PR addresses both issues:

  • the best delay between acquire attempts seems to be around 10-100µs, which gave about 10 timeouts each for 10000 iterations
  • for TryWait, a delay of at least 50µs after the call is needed to avoid the OS hang.

Testing was done using test/testsem.c.

I'll make a similar patch for the SDL2 branch since it is also applicable there.

Existing Issue(s)

N/A

if (timeoutNS == 0) {
int result = LightSemaphore_TryAcquire(&sem->semaphore, 1);
/* TryWait will block unless we wait about 50 microseconds */
SDL_DelayNS(SDL_US_TO_NS(50));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. Why would you need to delay after the tryacquire and before the code that acquired it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I had a better explanation, but trying to step through it in the debugger, stack gets corrupted without that wait and forces me to hard-shutdown the console.

It may also come from the test code calling TryWait repeatedly:

        while (alive) {
            if (SDL_SemTryWait(sem) == SDL_MUTEX_TIMEDOUT) {
                ++state->content_count;
            }
            ++state->loop_count;
        }

Since outside testing you'd most likely call Wait or WaitTimeout instead of busy-waiting TryWait.

Moving the wait from the SDL_sysem.c to testsem.c gives the same result:

        while (alive) {
            if (SDL_SemTryWait(sem) == SDL_MUTEX_TIMEDOUT) {
                ++state->content_count;
            }
            ++state->loop_count;
#ifdef __3DS__
            SDL_DelayNS(SDL_US_TO_NS(50));
#endif
        }

Would that be preferred ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, something else is happening in the system that is requiring the wait. If you have nothing else operating on the semaphore, do you still need the wait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent the last few hours trying to debug the issue, here are my findings:

  • When just creating a semaphore with a high count and calling TryWait as fast as possible to get it down to 0, it works fine without any delay.
  • As soon as the consumer (calling TryWait) and the producer (calling Post) are on different threads, there is a chance of blocking.

To expand further on that last point:

  • It seems that any newly created thread will starve its parent unless it goes to sleep
  • With the default settings of testsem (10 threads), it takes about 50µs for all the consumer threads to be asleep and the main thread (producer) to be scheduled.
  • When bringing it down to one consumer thread, then any sleep duration in the thread will allow the main thread to be scheduled.
  • It seems that even when the two threads have the same affinity, the scheduler will not change threads and whichever one is currently running will starve all the others.

Considering all this, what should be done about TryWait ? Should it keep the current 50µs delay to pass testsem ? As soon as there are 2 consumer threads, it already takes a delay of about 15µs to avoid starvation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug in the N3DS runtime or scheduler. I don't think we should fake things to get the tests to pass, as that'll bite someone later who is relying on these functions. Does the N3DS not have pre-emptive multitasking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what we need to do instead is have whoever is releasing the semaphore sleep immediately afterwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug in the N3DS runtime or scheduler. I don't think we should fake things to get the tests to pass, as that'll bite someone later who is relying on these functions. Does the N3DS not have pre-emptive multitasking?

From what I've found, the OS uses cooperative multitasking instead of preemptive. Which would explain why there are no context switching from a child thread unless it sleeps.

Maybe what we need to do instead is have whoever is releasing the semaphore sleep immediately afterwards?

This did not work, on its own. But here are a few things I noticed:

  • in testsem, the main thread does a busy wait until the semaphore is back to 0, to avoid starvation, it should have a wait:
    while (SDL_SemValue(sem)) {
      SDL_DelayNS(1);
    }
  • in TryWait, a delay of 1ns avoids starvation and lets all threads consume the semaphore equally.
  • in Post, a delay of 1ns will make the first consumer thread starve all the other threads from getting a successful acquire.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the code more friendly to cooperative multitasking seems like a good idea. Feel free to submit a new PR (or change this one) to do that.

The 3DS has a cooperative threading model. Sleeping after TryWait and
WaitTimeout avoid starving other threads. It inccurs a runtime penalty,
but it's better than having to hard reset your console to recover from
a deadlock.
if (LightSemaphore_TryAcquire(&sem->semaphore, 1) != 0) {
return WaitOnSemaphoreFor(sem, timeoutNS);
}
/* Avoid starvation since this is a cooperative threading model */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want an unconditional sleep here. It would introduce unnecessary delay if the semaphore is always signaled. If the timeout is 0, the application explicitly does not want to delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a timeout of 0:

  • if we don't sleep altogether we get the thread starvation discussed above
  • if we only sleep when failing to acquire the lock we will actually starve the other threads:
INFO: Doing 100000 contended Post/TryWait operations on semaphore using 10 threads
INFO: Took 3685 milliseconds, threads where contended 900000 out of 1000000 times in total (90.00%)
INFO: { 100000, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

With a timeout != 0, and without that unconditional sleep, we will once again starve the other threads:

INFO: Doing 100000 contended Post/WaitTimeout operations on semaphore using 10 threads
INFO: Took 25 milliseconds, threads timed out 10 out of 100010 times in total (0.01%)
INFO: { 100000, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

Results currently:

INFO: Doing 100000 contended Post/WaitTimeout operations on semaphore using 10 threads
INFO: Took 367 milliseconds, threads timed out 0 out of 100000 times in total (0.00%)
INFO: { 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000 }

INFO: Doing 100000 contended Post/TryWait operations on semaphore using 10 threads
INFO: Took 369 milliseconds, threads where contended 0 out of 100000 times in total (0.00%)
INFO: { 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000 }

So while this does not run the fastest, it avoids starvation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real problem is that this test is designed to test behavior with preemptive multitasking. 0% contention here isn't really a valid result.

Hmm, okay, can you very clearly document what's happening and why you're doing what you're doing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that a timeout != 0 ends up starving threads. Wouldn't that translate into a cooperative yield for that amount of time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay, here is a situation which I think illustrates the pros and cons of this approach.

First a bit of background on the 3DS's threading model. It's cooperative, meaning the thread must manually yield for another one to run. Userland processes can only run on a single core. And lastly, it seems that at equal priority, scheduling is done in a round-robin fashion.

For the example, we have one semaphore that is roughly equivalent to a thread-safe queue of data. One or more thread that append data to the queue, once the queue is full and/or they have no data left to add, they sleep. On the other hand, we have N threads consuming data from the queue. The consumer threads have three ways of waiting for data to be available:

  • A blocking wait:
    while(alive) {
      void *data = Acquire(queue);
      DoWork(data);
    }
  • A busy wait:
    while(alive) {
      void *data = TryAcquire(queue);
      if (data != NULL) {
        DoWork(data);
      }
    }
  • A wait with timeout:
    while(alive) {
      void *data = TryAcquireTimeout(queue, timeout);
      if (data != NULL) {
        DoWork(data);
      }
    }

There is one current issue with the busy wait. Whether there is data or not, the thread will never yield, starving all other threads from ever running again. To address this, the first idea is to yield execution if the queue was empty. This avoids locking up the system by running this thread indefinitely. But with that fix alone, we have a new problem, the first thread after a producer empties the queue and starves the other consumer threads.

There is a similar issue with the timeout wait. It will only sleep if there are no data, and the first consumer thread will starve the others.

To address this issue, we introduce an unconditional yield at the end of TryAcquire and TryAcquireTimeout. This spreads the data equally over each thread.

This has a few caveats:

  • If there is only one consumer thread, this brings an important runtime cost.
  • Potentially unexpected behaviour, when you try to acquire data with no timeout, you don't necessarily expect your thread to sleep.

As for the tl;dr:

  • By introducing a yield in TryWait if we fail, we avoid monopolising the CPU time indefinitely.
  • By introducing an unconditional yield in TryWait and TryWaitTimeout, we get the benefit from above; plus we avoid starving the other threads at the cost of higher runtime overhead.

This latter change allows for a more predictable behaviour. Preemptive systems would (likely) spread the work more or less equally on their thread, so someone using SDL would expect this to be true on supported platforms. This change emulates this behaviour, but I can totally understand the point where the runtime cost is too high, that this may be too unlikely of a situation, or if this unnecessary.

EDIT: I've noticed when re-reading that doing a blocking wait would also starve other consumer threads.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, if you document it well, I'm fine with you committing the solution you're comfortable with.

@@ -208,6 +208,8 @@ TestOverheadContended(SDL_bool try_wait)
}
/* Make sure threads consumed everything */
while (SDL_SemValue(sem)) {
/* Friendlier with cooperative threading models */
SDL_DelayNS(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a longer delay, like 10 ms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe not, since we're trying to time the whole thing.

@slouken slouken merged commit 053ce39 into libsdl-org:main Dec 11, 2022
@FtZPetruska FtZPetruska deleted the n3ds-sem branch December 11, 2022 19:30
FtZPetruska added a commit to FtZPetruska/SDL that referenced this pull request Dec 11, 2022
FtZPetruska added a commit to FtZPetruska/SDL that referenced this pull request Dec 11, 2022
FtZPetruska added a commit to FtZPetruska/SDL that referenced this pull request Dec 11, 2022
slouken pushed a commit that referenced this pull request Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants