Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread.yield fairness with multiple busy yielding threads #2112

Merged
merged 3 commits into from Feb 8, 2019

Conversation

@ahh
Copy link
Contributor

commented Oct 19, 2018

Here's a very hacky repro for the problem: the below code counts how many times a thread goes through a loop yielding before successfully doing so. At trunk, this runs 10,000+ times per thread, or more if the threads are pinned to the same CPU. With this patch, we reliably yield every time, because a yielding thread is properly marked as a waiter.

let threads = 4
let print = Sys.argv.(1) |> bool_of_string
let iters = Sys.argv.(2) |> int_of_string

let yields = ref 0
let are_ready = ref 0
let are_done = ref 0

let last = ref '!'

let start = ref 0.0

let stop = ref 0.0

let run_lengths = ref []

let report label cnt =
  let old = !run_lengths in
  run_lengths := (label, cnt) :: old;;

let threads =
  List.init threads (Thread.create (fun i ->
    (* Label our threads A..Z (clearer than TIDs, and integer indices are confusable
       with run lengths) *)
    let label = Char.unsafe_chr (Char.code 'A' + i) in
    incr are_ready;
    (* Don't make any progress until all threads are spawned and properly contending for
       the Ocaml lock. *)
    if !are_ready = threads then start := (Unix.gettimeofday ());
    while !are_ready < threads do
      Thread.yield ()
    done;
    let consecutive = ref 0 in
    while !yields < iters do
      incr yields;
      last := label;
      Thread.yield ();
      incr consecutive;
      if not (!last = label)
      then (
        report label !consecutive;
        consecutive := 0)
    done;
    if !consecutive > 0 then report label !consecutive;
    incr are_done;
    if !are_done = threads then stop := (Unix.gettimeofday ())
  ));;


List.iter Thread.join threads;;

let report (label, count) = Printf.printf "%c ran for %d\n" label count;;

if print then List.iter report (List.rev !run_lengths)

let time = !stop -. !start;;
let switches = List.length !run_lengths;;
Printf.printf "Executed in %fs with %d switches\n" time switches

I can measure no overhead in throughput from the normal volume of yielding due to the tick thread (say, with two threads merrily allocating) but I'm happy to run more tests.

Thanks!

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

Interesting proposal, thanks! @jhjourdan, would you like to look at it?

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

For easy reference: MPR 7669 and GPR #1533.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

Interesting proposal, indeed! I agree that this approach is much more reliable than my dirty nanosleep hack, hence I am not surprised by your fairness measurements.

@xavierleroy, do you think we could add the code snippet above to the test suite? Or do you think these kind of measurements are not reliable enough?

Moreover, it seems the Windows API now supports condition variables. We could re-implement the master lock using those, and hence have fairness on Windows too. They are not supported under Windows Server 2003 and Windows XP, so that we would need to fallback to the current implementation on XP or even drop support for XP (do people still care about Windows XP nowdays ?).

Apart from that (and what is shown by the continuous integration), I reviewed the code, and it seems OK to me.

@dra27

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

If it’s one or two little #ifdefs to support Windows XP/2003 then fine to keep the support, but otherwise don’t worry - building OCaml on XP/2003 is already very difficult to set up, so the person/company intent on doing so will probably also be able to restore the old code!

@dra27

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

(There is a reasonable chance that multicore will absolutely force Windows 7+ anyway, which will be most welcome...)

@ahh ahh force-pushed the ahh:ahh-yield-fix branch from de9e17d to 48c99e7 Oct 19, 2018
@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 19, 2018

@jhjourdan I hadn't added an explicit test because yield behavior is so nondeterministic, but now that I think more it's actually not so bad. I added a slightly simplified version of that test above, take a look. I don't really know how the test metadata works and I can't find a reference to it, so I'm not sure the comment block at the top of the test is correct, but it seems to pass as is and fail if I break it, so that's good.

(I have no objection to dropping the test, but if you want it, here you go.)

@ahh ahh closed this Oct 19, 2018
@ahh ahh reopened this Oct 19, 2018
@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 19, 2018

(Apologies, fatfingered "close".)

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2018

@jhjourdan I hadn't added an explicit test because yield behavior is so nondeterministic, but now that I think more it's actually not so bad. I added a slightly simplified version of that test above, take a look. I don't really know how the test metadata works and I can't find a reference to it, so I'm not sure the comment block at the top of the test is correct, but it seems to pass as is and fail if I break it, so that's good.

Great. I am personally in favor of adding this test. We need to know what others think. To make sure that we won't get any false negatives, we could have a better safety margin (i.e., use a threshold of 10 instead of 3).

BTW, there are still complains of the CI about too long lines.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2018

@dra27 : any idea why the Appveyor CI is failing ? This seems completely unrelated to this PR.

@nojb

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2018

From the log:

[00:07:11] i686-w64-mingw32-gcc -c -O -mms-bitfields -Wall -Wno-unused -fno-tree-vrp -Werror -DCAML_NAME_SPACE -DUNICODE -D_UNICODE -DWINDOWS_UNICODE=1 -I../../runtime  \
[00:07:11]    -o st_stubs_b.o st_stubs.c
[00:07:11] In file included from st_stubs.c:52:0:
[00:07:11] st_win32.h:104:36: error: unknown type name ‘st_masterlock’
[00:07:11]  static INLINE void st_thread_yield(st_masterlock * m)
[00:07:11]                                     ^~~~~~~~~~~~~
[00:07:11] st_stubs.c: In function ‘caml_thread_yield’:
[00:07:11] st_stubs.c:748:3: error: implicit declaration of function ‘st_thread_yield’ [-Werror=implicit-function-declaration]
[00:07:11]    st_thread_yield(&caml_master_lock);
[00:07:11]    ^~~~~~~~~~~~~~~
[00:07:11] cc1: all warnings being treated as errors
[00:07:11] make[3]: *** [Makefile:96: st_stubs_b.o] Error 1
@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2018

If it’s one or two little #ifdefs to support Windows XP/2003 then fine to keep the support, but otherwise don’t worry - building OCaml on XP/2003 is already very difficult to set up, so the person/company intent on doing so will probably also be able to restore the old code!

I guess that's something to discuss with the dev team. For now, I will prepare a PR for using condition variables on > XP using ifdefs, once this one will get merged.

@ahh ahh force-pushed the ahh:ahh-yield-fix branch 2 times, most recently from 9575bc8 to 61d6a92 Oct 21, 2018
@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

Fixed the long lines. Think I fixed the AppVeyor build but less certain of that.

@ahh ahh force-pushed the ahh:ahh-yield-fix branch from 61d6a92 to c850862 Oct 22, 2018
@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

I uploaded a fix for the Windows build, but the pull request isn't updating; I'm not sure why. ahh@c850862 shows the log

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

The Windows build shows the following error message:

i686-w64-mingw32-gcc -c -O -mms-bitfields -Wall -Wno-unused -fno-tree-vrp -Werror -DCAML_NAME_SPACE -DUNICODE -D_UNICODE -DWINDOWS_UNICODE=1 -I../../runtime  \
   -o st_stubs_b.o st_stubs.c
In file included from st_stubs.c:52:0:
st_win32.h: In function ‘st_thread_yield’:
st_win32.h:161:5: error: expected ‘;’ before ‘}’ token
     }
     ^
@dra27

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

CIs both kicked...

@dra27

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

Ok, the failure now looks to be something to do with the new test - I’ll try to have a look later, unless someone else has access to a Windows box in the meantime?

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

The test should be disabled on Windows. There is no fix for this issue on Windows (yet ?), so there is no reason for this test to pass.

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Oct 22, 2018

@dra27 I'm looking into it.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

@damiendoligez : Again, I don't think there is anything to do here: st_win32.h does not contain the fix proposed by this PR, because it needs condition variables, which are not available on Windows XP.

At least for now, I think we should just disable the test on Windows, just like we did for #1533.

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

I agree that it's probably best to disable the test on Windows, but it's at least a little surprising it doesn't work; note that Windows sleeps unconditionally since we don't have a waiter count there. That still leaves a lot of ways for the scheduler to hose us (deciding to wake up the waiter on a remote CPU which takes too long to run, for example!) though.

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Oct 22, 2018

Looks like the Windows scheduler is not as reliable as the Linux one. I'm getting some test runs that are OK, but most of the time I get a handful of random numbers. The highest I've observed so far is 5598.

Looks like disabling the test under Windows is the best solution.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

Regarding Windows: I was under the impression that the Windows scheduler is rather fair, and that the Sleep(0); seems to yield to other threads like we expect it to do. So, I'd rather stick with the current Win32 code rather than embark on a reimplementation with condition variables and the like.

Now, the hiccups that @damiendoligez reports maybe prove me wrong, or maybe they are exacerbated by our test setting (virtual machines running on a heavily-loaded Jenkins/Cloudstack installation).

I still think we should not include the test in our test suite, even under Linux, because it is testing something (a latency) that is not reproducible (especially on heavily-loaded hardware). A test should pass all the time or fail all the time.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

Now, the hiccups that @damiendoligez reports maybe prove me wrong, or maybe they are exacerbated by our test setting (virtual machines running on a heavily-loaded Jenkins/Cloudstack installation).

We already observed that when writing the test for #1533. We disabled it on Windows because Sleep(0) was not able to yield reliably to another thread.

I still think we should not include the test in our test suite, even under Linux, because it is testing something (a latency) that is not reproducible (especially on heavily-loaded hardware). A test should pass all the time or fail all the time.

We are not testing a latency here. The result of this test should not depend on whether the hardware is heavily-loaded or not. Using condition variables, each call to yield will give the control to another thread. The only case that would make the test fail is when the thread that receives the control gets immediately interrupted by the scheduler. Moreover, if we use a good safety margin in the number of observed yield failure, this has to happen several times in a row, which is highly unlikely (if not impossible, since there is only a limited number of code points in the loop that would allow handling the scheduling signal).

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

So the situation is not worse than it was before, in any case.
I will in fact say that the program as written is not safe:
In short: the double checked locking should not be an issue.

Still, I think @xavierleroy has a point: This program is not safe under the C11 memory model. Not that I think this should be avoided at all costs (the OCaml runtime is written in C99 where we do not have atomic variables), but that should be avoided when we can.

And, as it turns out, I suspect the first check of the waiters count is not really useful here. If there is only one active thread, then taking and releasing the lock should essentially be free, since the mutex's data structures should belong to the core's cache. Am I right in my reasoning?

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

(That would actually mean we could get rid of the st_masterlock_waiters, which is good since it is broken on Windows.)

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

Seem reasonable?

It seems to me, at least.

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2018

Still, I think @xavierleroy has a point: This program is not safe under the C11 memory model. Not that I think this should be avoided at all costs (the OCaml runtime is written in C99 where we do not have atomic variables), but that should be avoided when we can.

I agree, but I want to emphasize that I didn't make this unsafe. Any place where what we're doing is unsafe and might cause problems would also be unsafe right now.

And, as it turns out, I suspect the first check of the waiters count is not really useful here. If there is only one active thread, then taking and releasing the lock should essentially be free, since the mutex's data structures should belong to the core's cache. Am I right in my reasoning?

Unfortunately you are not. Even an uncontended mutex acquire can be surprisingly noticeable; atomic instructions aren't free. (They are much cheaper than they were, say, fifteen years ago, but that's not the same as free.) I would be leery of removing the waiters check.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

Unfortunately you are not. Even an uncontended mutex acquire can be surprisingly noticeable; atomic instructions aren't free.

I've just done some tests on my machine:

#include <pthread.h>
#include <unistd.h>
#include <stdio.h>

int ended = 0;
void* thread1(void* arg) {
  sleep(10);
  ended = 1;
}

pthread_mutex_t lock;
int n2 = 0;

void* thread2(void* arg) {
  while(!ended) {
    pthread_mutex_lock(&lock);
    pthread_mutex_unlock(&lock);
    n2++;
  }
}

int main (void) {
  pthread_t t;
  pthread_create(&t, NULL, thread1, NULL);

  pthread_mutex_init(&lock, NULL);

  // pthread_create(&t, NULL, thread2, NULL);

  int n1 = 0;
  while(!ended) {
    pthread_mutex_lock(&lock);
    pthread_mutex_unlock(&lock);
    n1++;
  }

  printf("%d %d\n", n1, n2);

  return 0;
}

It seems like that if only one thread is using the lock, then a full acquire-release cycle takes around 50 cycles. Do you really think 50 cycles are too much for Thread.yield to afford? Or do you think this could be more on other architectures?

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2018

I do think it could be more on other architectures; it's hard to speak authoritatively without hardware to benchmark at hand, but in the past architects I've spoken to have generally not let me pin them down on the cost of an uncontended atomic operation, because it's often hard for them to make those guarantees. :) But that said, honestly, I do also think 50 cycles is a bad thing to accept here.

In particular: in Async, we yield at certain points that are fairly hot, because doing so can dramatically improve fairness (and, though the experiments here predate me, I believe also reduce the effective latency of long-running syscalls.) When we need to yield, doing so here is very effective; when we don't need to, yielding is free. So it's a very easy decision.

Adding any significant cost--even just a pthread lock!--makes it an actual, possibly difficult decision. Now we don't know what's worth it and what's not; we'll have to make judgment calls we could get wrong; the situation isn't an automatically one-sided tradeoff. Would it still be worth doing in Async? Probably, but that's not the point. Right now we can unconditionally just yield and not worry about it in the case we won't yield. That seems very valuable to me.

And to re-iterate, the memory model issues aren't changed by this patch set at all. We do exactly the same locking and accesses. If we want to have a separate conversation about fixing that, we can, but I don't think it's necessary to do it here.

Anyway, that's my two cents.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

Concerning the unprotected read from m->waiters:

I know there's a race condition there in the original code; I might even have written it myself. The reasoning at the time was that if the value read from m->waiters is sometimes wrong (incl. out of thin air) we'll skip some yields or do some useless yields, but nothing really wrong will happen.

I think the new code has the same property, because of the lock-then-check-again trick, but I wanted to make sure this is the case.

As regards C11, yes, the race is undefined behavior, but I still hope that GCC and Clang refrain from producing crazy code here.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

Concerning the optimization "if no waiters then don't even release and acquire the master lock":

I'm pretty sure this optimization was not there initially and was added later based on observations by serious users that it makes a noticeable difference in practice. I'd need to dig in the bug tracker and my old e-mail to find out the OS and the kind of application where the difference was observed.

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2018

I know there's a race condition there in the original code; I might even have written it myself. The reasoning at the time was that if the value read from m->waiters is sometimes wrong (incl. out of thin air) we'll skip some yields or do some useless yields, but nothing really wrong will happen.

With the exception of memory model issues, I agree with this reasoning entirely.

I think the new code has the same property, because of the lock-then-check-again trick, but I wanted to make sure this is the case.

I agree that reading a zany value from m->waiters outside the lock is still fine--we maybe miss a yield, or maybe take a lock unnecessarily, but that's fine.

(The second check under the lock is necessary to prevent deadlock, obviously.)

As regards C11, yes, the race is undefined behavior, but I still hope that GCC and Clang refrain from producing crazy code here.

I think, in practice, you'll be mostly OK. I do think that moving to C11 and doing it right is worth doing when the portability issues are satisfactory, but it's not a high priority.

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Nov 16, 2018

@ahh you should add @jhjourdan's name in the list of reviewers, then I think it's good to merge.

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Nov 17, 2018

@damiendoligez Can I do that as a non-maintainer? Sorry, I don't see a button.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2018

@ahh he meant in the changelog credits here.

@ahh ahh force-pushed the ahh:ahh-yield-fix branch 2 times, most recently from c850862 to 9c7dc96 Nov 17, 2018
@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Nov 17, 2018

Ah, of course, that makes more sense. :) Done!

@ahh ahh closed this Nov 19, 2018
@ahh ahh force-pushed the ahh:ahh-yield-fix branch from 9c7dc96 to 7859094 Nov 19, 2018
Andrew Hunter added 3 commits Oct 15, 2018
Thread.yield invoked a trivial blocking section, which basically woke
up a competitor and then raced with them to get the ocaml lock back, invoking
nanosleep() to help guarantee that the yielder would lose the race. However,
until the yielder woke up again and attempted to take the ocaml lock, it
wouldn't be marked as a waiter.

As a result, if two threads A and B yielded to each other in a tight loop, A's
first yield would work well, but then B would execute 10000+ iterations of the
loop before A could mark itself as a waiter and be yielded to. This works even
worse if A and B are pinned to the same CPU, in which case A can't be marked as
a waiter until the kernel preempts B, which can take tens or hundreds of
milliseconds!

So we reimplement yield; instead of dropping the lock and taking it again (with
a wait in the middle), atomically wake a competitor and mark the yielding thread
as a waiter. (We essentially inline a failed masterlock_acquire into
masterlock_release, specialized for the case where we know another waiter exists
and we want them to run instead.)

Now, threads yielding to each other very consistently succeed--in that same
tight loop, we see a change of control on every iteration (with some very rare
exceptions, most likely from other uncommon blocking region invocations.)

This also means we don't have to worry about the vagaries of kernel scheduling
and whether or not a yielding or a yielded-to thread gets to run first; we
consistently let a competing thread run whenever we yield, which is what the API
claims to do.
Andrew Hunter
@ahh ahh reopened this Nov 19, 2018
@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Nov 19, 2018

Okay, sorry, I really badly screwed up that rebase and fix, and then screwed it up again trying to fix it. I think all is fine now--we have exactly the test from 88bde14, a Changes that mentions @jhjourdan, and the same code we've been happy with. PTAL. Sorry for the line noise here.

@damiendoligez damiendoligez merged commit 7a27cd5 into ocaml:trunk Feb 8, 2019
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@damiendoligez

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

Thanks !

@ahh

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.