darwin: invoke `mach_timebase_info` only once #1325

indutny · 2014-06-19T15:29:36Z

According to @aktau, the call to mach_timebase_info costs 90% of the
total execution time of uv_hrtime(). The result of the call is static
on all existing platforms, so there is no need in invoking it multiple
times.

aktau · 2014-06-19T15:39:47Z

Hey Fedor! Thanks for the very quick response. This does replace one syscall with another though (pthread_once). Any reason why it couldn't be moved inside of an if-check as I suggested?

if (unlikely(info.denom == 0)) {
  if (pthread_once(&once, uv__hrtime_init) != 0) {
    abort();
  }
}

That would avoid a syscall on the critical path. I realize that there might be a partial write in progess which would make the condition fail, but perhaps we can solve that with an atomic get/set:

static uint64_t uv__hrtime_multiplier;

static void uv__hrtime_init(void) {
  if (mach_timebase_info(&uv__hrtime_timebase_info) != KERN_SUCCESS)
    abort();
  }
  atomic_set_64(uv__hrtime_multiplier, uv__hrtime_timebase_info.numer / uv__hrtime_timebase_info.denom);
}

...

if (unlikely(atomic_load_64(uv__hrtime_multiplier == 0))) {
  if (pthread_once(&once, uv__hrtime_init) != 0) {
    abort();
  }
}

Just writing down my thoughts at the moment.

EDIT: Also, what's the reason for the info = &uv__hrtime_timebase_info; line?

indutny · 2014-06-19T15:49:06Z

May I ask you to do some benchmarks of it and post results here? I wonder how slow the pthread_once is actually is.

aktau · 2014-06-19T15:51:18Z

I hope I can get to that at some point during the coming days! It was just my intuition that the call to mach_absolute_time() would reach close to the minimum of what a syscall can cost, so that any other syscall would at least double the runtime.

aktau · 2014-06-19T15:55:41Z

For my own reference: 8 byte reads/writes are atomic if the alignment is correct: http://stackoverflow.com/a/1292938/558819

This provides a clear path to the optimal codepath if it turns out that pthread_once is slower than expected. (align static + use atomic op which any decent compiler will lower to a single mov without barriers).

indutny · 2014-06-19T16:13:23Z

Force pushed with indeed better solution.

aktau · 2014-06-19T16:16:53Z

Force pushed with indeed better solution.

This is the solution from before commit a7cedbe

And that was done for thread safety reasons.

indutny · 2014-06-19T16:19:42Z

Oh, you are right, indeed. What do you think about the last commit?

aktau · 2014-06-19T16:35:42Z

I don't know for sure what ACCESS_ONCE does yet (didn't look). But I did do some really simplistic benchmarks: https://gist.github.com/aktau/9f52f812200d8d69a5d1

results:

$ echo "clang-3.4" && /usr/local/bin/clang-3.4 -std=c99 -g3 -O2 bench.c -o bench && ./bench && echo "gcc-4.9" && /usr/local/bin/gcc-4.9 -std=c99 -g3 -O2 bench.c -o bench && ./bench
clang-3.4
mach_absolute_time        took  436573634 ns,  21 ns per call, dummy 11220952913650381342
atomic + unlikely         took  738636415 ns,  36 ns per call,  169.2% of lower-bound, dummy 11232726204406528702
atomic + unlikely + align took  746793238 ns,  37 ns per call,  171.1% of lower-bound, dummy 11247578014054130890
ACCESS_ONCE               took  737353267 ns,  36 ns per call,  168.9% of lower-bound, dummy 11262409467510417712
pthread mutex             took 1707136147 ns,  85 ns per call,  391.0% of lower-bound, dummy 11286921008564394065
pthread once              took  803783314 ns,  40 ns per call,  184.1% of lower-bound, dummy 11311987089439266983
current libuv             took 4577809367 ns, 228 ns per call, 1048.6% of lower-bound, dummy 11365789124687598240
non threadsafe            took  734334774 ns,  36 ns per call,  168.2% of lower-bound, dummy 11418915193310618148
atomic                    took  741769774 ns,  37 ns per call,  169.9% of lower-bound, dummy 11433661662949986180

gcc-4.9
mach_absolute_time        took  476533200 ns,  23 ns per call, dummy 4143116828107367400
atomic + unlikely         took  744618783 ns,  37 ns per call,  156.3% of lower-bound, dummy 4155330547845523307
atomic + unlikely + align took  728663609 ns,  36 ns per call,  152.9% of lower-bound, dummy 4170048722709943038
ACCESS_ONCE               took  735128677 ns,  36 ns per call,  154.3% of lower-bound, dummy 4184707760867182211
pthread mutex             took 1748285394 ns,  87 ns per call,  366.9% of lower-bound, dummy 4209641516512722094
pthread once              took  836038519 ns,  41 ns per call,  175.4% of lower-bound, dummy 4235274408810859842
current libuv             took 4668379608 ns, 233 ns per call,  979.7% of lower-bound, dummy 4290621455330345238
non threadsafe            took  748268089 ns,  37 ns per call,  157.0% of lower-bound, dummy 4344585482989562800
atomic                    took  741199666 ns,  37 ns per call,  155.5% of lower-bound, dummy 4359549083582924228

So the atomic is about as fast as the non-threadsafe implementation.

EDIT: results updated with many more methods and 2 compilers.

aktau · 2014-06-19T16:44:36Z

I see that ACCESS_ONCE provokes a volatile read. I'm not entirely sure if that's necessary here. We'd have to inspect the generated assembly. But what I remember from doing a bit of threading work some time ago is that using volatile to alleviate threading issues is usually not correct. Although it is better than without volatile because it stops the compiler from reordering things around.

On x64 the ACCESS_ONCE code will "probably" work because 8-byte read/writes are atomic and the timebase struct is 8 bytes... for now.

indutny · 2014-06-19T16:47:52Z

is atomic a pthread_once implementation?

aktau · 2014-06-19T16:49:12Z

Nope, check out the gist :).

Am Donnerstag, 19. Juni 2014 schrieb Fedor Indutny :

is atomic a pthread_once implementation?

—
Reply to this email directly or view it on GitHub
#1325 (comment).

indutny · 2014-06-19T16:51:32Z

Ah, I see... I think the current version should be fine too, though.

@saghul ?

aktau · 2014-06-19T17:06:29Z

Ah, I see... I think the current version should be fine too, though.

For x64, I mostly agree, the volatile marking should be sufficient, for other platforms I don't know. We don't know how mach_timebase_info sets the values though. Could look at the XNU sources I suppose. Since @bnoordhuis was the one who did the first threading-fix, perhaps he has an idea about this too.

EDIT: updated benchmarks with your implementation @indutny.

EDIT 2: what's really annoying is that calls to mach_timebase_info shouldn't be this slow... http://fxr.watson.org/fxr/source/osfmk/i386/rtclock.c?v=xnu-2050.18.24#L546

saghul · 2014-06-19T17:24:07Z

I can't really add anything here, I trust your judgement, @indutny :-)

aktau · 2014-06-19T18:43:37Z

Just as a last addendum, the atomic code has (as far as I can see), quite good code generation. This is what clang 3.4.1 manages to generate on OSX. clang-3.4 -std=c99 -g3 -O2 -march=native bench.c -o bench && ./bench

bench`uv_hrtime_atomic_bopt at bench.c:61:
bench[0x1000007b0]:  pushq  %rbp
bench[0x1000007b1]:  movq   %rsp, %rbp
bench[0x1000007b4]:  subq   $0x10, %rsp
bench[0x1000007b8]:  movl   0x88e(%rip), %eax         ; uv_hrtime_atomic_bopt.info.1
bench[0x1000007be]:  testl  %eax, %eax
bench[0x1000007c0]:  je     0x1000007e5               ; uv_hrtime_atomic_bopt + 53 at bench.c:75
bench[0x1000007c2]:  callq  0x100000dd6               ; symbol stub for: mach_absolute_time
bench[0x1000007c7]:  movl   0x87b(%rip), %ecx         ; uv_hrtime_atomic_bopt.info.0
bench[0x1000007cd]:  imulq  %rax, %rcx
bench[0x1000007d1]:  movl   0x875(%rip), %esi         ; uv_hrtime_atomic_bopt.info.1
bench[0x1000007d7]:  xorl   %edx, %edx
bench[0x1000007d9]:  movq   %rcx, %rax
bench[0x1000007dc]:  divq   %rsi
bench[0x1000007df]:  addq   $0x10, %rsp
bench[0x1000007e3]:  popq   %rbp
bench[0x1000007e4]:  ret
bench[0x1000007e5]:  leaq   -0x8(%rbp), %rdi
bench[0x1000007e9]:  callq  0x100000ddc               ; symbol stub for: mach_timebase_info
bench[0x1000007ee]:  testl  %eax, %eax
bench[0x1000007f0]:  jne    0x100000806               ; uv_hrtime_atomic_bopt + 86 at bench.c:69
bench[0x1000007f2]:  movl   -0x8(%rbp), %eax
bench[0x1000007f5]:  xchgl  %eax, 0x84d(%rip)         ; uv_hrtime_atomic_bopt.info.0
bench[0x1000007fb]:  movl   -0x4(%rbp), %eax
bench[0x1000007fe]:  xchgl  %eax, 0x848(%rip)         ; uv_hrtime_atomic_bopt.info.1
bench[0x100000804]:  jmp    0x1000007c2               ; uv_hrtime_atomic_bopt + 18 at bench.c:75
bench[0x100000806]:  callq  0x100000dd0               ; symbol stub for: abort

I'm not entirely sure why clang decides to use xchgl since the function itself is 8-byte aligned and thus the struct as well. This should make the assignment a simple mov. Gcc decides to use a mfence + mov but otherwise generates similar code. So something must be up with that. At any rate, the fast path use plain movs, which is as good as we can get without pre-multiplying the multiplication factor (which could lose accuracy in some cases I guess, as currently the numerator multiplies with the absolute time first).

aktau · 2014-06-19T19:26:39Z

And as a last note, if libuv needs to run on PowerPC based macs...: http://stackoverflow.com/questions/23378063/how-can-i-use-mach-absolute-time-without-overflowing

bnoordhuis · 2014-06-19T21:23:59Z

src/unix/darwin.c

-    if (mach_timebase_info(&info) != KERN_SUCCESS)
-      abort();
+  if (ACCESS_ONCE(uint32_t, info.numer) == 0 &&
+      ACCESS_ONCE(uint32_t, info.denom) == 0 &&


That's not entirely thread-safe, not even with the x86's relaxed memory model because loads can still be reordered with stores to different locations. While some CPU models implement stronger guarantees than that, I wouldn't rely on it. Remember that ACCESS_ONCE is just a compiler barrier, not a memory barrier.

Remember that ACCESS_ONCE is just a compiler barrier, not a memory barrier.

An interesting article where the distinction is discussed further in the comment section: http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

I've been trying to reason about my version with atomics (https://gist.github.com/aktau/9f52f812200d8d69a5d1). I used __ATOMIC_SEQ_CST because anything else would likely drive me insane.

Vis-a-vis loads/stores I have this:

1. load denom (if-check) 2. store numer (only if if-check failed) 3. store denom (only if if-check failed) 4. load denom (for division, no order guarantees) 5. load numer (for multiplication, no order guarantees)

Going through this checklist for one thread setting the variables (steps 2 and 3) and another in the process of calculating the multiplier (steps 4 and 5). I'm thinking this should hold:

If we reach 4/5, then denom has been assigned (step 3). This means that we only have to make absolutely sure that step 2 happens (and is visible) before step 3.

I think that some of the default x86 memory guarantees (Stores are not reordered with other stores. and Stores are not reordered with older loads.) make this assertion true. However, for processors that don't give such lavish guarantees, the memory fences introduced by the atomic calls should be sufficient. Am I right or completely off the ball here @bnoordhuis? I know it's very tricky and that it's all too easy to think an implementation with atomics is correct when it isn't.

With the caveat that it's 1.30 AM over here and it's been a long day: yes, that looks correct. It's still possible for multiple threads to end up calling mach_timebase_info() but that seems harmless because it's "pure" (AFAIK), it always returns the same value.

@bnoordhuis you meant my commit or @aktau description? I think that this diff is correct, since it enters the mach_timebase_info only if both fields are observed to be 0, there is no big deal about how many mach_timebase_info will be actually called.

@aktau's description. The bug in this PR is that because you don't know if mach_timebase_info() stores numer or denom first, you see a non-zero numer in the check, then divide by zero because denom hasn't been updated yet. I won't speculate on the probability but it's, ah, non-zero.

The bug in this PR is that because you don't know if mach_timebase_info() stores numer or denom first, you see a non-zero numer in the check, then divide by zero because denom hasn't been updated yet.

It's useful to add that even if the order of numer/denom checks in the if-statement is reversed, there could still be a problem when only denom has been assigned and numer hasn't. This wouldn't divide by 0 but would provoke a wrong result (0) anyway.

On the x86, it seems that assigning the info structure from a temporary variable as I have done, but with the volatile modifier instead of atomics, will provide a correct result (i.e.: only a compiler fence). However, this is not guaranteed for other ISAs, as you have mentioned @bnoordhuis. This is why I used the atomics, which generate memory fences (mfence or xchl on x86). Technically I think only a store fence (sfence on x86) is necessary but it would provide no speed benefit for dubious gain. We already see that the version with fully fenced atomics is equally fast as the fastest non-threadsafe version.

aktau · 2014-06-20T10:17:15Z

Benchmarks updated. It seems that a pthread_once based inplementation is a decent contender:

with gcc: atomics are 55.5% slower than a pure mach_absolute_time call, pthread_once is 75.4% slower
with clang: atomics are 69.9% slower than a pure mach_absolute_time call, pthread_once is 84.1% slower

(take numbers with a small grain of salt, haven't been statistically analyzed, standard deviation could be a few percents).

bnoordhuis · 2014-06-20T10:33:42Z

It seems that a pthread_once based inplementation is a decent contender

Yes, that sounds plausible. Might be easiest to just stick in a pthread_once() and call it a day.

aktau · 2014-06-20T10:44:05Z

Yes, that sounds plausible. Might be easiest to just stick in a pthread_once() and call it a day.

It would at least be safe, work for all compilers with which libuv works right now (libuv requires pthreads support) and 5.5x faster than the current implementation (atomics would be 6.2x faster, for reference). I thus vote for something like this:

static mach_timebase_info_data_t once_info;
static void init_info_once(void) {
  if (mach_timebase_info(&once_info) != KERN_SUCCESS)
    abort();
}

static pthread_once_t once = PTHREAD_ONCE_INIT;
uint64_t uv__hrtime(void) {
  pthread_once(&once, init_info_once);

  return mach_absolute_time() * once_info.numer / once_info.denom;
}

(with whatever variable naming scheme you prefer).

indutny · 2014-06-24T16:30:38Z

@bnoordhuis I'm ok with using pthread_once. Just want to know what is really problematic in my code right now, could you please post a scenario where it fails?

As far as I can see - it will use numer and denom only if they are non-zero, and initialize them if one of them is zero (just fixed this). Is it correct?

@aktau

According to @aktau, the call to `mach_timebase_info` costs 90% of the total execution time of `uv_hrtime()`. The result of the call is static on all existing platforms, so there is no need in invoking it multiple times.

bnoordhuis · 2014-06-24T18:30:55Z

As far as I can see - it will use numer and denom only if they are non-zero, and initialize them if one of them is zero (just fixed this). Is it correct?

It looks correct to me. There is a general caveat with mixing volatile and non-volatile loads in that the compiler can (at least in theory) reorder them in such a way that the non-volatile load comes first; volatile loads and stores only have a 'happens before' relationship with other volatile loads and stores. That shouldn't apply here because of the side effects of the system call.

aktau · 2014-06-24T18:36:31Z

It looks correct to me. There is a general caveat with mixing volatile and non-volatile loads in that the compiler can (at least in theory) reorder them in such a way that the non-volatile load comes first; volatile loads and stores only have a 'happens before' relationship with other volatile loads and stores. That shouldn't apply here because of the side effects of the system call.

If it would be generally correct then it would be best to use it, as it takes 32ns instead of 40ns (pthread_once) on my system. However, x86 seems to have a particularly nice memory model, do we know the same is true for all systems libuv is running on? I reckon the optimal + guaranteed case is only possible with atomics.

bnoordhuis · 2014-06-24T18:45:40Z

That's a good point. I've been reasoning from an x86 perspective (libuv only supports OS X 10.6+ so that effectively restricts it to x86) but libuv also runs on iOS. That means ARM and ARM has much weaker guarantees. I cast my vote for pthread_once().

aktau · 2014-06-24T18:54:17Z

That means ARM and ARM has much weaker guarantees.

Indeed. Fwiw, atomics are suitably implemented on all popular systems supported by clang/gcc, sun compiler (Solaris) and msvc (Windows). An example from nanomsg: https://github.com/nanomsg/nanomsg/blob/master/src/utils/atomic.c (these are not the operations we would need, but one gets the gist).

trevnorris · 2014-06-24T19:46:15Z

Please quickly beat me down if I'm completely off my rocker, but isn't mach_absolute_time() a monotonic clock? If that's the case then it would stop counting when the CPU powers down.

aktau · 2014-06-24T20:00:53Z

Please quickly beat me down if I'm completely off my rocker, but isn't mach_absolute_time() a monotonic clock? If that's the case then it would stop counting when the CPU powers down.

Not only is that not guaranteed anywhere I could find, but even if it were the case, that doesn't impact this bug report.

trevnorris · 2014-06-24T20:11:34Z

[...] even if it were the case, that doesn't impact this bug report.

Thought it would since the value returned by mach_absolute_time() would differ whether the CPU has powered off or not.

indutny · 2014-06-24T20:16:30Z

@bnoordhuis compiler could not reorder them, because mach_timebase_info is a syscall.

indutny · 2014-06-24T21:31:02Z

Ok, I guess I'll land it :)

aktau · 2014-06-24T21:32:39Z

Thought it would since the value returned by mach_absolute_time() would differ whether the CPU has powered off or not.

I'm not sure how anything would call mach_absolute_time() with the CPU powered off.

Ok, I guess I'll land it :)

Which one? :)

indutny · 2014-06-24T21:33:30Z

@aktau this one

indutny · 2014-06-24T21:34:36Z

Landed in 211bf4e

aktau · 2014-06-24T21:35:11Z

@aktau this one

Sure, we're (neovim) only on x86 for the time being anyway (and single-threaded) ;). Thanks for the fix!

aktau · 2014-06-24T21:39:32Z

Landed in 211bf4e

I'd like to try this out by the way. I see you merged this into branch v0.10. I'm not sure how your git workflow is, but when would this reach master or some recent version at least?

Neovim uses the v0.11 API, I believe.

saghul · 2014-06-24T21:41:48Z

You can probably cherry-pick the commit. I'll merge 0.10 into master tomorrow, I don't trust myself right now :-P

aktau · 2014-06-24T21:42:54Z

Perfect, thanks!

Am Dienstag, 24. Juni 2014 schrieb Saúl Ibarra Corretgé :

You can probably cherry-pick the commit. I'll merge 0.10 into master
tomorrow, I don't trust myself right now :-P

—
Reply to this email directly or view it on GitHub
#1325 (comment).

aktau · 2014-06-26T23:22:15Z

I can confirm that uv_hrtime no longer appears prominently in my stacktraces. Easy performance win. On to the next one ;).

Fixes some bugs and increase the performance of uv_hrtime() on OSX, which reduces its prominence in performance traces. This allows us to better see what's really causing slowness. ref: - neovim#868 - joyent/libuv#1325 - https://github.com/joyent/libuv/releases

Fixes some bugs and increase the performance of uv_hrtime() on OSX, which reduces its prominence in performance traces. This allows us to better see what's really causing slowness. ref: - #868 - joyent/libuv#1325 - https://github.com/joyent/libuv/releases

Fixes some bugs and increase the performance of uv_hrtime() on OSX, which reduces its prominence in performance traces. This allows us to better see what's really causing slowness. ref: - neovim#868 - joyent/libuv#1325 - https://github.com/joyent/libuv/releases

indutny referenced this pull request Jun 19, 2014

darwin: fix uv_hrtime() thread safety issue

a7cedbe

bnoordhuis reviewed Jun 19, 2014
View reviewed changes

aktau mentioned this pull request Jun 20, 2014

[WIP] add extra string handling functions neovim/neovim#869

Closed

darwin: invoke mach_timebase_info only once

24cdcc9

According to @aktau, the call to `mach_timebase_info` costs 90% of the total execution time of `uv_hrtime()`. The result of the call is static on all existing platforms, so there is no need in invoking it multiple times.

indutny closed this Jun 24, 2014

aktau mentioned this pull request Jun 24, 2014

line_breakcheck taking up a lot of time in find_tags? neovim/neovim#868

Closed

indutny deleted the fix/performance-hrtime branch June 24, 2014 21:42

aktau mentioned this pull request Jul 3, 2014

[RDY] cmake: upgrade libuv 0.11.23 -> 0.11.26 neovim/neovim#911

Closed

darwin: invoke mach_timebase_info only once #1325

darwin: invoke mach_timebase_info only once #1325

Conversation

indutny commented Jun 19, 2014

aktau commented Jun 19, 2014

indutny commented Jun 19, 2014

aktau commented Jun 19, 2014

aktau commented Jun 19, 2014

indutny commented Jun 19, 2014

aktau commented Jun 19, 2014

indutny commented Jun 19, 2014

aktau commented Jun 19, 2014

aktau commented Jun 19, 2014

indutny commented Jun 19, 2014

aktau commented Jun 19, 2014

indutny commented Jun 19, 2014

aktau commented Jun 19, 2014

saghul commented Jun 19, 2014

aktau commented Jun 19, 2014

aktau commented Jun 19, 2014

bnoordhuis Jun 19, 2014

Choose a reason for hiding this comment

aktau Jun 19, 2014

Choose a reason for hiding this comment

bnoordhuis Jun 19, 2014

Choose a reason for hiding this comment

indutny Jun 19, 2014

Choose a reason for hiding this comment

bnoordhuis Jun 20, 2014

Choose a reason for hiding this comment

aktau Jun 20, 2014

Choose a reason for hiding this comment

aktau commented Jun 20, 2014

bnoordhuis commented Jun 20, 2014

aktau commented Jun 20, 2014

indutny commented Jun 24, 2014

bnoordhuis commented Jun 24, 2014

aktau commented Jun 24, 2014

bnoordhuis commented Jun 24, 2014

aktau commented Jun 24, 2014

trevnorris commented Jun 24, 2014

aktau commented Jun 24, 2014

trevnorris commented Jun 24, 2014

indutny commented Jun 24, 2014

indutny commented Jun 24, 2014

aktau commented Jun 24, 2014

indutny commented Jun 24, 2014

indutny commented Jun 24, 2014

aktau commented Jun 24, 2014

aktau commented Jun 24, 2014

saghul commented Jun 24, 2014

aktau commented Jun 24, 2014

aktau commented Jun 26, 2014

darwin: invoke `mach_timebase_info` only once #1325

darwin: invoke `mach_timebase_info` only once #1325