darwin: invoke mach_timebase_info
only once
#1325
Conversation
Hey Fedor! Thanks for the very quick response. This does replace one syscall with another though ( if (unlikely(info.denom == 0)) {
if (pthread_once(&once, uv__hrtime_init) != 0) {
abort();
}
} That would avoid a syscall on the critical path. I realize that there might be a partial write in progess which would make the condition fail, but perhaps we can solve that with an atomic get/set: static uint64_t uv__hrtime_multiplier;
static void uv__hrtime_init(void) {
if (mach_timebase_info(&uv__hrtime_timebase_info) != KERN_SUCCESS)
abort();
}
atomic_set_64(uv__hrtime_multiplier, uv__hrtime_timebase_info.numer / uv__hrtime_timebase_info.denom);
}
...
if (unlikely(atomic_load_64(uv__hrtime_multiplier == 0))) {
if (pthread_once(&once, uv__hrtime_init) != 0) {
abort();
}
} Just writing down my thoughts at the moment. EDIT: Also, what's the reason for the |
May I ask you to do some benchmarks of it and post results here? I wonder how slow the |
I hope I can get to that at some point during the coming days! It was just my intuition that the call to |
For my own reference: 8 byte reads/writes are atomic if the alignment is correct: http://stackoverflow.com/a/1292938/558819 This provides a clear path to the optimal codepath if it turns out that |
Force pushed with indeed better solution. |
This is the solution from before commit a7cedbe And that was done for thread safety reasons. |
Oh, you are right, indeed. What do you think about the last commit? |
I don't know for sure what results: $ echo "clang-3.4" && /usr/local/bin/clang-3.4 -std=c99 -g3 -O2 bench.c -o bench && ./bench && echo "gcc-4.9" && /usr/local/bin/gcc-4.9 -std=c99 -g3 -O2 bench.c -o bench && ./bench
clang-3.4
mach_absolute_time took 436573634 ns, 21 ns per call, dummy 11220952913650381342
atomic + unlikely took 738636415 ns, 36 ns per call, 169.2% of lower-bound, dummy 11232726204406528702
atomic + unlikely + align took 746793238 ns, 37 ns per call, 171.1% of lower-bound, dummy 11247578014054130890
ACCESS_ONCE took 737353267 ns, 36 ns per call, 168.9% of lower-bound, dummy 11262409467510417712
pthread mutex took 1707136147 ns, 85 ns per call, 391.0% of lower-bound, dummy 11286921008564394065
pthread once took 803783314 ns, 40 ns per call, 184.1% of lower-bound, dummy 11311987089439266983
current libuv took 4577809367 ns, 228 ns per call, 1048.6% of lower-bound, dummy 11365789124687598240
non threadsafe took 734334774 ns, 36 ns per call, 168.2% of lower-bound, dummy 11418915193310618148
atomic took 741769774 ns, 37 ns per call, 169.9% of lower-bound, dummy 11433661662949986180
gcc-4.9
mach_absolute_time took 476533200 ns, 23 ns per call, dummy 4143116828107367400
atomic + unlikely took 744618783 ns, 37 ns per call, 156.3% of lower-bound, dummy 4155330547845523307
atomic + unlikely + align took 728663609 ns, 36 ns per call, 152.9% of lower-bound, dummy 4170048722709943038
ACCESS_ONCE took 735128677 ns, 36 ns per call, 154.3% of lower-bound, dummy 4184707760867182211
pthread mutex took 1748285394 ns, 87 ns per call, 366.9% of lower-bound, dummy 4209641516512722094
pthread once took 836038519 ns, 41 ns per call, 175.4% of lower-bound, dummy 4235274408810859842
current libuv took 4668379608 ns, 233 ns per call, 979.7% of lower-bound, dummy 4290621455330345238
non threadsafe took 748268089 ns, 37 ns per call, 157.0% of lower-bound, dummy 4344585482989562800
atomic took 741199666 ns, 37 ns per call, 155.5% of lower-bound, dummy 4359549083582924228 So the atomic is about as fast as the non-threadsafe implementation. EDIT: results updated with many more methods and 2 compilers. |
I see that On x64 the |
is |
Nope, check out the gist :). Am Donnerstag, 19. Juni 2014 schrieb Fedor Indutny :
|
Ah, I see... I think the current version should be fine too, though. @saghul ? |
For x64, I mostly agree, the volatile marking should be sufficient, for other platforms I don't know. We don't know how EDIT: updated benchmarks with your implementation @indutny. EDIT 2: what's really annoying is that calls to |
I can't really add anything here, I trust your judgement, @indutny :-) |
Just as a last addendum, the atomic code has (as far as I can see), quite good code generation. This is what clang 3.4.1 manages to generate on OSX. bench`uv_hrtime_atomic_bopt at bench.c:61:
bench[0x1000007b0]: pushq %rbp
bench[0x1000007b1]: movq %rsp, %rbp
bench[0x1000007b4]: subq $0x10, %rsp
bench[0x1000007b8]: movl 0x88e(%rip), %eax ; uv_hrtime_atomic_bopt.info.1
bench[0x1000007be]: testl %eax, %eax
bench[0x1000007c0]: je 0x1000007e5 ; uv_hrtime_atomic_bopt + 53 at bench.c:75
bench[0x1000007c2]: callq 0x100000dd6 ; symbol stub for: mach_absolute_time
bench[0x1000007c7]: movl 0x87b(%rip), %ecx ; uv_hrtime_atomic_bopt.info.0
bench[0x1000007cd]: imulq %rax, %rcx
bench[0x1000007d1]: movl 0x875(%rip), %esi ; uv_hrtime_atomic_bopt.info.1
bench[0x1000007d7]: xorl %edx, %edx
bench[0x1000007d9]: movq %rcx, %rax
bench[0x1000007dc]: divq %rsi
bench[0x1000007df]: addq $0x10, %rsp
bench[0x1000007e3]: popq %rbp
bench[0x1000007e4]: ret
bench[0x1000007e5]: leaq -0x8(%rbp), %rdi
bench[0x1000007e9]: callq 0x100000ddc ; symbol stub for: mach_timebase_info
bench[0x1000007ee]: testl %eax, %eax
bench[0x1000007f0]: jne 0x100000806 ; uv_hrtime_atomic_bopt + 86 at bench.c:69
bench[0x1000007f2]: movl -0x8(%rbp), %eax
bench[0x1000007f5]: xchgl %eax, 0x84d(%rip) ; uv_hrtime_atomic_bopt.info.0
bench[0x1000007fb]: movl -0x4(%rbp), %eax
bench[0x1000007fe]: xchgl %eax, 0x848(%rip) ; uv_hrtime_atomic_bopt.info.1
bench[0x100000804]: jmp 0x1000007c2 ; uv_hrtime_atomic_bopt + 18 at bench.c:75
bench[0x100000806]: callq 0x100000dd0 ; symbol stub for: abort I'm not entirely sure why clang decides to use |
And as a last note, if libuv needs to run on PowerPC based macs...: http://stackoverflow.com/questions/23378063/how-can-i-use-mach-absolute-time-without-overflowing |
if (mach_timebase_info(&info) != KERN_SUCCESS) | ||
abort(); | ||
if (ACCESS_ONCE(uint32_t, info.numer) == 0 && | ||
ACCESS_ONCE(uint32_t, info.denom) == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not entirely thread-safe, not even with the x86's relaxed memory model because loads can still be reordered with stores to different locations. While some CPU models implement stronger guarantees than that, I wouldn't rely on it. Remember that ACCESS_ONCE is just a compiler barrier, not a memory barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that ACCESS_ONCE is just a compiler barrier, not a memory barrier.
An interesting article where the distinction is discussed further in the comment section: http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
I've been trying to reason about my version with atomics (https://gist.github.com/aktau/9f52f812200d8d69a5d1). I used __ATOMIC_SEQ_CST
because anything else would likely drive me insane.
Vis-a-vis loads/stores I have this:
1. load denom (if-check)
2. store numer (only if if-check failed)
3. store denom (only if if-check failed)
4. load denom (for division, no order guarantees)
5. load numer (for multiplication, no order guarantees)
Going through this checklist for one thread setting the variables (steps 2 and 3) and another in the process of calculating the multiplier (steps 4 and 5). I'm thinking this should hold:
If we reach 4/5, then denom has been assigned (step 3). This means that we only have to make absolutely sure that step 2 happens (and is visible) before step 3.
I think that some of the default x86 memory guarantees (Stores are not reordered with other stores.
and Stores are not reordered with older loads.
) make this assertion true. However, for processors that don't give such lavish guarantees, the memory fences introduced by the atomic calls should be sufficient. Am I right or completely off the ball here @bnoordhuis? I know it's very tricky and that it's all too easy to think an implementation with atomics is correct when it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the caveat that it's 1.30 AM over here and it's been a long day: yes, that looks correct. It's still possible for multiple threads to end up calling mach_timebase_info() but that seems harmless because it's "pure" (AFAIK), it always returns the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis you meant my commit or @aktau description? I think that this diff is correct, since it enters the mach_timebase_info
only if both fields are observed to be 0
, there is no big deal about how many mach_timebase_info
will be actually called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aktau's description. The bug in this PR is that because you don't know if mach_timebase_info() stores numer or denom first, you see a non-zero numer in the check, then divide by zero because denom hasn't been updated yet. I won't speculate on the probability but it's, ah, non-zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug in this PR is that because you don't know if mach_timebase_info() stores numer or denom first, you see a non-zero numer in the check, then divide by zero because denom hasn't been updated yet.
It's useful to add that even if the order of numer/denom checks in the if-statement is reversed, there could still be a problem when only denom has been assigned and numer hasn't. This wouldn't divide by 0 but would provoke a wrong result (0) anyway.
On the x86, it seems that assigning the info structure from a temporary variable as I have done, but with the volatile
modifier instead of atomics, will provide a correct result (i.e.: only a compiler fence). However, this is not guaranteed for other ISAs, as you have mentioned @bnoordhuis. This is why I used the atomics, which generate memory fences (mfence
or xchl
on x86). Technically I think only a store fence (sfence
on x86) is necessary but it would provide no speed benefit for dubious gain. We already see that the version with fully fenced atomics is equally fast as the fastest non-threadsafe version.
Benchmarks updated. It seems that a
(take numbers with a small grain of salt, haven't been statistically analyzed, standard deviation could be a few percents). |
Yes, that sounds plausible. Might be easiest to just stick in a pthread_once() and call it a day. |
It would at least be safe, work for all compilers with which libuv works right now (libuv requires pthreads support) and 5.5x faster than the current implementation (atomics would be 6.2x faster, for reference). I thus vote for something like this: static mach_timebase_info_data_t once_info;
static void init_info_once(void) {
if (mach_timebase_info(&once_info) != KERN_SUCCESS)
abort();
}
static pthread_once_t once = PTHREAD_ONCE_INIT;
uint64_t uv__hrtime(void) {
pthread_once(&once, init_info_once);
return mach_absolute_time() * once_info.numer / once_info.denom;
} (with whatever variable naming scheme you prefer). |
@bnoordhuis I'm ok with using As far as I can see - it will use |
According to @aktau, the call to `mach_timebase_info` costs 90% of the total execution time of `uv_hrtime()`. The result of the call is static on all existing platforms, so there is no need in invoking it multiple times.
It looks correct to me. There is a general caveat with mixing volatile and non-volatile loads in that the compiler can (at least in theory) reorder them in such a way that the non-volatile load comes first; volatile loads and stores only have a 'happens before' relationship with other volatile loads and stores. That shouldn't apply here because of the side effects of the system call. |
If it would be generally correct then it would be best to use it, as it takes 32ns instead of 40ns (pthread_once) on my system. However, x86 seems to have a particularly nice memory model, do we know the same is true for all systems libuv is running on? I reckon the optimal + guaranteed case is only possible with atomics. |
That's a good point. I've been reasoning from an x86 perspective (libuv only supports OS X 10.6+ so that effectively restricts it to x86) but libuv also runs on iOS. That means ARM and ARM has much weaker guarantees. I cast my vote for pthread_once(). |
Indeed. Fwiw, atomics are suitably implemented on all popular systems supported by clang/gcc, sun compiler (Solaris) and msvc (Windows). An example from nanomsg: https://github.com/nanomsg/nanomsg/blob/master/src/utils/atomic.c (these are not the operations we would need, but one gets the gist). |
Please quickly beat me down if I'm completely off my rocker, but isn't |
Not only is that not guaranteed anywhere I could find, but even if it were the case, that doesn't impact this bug report. |
Thought it would since the value returned by |
@bnoordhuis compiler could not reorder them, because |
Ok, I guess I'll land it :) |
I'm not sure how anything would call
Which one? :) |
@aktau this one |
Landed in 211bf4e |
Sure, we're (neovim) only on x86 for the time being anyway (and single-threaded) ;). Thanks for the fix! |
I'd like to try this out by the way. I see you merged this into branch v0.10. I'm not sure how your git workflow is, but when would this reach master or some recent version at least? Neovim uses the v0.11 API, I believe. |
You can probably cherry-pick the commit. I'll merge 0.10 into master tomorrow, I don't trust myself right now :-P |
Perfect, thanks! Am Dienstag, 24. Juni 2014 schrieb Saúl Ibarra Corretgé :
|
I can confirm that |
Fixes some bugs and increase the performance of uv_hrtime() on OSX, which reduces its prominence in performance traces. This allows us to better see what's really causing slowness. ref: - neovim#868 - joyent/libuv#1325 - https://github.com/joyent/libuv/releases
Fixes some bugs and increase the performance of uv_hrtime() on OSX, which reduces its prominence in performance traces. This allows us to better see what's really causing slowness. ref: - #868 - joyent/libuv#1325 - https://github.com/joyent/libuv/releases
Fixes some bugs and increase the performance of uv_hrtime() on OSX, which reduces its prominence in performance traces. This allows us to better see what's really causing slowness. ref: - neovim#868 - joyent/libuv#1325 - https://github.com/joyent/libuv/releases
According to @aktau, the call to
mach_timebase_info
costs 90% of thetotal execution time of
uv_hrtime()
. The result of the call is staticon all existing platforms, so there is no need in invoking it multiple
times.