Skip to content
This repository has been archived by the owner on May 4, 2018. It is now read-only.

darwin: invoke mach_timebase_info only once #1325

Closed
wants to merge 1 commit into from
Closed

darwin: invoke mach_timebase_info only once #1325

wants to merge 1 commit into from

Conversation

indutny
Copy link
Contributor

@indutny indutny commented Jun 19, 2014

According to @aktau, the call to mach_timebase_info costs 90% of the
total execution time of uv_hrtime(). The result of the call is static
on all existing platforms, so there is no need in invoking it multiple
times.

@aktau
Copy link

aktau commented Jun 19, 2014

Hey Fedor! Thanks for the very quick response. This does replace one syscall with another though (pthread_once). Any reason why it couldn't be moved inside of an if-check as I suggested?

if (unlikely(info.denom == 0)) {
  if (pthread_once(&once, uv__hrtime_init) != 0) {
    abort();
  }
}

That would avoid a syscall on the critical path. I realize that there might be a partial write in progess which would make the condition fail, but perhaps we can solve that with an atomic get/set:

static uint64_t uv__hrtime_multiplier;

static void uv__hrtime_init(void) {
  if (mach_timebase_info(&uv__hrtime_timebase_info) != KERN_SUCCESS)
    abort();
  }
  atomic_set_64(uv__hrtime_multiplier, uv__hrtime_timebase_info.numer / uv__hrtime_timebase_info.denom);
}

...

if (unlikely(atomic_load_64(uv__hrtime_multiplier == 0))) {
  if (pthread_once(&once, uv__hrtime_init) != 0) {
    abort();
  }
}

Just writing down my thoughts at the moment.

EDIT: Also, what's the reason for the info = &uv__hrtime_timebase_info; line?

@indutny
Copy link
Contributor Author

indutny commented Jun 19, 2014

May I ask you to do some benchmarks of it and post results here? I wonder how slow the pthread_once is actually is.

@aktau
Copy link

aktau commented Jun 19, 2014

I hope I can get to that at some point during the coming days! It was just my intuition that the call to mach_absolute_time() would reach close to the minimum of what a syscall can cost, so that any other syscall would at least double the runtime.

@aktau
Copy link

aktau commented Jun 19, 2014

For my own reference: 8 byte reads/writes are atomic if the alignment is correct: http://stackoverflow.com/a/1292938/558819

This provides a clear path to the optimal codepath if it turns out that pthread_once is slower than expected. (align static + use atomic op which any decent compiler will lower to a single mov without barriers).

@indutny
Copy link
Contributor Author

indutny commented Jun 19, 2014

Force pushed with indeed better solution.

@aktau
Copy link

aktau commented Jun 19, 2014

Force pushed with indeed better solution.

This is the solution from before commit a7cedbe

And that was done for thread safety reasons.

@indutny
Copy link
Contributor Author

indutny commented Jun 19, 2014

Oh, you are right, indeed. What do you think about the last commit?

@aktau
Copy link

aktau commented Jun 19, 2014

I don't know for sure what ACCESS_ONCE does yet (didn't look). But I did do some really simplistic benchmarks: https://gist.github.com/aktau/9f52f812200d8d69a5d1

results:

$ echo "clang-3.4" && /usr/local/bin/clang-3.4 -std=c99 -g3 -O2 bench.c -o bench && ./bench && echo "gcc-4.9" && /usr/local/bin/gcc-4.9 -std=c99 -g3 -O2 bench.c -o bench && ./bench
clang-3.4
mach_absolute_time        took  436573634 ns,  21 ns per call, dummy 11220952913650381342
atomic + unlikely         took  738636415 ns,  36 ns per call,  169.2% of lower-bound, dummy 11232726204406528702
atomic + unlikely + align took  746793238 ns,  37 ns per call,  171.1% of lower-bound, dummy 11247578014054130890
ACCESS_ONCE               took  737353267 ns,  36 ns per call,  168.9% of lower-bound, dummy 11262409467510417712
pthread mutex             took 1707136147 ns,  85 ns per call,  391.0% of lower-bound, dummy 11286921008564394065
pthread once              took  803783314 ns,  40 ns per call,  184.1% of lower-bound, dummy 11311987089439266983
current libuv             took 4577809367 ns, 228 ns per call, 1048.6% of lower-bound, dummy 11365789124687598240
non threadsafe            took  734334774 ns,  36 ns per call,  168.2% of lower-bound, dummy 11418915193310618148
atomic                    took  741769774 ns,  37 ns per call,  169.9% of lower-bound, dummy 11433661662949986180

gcc-4.9
mach_absolute_time        took  476533200 ns,  23 ns per call, dummy 4143116828107367400
atomic + unlikely         took  744618783 ns,  37 ns per call,  156.3% of lower-bound, dummy 4155330547845523307
atomic + unlikely + align took  728663609 ns,  36 ns per call,  152.9% of lower-bound, dummy 4170048722709943038
ACCESS_ONCE               took  735128677 ns,  36 ns per call,  154.3% of lower-bound, dummy 4184707760867182211
pthread mutex             took 1748285394 ns,  87 ns per call,  366.9% of lower-bound, dummy 4209641516512722094
pthread once              took  836038519 ns,  41 ns per call,  175.4% of lower-bound, dummy 4235274408810859842
current libuv             took 4668379608 ns, 233 ns per call,  979.7% of lower-bound, dummy 4290621455330345238
non threadsafe            took  748268089 ns,  37 ns per call,  157.0% of lower-bound, dummy 4344585482989562800
atomic                    took  741199666 ns,  37 ns per call,  155.5% of lower-bound, dummy 4359549083582924228

So the atomic is about as fast as the non-threadsafe implementation.

EDIT: results updated with many more methods and 2 compilers.

@aktau
Copy link

aktau commented Jun 19, 2014

I see that ACCESS_ONCE provokes a volatile read. I'm not entirely sure if that's necessary here. We'd have to inspect the generated assembly. But what I remember from doing a bit of threading work some time ago is that using volatile to alleviate threading issues is usually not correct. Although it is better than without volatile because it stops the compiler from reordering things around.

On x64 the ACCESS_ONCE code will "probably" work because 8-byte read/writes are atomic and the timebase struct is 8 bytes... for now.

@indutny
Copy link
Contributor Author

indutny commented Jun 19, 2014

is atomic a pthread_once implementation?

@aktau
Copy link

aktau commented Jun 19, 2014

Nope, check out the gist :).

Am Donnerstag, 19. Juni 2014 schrieb Fedor Indutny :

is atomic a pthread_once implementation?


Reply to this email directly or view it on GitHub
#1325 (comment).

@indutny
Copy link
Contributor Author

indutny commented Jun 19, 2014

Ah, I see... I think the current version should be fine too, though.

@saghul ?

@aktau
Copy link

aktau commented Jun 19, 2014

Ah, I see... I think the current version should be fine too, though.

For x64, I mostly agree, the volatile marking should be sufficient, for other platforms I don't know. We don't know how mach_timebase_info sets the values though. Could look at the XNU sources I suppose. Since @bnoordhuis was the one who did the first threading-fix, perhaps he has an idea about this too.

EDIT: updated benchmarks with your implementation @indutny.

EDIT 2: what's really annoying is that calls to mach_timebase_info shouldn't be this slow... http://fxr.watson.org/fxr/source/osfmk/i386/rtclock.c?v=xnu-2050.18.24#L546

@saghul
Copy link
Contributor

saghul commented Jun 19, 2014

I can't really add anything here, I trust your judgement, @indutny :-)

@aktau
Copy link

aktau commented Jun 19, 2014

Just as a last addendum, the atomic code has (as far as I can see), quite good code generation. This is what clang 3.4.1 manages to generate on OSX. clang-3.4 -std=c99 -g3 -O2 -march=native bench.c -o bench && ./bench

bench`uv_hrtime_atomic_bopt at bench.c:61:
bench[0x1000007b0]:  pushq  %rbp
bench[0x1000007b1]:  movq   %rsp, %rbp
bench[0x1000007b4]:  subq   $0x10, %rsp
bench[0x1000007b8]:  movl   0x88e(%rip), %eax         ; uv_hrtime_atomic_bopt.info.1
bench[0x1000007be]:  testl  %eax, %eax
bench[0x1000007c0]:  je     0x1000007e5               ; uv_hrtime_atomic_bopt + 53 at bench.c:75
bench[0x1000007c2]:  callq  0x100000dd6               ; symbol stub for: mach_absolute_time
bench[0x1000007c7]:  movl   0x87b(%rip), %ecx         ; uv_hrtime_atomic_bopt.info.0
bench[0x1000007cd]:  imulq  %rax, %rcx
bench[0x1000007d1]:  movl   0x875(%rip), %esi         ; uv_hrtime_atomic_bopt.info.1
bench[0x1000007d7]:  xorl   %edx, %edx
bench[0x1000007d9]:  movq   %rcx, %rax
bench[0x1000007dc]:  divq   %rsi
bench[0x1000007df]:  addq   $0x10, %rsp
bench[0x1000007e3]:  popq   %rbp
bench[0x1000007e4]:  ret
bench[0x1000007e5]:  leaq   -0x8(%rbp), %rdi
bench[0x1000007e9]:  callq  0x100000ddc               ; symbol stub for: mach_timebase_info
bench[0x1000007ee]:  testl  %eax, %eax
bench[0x1000007f0]:  jne    0x100000806               ; uv_hrtime_atomic_bopt + 86 at bench.c:69
bench[0x1000007f2]:  movl   -0x8(%rbp), %eax
bench[0x1000007f5]:  xchgl  %eax, 0x84d(%rip)         ; uv_hrtime_atomic_bopt.info.0
bench[0x1000007fb]:  movl   -0x4(%rbp), %eax
bench[0x1000007fe]:  xchgl  %eax, 0x848(%rip)         ; uv_hrtime_atomic_bopt.info.1
bench[0x100000804]:  jmp    0x1000007c2               ; uv_hrtime_atomic_bopt + 18 at bench.c:75
bench[0x100000806]:  callq  0x100000dd0               ; symbol stub for: abort

I'm not entirely sure why clang decides to use xchgl since the function itself is 8-byte aligned and thus the struct as well. This should make the assignment a simple mov. Gcc decides to use a mfence + mov but otherwise generates similar code. So something must be up with that. At any rate, the fast path use plain movs, which is as good as we can get without pre-multiplying the multiplication factor (which could lose accuracy in some cases I guess, as currently the numerator multiplies with the absolute time first).

@aktau
Copy link

aktau commented Jun 19, 2014

And as a last note, if libuv needs to run on PowerPC based macs...: http://stackoverflow.com/questions/23378063/how-can-i-use-mach-absolute-time-without-overflowing

if (mach_timebase_info(&info) != KERN_SUCCESS)
abort();
if (ACCESS_ONCE(uint32_t, info.numer) == 0 &&
ACCESS_ONCE(uint32_t, info.denom) == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not entirely thread-safe, not even with the x86's relaxed memory model because loads can still be reordered with stores to different locations. While some CPU models implement stronger guarantees than that, I wouldn't rely on it. Remember that ACCESS_ONCE is just a compiler barrier, not a memory barrier.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that ACCESS_ONCE is just a compiler barrier, not a memory barrier.

An interesting article where the distinction is discussed further in the comment section: http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

I've been trying to reason about my version with atomics (https://gist.github.com/aktau/9f52f812200d8d69a5d1). I used __ATOMIC_SEQ_CST because anything else would likely drive me insane.

Vis-a-vis loads/stores I have this:

1. load denom (if-check)
2. store numer (only if if-check failed)
3. store denom (only if if-check failed)
4. load denom (for division, no order guarantees)
5. load numer (for multiplication, no order guarantees)

Going through this checklist for one thread setting the variables (steps 2 and 3) and another in the process of calculating the multiplier (steps 4 and 5). I'm thinking this should hold:

If we reach 4/5, then denom has been assigned (step 3). This means that we only have to make absolutely sure that step 2 happens (and is visible) before step 3.

I think that some of the default x86 memory guarantees (Stores are not reordered with other stores. and Stores are not reordered with older loads.) make this assertion true. However, for processors that don't give such lavish guarantees, the memory fences introduced by the atomic calls should be sufficient. Am I right or completely off the ball here @bnoordhuis? I know it's very tricky and that it's all too easy to think an implementation with atomics is correct when it isn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the caveat that it's 1.30 AM over here and it's been a long day: yes, that looks correct. It's still possible for multiple threads to end up calling mach_timebase_info() but that seems harmless because it's "pure" (AFAIK), it always returns the same value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis you meant my commit or @aktau description? I think that this diff is correct, since it enters the mach_timebase_info only if both fields are observed to be 0, there is no big deal about how many mach_timebase_info will be actually called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aktau's description. The bug in this PR is that because you don't know if mach_timebase_info() stores numer or denom first, you see a non-zero numer in the check, then divide by zero because denom hasn't been updated yet. I won't speculate on the probability but it's, ah, non-zero.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug in this PR is that because you don't know if mach_timebase_info() stores numer or denom first, you see a non-zero numer in the check, then divide by zero because denom hasn't been updated yet.

It's useful to add that even if the order of numer/denom checks in the if-statement is reversed, there could still be a problem when only denom has been assigned and numer hasn't. This wouldn't divide by 0 but would provoke a wrong result (0) anyway.

On the x86, it seems that assigning the info structure from a temporary variable as I have done, but with the volatile modifier instead of atomics, will provide a correct result (i.e.: only a compiler fence). However, this is not guaranteed for other ISAs, as you have mentioned @bnoordhuis. This is why I used the atomics, which generate memory fences (mfence or xchl on x86). Technically I think only a store fence (sfence on x86) is necessary but it would provide no speed benefit for dubious gain. We already see that the version with fully fenced atomics is equally fast as the fastest non-threadsafe version.

@aktau
Copy link

aktau commented Jun 20, 2014

Benchmarks updated. It seems that a pthread_once based inplementation is a decent contender:

  • with gcc: atomics are 55.5% slower than a pure mach_absolute_time call, pthread_once is 75.4% slower
  • with clang: atomics are 69.9% slower than a pure mach_absolute_time call, pthread_once is 84.1% slower

(take numbers with a small grain of salt, haven't been statistically analyzed, standard deviation could be a few percents).

@bnoordhuis
Copy link
Contributor

It seems that a pthread_once based inplementation is a decent contender

Yes, that sounds plausible. Might be easiest to just stick in a pthread_once() and call it a day.

@aktau
Copy link

aktau commented Jun 20, 2014

Yes, that sounds plausible. Might be easiest to just stick in a pthread_once() and call it a day.

It would at least be safe, work for all compilers with which libuv works right now (libuv requires pthreads support) and 5.5x faster than the current implementation (atomics would be 6.2x faster, for reference). I thus vote for something like this:

static mach_timebase_info_data_t once_info;
static void init_info_once(void) {
  if (mach_timebase_info(&once_info) != KERN_SUCCESS)
    abort();
}

static pthread_once_t once = PTHREAD_ONCE_INIT;
uint64_t uv__hrtime(void) {
  pthread_once(&once, init_info_once);

  return mach_absolute_time() * once_info.numer / once_info.denom;
}

(with whatever variable naming scheme you prefer).

@indutny
Copy link
Contributor Author

indutny commented Jun 24, 2014

@bnoordhuis I'm ok with using pthread_once. Just want to know what is really problematic in my code right now, could you please post a scenario where it fails?

As far as I can see - it will use numer and denom only if they are non-zero, and initialize them if one of them is zero (just fixed this). Is it correct?

According to @aktau, the call to `mach_timebase_info` costs 90% of the
total execution time of `uv_hrtime()`. The result of the call is static
on all existing platforms, so there is no need in invoking it multiple
times.
@bnoordhuis
Copy link
Contributor

As far as I can see - it will use numer and denom only if they are non-zero, and initialize them if one of them is zero (just fixed this). Is it correct?

It looks correct to me. There is a general caveat with mixing volatile and non-volatile loads in that the compiler can (at least in theory) reorder them in such a way that the non-volatile load comes first; volatile loads and stores only have a 'happens before' relationship with other volatile loads and stores. That shouldn't apply here because of the side effects of the system call.

@aktau
Copy link

aktau commented Jun 24, 2014

It looks correct to me. There is a general caveat with mixing volatile and non-volatile loads in that the compiler can (at least in theory) reorder them in such a way that the non-volatile load comes first; volatile loads and stores only have a 'happens before' relationship with other volatile loads and stores. That shouldn't apply here because of the side effects of the system call.

If it would be generally correct then it would be best to use it, as it takes 32ns instead of 40ns (pthread_once) on my system. However, x86 seems to have a particularly nice memory model, do we know the same is true for all systems libuv is running on? I reckon the optimal + guaranteed case is only possible with atomics.

@bnoordhuis
Copy link
Contributor

That's a good point. I've been reasoning from an x86 perspective (libuv only supports OS X 10.6+ so that effectively restricts it to x86) but libuv also runs on iOS. That means ARM and ARM has much weaker guarantees. I cast my vote for pthread_once().

@aktau
Copy link

aktau commented Jun 24, 2014

That means ARM and ARM has much weaker guarantees.

Indeed. Fwiw, atomics are suitably implemented on all popular systems supported by clang/gcc, sun compiler (Solaris) and msvc (Windows). An example from nanomsg: https://github.com/nanomsg/nanomsg/blob/master/src/utils/atomic.c (these are not the operations we would need, but one gets the gist).

@trevnorris
Copy link
Contributor

Please quickly beat me down if I'm completely off my rocker, but isn't mach_absolute_time() a monotonic clock? If that's the case then it would stop counting when the CPU powers down.

@aktau
Copy link

aktau commented Jun 24, 2014

Please quickly beat me down if I'm completely off my rocker, but isn't mach_absolute_time() a monotonic clock? If that's the case then it would stop counting when the CPU powers down.

Not only is that not guaranteed anywhere I could find, but even if it were the case, that doesn't impact this bug report.

@trevnorris
Copy link
Contributor

[...] even if it were the case, that doesn't impact this bug report.

Thought it would since the value returned by mach_absolute_time() would differ whether the CPU has powered off or not.

@indutny
Copy link
Contributor Author

indutny commented Jun 24, 2014

@bnoordhuis compiler could not reorder them, because mach_timebase_info is a syscall.

@indutny
Copy link
Contributor Author

indutny commented Jun 24, 2014

Ok, I guess I'll land it :)

@aktau
Copy link

aktau commented Jun 24, 2014

Thought it would since the value returned by mach_absolute_time() would differ whether the CPU has powered off or not.

I'm not sure how anything would call mach_absolute_time() with the CPU powered off.

Ok, I guess I'll land it :)

Which one? :)

@indutny
Copy link
Contributor Author

indutny commented Jun 24, 2014

@aktau this one

@indutny
Copy link
Contributor Author

indutny commented Jun 24, 2014

Landed in 211bf4e

@indutny indutny closed this Jun 24, 2014
@aktau
Copy link

aktau commented Jun 24, 2014

@aktau this one

Sure, we're (neovim) only on x86 for the time being anyway (and single-threaded) ;). Thanks for the fix!

@aktau
Copy link

aktau commented Jun 24, 2014

Landed in 211bf4e

I'd like to try this out by the way. I see you merged this into branch v0.10. I'm not sure how your git workflow is, but when would this reach master or some recent version at least?

Neovim uses the v0.11 API, I believe.

@saghul
Copy link
Contributor

saghul commented Jun 24, 2014

You can probably cherry-pick the commit. I'll merge 0.10 into master tomorrow, I don't trust myself right now :-P

@indutny indutny deleted the fix/performance-hrtime branch June 24, 2014 21:42
@aktau
Copy link

aktau commented Jun 24, 2014

Perfect, thanks!

Am Dienstag, 24. Juni 2014 schrieb Saúl Ibarra Corretgé :

You can probably cherry-pick the commit. I'll merge 0.10 into master
tomorrow, I don't trust myself right now :-P


Reply to this email directly or view it on GitHub
#1325 (comment).

@aktau
Copy link

aktau commented Jun 26, 2014

I can confirm that uv_hrtime no longer appears prominently in my stacktraces. Easy performance win. On to the next one ;).

aktau added a commit to aktau/neovim that referenced this pull request Jul 3, 2014
Fixes some bugs and increase the performance of uv_hrtime() on OSX, which
reduces its prominence in performance traces. This allows us to better see
what's really causing slowness.

ref:
- neovim#868
- joyent/libuv#1325
- https://github.com/joyent/libuv/releases
aktau added a commit to neovim/neovim that referenced this pull request Jul 6, 2014
Fixes some bugs and increase the performance of uv_hrtime() on OSX, which
reduces its prominence in performance traces. This allows us to better see
what's really causing slowness.

ref:
- #868
- joyent/libuv#1325
- https://github.com/joyent/libuv/releases
fmoralesc pushed a commit to fmoralesc/neovim that referenced this pull request Aug 19, 2014
Fixes some bugs and increase the performance of uv_hrtime() on OSX, which
reduces its prominence in performance traces. This allows us to better see
what's really causing slowness.

ref:
- neovim#868
- joyent/libuv#1325
- https://github.com/joyent/libuv/releases
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants