runtime: osyield is expensive on darwin? #19409
This is a naive question about an unexpected and striking benchmarking result.
At tip, on my amd64 OS X laptop, I get:
$ go test -bench=BenchmarkSplitSingleByteSeparator -run=NONE bytes BenchmarkSplitSingleByteSeparator-8 500 2722171 ns/op
If I apply CL 37795, the execution time increases 65%:
$ go test -bench=BenchmarkSplitSingleByteSeparator -run=NONE bytes BenchmarkSplitSingleByteSeparator-8 300 4518575 ns/op
Note that in that CL, all that really happens is that a single function is removed from the call stack. Index checks the length of its argument, and if it is 1, then Index calls IndexByte.
CPU profiling indicates that basically all of the extra time is spent in runtime.usleep (called from runtime.osyield) and runtime.mach_semaphore_signal (called from runtime.notewakeup).
I'm left wondering:
(1) Is there a cheaper way to do an osyield on darwin that doesn't cost a full microsecond? (Linux appears to make an arch_prctl syscall instead of calling select.)
(2) Why does removing a function call from the stack create additional calls to osyield? Can this be avoided?
The text was updated successfully, but these errors were encountered:
Oh right, of course. And changing the CL to occasionally call Index instead of IndexByte eliminates the regression:
I am still interested in making osyield cheaper. I have definitely seen those stack traces show up prominently in other profiles as well.
I spent a little while playing with this. Adding an explicit preemption check in the genSplit loop fixes the problem, so @dr2chase's back-edge preemption would fix this problem assuming it can be landed without other bad performance effects.
Overall, the pending CL to replace osyield's usleep(1) with sched_yield() on macOS has a mixed effect on the go1 benchmarks but has an overall -0.29% geomean speedup.
However, the effect on Josh's benchmark is devastating, +13% slower.
It looks like in the case where we do spin waiting for preemption, using usleep(1) is much better than using sched_yield(). Because of this, I think we should put this change on hold until back-edge preemption is solved, rather than make the problem noticeably worse.
Basically, "fixing" osyield seems to make programs either a tiny bit faster or a lot slower. Most programs won't care but the ones that do will get much worse. Better not to do that.
I'd be interested to hear the results of the experiment on Linux too, but I'd rather not change Linux at this point to work around the problem. I'd rather fix the underlying problem #10958. We've got most of a solution and mainly need to make it perform a bit better.