Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent blocking syscalls cause high sysmon overhead #18236

Open
philhofer opened this issue Dec 7, 2016 · 20 comments
Open

runtime: frequent blocking syscalls cause high sysmon overhead #18236

philhofer opened this issue Dec 7, 2016 · 20 comments

Comments

@philhofer
Copy link
Contributor

@philhofer philhofer commented Dec 7, 2016

The hard-coded 20us sysmon tick in the runtime appears to be far too aggressive for minority architectures.

Our storage controllers run a single-core 800MHz arm cpu, which runs single-threaded workloads between 10 and 20 times slower than my desktop Intel machine.

Most of our product-level benchmarks improved substantially (5-10%) when I manually adjusted the minimum tick to 2ms on our arm devices. (Our use-case benefits from particularly long syscall ticks, since we tend to make read and write syscalls with large buffers, and those can take a long time even if they never actually block in I/O. Also, as you might imagine, the relative cost of thread creation and destruction on this system is quite high.)

Perhaps the runtime should calibrate the sysmon tick based on the performance of a handful of trivial syscalls (getpid(), etc). Similarly, the tick could be adjusted upwards if an m wakes up within a millisecond (or so) of being kicked off its p. Perhaps someone with a better intuition for the runtime can suggest something heuristically less messy.

Thanks,
Phil

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Dec 7, 2016

@davecheney
Copy link
Contributor

@davecheney davecheney commented Dec 7, 2016

@philhofer
Copy link
Contributor Author

@philhofer philhofer commented Dec 7, 2016

It's a Marvell Armada 370, which is ARMv7

@davecheney
Copy link
Contributor

@davecheney davecheney commented Dec 7, 2016

@philhofer
Copy link
Contributor Author

@philhofer philhofer commented Dec 7, 2016

Yes, please drop ARMv5. I was secretly elated when I saw that was finally happening. There are too many wildly divergent implementations of ARM in the wild.

@minux
Copy link
Member

@minux minux commented Dec 7, 2016

@aclements
Copy link
Member

@aclements aclements commented Dec 7, 2016

In general, sysmon should back off to a 10ms delay unless it's frequently finding goroutines that are blocked in syscalls for more than 10ms or there are frequent periods where all threads are idle (e.g., waiting on the network or blocked in syscalls). Is that the case in your application?

Perhaps the runtime should calibrate the sysmon tick based on the performance of a handful of trivial syscalls (getpid(), etc). Similarly, the tick could be adjusted upwards if an m wakes up within a millisecond (or so) of being kicked off its p.

Perhaps I don't understand your application, but I don't see how these would help. There are two delays involved here: one is the minimum sysmon delay, which is 20us, and the other is the retake delay, which is 10ms. I think both of your suggestions aim at adjusting the retake delay, but it sounds like that's not the problem here.

I think the best way to handle the minimum delay is to eliminate it by making sysmon event-driven. The minimum delay is there because sysmon is sampling what is ultimately a stream of instantaneous events. I believe it could instead compute when the next interesting event will occur and just sleep for that long. This is slightly complicated by how we figure out how long a goroutine has been executing or in a syscall, but in this model the minimum delay controls the variance we're willing to accept on that. In practice, that variance can be up to 10ms right now, which would suggest a minimum tick of 10ms if sysmon were event-driven.

@aclements aclements added this to the Go1.9Early milestone Dec 7, 2016
@philhofer
Copy link
Contributor Author

@philhofer philhofer commented Dec 7, 2016

@aclements Thanks for the detailed reply.

We do, in fact, have all threads blocked in syscalls quite frequently. There's only one P, and the only meaningful thing that this hardware does is move data from the network to the disk and vice-versa. (We spend a good chunk of time in startm and stopm.) Many of these syscalls don't truly "block," because we do a relatively good job of issuing the right prefetch hints to hit the kernel block cache, but inevitably blocking requests start to pile up.

Sorta related: #6817, #7903 (we do already have a sempahore around blocking syscalls)

Probably also worth pointing out that this hardware definitely can't hit a 1ms timer deadline reliably under linux, and struggles to meet a 10ms deadline with less than 2ms of overrun.

@philhofer
Copy link
Contributor Author

@philhofer philhofer commented Dec 7, 2016

In case it helps, here's the hack that bought us a perf boost:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index cad1b1c..5fc3dcd 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3753,7 +3753,11 @@ func sysmon() {
        delay := uint32(0)
        for {
                if idle == 0 { // start with 20us sleep...
-                       delay = 20
+                       if GOARCH == "arm" {
+                               delay = 2 * 1000
+                       } else {
+                               delay = 20
+                       }
                } else if idle > 50 { // start doubling the sleep after 1ms...
                        delay *= 2
                }
@aclements aclements changed the title runtime: sysmon tick is hardcoded runtime: frequent blocking syscalls cause high sysmon overhead Jan 8, 2017
@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.9Early May 3, 2017
@aclements aclements modified the milestones: Go1.10, Go1.9Maybe Jul 18, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Jun 19, 2018

Sorry, bumping this again. (But I assume Igneous has been running their own builds for the past two years anyway)

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018
@jeffdh
Copy link

@jeffdh jeffdh commented Jun 20, 2018

Thanks for the bump @bradfitz. We have opted to put a semaphore in front of disk I/O that keeps things in optimal syscall/thread territory to avoid this issue as opposed to the patch listed above.

@aclements
Copy link
Member

@aclements aclements commented Jun 25, 2019

Is this still an issue or should we go ahead and close this?

@aclements aclements removed this from the Go1.13 milestone Jun 25, 2019
@aclements aclements added this to the Unplanned milestone Jun 25, 2019
@philhofer
Copy link
Contributor Author

@philhofer philhofer commented Jun 25, 2019

@agirbal
Copy link

@agirbal agirbal commented Jan 28, 2021

While troubleshooting CPU increase of our HTTP server since Go 1.14 (#41699) I've noticed that runtime.sysmon / runtime.timeSleepUntil was showing up prominently in our profiling (6-12% of CPU). I have tried a patch to set sysmon delay to 10ms on both Go 1.15 and Go 1.13.

The graph below shows canary results:

  • baseline is always vanilla Go 1.13
  • beginning of graph until 11:00 has Go 1.15 vanilla as canary. It results in fairly massive ~+15% increase.
  • 11:00 to 16:00 uses Go 1.15 with sysmon patched to 10ms. We see ~+10% increase, so getting back about 5%.
  • 16:00 onward uses Go 1.13 with sysmon patched to 10ms. At this point we see a gain of 3%.

image

This seems to indicate that sysmon can significantly increase the overhead, which grows as the server becomes busier.
Here is what it looks like in our profile graph:
image

PS: we run on FreeBSD OS.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 28, 2021

@agirbal If your program uses many timers and often resets those timers to fire earlier, then the code in the future 1.16 release will be much more efficient than the code in 1.15. Is it possible for you to test your program with the 1.16rc1 release that just came out? Thanks.

@agirbal
Copy link

@agirbal agirbal commented Jan 29, 2021

@ianlancetaylor ok I just tried 1.16rc1 and the issue seems gone! That is amazing news, we can get back on top of our upgrades, thank you. I have commented further in #41699 .

To center back on the sysmon issue, in graph below:

  • baseline is Go 1.13 vanilla
  • until 11:00 canary is Go 1.13 with sysmon reduction, showing about 4% savings.
  • 11:00 to 14:00 canary is Go 1.16 with sysmon reduction, showing about 4% savings.
  • from 14:00 canary is vanilla Go 1.16. It seems to have a slight regression of less than 1%.

image

The 4-5% savings is fairly significant for us. Is it acceptable to run with a constant sysmon delay of 10ms, any obvious downsides? Would you consider reducing the sysmon overhead? Thanks!

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 29, 2021

I think this is worth considering. I think that simply changing the delay to 10ms without any other changes may be too aggressive, but perhaps we can do some work to make it less active.

One of the major reasons not to extend the delay previously was that sysmon helped with timer expirations. That is gone now in 1.16.

The biggest reason is for retaking Ps from blocking syscalls (which occurs after 20us). If we increase the delay to 10ms, then retake will almost never occur.

Could you share profiles of what your application looks like now on 1.16, with and without the sysmon reduction?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 29, 2021

It occurs to me that we could handle blocking syscalls by having entersyscall start a timer. If the timer expires, it hands off the P. exitsyscall cancels the timer. The timeout would be fairly short, so the main cost would be storing the timer somewhere. I don't know if we want to make the G struct that much bigger. Although it's already pretty big.

We would still need sysmon to preempt long running G's, though, so I don't know if it's worth it.

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 29, 2021

It occurs to me that we could handle blocking syscalls by having entersyscall start a timer. If the timer expires, it hands off the P. exitsyscall cancels the timer. The timeout would be fairly short, so the main cost would be storing the timer somewhere.

It seems to me like the main cost would be arming the timer. Since a 20us timer is almost certainly the next timer, I think that basically every entersyscall would have to wakeNetPoller which would get very expensive fast. Perhaps we could get away with not waking netpoll and just letting it extend to the next M that notices it? Or more generally, perhaps exitsyscall0 could itself attempt to a retake by finding a P with an expired timer.

Edit for the last part: I guess it would need to be any pidleget trying to retake, not just exitsyscall0.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 29, 2021

True, waking the poller doesn't make sense. Having pidleget pick up P's waiting in syscall make sense. But it seems that we would still need sysmon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants