New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent blocking syscalls cause high sysmon overhead #18236

Open
philhofer opened this Issue Dec 7, 2016 · 11 comments

Comments

Projects
None yet
7 participants
@philhofer
Contributor

philhofer commented Dec 7, 2016

The hard-coded 20us sysmon tick in the runtime appears to be far too aggressive for minority architectures.

Our storage controllers run a single-core 800MHz arm cpu, which runs single-threaded workloads between 10 and 20 times slower than my desktop Intel machine.

Most of our product-level benchmarks improved substantially (5-10%) when I manually adjusted the minimum tick to 2ms on our arm devices. (Our use-case benefits from particularly long syscall ticks, since we tend to make read and write syscalls with large buffers, and those can take a long time even if they never actually block in I/O. Also, as you might imagine, the relative cost of thread creation and destruction on this system is quite high.)

Perhaps the runtime should calibrate the sysmon tick based on the performance of a handful of trivial syscalls (getpid(), etc). Similarly, the tick could be adjusted upwards if an m wakes up within a millisecond (or so) of being kicked off its p. Perhaps someone with a better intuition for the runtime can suggest something heuristically less messy.

Thanks,
Phil

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 7, 2016

@davecheney

This comment has been minimized.

Contributor

davecheney commented Dec 7, 2016

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 7, 2016

It's a Marvell Armada 370, which is ARMv7

@davecheney

This comment has been minimized.

Contributor

davecheney commented Dec 7, 2016

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 7, 2016

Yes, please drop ARMv5. I was secretly elated when I saw that was finally happening. There are too many wildly divergent implementations of ARM in the wild.

@minux

This comment has been minimized.

Member

minux commented Dec 7, 2016

@aclements

This comment has been minimized.

Member

aclements commented Dec 7, 2016

In general, sysmon should back off to a 10ms delay unless it's frequently finding goroutines that are blocked in syscalls for more than 10ms or there are frequent periods where all threads are idle (e.g., waiting on the network or blocked in syscalls). Is that the case in your application?

Perhaps the runtime should calibrate the sysmon tick based on the performance of a handful of trivial syscalls (getpid(), etc). Similarly, the tick could be adjusted upwards if an m wakes up within a millisecond (or so) of being kicked off its p.

Perhaps I don't understand your application, but I don't see how these would help. There are two delays involved here: one is the minimum sysmon delay, which is 20us, and the other is the retake delay, which is 10ms. I think both of your suggestions aim at adjusting the retake delay, but it sounds like that's not the problem here.

I think the best way to handle the minimum delay is to eliminate it by making sysmon event-driven. The minimum delay is there because sysmon is sampling what is ultimately a stream of instantaneous events. I believe it could instead compute when the next interesting event will occur and just sleep for that long. This is slightly complicated by how we figure out how long a goroutine has been executing or in a syscall, but in this model the minimum delay controls the variance we're willing to accept on that. In practice, that variance can be up to 10ms right now, which would suggest a minimum tick of 10ms if sysmon were event-driven.

@aclements aclements added this to the Go1.9Early milestone Dec 7, 2016

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 7, 2016

@aclements Thanks for the detailed reply.

We do, in fact, have all threads blocked in syscalls quite frequently. There's only one P, and the only meaningful thing that this hardware does is move data from the network to the disk and vice-versa. (We spend a good chunk of time in startm and stopm.) Many of these syscalls don't truly "block," because we do a relatively good job of issuing the right prefetch hints to hit the kernel block cache, but inevitably blocking requests start to pile up.

Sorta related: #6817, #7903 (we do already have a sempahore around blocking syscalls)

Probably also worth pointing out that this hardware definitely can't hit a 1ms timer deadline reliably under linux, and struggles to meet a 10ms deadline with less than 2ms of overrun.

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 7, 2016

In case it helps, here's the hack that bought us a perf boost:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index cad1b1c..5fc3dcd 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3753,7 +3753,11 @@ func sysmon() {
        delay := uint32(0)
        for {
                if idle == 0 { // start with 20us sleep...
-                       delay = 20
+                       if GOARCH == "arm" {
+                               delay = 2 * 1000
+                       } else {
+                               delay = 20
+                       }
                } else if idle > 50 { // start doubling the sleep after 1ms...
                        delay *= 2
                }

@aclements aclements changed the title from runtime: sysmon tick is hardcoded to runtime: frequent blocking syscalls cause high sysmon overhead Jan 8, 2017

@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.9Early May 3, 2017

@bradfitz bradfitz added the Performance label May 3, 2017

@aclements aclements modified the milestones: Go1.10, Go1.9Maybe Jul 18, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 19, 2018

Sorry, bumping this again. (But I assume Igneous has been running their own builds for the past two years anyway)

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018

@jeffdh

This comment has been minimized.

jeffdh commented Jun 20, 2018

Thanks for the bump @bradfitz. We have opted to put a semaphore in front of disk I/O that keeps things in optimal syscall/thread territory to avoid this issue as opposed to the patch listed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment