Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Livelock in mallocgc #29287

Open
ewhamilton opened this issue Dec 15, 2018 · 22 comments

Comments

@ewhamilton
Copy link

commented Dec 15, 2018

What version of Go are you using (go version)?

root@cb464ce07cbb:/repo# go version
go version go1.11.1 linux/amd64

Does this issue reproduce with the latest release?

This is latest major release. Haven't tried latest minor release yet.

What operating system and processor architecture are you using (go env)?

go env Output
Note: Below is go env from container build environment.  Repro is within a Docker container in a Kubernetes deployment with NumCPU=1 and GOMAXPROCS=4 (also seen when GOMAXPROCS defaults to 1).
root@cb464ce07cbb:/repo# go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/repo/go:/k10/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build291547819=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I have captured a trace of our application in which a goroutine calling mallocgc() stalls and blocks subsequent STW requests from GC. This sounds similar to #28695 but it happens more frequently (often but not always on on workload). Sorry, I don't have a simple test case that repros though there are only a few goroutines involved at the time of failure.

The mallocgc was a result of a library allocating an 8MB buffer with make([]byte, 0, 8 * 1024 * 1024). From my reading of runtime code it is headed down the largealloc path, perhaps hitting deductSweepCredit() and the sweeping done in alloc_m().

I hope that the trace is sufficient to identify the issue. At the time of the trace I was also collecting logs for GODEBUG=schedtrace=1000,scheddetail=1,gctrace=1,gcpacertrace=1. Neither the trace nor the logs show any sign of a goroutine not preempting while in user code.

Trace file is
tracefile.23-50-04.gz

Log file is
trace.log

More background is at https://groups.google.com/forum/#!topic/golang-nuts/g2gDf12y-Fo

What did you expect to see?

Expected timely allocation and garbage collection such that Kubernetes health checks every 10 seconds do not fail repeatedly.

What did you see instead?

Application hung, stopped responding to health check requests, and after over 30 seconds of STW was sent SIGTERM as part of restarting app. Observed that the SIGTERM appeared to break out of the livelock, but by then it was too late.

@ewhamilton ewhamilton changed the title Livelock in mallocgc runtime: Livelock in mallocgc Dec 15, 2018

@odeke-em

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

Thank you for filing this issue @ewhamilton and welcome to the Go project!

/cc @aclements @mknyszek @RLH

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2018

I took a look at the trace (with @aclements help) and it looks like G36 is stuck in some non-preemptible state for a long period of time. Only later does it finally get to a GC safepoint and it's preempted, so the GC can begin. It's really difficult for me to try and figure out where it's getting into that state since not all the code is available, and it's unclear what's getting inlined. Technically, every function call boundary is a GC safepoint, but there could be enough inlining going on that it's not reached.

In the trace there's a backtrace on G36, but that refers to the moment at which it's finally preempted, not the location where it was running all that time, so I don't think it's blocked inside mallocgc.

Overall, as mentioned in the golang-nuts thread, it strongly suggests a non-preemptible loop in some library. For example, iterating/operating over a very large slice could result in something like this.

Can you get more traces? Do they look similar? Do they have the same backtraces on the non-preemptible goroutines?

Also, try profiling with pprof and see where your program is spending all its time. G36 is running continuously for 30 seconds, so clearly it's doing something. It should almost definitely show up in the CPU profile.

There's already a well-known issue that the runtime's preemption abilities are somewhat limited (see #10958 and #24543). Besides that, it doesn't look like any new runtime issue to me. For now, I won't discount the possibility though.

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 18, 2018

Thanks for taking a look. Yes, I should be able to get more traces and I'll see if I can repro with pprof next.

As you mentioned the stack trace in this instance comes at a point where G36 has run after whatever it was doing for so long. No doubt it'd be clearer what's going on with a corefile from the midst of the long pause (and I'll pursue that if necessary).

I do think there's more evidence that the long execution may be within the call to mallocgc:

  • If we launch go tool trace tracefile.23-50-04 and then navigate through Goroutine analysis to kio/daemon.(*Daemon).Start.func1.1 then G36 and click on the very long span it reports the top of stack as:
runtime.mallocgc:996
runtime.makeslice:70
vendor/google.golang.org/api/gensupport.NewMediaBuffer:26
vendor/google.golang.org/api/gensupport.PrepareUpload:188
vendor/google.golang.org/api/storage/v1.(*ObjectsInsertCall).Media:8622
vendor/github.com/graymeta/stow/google.(*Container).Put:144
  • In an earlier shown in the golang-nuts thread, the panic from SIGTERM caught the offending thread at:
goroutine 15 [runnable]:
runtime.gcStart(0x0, 0x1, 0x0, 0x0)
	/usr/local/go/src/runtime/mgc.go:1236 +0x583 fp=0xc000cd7540 sp=0xc000cd7538 pc=0x419e63
runtime.mallocgc(0x800000, 0x1c04f80, 0x1, 0x600)
	/usr/local/go/src/runtime/malloc.go:996 +0x40e fp=0xc000cd75e0 sp=0xc000cd7540 pc=0x40bb0e
runtime.makeslice(0x1c04f80, 0x0, 0x800000, 0xc00022ad50, 0xc000cd7678, 0x15cfcdc)
	/usr/local/go/src/runtime/slice.go:70 +0x77 fp=0xc000cd7610 sp=0xc000cd75e0 pc=0x4414f7
vendor/google.golang.org/api/gensupport.NewMediaBuffer(...)
	/repo/go/src/vendor/google.golang.org/api/gensupport/buffer.go:26
vendor/google.golang.org/api/gensupport.PrepareUpload(0x23aa900, 0xc000039b30, 0x800000, 0x0, 0x23aa900, 0xc000039b30)
	/repo/go/src/vendor/google.golang.org/api/gensupport/media.go:188 +0x50 fp=0xc000cd7688 sp=0xc000cd7610 pc=0x15d03e0
vendor/google.golang.org/api/storage/v1.(*ObjectsInsertCall).Media(0xc000b2e780, 0x23a5a80, 0xc00022ad50, 0x0, 0x0, 0x0, 0xe)
	/repo/go/src/vendor/google.golang.org/api/storage/v1/storage-gen.go:8622 +0xb8 fp=0xc000cd7718 sp=0xc000cd7688 pc=0x195c188
vendor/github.com/graymeta/stow/google.(*Container).Put(0xc000de4be0, 0xc00058a871, 0x86, 0x23a5a80, 0xc00022ad50, 0x45, 0xc00022ad80, 0x50, 0xc000cd7840, 0x40c278, ...)
	/repo/go/src/vendor/github.com/graymeta/stow/google/container.go:144 +0x186 fp=0xc000cd77d0 sp=0xc000cd7718 pc=0x1969d36
  • Looking at the scheddetail=1 output from SCHED 1259242ms onward we see G36 marked as mallocing=1 and locks=1. That doesn't look like G36 running user code to me.

To me this feels like either

  • spinning CPU indefintely as a result of an edge condition where no forward progress is made each iteration, or
  • missing a wakeup of some sort

Hopefully pprof will point to where the time is being spent.

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented Dec 18, 2018

Regarding your first bullet point, I mentioned this earlier but I think I wasn't very clear. This stack trace comes from the moment the goroutine gets preempted, not where it's running. The execution tracer just traces scheduling events, so it doesn't actually know what a goroutine was doing for the period of time it infers, just what it was doing when it got parked and unparked. I think the trace viewer could be improved by making that explicit, since initially I missed that detail also.

Regarding your second bullet point, I'm not totally convinced that this is the offending goroutine. Note that this goroutine was not running when the SIGTERM hit, but was only runnable.

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 19, 2018

Yes, you were clear enough and I understood that the stack traces were from a point after wherever the goroutines were while the long time was elapsing. They may be indicators of where the code was, but I agree are not definitive.

I've captured 2 new sets of traces, this time running with both trace.Start() and pprof.StartCPUProfile at the beginning of our suspect code and the corresponding stop calls in the shutdown handler triggered by the SIGTERM handler.

In trace-21-17-45.out there appear to be 2 long pauses-- the first one is for 11 seconds and did not trigger the instrumentation via SIGTERM. The second was for 22 seconds and caught both.

In trace-21-56-53.out we see G13 with 3 long duration chunks. Only the third seems to interfere with the STW by G821 and result in SIGTERM.

In both cases the output of pprof lists runtime.memclrNoHeapPointers (from mallocgc) as the most-sampled bucket. I'll also observe that the number of samples appears to be much less than would be expected in the duration. I'm not yet sure what to make of that.

pprof-21-17-45.out.gz
pprof-21-56-53.out.gz
trace-21-17-45.out.gz
trace-21-56-53.out.gz

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2018

That's very weird. And as you point out, the amount of CPU profiling samples is super super low. Assuming you haven't changed the profiling rate, then there's the potential that that goroutine is in fact blocked on a runtime lock for a long time. That would be very bad and in fact a runtime issue.

Could you please:

  1. send your program a SIGQUIT instead of a SIGTERM (and don't handle the SIGQUIT), and
  2. set GOTRACEBACK=crash when running your Go binary?

This way the program will dump a stack trace of every goroutine that's running and we can see exactly what it's blocked on.

Thank you so much for your help thus far! :)

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 19, 2018

I'll see what I can do to get either a stack trace of all goroutines or a corefile without unwedging the paused/running threads. It's a bit complicated by the fact that the program is running in a container and controlled by Kubernetes which sends the SIGTERM, but it looks like a preStop hook may work.

I've also added trace.Log() calls immediately before and after the make([]byte, 0, 8 * 1024 *1024) call. Together with the existing traces in the runtime, that should show definitively whether the pause was during runtime or user goroutine execution.

On a related note, is there a recommended way to dump traces? The trace visualization via http is great but it's helpful to know what has been measured vs. inferred. I found that go tool trace -d gives most of that-- timestamps, tags that can correlate to code, processor & goroutine ID, sometimes a bit more.

When I looked at trace-21-56-53.out using go tool trace -d, I did see something of interest-- at the time that G13 wakes up (triggered by SIGTERM handling I believe) the first trace entry is:

107257548315 GoPreempt p=2 g=13 off=70560

Maybe that's unsurprising and just the response of G13 to the STW by G821-- guess the question is what was G13 doing until then.

Here's a larger extract if you're interested in looking while I try to catch the pause in the act:
trace-fragment-21-56-53.txt

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 20, 2018

Good news: I managed to repro a pause failure and capture the panic output. Here's the last 30 seconds of log: repro-sigquit.txt.gz

Bad news: Didn't capture the core file for gdb due to container restart and readonly /proc.

Not sure news:The most interesting goroutines are on system stacks so not dumped. From the log looks like G12 is the goroutine that does the mallocgc. G869 is trying to do STW. G869 is doing CPU profiling.

While I wish I'd turned off CPU profiling, I don't think it invalidates this run-- I've been consistently seeing mallocgc pausing and interfering with the next STW, whatever its cause.

I'll try again tomorrow with profiling off and working directory changed to a persistent volume in order to capture core file and provide a gdb thread apply all bt output.

@RLH

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2018

There has always been the theoretical issue of the OS starving Go's Ps. This could be caused by co-tenant processes running putting a high load on the system. Michael's comment about low sampling rate seems to support this as a possible cause. The gctrace indicates the runtime believes it has 4 HW threads available, it that in fact the case?

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2018

@ewhamilton Thanks for catching it in the act!

Looking at the traceback, G12 is in fact the goroutine we're interested in (since it's running), but that message doesn't mean it's on the system stack. It means that the runtime failed to preempt it the normal ways.

A core dump should indeed be able to show us exactly where the offending goroutine is.

@RLH that's a good point, Go doesn't have a solution to the co-tenancy problem (though I'm not sure anyone does). @ewhamilton what's the load on the system (inside and outside the container) aside from your program?

@aclements

This comment has been minimized.

Copy link
Member

commented Dec 20, 2018

The thread apply all bt output would be great. Alternatively, it looks like your SIGQUIT output was captured without GOTRACEBACK=crash set. If you set that, it should be able to capture the stack trace of the stuck goroutine.

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 21, 2018

@RLH, yes there could be external factors that could appear as pauses— whether from co-tenancy of hardware, page faults, interrupt storms, pod migration, etc. This specific hardware environment is an 8-node GKE (Google Kubernetes Engine) cluster. Each container does present itself as having 1 CPU and by default NumCPU=1 and GOMAXPROCS=1. I set GOMAXPROCS=4 along the way when it looked like it might help. We’ve seen the same issue either way and I only left it at 4 to limit the variables across runs.

My primary reason to think that is not what’s happening here is the scale of the pause— 30 seconds or more is wallclock scale— and the fact that the workload is a light weight functional test, the failure is repeatable in this code path, that we don’t see it in other apps, etc. For what it’s worth, the GKE console also shows this cluster as underutilized.

Because of the complexities of the containerized environment and the interactions with the golang runtime I have 3 sets of stack traces:

  • SIGABRT caught by go runtime, goroutines may have awoken, no core file because pid=1 in container (log.nocore.txt)
  • SIGABRT caught by go runtime, goroutines may have awoken, core file captured (log.core.?.txt, btall.core.?.txt)
  • runtime changed not to catch SIGABRT, core file captured without go runtime awkakening (log.abrt.?.txt, btall.abrt.?.txt)

During a test run I sometimes had multiple failures. I apologize for having 7 different repro instances, but I also know with something this elusive the clue to resolution may be in any of these.

Thanks for folks’ interest and help (and for golang of course)

log.nocore.txt

log.core.1.txt
btall.core.1.txt
log.core.2.txt
btall.core.2.txt
log.core.3.txt
btall.core.3.txt

log.abrt.1.txt
btall.abrt.1.txt
log.abrt.2.txt
btall.abrt.2.txt
log.abrt.3.txt
btall.abrt.3.txt

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 21, 2018

No doubt you all know what to look for in the panic dumps and btall output.

FYI, here's my recipe for identifying the goroutines of interest in this lockup:

  • First sign of trouble is mallocing=1 captured in a SCHED sample.
    ** curg=X indicates that GX is the goroutine doing the mallocgc
  • Second sign of trouble is gcwaiting=1 when another caller tries to STW
    ** goroutine can be found from P with status=3 and M != -1, look at curg for that M
@RLH

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2018

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Dec 22, 2018

I added trace entries within the runtime and verified from the timestamps just before and after the calls to memclrNoHeapPointers that a significant time (routinely over 5 seconds for 8MB) is spent there.

That seems too long for an 8MB allocation, but I suppose that if zeroing a newly allocated 8MB mmap region causes 2000 page faults and OS timeslicing of threads, maybe it's to be expected (still seems long for map anonymous). I may not be able to create a simple reproducer, but I can certainly benchmark a bunch of 8MB allocations in our environment, being careful about reuse. If that's the source of 5+ seconds of delay, I'm going to have to modify the app to avoid those at runtime. (In our case that's doable but I haven't done yet because I didn't want to mask the failure without knowing it was fixed).

In the attached trace I also see a much longer delay (25+ seconds) with no sign of things going on the system. This delay is terminated at the time of the arrival of the SIGTERM. I think this delay is real and is perhaps exposed by the long memory clear times.

Below are the trace events extracted for the section that got my attention. Here's what I see of interest:

  • G36 is doing 8MB allocation
  • HeapAlloc mem=102 is hack I added to mheap.go before call to memclrNoHeapPointers.
  • G32 starts running and wants to send to chan (new connection, see go tool trace visualizer)
  • G32 does HeapAlloc
  • GCSTWStart
  • Series of gaps for 25 seconds
  • G36 HeapAlloc mem=103 sentinel at end of memclr
  • G36 Preempt
  • G32 Gomaxprocs trace from procresize

All that could be explained by G36 8MB memclr taking 25 seconds, but that's really long and variable. I'll see what the variability is of a test program in our environment.

25526506823 UserLog p=1 g=36 off=21510 id=0 keyid=0 category= message=Before make([]byte, 0, 8 * 1024 * 1024)
25526518926 HeapAlloc p=1 g=36 off=21558 mem=100
25526522039 HeapAlloc p=1 g=36 off=21562 mem=38057008
25526522184 HeapAlloc p=1 g=36 off=21568 mem=101
25526522243 HeapAlloc p=1 g=36 off=21571 mem=102
25530864005 ProcStart p=0 g=0 off=5024 thread=12
25530875205 GoStart p=0 g=520 off=5030 g=520 seq=1
25530876398 GoEnd p=0 g=520 off=5036
25530885592 ProcStart p=3 g=0 off=41042 thread=0
25531022969 GoStart p=0 g=9 off=5038 g=9 seq=29
25531031144 GoSysCall p=0 g=9 off=5043
25531069460 GoBlockRecv p=0 g=9 off=5047
25531074493 ProcStop p=0 g=0 off=5051
25531077897 ProcStop p=3 g=0 off=41048
26529972011 ProcStart p=3 g=0 off=41051 thread=12
26529972971 GoUnblock p=1000002 g=0 off=41057 g=32 seq=0
26529974222 GoStart p=3 g=32 off=41061 g=32 seq=0
26529992027 GoSysCall p=3 g=32 off=41064
26530043522 GoSysCall p=3 g=32 off=41069
26530052395 HeapAlloc p=3 g=32 off=41074 mem=38058800
26530054723 GCStart p=1000004 g=0 off=41081 seq=1
26530067058 GCSTWStart p=3 g=0 off=41086 kindid=1 kind=sweep termination
27197169006 GoSysExit p=1000003 g=49 off=21594 g=49 seq=32 ts=303595723065293
29952525206 GoSysExit p=1000003 g=65 off=21701 g=65 seq=38 ts=303595817773639
29958914829 GoUnblock p=-1 g=0 off=42006 g=445 seq=2
40796101331 GoUnblock p=1000002 g=0 off=42017 g=432 seq=18
40796101884 GoUnblock p=1000002 g=0 off=42028 g=453 seq=14
40796102145 GoUnblock p=1000002 g=0 off=42035 g=507 seq=18
44527088801 GoSysExit p=1000003 g=17 off=25975 g=17 seq=22 ts=303596318737112
47718917716 GoSysExit p=1000003 g=10 off=26028 g=10 seq=2 ts=303596428448095
49960065198 GoUnblock p=1000002 g=0 off=42042 g=450 seq=66
51430146194 HeapAlloc p=1 g=36 off=21574 mem=103
51430332564 GoPreempt p=1 g=36 off=21581
51430339081 ProcStop p=1 g=0 off=21586
51430630186 Gomaxprocs p=3 g=32 off=41090 procs=4
51430656457 GCSTWDone p=3 g=0 off=41099
51430668531 GoUnblock p=3 g=32 off=41102 g=81 seq=10
51430677404 GoCreate p=3 g=32 off=41110 g=546 stack=226
51430684329 GoSysCall p=3 g=32 off=41120
51430705974 GCMarkAssistStart p=3 g=32 off=41125

Here's the trace and logs in case anyone is interested (knowing the ball is in my court to measure baseline for 8MB allocation and zeroing in our environment):
trace.2.out.gz
events.trace.2.txt
log.trace.2.txt

@robaho

This comment has been minimized.

Copy link

commented Dec 22, 2018

@ewhamilton 8 mb allocations is nothing, even under stress. Sorry to say, but it is almost certainly an unsafe/native library corrupting the heap... I would start by removing all native/unsafe code and see if you can reproduce the issue.

But I still think the most likely issue is a goroutine bug that is getting stuck in a near infinite loop preventing GC from occurring.

I base most of this on the fact that the prevalence of Go would of caused similar issues with others, and the fact that it is only affecting you, means there is nothing wrong in the runtime. Not a guarantee of this, but highly likely.

I am suggesting again, that you when it is in this "hang phase" you use an external stack sampler to record the stacks of all Go routines several times, with their cpu utilization (something like Instruments on OSX). This is your best chance of find the culprit.

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Jan 3, 2019

@robaho in the repros above, the 3 log.abrt.?.txt instances are each single point in time external stack samples when in the pause condition. They aren’t as compelling as runs with multiple samples would be but we already have them and I haven’t yet set up to run perf or profile within a Docker container under Kubernetes.

So I did some more digging with delve and gdb over my break...

In log.abrt.1.txt, we have:

  • thread 14 (LWP 9) is in the memclr from largeAlloc for the 8MB []byte.
  • thread 4 (LWP 8) is in loop within stopTheWorldWithSema, waiting for thread 14 to preempt
    ** 100us notetsleep()/futexsleep() between checks

In log.abrt.2.txt we have:

  • thread 13 (LWP 17) is in memclr from growing heap, (looks like for 8K chunk)
  • thread 1 (LWP 5) is gcBgMarkWorker in loop in scang waiting for thread 13 to preempt and scan itself
    ** calls procyield(10) as it spins and osyield() once every 5 usec
    ** procyield(10) executes 10 PAUSE instructions
    ** osyield() calls sched_yield()

In log.abrt.3.txt:

  • thread 14 (LWP 10) is in memclr from largeAlloc
  • thread 1 (LWP 5) is in loop in STW waiting for thread 14 to stop
    ** 100us notetsleep()/futexsleep() between checks

I’ve dug a bit deeper into our environment. The app is running within a Docker container in a Kubernetes cluster. This particular cluster is an 8 node cluster, with each node relatively small. Each node is a VM that has one vCPU and 3.75GB of memory. In GCE terms each is an “n1-standard-1” node (see https://cloud.google.com/compute/docs/machine-types). Per the documentation the vCPU maps onto a single HW thread of a Broadwell Xeon E4 v4 CPU. The node OS is Ubuntu 18.04 LTS. I’ve confirmed that at both the node and container level there is a single vCPU.

One possibility is that we’re under-provisioned and it’s just slow— it doesn’t feel right for 20+ second pauses, but I still need to collect some baseline timings for 8MB allocations/faults/clears.

A second possibility is that it looks suspicious that these 3 instances all have one compute-bound thread doing memclr (and maybe page fault) and a second thread polling for the first to preempt while running on a single hyperthread vCPU.

I have to wonder whether the polling by the runtime could be interfering with the CPU-bound clearing, either completely stalling it or slowing it. I see no signs of any use of thread priorities that could invert or such but there has been work getting the Linux CFS scheduler and Kubernetes to work well together.

A quick check of the Linux 4.15 kernel shows that futexsleep() calculates an absolute timeout and won’t reschedule if the timeout expires by the time it goes to wait. 100us seems like it should yield sometimes, but it is quick and TBD how the CFS scheduler handles frequent short timeouts on a uniprocessor.

Regarding log.abrt.2.txt, I don’t believe a PAUSE instruction will do anything but save power on a single hyperthread vCPU. And sched_yield(2) includes the following:
{quote}
sched_yield () is intended for use with read-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR).
Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified
and very likely means your application design is broken.
{quote}
TBD how the Linux CFS scheduler handles sched_yield.

I’ve got baselining and further experiments to do. @RLH, looking at these instances, I’ll also see if I can create a simplified reproducer program.

@robaho

This comment has been minimized.

Copy link

commented Jan 3, 2019

@ewhamilton since you only have a single processor, I think you should set GOMAXPROCS=1 otherwise the scheduler may be having heuristic difficulties - thinking it should be able to run the concurrent work - even though it can't - since there is no physical resources available.

Your logs show that GOMAXPROCS is 4.

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Jan 4, 2019

Agreed. Since I'm re-measuring baseline with the latest version of our code, go 1.11.4, and Kubernetes, I'll make it as stock as possible. GOMAXPROCS defaults to NumCPU == 1 in this environment so I just need to stop overriding it. I'll baseline with GODEBUG unset as well.

While our original problem was seen multiple times with GOMAXPROCS==1, hopefully my repro case wasn't just experimental error-- TBD. Or maybe I should hope the problem disappears...

@robaho

This comment has been minimized.

Copy link

commented Jan 4, 2019

Another possible cause is fragmentation. If you continually creating 8mb objects while allocating other small ones, you might be fragmenting memory, leading to lots of OS memory requests. Go does not have a compacting collector. To be honest, I’m not certain about this, but maybe a Go designer can comment.
You might need to pool these large objects.

@robaho

This comment has been minimized.

Copy link

commented Jan 4, 2019

It looks like Go partitions the heap into large and small objects so I wouldn’t expect that to be a cause. But you can look at the fragmentation size to be certain.

@ewhamilton

This comment has been minimized.

Copy link
Author

commented Jan 7, 2019

I can no longer reproduce the issue on demand. Accordingly I'm going to drop it.

We upgraded our GKE test nodes from 1.10.6-gke.2 to 1.11.5-gke.5. That included a change in VMs and the kernel used for each node. Now even with the same container images as before and the same test code the problem does not recur even after 5x as many attempts.

I don't know whether or not the golang runtime contributed to the problem-- I'm still a bit leery of the robustness of poll loops using sched_yield() on a uniprocessor without SCHED_RT-- but there's clearly an external factor as well. I think it that my setting GOMAXPROCS=4 may have aggravated the situation but since I can no longer compare repro the problem I cannot prove or disprove that.

I do think it is prudent advice to leave GOMAXPROCS at its default of GOMAXPROCS=1 on any uniprocessor-- doing so will avoid some if not all of the issues.

I don't know how the project prefers to handle issues like this-- feel free to close this or let it age or ask me to close it.

Thanks for everyone's help and suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.