-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: FreeBSD memory corruption involving fork system call #15658
Comments
Here's another panic experienced in mallocgc by the same sample code: |
@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7) And do you only see it on FreeBSD, or other operating systems as well? |
@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related? |
On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet). @derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers. |
Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried. |
Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon. |
Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value. I didn't include too many of them in my initial report, because I thought they were largely redundant. @bradfitz: these logs are against master:
@aclements: I will try your patch next. One variable at a time. |
The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was None of the TOC code is in 1.7 or for that matter on TIP. I can force On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
|
@aclements: I've run it with your patch, although I haven't been able to babysit it too much. The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it. This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there: Will keep trying to come up with more info. |
@aclements: Here's some more logs from a binary build with the patch: 1463315708-finalizer-already-set.txt Please let me know if I can be of further assistance? |
Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.) |
I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :) |
Thanks @aclements ! One crash, ~4 hours into running, since removing the SetFinalizer in NewFile. I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work. |
After ~11 hours + 5 minutes, the process panic'd: |
Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers. |
My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.) Thanks again! |
I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.
|
Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out? |
That's correct! Thanks. Anyone else able to reproduce? |
@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs. |
@aclements: both machines were 4 core, 8 thread CPUs:
The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE). I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant. The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required. I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band). |
@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that). Two more experiments to try:
|
@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case. I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that. I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up. I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all). |
@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected. Perhaps there's something to this. Will keep you posted. |
cc @davecheney |
@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion: func run(done chan struct{}) {
cmd := exec.Command("doesnotexist")
cmd.Wait()
done <- struct{}{}
return
} |
@AleMents, after the above change, it ran for 4 days without a panic. |
Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug: golang/go#15658 This also fixes the value reported by SwapMemory and SwapMemoryWithContext on FreeBSD which previously only included the first swap device and also reported the values in terms of 1K blocks instead of bytes.
Improve performance by eliminating the fork out to uname on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug: golang/go#15658 Also added a test for PlatformInformation.
@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293 |
Just seen that and thought that it could be related. I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this. If the patch is compatible with 11.x I'll grab it and run some more tests. |
From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less. |
The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards Not sure if such a thing exists for FreeBSD though. 👼 |
no go still crashes
|
@justinclift Wouldn't DTrace be enough to investigate this problem? |
Not sure how as this appears to be random memory corruption of the forking
proces. I’ve tried running it through truss before but it adds enough delay
that it doesn’t reproduce.
I can see if tracing the fork syscall with dtrace shows anything
…On Mon, 18 Jun 2018 at 14:39, Bartek Rutkowski ***@***.***> wrote:
@justinclift <https://github.com/justinclift> Wouldn't DTrace (
https://www.freebsd.org/doc/handbook/dtrace.html) be enough to
investigate this problem?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15658 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGXL2HVT4qVJ4zI1cAoC7BLw8MxQpiYks5t962GgaJpZM4IcqQO>
.
|
This issue is a sprawling 2 year epic adventure. The fact that multiple bugs with different reproducers have been subsumed under one issue is unhelpful. Can someone write the Cliff's Notes including the current fastest reproducer? Thanks in advance. |
Hi matt my current reproduction case is: package main
import (
"fmt"
"log"
"runtime"
"syscall"
)
var (
forkRoutines = 16
)
func run(done chan struct{}) {
if err := syscall.ForkOnlyBSDTest(); err != nil {
log.Fatal(err)
}
done <- struct{}{}
}
func main() {
fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
done := make(chan struct{}, forkRoutines*2)
for i := 0; i < forkRoutines; i++ {
go run(done)
}
for range done {
go run(done)
}
} Then this in the lib: // +build darwin dragonfly freebsd netbsd openbsd
package syscall
func ForkOnlyBSDTest() (err error) {
var r1 uintptr
var pid int
var err1 Errno
var wstatus WaitStatus
ForkLock.Lock()
runtime_BeforeFork()
r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
if err1 != 0 {
runtime_AfterFork()
ForkLock.Unlock()
return err1
}
if r1 == 0 {
// in child, die die die my darling
for {
RawSyscall(SYS_EXIT, 253, 0, 0)
}
}
runtime_AfterFork()
ForkLock.Unlock()
pid = int(r1)
// Prime directive, exterminate
// Whatever stands left
_, err = Wait4(pid, &wstatus, 0, nil)
for err == EINTR {
_, err = Wait4(pid, &wstatus, 0, nil)
}
return
} It's not quick and not guaranteed. |
There is also a thread on the freebsd ports mailing list "lang/go failes to build with poudriere, since 2018-04-05" who has a reproduction case running the lang/go port build which is related. |
@stevenh is there any correlation with the amount of memory in the system or the number of cores? For example, would lowering hw.physmem on a 48-way help? |
I’ve not seen one
…On Sat, 7 Jul 2018 at 00:40, Matthew Macy ***@***.***> wrote:
@stevenh <https://github.com/stevenh> is there any correlation with the
amount of memory in the system or the number of cores? For example, would
lowering hw.physmem on a 48-way help?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15658 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGXLzGJGnr8pNivK8fjDV7xuq204ljfks5uD_VSgaJpZM4IcqQO>
.
|
I've been running the usual tests on 11.2-RELEASE-p3 with go1.11.1 and have not managed to reproduce after a weeks run. Has anyone else seen a crash with these updated versions? |
Ok still no issues triggered on my test box and its been running forks in a loop since 18th and 20th of November
Given this I'm tempted to say that the issue has been fixed either in FreeBSD or golang runtime. |
Thanks very much for following up on this., I'm going to close the issues in the hopes that it is indeed fixed. Anyone, please comment if you find future failures. |
While updating our test box to 12.0 from 11.2 I noticed that this latest test was running with a potentially relevant kernel patch on top of 11.2-RELEASE. If anyone can reproduce this issue on 11.2 then they should try applying the following to see if was indeed the fix: This is already included in 12.0-RELEASE and was MFC'ed to 11-stable as of: |
Please answer these questions before submitting your issue. Thanks!
go version
)?go version go1.6.2 freebsd/amd64
go env
)?I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.
Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.
Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.
I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.
Here are some sample messages:
fatal error: runtime.SetFinalizer: finalizer already set
runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty
1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt
This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.
Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!
The text was updated successfully, but these errors were encountered: