Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: FreeBSD memory corruption involving fork system call #15658

Closed
derekmarcotte opened this issue May 12, 2016 · 243 comments

Comments

Projects
None yet
@derekmarcotte
Copy link

commented May 12, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

  • fatal error: runtime.SetFinalizer: finalizer already set
  • runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty

1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 12, 2016

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

@bradfitz bradfitz added this to the Go1.7Maybe milestone May 12, 2016

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 12, 2016

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

@aclements

This comment has been minimized.

Copy link
Member

commented May 12, 2016

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

@aclements

This comment has been minimized.

Copy link
Member

commented May 12, 2016

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 12, 2016

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 13, 2016

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 13, 2016

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

@RLH

This comment has been minimized.

Copy link
Contributor

commented May 14, 2016

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#15658 (comment)

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 15, 2016

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 16, 2016

@aclements: Here's some more logs from a binary build with the patch:

1463315708-finalizer-already-set.txt
1463349807-finalizer-already-set.txt
1463352601-workbuf-not-empty.txt
1463362849-workbuf-empty.txt
1463378745-wedged-gdb.txt

Please let me know if I can be of further assistance?

@aclements

This comment has been minimized.

Copy link
Member

commented May 16, 2016

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

@aclements

This comment has been minimized.

Copy link
Member

commented May 16, 2016

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 17, 2016

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 17, 2016

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

@aclements

This comment has been minimized.

Copy link
Member

commented May 17, 2016

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 17, 2016

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 19, 2016

I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)
@aclements

This comment has been minimized.

Copy link
Member

commented May 19, 2016

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 19, 2016

That's correct! Thanks. Anyone else able to reproduce?

@aclements

This comment has been minimized.

Copy link
Member

commented May 26, 2016

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented May 26, 2016

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

@aclements

This comment has been minimized.

Copy link
Member

commented May 31, 2016

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.
@derekmarcotte

This comment has been minimized.

Copy link
Author

commented Jun 1, 2016

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented Jun 1, 2016

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2016

@derekmarcotte

This comment has been minimized.

Copy link
Author

commented Jun 2, 2016

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}
@derekmarcotte

This comment has been minimized.

Copy link
Author

commented Jun 6, 2016

@alements, after the above change, it ran for 4 days without a panic.

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Feb 17, 2018

Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

This also fixes the value reported by SwapMemory and SwapMemoryWithContext on FreeBSD which previously only included the first swap device and also reported the values in terms of 1K blocks instead of bytes.

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to uname on FreeBSD
Improve performance by eliminating the fork out to uname on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

Also added a test for PlatformInformation.
@sean-

This comment has been minimized.

Copy link

commented Jun 14, 2018

@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2018

Just seen that and thought that it could be related.

I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.

If the patch is compatible with 11.x I'll grab it and run some more tests.

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2018

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.

@justinclift

This comment has been minimized.

Copy link

commented Jun 15, 2018

The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards

Not sure if such a thing exists for FreeBSD though. 👼

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2018

no go still crashes

  
Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c6fe1, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
        /usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
@bartekrutkowski

This comment has been minimized.

Copy link

commented Jun 18, 2018

@justinclift Wouldn't DTrace be enough to investigate this problem?

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2018

@mattmacy

This comment has been minimized.

Copy link

commented Jul 4, 2018

This issue is a sprawling 2 year epic adventure. The fact that multiple bugs with different reproducers have been subsumed under one issue is unhelpful. Can someone write the Cliff's Notes including the current fastest reproducer? Thanks in advance.

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jul 5, 2018

Hi matt my current reproduction case is:

package main
  
import (
        "fmt"
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}

Then this in the lib:

// +build darwin dragonfly freebsd netbsd openbsd
  
package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                // in child, die die die my darling
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        // Prime directive, exterminate
        // Whatever stands left
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

It's not quick and not guaranteed.

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jul 5, 2018

There is also a thread on the freebsd ports mailing list "lang/go failes to build with poudriere, since 2018-04-05" who has a reproduction case running the lang/go port build which is related.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 6, 2018

@mattmacy

This comment has been minimized.

Copy link

commented Jul 6, 2018

@stevenh is there any correlation with the amount of memory in the system or the number of cores? For example, would lowering hw.physmem on a 48-way help?

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2018

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2018

I've been running the usual tests on 11.2-RELEASE-p3 with go1.11.1 and have not managed to reproduce after a weeks run.

Has anyone else seen a crash with these updated versions?

@aclements aclements modified the milestones: Go1.12, Go1.13 Dec 18, 2018

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2018

Ok still no issues triggered on my test box and its been running forks in a loop since 18th and 20th of November

root  50215  273.0  0.0   104384   4988  9  S+J  20Nov18  134201:59.00 ./test
root  46993  259.4  0.0   104640   3204  8  S+J  18Nov18  151748:35.09 ./test2

Given this I'm tempted to say that the issue has been fixed either in FreeBSD or golang runtime.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2018

Thanks very much for following up on this., I'm going to close the issues in the hopes that it is indeed fixed. Anyone, please comment if you find future failures.

@stevenh

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

While updating our test box to 12.0 from 11.2 I noticed that this latest test was running with a potentially relevant kernel patch on top of 11.2-RELEASE.

If anyone can reproduce this issue on 11.2 then they should try applying the following to see if was indeed the fix:
https://svnweb.freebsd.org/base?view=revision&revision=335171

This is already included in 12.0-RELEASE and was MFC'ed to 11-stable as of:
https://svnweb.freebsd.org/base?view=revision&revision=335508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.