New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: FreeBSD memory corruption involving fork system call #15658

Open
derekmarcotte opened this Issue May 12, 2016 · 239 comments

Comments

Projects
None yet
@derekmarcotte

derekmarcotte commented May 12, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

  • fatal error: runtime.SetFinalizer: finalizer already set
  • runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty

1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 12, 2016

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

derekmarcotte commented May 12, 2016

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

@bradfitz bradfitz added this to the Go1.7Maybe milestone May 12, 2016

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz May 12, 2016

Member

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

Member

bradfitz commented May 12, 2016

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 12, 2016

Member

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

Member

aclements commented May 12, 2016

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 12, 2016

Member

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

Member

aclements commented May 12, 2016

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz May 12, 2016

Member

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

Member

bradfitz commented May 12, 2016

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 13, 2016

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

derekmarcotte commented May 13, 2016

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 13, 2016

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

derekmarcotte commented May 13, 2016

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH May 14, 2016

Contributor

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#15658 (comment)

Contributor

RLH commented May 14, 2016

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#15658 (comment)

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 15, 2016

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

derekmarcotte commented May 15, 2016

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte commented May 16, 2016

@aclements: Here's some more logs from a binary build with the patch:

1463315708-finalizer-already-set.txt
1463349807-finalizer-already-set.txt
1463352601-workbuf-not-empty.txt
1463362849-workbuf-empty.txt
1463378745-wedged-gdb.txt

Please let me know if I can be of further assistance?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 16, 2016

Member

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

Member

aclements commented May 16, 2016

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 16, 2016

Member

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

Member

aclements commented May 16, 2016

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 17, 2016

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

derekmarcotte commented May 17, 2016

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 17, 2016

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

derekmarcotte commented May 17, 2016

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 17, 2016

Member

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

Member

aclements commented May 17, 2016

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 17, 2016

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

derekmarcotte commented May 17, 2016

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 19, 2016

I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)

derekmarcotte commented May 19, 2016

I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)
@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 19, 2016

Member

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

Member

aclements commented May 19, 2016

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 19, 2016

That's correct! Thanks. Anyone else able to reproduce?

derekmarcotte commented May 19, 2016

That's correct! Thanks. Anyone else able to reproduce?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 26, 2016

Member

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

Member

aclements commented May 26, 2016

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 26, 2016

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

derekmarcotte commented May 26, 2016

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 31, 2016

Member

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.
Member

aclements commented May 31, 2016

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.
@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 1, 2016

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

derekmarcotte commented Jun 1, 2016

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 1, 2016

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

derekmarcotte commented Jun 1, 2016

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian
Contributor

josharian commented Jun 1, 2016

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 2, 2016

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}

derekmarcotte commented Jun 2, 2016

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}
@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 6, 2016

@alements, after the above change, it ran for 4 days without a panic.

derekmarcotte commented Jun 6, 2016

@alements, after the above change, it ran for 4 days without a panic.

@blinkingled

This comment has been minimized.

Show comment
Hide comment
@blinkingled

blinkingled Dec 8, 2017

I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -

td 0xffffff81dd9a6900 test: timeout race
td 0xffffff81e01e7400 test: timeout race

This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -

 /*
         * Cleanup the timeout.  If the timeout has already occured thandle
         * has already been stopped, otherwise stop thandle.  If the timeout
         * is running (the callout thread must be blocked trying to get
         * lwp_token) then wait for us to get scheduled.
         */
        if (timo) {
                while (td->td_flags & TDF_TIMEOUT_RUNNING) {
                        /* else we won't get rescheduled! */
                        if (lp->lwp_stat != LSSTOP)
                                lp->lwp_stat = LSSLEEP;
                        lwkt_deschedule_self(td);
                        td->td_wmesg = "tsrace";
                        lwkt_switch();
                        kprintf("td %p %s: timeout race\n", td, td->td_comm);
                }

I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case.

blinkingled commented Dec 8, 2017

I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -

td 0xffffff81dd9a6900 test: timeout race
td 0xffffff81e01e7400 test: timeout race

This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -

 /*
         * Cleanup the timeout.  If the timeout has already occured thandle
         * has already been stopped, otherwise stop thandle.  If the timeout
         * is running (the callout thread must be blocked trying to get
         * lwp_token) then wait for us to get scheduled.
         */
        if (timo) {
                while (td->td_flags & TDF_TIMEOUT_RUNNING) {
                        /* else we won't get rescheduled! */
                        if (lp->lwp_stat != LSSTOP)
                                lp->lwp_stat = LSSLEEP;
                        lwkt_deschedule_self(td);
                        td->td_wmesg = "tsrace";
                        lwkt_switch();
                        kprintf("td %p %s: timeout race\n", td, td->td_comm);
                }

I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018

@vangyzen

This comment has been minimized.

Show comment
Hide comment
@vangyzen

vangyzen Feb 15, 2018

I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system:

https://svnweb.freebsd.org/base?view=revision&revision=329254

I you want to test that change, you'll also need this one:

https://svnweb.freebsd.org/base?view=revision&revision=329252

vangyzen commented Feb 15, 2018

I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system:

https://svnweb.freebsd.org/base?view=revision&revision=329254

I you want to test that change, you'll also need this one:

https://svnweb.freebsd.org/base?view=revision&revision=329252

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Feb 15, 2018

Contributor

It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one.
Do not call pmap_enter() with invalid protection mode.
Ensure memory consistency on COW

I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great.

Contributor

stevenh commented Feb 15, 2018

It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one.
Do not call pmap_enter() with invalid protection mode.
Ensure memory consistency on COW

I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Feb 15, 2018

Contributor

Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)

Contributor

stevenh commented Feb 15, 2018

Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Feb 17, 2018

Contributor

Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state
Contributor

stevenh commented Feb 17, 2018

Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD
which also helps prevent crashes / hangs due to the outstanding fork
crash bug:
golang/go#15658

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

This also fixes the value reported by SwapMemory and SwapMemoryWithContext on FreeBSD which previously only included the first swap device and also reported the values in terms of 1K blocks instead of bytes.

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to uname on FreeBSD
Improve performance by eliminating the fork out to uname on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

Also added a test for PlatformInformation.
@sean-

This comment has been minimized.

Show comment
Hide comment
@sean-

sean- Jun 14, 2018

@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293

sean- commented Jun 14, 2018

@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 14, 2018

Contributor

Just seen that and thought that it could be related.

I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.

If the patch is compatible with 11.x I'll grab it and run some more tests.

Contributor

stevenh commented Jun 14, 2018

Just seen that and thought that it could be related.

I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.

If the patch is compatible with 11.x I'll grab it and run some more tests.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 15, 2018

Contributor

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.

Contributor

stevenh commented Jun 15, 2018

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.

@justinclift

This comment has been minimized.

Show comment
Hide comment
@justinclift

justinclift Jun 15, 2018

The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards

Not sure if such a thing exists for FreeBSD though. 👼

justinclift commented Jun 15, 2018

The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards

Not sure if such a thing exists for FreeBSD though. 👼

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 18, 2018

Contributor

no go still crashes

  
Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c6fe1, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
        /usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
Contributor

stevenh commented Jun 18, 2018

no go still crashes

  
Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c6fe1, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
        /usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
@bartekrutkowski

This comment has been minimized.

Show comment
Hide comment
@bartekrutkowski

bartekrutkowski Jun 18, 2018

@justinclift Wouldn't DTrace be enough to investigate this problem?

bartekrutkowski commented Jun 18, 2018

@justinclift Wouldn't DTrace be enough to investigate this problem?

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 18, 2018

Contributor
Contributor

stevenh commented Jun 18, 2018

@mattmacy

This comment has been minimized.

Show comment
Hide comment
@mattmacy

mattmacy Jul 4, 2018

This issue is a sprawling 2 year epic adventure. The fact that multiple bugs with different reproducers have been subsumed under one issue is unhelpful. Can someone write the Cliff's Notes including the current fastest reproducer? Thanks in advance.

mattmacy commented Jul 4, 2018

This issue is a sprawling 2 year epic adventure. The fact that multiple bugs with different reproducers have been subsumed under one issue is unhelpful. Can someone write the Cliff's Notes including the current fastest reproducer? Thanks in advance.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jul 5, 2018

Contributor

Hi matt my current reproduction case is:

package main
  
import (
        "fmt"
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}

Then this in the lib:

// +build darwin dragonfly freebsd netbsd openbsd
  
package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                // in child, die die die my darling
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        // Prime directive, exterminate
        // Whatever stands left
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

It's not quick and not guaranteed.

Contributor

stevenh commented Jul 5, 2018

Hi matt my current reproduction case is:

package main
  
import (
        "fmt"
        "log"
        "runtime"
        "syscall"
)

var (
        forkRoutines = 16
)

func run(done chan struct{}) {
        if err := syscall.ForkOnlyBSDTest(); err != nil {
                log.Fatal(err)
        }

        done <- struct{}{}
}

func main() {
        fmt.Printf("Starting %v forking goroutines...\n", forkRoutines)
        fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))

        done := make(chan struct{}, forkRoutines*2)

        for i := 0; i < forkRoutines; i++ {
                go run(done)
        }

        for range done {
                go run(done)
        }
}

Then this in the lib:

// +build darwin dragonfly freebsd netbsd openbsd
  
package syscall

func ForkOnlyBSDTest() (err error) {
        var r1 uintptr
        var pid int
        var err1 Errno
        var wstatus WaitStatus

        ForkLock.Lock()
        runtime_BeforeFork()

        r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
        if err1 != 0 {
                runtime_AfterFork()
                ForkLock.Unlock()
                return err1
        }

        if r1 == 0 {
                // in child, die die die my darling
                for {
                        RawSyscall(SYS_EXIT, 253, 0, 0)
                }
        }

        runtime_AfterFork()
        ForkLock.Unlock()

        pid = int(r1)

        // Prime directive, exterminate
        // Whatever stands left
        _, err = Wait4(pid, &wstatus, 0, nil)
        for err == EINTR {
                _, err = Wait4(pid, &wstatus, 0, nil)
        }

        return
}

It's not quick and not guaranteed.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jul 5, 2018

Contributor

There is also a thread on the freebsd ports mailing list "lang/go failes to build with poudriere, since 2018-04-05" who has a reproduction case running the lang/go port build which is related.

Contributor

stevenh commented Jul 5, 2018

There is also a thread on the freebsd ports mailing list "lang/go failes to build with poudriere, since 2018-04-05" who has a reproduction case running the lang/go port build which is related.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 6, 2018

@mattmacy

This comment has been minimized.

Show comment
Hide comment
@mattmacy

mattmacy Jul 6, 2018

@stevenh is there any correlation with the amount of memory in the system or the number of cores? For example, would lowering hw.physmem on a 48-way help?

mattmacy commented Jul 6, 2018

@stevenh is there any correlation with the amount of memory in the system or the number of cores? For example, would lowering hw.physmem on a 48-way help?

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jul 7, 2018

Contributor
Contributor

stevenh commented Jul 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment