runtime: FreeBSD memory corruption involving fork system call #15658

Open
derekmarcotte opened this Issue May 12, 2016 · 234 comments

Comments

Projects
None yet
@derekmarcotte

derekmarcotte commented May 12, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go version go1.6.2 freebsd/amd64

  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOOS="freebsd"
GOPATH=""
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GO15VENDOREXPERIMENT="1"
CC="cc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="clang++"
  1. What did you do?
package main

/* stdlib includes */
import (
        "fmt"
        "os/exec"
)

func run(done chan struct{}) {
        cmd := exec.Command("true")
        if err := cmd.Start(); err != nil {
                goto finished
        }

        cmd.Wait()

finished:
        done <- struct{}{}
        return
}

func main() {
        fmt.Println("Starting a bunch of goroutines...")

        // 8 & 16 are arbitrary
        done := make(chan struct{}, 16)

        for i := 0; i < 8; i++ {
                go run(done)
        }

        for {
                select {
                case <-done:
                        go run(done)
                }
        }
}
  1. What did you expect to see?

I expect this strange program to spawn instances of /bin/true in parallel, until I stop it.

  1. What did you see instead?

Various types of panics caused by what looks to be corruption within the finalizer lists, caused by what I am assuming is based on race conditions. These panics can happen as quickly as 2 minutes, or much longer. 10 minutes seems a good round number.

Occasionally addspecial gets stuck in an infinite loop holding the lock, and the process wedges. This is illustrated in log 1462933614, with x.next pointing to x. This appears to be corruption of that data structure. I have seen processes in this state run for 22 hours.

I understand there is some trepidation expressed in issue #11485 around the locking of the data structures involved.

Here are some sample messages:

  • fatal error: runtime.SetFinalizer: finalizer already set
  • runtime: nonempty check fails b.log[0]= 0 b.log[1]= 0 b.log[2]= 0 b.log[3]= 0 fatal error: workbuf is empty

1462926841-SetFinalizer-ex1.txt
1462926969-SetFinalizer-ex2.txt
1462933295-nonempty-check-fails.txt
1462933614-wedged.txt

This was run on an 8-core processor, and a 4-core 8-thread processor with ECC RAM, similar results.

Additionally, while this example is an extreme, it also represents the core functionality of a project I've been working on part-time for many months. I'm happy to provide any further assistance diagnosing this issue - I'm very invested!

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 12, 2016

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

Here's another panic experienced in mallocgc by the same sample code:

1463046620-malloc-panic.txt

@bradfitz bradfitz added this to the Go1.7Maybe milestone May 12, 2016

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz May 12, 2016

Member

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

Member

bradfitz commented May 12, 2016

@derekmarcotte, can you also reproduce this at master? (which will become Go 1.7)

And do you only see it on FreeBSD, or other operating systems as well?

/cc @aclements @ianlancetaylor

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 12, 2016

Member

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

Member

aclements commented May 12, 2016

@RLH, weren't you seeing "finalizer already set" failures a little while ago? Did you track that down? Could it have been related?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 12, 2016

Member

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

Member

aclements commented May 12, 2016

On closer inspection of the failures you posted (thanks for collecting several, BTW), this smells like memory corruption. Just guessing, there are a few likely culprits. It may be the finalizer code, but I actually think that's less likely. More likely is that it's the fork/exec code: that code is really subtle, mucks with the address space, and contains system-specific parts (which would explain why it's showing up on FreeBSD, but I haven't been able to reproduce it on Linux yet).

@derekmarcotte, can you try commenting out the runtime.SetFinalizer call in newProcess in os/exec.go (your test doesn't need that finalizer) and see if you can still reproduce it? If you can, that will rule out finalizers.

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz May 12, 2016

Member

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

Member

bradfitz commented May 12, 2016

Note that FreeBSD runs via gomote, if this is that easily reproducible. I haven't yet tried.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 13, 2016

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

derekmarcotte commented May 13, 2016

Just got a golang/go dev environment set up on my machine (was from FreeBSD packages). Will report back soon.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 13, 2016

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

Here's the heads of a bunch of logs with the epoch at the start of the process, so you can see the interval. I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process. I thought maybe the gc was setting these as free (or otherwise touching them) before the SetFinalizer had a chance to set their value.

I didn't include too many of them in my initial report, because I thought they were largely redundant.

@bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80


1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8


1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80


1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
        /home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
        /home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements: I will try your patch next. One variable at a time.

@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH May 14, 2016

Contributor

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#15658 (comment)

Contributor

RLH commented May 14, 2016

The "fatal error: runtime.SetFinalizer: finalizer already set" bug I was
seeing a few days ago were on the 1.8 dev.garbage branch and the result of
a TOC write barrier not marking an object as being published, TOC doing a
rollback, and reallocating a new object over an old one. Both objects had
finalizers and the bug was tripped.

None of the TOC code is in 1.7 or for that matter on TIP. I can force
myself to imagine that the 1.7 allocCache code could cause a similar
situation if the allocCache and bitmaps were not coherent. but if that were
the case it would expect lots of failures all over the place and across all
platforms, not just freeBSD.

On Fri, May 13, 2016 at 6:47 PM, Derek Marcotte notifications@github.com
wrote:

Here's the heads of a bunch of logs with the epoch at the start of the
process, so you can see the interval. I suspected a race vs. memory
corruption because by and large it is the finalizer already set that
crashes the process. I thought maybe the gc was setting these as free (or
otherwise touching them) before the SetFinalizer had a chance to set their
value.

I didn't include too many of them in my initial report, because I thought
they were largely redundant.

@bradfitz https://github.com/bradfitz: these logs are against master:

1463168442
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420059f48 sp=0xc420059f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420059f80 sp=0xc420059f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420059f88 sp=0xc420059f80

1463170469
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0x7fffffffea80 sp=0x7fffffffea68
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0x7fffffffeab8 sp=0x7fffffffea80
runtime.systemstack(0x52ad00)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0x7fffffffeac0 sp=0x7fffffffeab8

1463170494
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc420067f48 sp=0xc420067f30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc420067f80 sp=0xc420067f48
runtime.systemstack(0xc420019500)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc420067f88 sp=0xc420067f80

1463171133
Starting a bunch of goroutines...
fatal error: runtime.SetFinalizer: finalizer already set

runtime stack:
runtime.throw(0x4c366d, 0x2b)
/home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b fp=0xc4202cbf48 sp=0xc4202cbf30
runtime.SetFinalizer.func2()
/home/derek/go/src/github.com/golang/go/src/runtime/mfinal.go:375 +0x73 fp=0xc4202cbf80 sp=0xc4202cbf48
runtime.systemstack(0xc42001c000)
/home/derek/go/src/github.com/golang/go/src/runtime/asm_amd64.s:298 +0x79 fp=0xc4202cbf88 sp=0xc4202cbf80

@aclements https://github.com/aclements: I will try your patch next.
One variable at a time.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#15658 (comment)

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 15, 2016

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

derekmarcotte commented May 15, 2016

@aclements: I've run it with your patch, although I haven't been able to babysit it too much.

The first time, all threads were idle after a number of hours (i.e. 0% cpu across the board). Connecting gdb to that process gave me trouble, and I couldn't get any logging out of it.

This morning, I was able to connect to a different process that looks a lot like 1462933614-wedged.txt. I've attached a log from gdb there:

1463270694.txt

Will keep trying to come up with more info.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment

derekmarcotte commented May 16, 2016

@aclements: Here's some more logs from a binary build with the patch:

1463315708-finalizer-already-set.txt
1463349807-finalizer-already-set.txt
1463352601-workbuf-not-empty.txt
1463362849-workbuf-empty.txt
1463378745-wedged-gdb.txt

Please let me know if I can be of further assistance?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 16, 2016

Member

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

Member

aclements commented May 16, 2016

Thanks for the logs! I hadn't realized there were two finalizers involved here. Could you also comment out the SetFinalizer in NewFile in os/file_unix.go and see if it's still reproducible? (Your test also doesn't need that finalizer.)

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 16, 2016

Member

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

Member

aclements commented May 16, 2016

I suspected a race vs. memory corruption because by and large it is the finalizer already set that crashes the process.

I didn't mean to say that it isn't necessarily a race. It's actually quite likely a race, but it's resulting in corruption of internal runtime structures, which suggests that the race is happening on freed memory. The "workbuf is empty" failure mode especially points at memory corruption, which is why my initial guess is that the finalizers (and the specials queue in general) may be victims rather than perpetrators. It's also easy to get finalizers out of the picture, while it's harder to get fork/exec out of the picture without completely changing the program. :)

@derekmarcotte derekmarcotte referenced this issue in derekmarcotte/hfm May 16, 2016

Open

FreeBSD instability: go runtime panics #1

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 17, 2016

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

derekmarcotte commented May 17, 2016

Thanks @aclements !

One crash, ~4 hours into running, since removing the SetFinalizer in NewFile.

1463429459.txt

I have a second process running for almost 11 hours, with 4 of the threads wedged, but it is still doing work.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 17, 2016

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

After ~11 hours + 5 minutes, the process panic'd:

1463445934-invalid-p-state.txt

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 17, 2016

Member

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

Member

aclements commented May 17, 2016

Thanks for the new logs. I suspect that's actually not the same failure, which suggests that the original problem is in fact related to finalizers.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 17, 2016

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

My pleasure, thanks for the feedback. Are there any next steps for me? (I'm likely to poke around this bottom issue, although I'm neither a go runtime guy, nor a FreeBSD systems-level guy - yet. Would like to be as helpful as possible.)

Thanks again!

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 19, 2016

I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)

derekmarcotte commented May 19, 2016

I'm going to post a few more here. This'll be my last batch, unless requested. 😄 1463584079 is a new message.

 ==> 1463485780 <==
Starting a bunch of goroutines...

fatal error: workbuf is not empty

runtime stack:
runtime.throw(0x4bff02, 0x14)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*workbuf).checkempty(0x8007dae00)
        /home/derek/go/src/github.com/golang/go/src/runtime/mgcwork.go:301 +0x3f
runtime.getempty(0x8007dae00)

==> 1463584079 <==
Starting a bunch of goroutines...
fatal error: MHeap_AllocLocked - MSpan not free

runtime stack:
runtime.throw(0x4c2528, 0x22)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.(*mheap).allocSpanLocked(0x52dbe0, 0x1, 0xc4200ae1a0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:637 +0x498
runtime.(*mheap).alloc_m(0x52dbe0, 0x1, 0x12, 0xc420447fe0)
        /home/derek/go/src/github.com/golang/go/src/runtime/mheap.go:510 +0xd6

==> 1463603516 <==
Starting a bunch of goroutines...
acquirep: p->m=842351485952(10) p->status=1
acquirep: p->m=842350813184(4) p->status=1
fatal error: acquirep: invalid p state
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc420015500)

==> 1463642257 <==
Starting a bunch of goroutines...
acquirep: p->m=0(0) p->status=2
fatal error: acquirep: invalid p state
acquirep: p->m=0(0) p->status=3
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c0ab3, 0x19)
        /home/derek/go/src/github.com/golang/go/src/runtime/panic.go:566 +0x8b
runtime.acquirep1(0xc42001c000)
@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 19, 2016

Member

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

Member

aclements commented May 19, 2016

Thanks! I assume these last few failures are also with master and with the two SetFinalizer calls commented out?

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 19, 2016

That's correct! Thanks. Anyone else able to reproduce?

That's correct! Thanks. Anyone else able to reproduce?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 26, 2016

Member

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

Member

aclements commented May 26, 2016

@derekmarcotte, what version of FreeBSD and how many CPUs are you running with? I haven't had any luck yet reproducing on FreeBSD 10.1 with 2 CPUs.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte May 26, 2016

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

@aclements: both machines were 4 core, 8 thread CPUs:

Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz (32GB/ECC) - none of the logs included were from this machine, but could not keep my process (of my project, not the listing in this issue) running for very long

AMD FX(tm)-8350 Eight-Core Processor (8GB) - this is my dev box where all the logs are included

The Xeon is running 10.3-RELEASE, and the AMD was running 10.1-RELEASE at the time of the logs (has since been upgraded to 10.3-RELEASE).

I suspect I would be able to chew through many more invocations in the same time as a 2 core machine on these hosts, and additionally increase probability of contention/collisions in any given instant.

The Xeon has since moved to production, so I don't have that hardware at my disposal for the time being, although I might be able to arrange something if it's required.

I can get dmesgs/kldstats for the xeon, and amd if helpful (would rather post out of band).

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 31, 2016

Member

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.
Member

aclements commented May 31, 2016

@derekmarcotte, thanks for the extra details (I see now you already gave the CPU configuration; sorry I missed that).

Two more experiments to try:

  1. See if you can reproduce with GOMAXPROCS=2 (either 1.6.2 or master is fine).
  2. Try the same test program, but invoking an absolute path to a command that doesn't exist (and ignoring the error). This should get the exec out of the picture.
@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 1, 2016

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

@aclements, Thanks for your suggestions. I'm currently exploring a different option. I was using gb for building my project (sorry! forgot to mention), and additionally for this test case.

I certainly didn't expect wildly differing binaries in a project with no external dependencies, as gb uses the go compiler internally. I've got more research to do here to account for this, so I apologize for that.

I've built using go directly and am in the process of testing. So far it has been running for 12 hours, without problem (with the SetFinalizers disabled). I have had previous test runs last this long, so I'm not ready to call it a success just yet. I'll be out of town for the next few days, so I can leave it running for a while and see where it ends up.

I think this is a very promising lead, based on the objdump of the two artefacts. It might be interesting to include in the Issue Report template, with the build tool ecosystem that is currently out there (or that there is an ecosystem at all).

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 1, 2016

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

@aclements, rebuilding gb from source using the go based on master, and then rebuilding the test with the new gb creates nearly identical binaries (minus file path TEXT entries), this is to be expected.

Perhaps there's something to this. Will keep you posted.

@josharian

This comment has been minimized.

Show comment
Hide comment
Contributor

josharian commented Jun 1, 2016

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 2, 2016

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}

derekmarcotte commented Jun 2, 2016

@aclements, the go-only build of the binary did eventually crash... somewhere around 24 hours (~153M goroutines later), so I don't think it's gb-related. I'm going to re-build with the following, per your suggestion:

func run(done chan struct{}) {
        cmd := exec.Command("doesnotexist")
        cmd.Wait()

        done <- struct{}{}
        return
}
@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 6, 2016

@alements, after the above change, it ran for 4 days without a panic.

@alements, after the above change, it ran for 4 days without a panic.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jun 6, 2016

Member

@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, and the finalizer, so there isn't really anything left. It needs to be something like:

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the cmd.Start() still happen (even though it will fail). Again, without that, the Wait just returns nil immediately.

I'm sorry I didn't catch that earlier.

Member

aclements commented Jun 6, 2016

@derekmarcotte, argh, I'm sorry. That wasn't a useful test. :( That eliminates the fork, the exec, and the finalizer, so there isn't really anything left. It needs to be something like:

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

It's important that the path be absolute. Otherwise, Go will try to resolve the path itself, fail at that point, and not do anything for the subsequent method calls. It's also important that the cmd.Start() still happen (even though it will fail). Again, without that, the Wait just returns nil immediately.

I'm sorry I didn't catch that earlier.

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 9, 2016

@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste.

I've recompiled with the above, and crashed after ~32 hours. Log attached:

doesnotexists.txt

@aclements, at least we know that Go can spin for 4 days without crashing on my dev machine - so not a total waste.

I've recompiled with the above, and crashed after ~32 hours. Log attached:

doesnotexists.txt

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jun 9, 2016

Member

@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls?

Member

aclements commented Jun 9, 2016

@derekmarcotte, thanks! Was that with go or gb? With or without the SetFinalizer calls?

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jun 9, 2016

Member

Also, 1.6.2 or master?

Member

aclements commented Jun 9, 2016

Also, 1.6.2 or master?

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 9, 2016

@aclements great questions. Was thinking we aught to recap all the variables at this point anyways.

Trying to narrow things down as much as possible. So, this is with go build using master (as of 7af2ce3), with this diff applied:

diff --git a/src/os/exec.go b/src/os/exec.go
index 239fd92..6a8eed5 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -5,7 +5,6 @@
 package os

 import (
-       "runtime"
        "sync/atomic"
        "syscall"
 )
@@ -19,7 +18,6 @@ type Process struct {

 func newProcess(pid int, handle uintptr) *Process {
        p := &Process{Pid: pid, handle: handle}
-       runtime.SetFinalizer(p, (*Process).Release)
        return p
 }

diff --git a/src/os/file_unix.go b/src/os/file_unix.go
index 9b64f21..a997e9e 100644
--- a/src/os/file_unix.go
+++ b/src/os/file_unix.go
@@ -54,7 +54,6 @@ func NewFile(fd uintptr, name string) *File {
                return nil
        }
        f := &File{&file{fd: fdi, name: name}}
-       runtime.SetFinalizer(f.file, (*file).close)
        return f
 }

with source:

package main

/* stdlib includes */
import (
    "fmt"
    "os/exec"
)

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

func main() {
    fmt.Println("Starting a bunch of goroutines...")

    // 8 & 16 are arbitrary
    done := make(chan struct{}, 16)

    for i := 0; i < 8; i++ {
        go run(done)
    }

    for {
        select {
        case <-done:
            go run(done)
        }
    }
}

@aclements great questions. Was thinking we aught to recap all the variables at this point anyways.

Trying to narrow things down as much as possible. So, this is with go build using master (as of 7af2ce3), with this diff applied:

diff --git a/src/os/exec.go b/src/os/exec.go
index 239fd92..6a8eed5 100644
--- a/src/os/exec.go
+++ b/src/os/exec.go
@@ -5,7 +5,6 @@
 package os

 import (
-       "runtime"
        "sync/atomic"
        "syscall"
 )
@@ -19,7 +18,6 @@ type Process struct {

 func newProcess(pid int, handle uintptr) *Process {
        p := &Process{Pid: pid, handle: handle}
-       runtime.SetFinalizer(p, (*Process).Release)
        return p
 }

diff --git a/src/os/file_unix.go b/src/os/file_unix.go
index 9b64f21..a997e9e 100644
--- a/src/os/file_unix.go
+++ b/src/os/file_unix.go
@@ -54,7 +54,6 @@ func NewFile(fd uintptr, name string) *File {
                return nil
        }
        f := &File{&file{fd: fdi, name: name}}
-       runtime.SetFinalizer(f.file, (*file).close)
        return f
 }

with source:

package main

/* stdlib includes */
import (
    "fmt"
    "os/exec"
)

func run(done chan struct{}) {
    cmd := exec.Command("/doesnotexist")
    cmd.Start()
    cmd.Wait()

    done <- struct{}{}
    return
}

func main() {
    fmt.Println("Starting a bunch of goroutines...")

    // 8 & 16 are arbitrary
    done := make(chan struct{}, 16)

    for i := 0; i < 8; i++ {
        go run(done)
    }

    for {
        select {
        case <-done:
            go run(done)
        }
    }
}
@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Jun 14, 2016

Contributor

It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong.

Contributor

ianlancetaylor commented Jun 14, 2016

It sounds like this is FreeBSD-specific, so marking as such. Please correct me if I am wrong.

@ianlancetaylor ianlancetaylor changed the title from runtime: garbage collector race condition (SetFinalizer, addfinalizer, addspecial, removespecial) to runtime: FreeBSD memory corruption involving finalizers Jun 14, 2016

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 15, 2016

@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out.

@ianlancetaylor : thanks for this. I'm just building a linux box dedicated to testing tonight. I haven't yet been able to test on linux. I'll update the thread as I find more information out.

@adg adg added this to the Go1.8 milestone Jun 20, 2016

@derekmarcotte

This comment has been minimized.

Show comment
Hide comment
@derekmarcotte

derekmarcotte Jun 16, 2017

@rsc: Thanks. Agreed that there may be other things going on here.

Having said that, the issue I'm referencing is having the same crashes (fatal error: workbuf is not empty, fatal error: acquirep: invalid p state), in the same places, under Linux running the same test as in FreeBSD. I suspect this should be enough to invalidate the change to FreeBSD-only, yet perhaps not enough to merge the two issues.

I haven't seen the fatal error: unexpected signal during runtime execution referenced there, for example, and I've seen a lot of crashes.

@rsc: Thanks. Agreed that there may be other things going on here.

Having said that, the issue I'm referencing is having the same crashes (fatal error: workbuf is not empty, fatal error: acquirep: invalid p state), in the same places, under Linux running the same test as in FreeBSD. I suspect this should be enough to invalidate the change to FreeBSD-only, yet perhaps not enough to merge the two issues.

I haven't seen the fatal error: unexpected signal during runtime execution referenced there, for example, and I've seen a lot of crashes.

@jclinton

This comment has been minimized.

Show comment
Hide comment
@jclinton

jclinton Jun 25, 2017

Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release.

Are the folks experiencing this running on Skylake or Kaby Lake hardware? If so, this might be relevant: https://lists.debian.org/debian-devel/2017/06/msg00308.html . Please try updating your CPU microcode to the latest release.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 25, 2017

Contributor

Nope Nehalem EP here.

Contributor

stevenh commented Jun 25, 2017

Nope Nehalem EP here.

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Aug 2, 2017

Contributor

I guess we still don't know what is happening. Punting to 1.10.

Contributor

ianlancetaylor commented Aug 2, 2017

I guess we still don't know what is happening. Punting to 1.10.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9 Aug 2, 2017

@justinclift justinclift referenced this issue in sqlitebrowser/dbhub.io Aug 14, 2017

Open

Cannot connect #73

@reezer

This comment has been minimized.

Show comment
Hide comment
@reezer

reezer Aug 25, 2017

I agree @derekmarcotte. I think it doesn't make sense to state something is FreeBSD related when something very similar happens on at least one other OS too.

If this gets changed it should be mentioned in the release notes as well. However there it might be enough to simply mention Linux and reference the according issue.

reezer commented Aug 25, 2017

I agree @derekmarcotte. I think it doesn't make sense to state something is FreeBSD related when something very similar happens on at least one other OS too.

If this gets changed it should be mentioned in the release notes as well. However there it might be enough to simply mention Linux and reference the according issue.

@dankm

This comment has been minimized.

Show comment
Hide comment
@dankm

dankm Sep 19, 2017

Might this change to FreeBSD be related? I doesn't seem likely to me, but others here know more about how go works than I do.

dankm commented Sep 19, 2017

Might this change to FreeBSD be related? I doesn't seem likely to me, but others here know more about how go works than I do.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Sep 20, 2017

Contributor

It looks like go only uses LDT on 386 and not amd64 so I suspect this wouldn't have any impact @dankm

Contributor

stevenh commented Sep 20, 2017

It looks like go only uses LDT on 386 and not amd64 so I suspect this wouldn't have any impact @dankm

stevenh added a commit to stevenh/gopsutil that referenced this issue Oct 5, 2017

Eliminate use of sysctl command on FreeBSD
In order to improve performance and help prevent crashes due to the outstanding fork crash bug:
golang/go#15658

Replace string parsed values from the sysctl command with native reads of sysctl values using unix.SysctlRaw and unix.SysctlUint32.

stevenh added a commit to stevenh/gopsutil that referenced this issue Oct 5, 2017

Eliminate use of sysctl command on FreeBSD
In order to improve performance and help prevent crashes due to the outstanding fork crash bug:
golang/go#15658

Replace string parsed values from the sysctl command with native reads of sysctl values using unix.SysctlRaw and unix.SysctlUint32.

This also merges OpenBSD and FreeBSD load implementations which are identical.

@stevenh stevenh referenced this issue in shirou/gopsutil Oct 5, 2017

Merged

Eliminate use of sysctl command on FreeBSD #432

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Oct 11, 2017

Contributor

Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:

2017/10/10 22:57:08 Starting 8 forking goroutines...
2017/10/10 22:57:08 GOMAXPROCS: 24 version: go1.9.1
acquirep: p->m=842351920128(9) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x7a8101, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95
runtime.acquirep1(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3689 +0x16d
runtime.acquirep(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3661 +0x2b
runtime.findrunnable(0xc420021300, 0x0)
        /usr/local/go/src/runtime/proc.go:2068 +0x74f
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2245 +0x12c
runtime.park_m(0xc420152900)
        /usr/local/go/src/runtime/proc.go:2308 +0xb6
runtime.mcall(0x800a0b040)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b
...
1 runs completed, 1 failures, over 6h47m54s
Contributor

stevenh commented Oct 11, 2017

Just wanted to confirm this is still an issue with 1.9.1 which includes the signal handling fixes:

2017/10/10 22:57:08 Starting 8 forking goroutines...
2017/10/10 22:57:08 GOMAXPROCS: 24 version: go1.9.1
acquirep: p->m=842351920128(9) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x7a8101, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95
runtime.acquirep1(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3689 +0x16d
runtime.acquirep(0xc42001ec00)
        /usr/local/go/src/runtime/proc.go:3661 +0x2b
runtime.findrunnable(0xc420021300, 0x0)
        /usr/local/go/src/runtime/proc.go:2068 +0x74f
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2245 +0x12c
runtime.park_m(0xc420152900)
        /usr/local/go/src/runtime/proc.go:2308 +0xb6
runtime.mcall(0x800a0b040)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b
...
1 runs completed, 1 failures, over 6h47m54s

continuum-milind-adpaiker added a commit to ContinuumLLC/godep-shirou-gopsutil that referenced this issue Oct 26, 2017

Eliminate use of sysctl command on FreeBSD
In order to improve performance and help prevent crashes due to the outstanding fork crash bug:
golang/go#15658

Replace string parsed values from the sysctl command with native reads of sysctl values using unix.SysctlRaw and unix.SysctlUint32.

This also merges OpenBSD and FreeBSD load implementations which are identical.

@marcan marcan referenced this issue in prometheus/node_exporter Nov 8, 2017

Closed

Various crashes/segfaults on one host #730

@blinkingled

This comment has been minimized.

Show comment
Hide comment
@blinkingled

blinkingled Dec 8, 2017

I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -

td 0xffffff81dd9a6900 test: timeout race
td 0xffffff81e01e7400 test: timeout race

This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -

 /*
         * Cleanup the timeout.  If the timeout has already occured thandle
         * has already been stopped, otherwise stop thandle.  If the timeout
         * is running (the callout thread must be blocked trying to get
         * lwp_token) then wait for us to get scheduled.
         */
        if (timo) {
                while (td->td_flags & TDF_TIMEOUT_RUNNING) {
                        /* else we won't get rescheduled! */
                        if (lp->lwp_stat != LSSTOP)
                                lp->lwp_stat = LSSLEEP;
                        lwkt_deschedule_self(td);
                        td->td_wmesg = "tsrace";
                        lwkt_switch();
                        kprintf("td %p %s: timeout race\n", td, td->td_comm);
                }

I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case.

I was running the original test case posted above on DragonFlyBSD out of curiosity mostly and although I don't see any crashes/corruption in the short time I let it run, on 5.0.2 after running the testcase I consistently get this in syslog -

td 0xffffff81dd9a6900 test: timeout race
td 0xffffff81e01e7400 test: timeout race

This is coming from sys/kern/kern_synch.c:716 - which is tsleep() -

 /*
         * Cleanup the timeout.  If the timeout has already occured thandle
         * has already been stopped, otherwise stop thandle.  If the timeout
         * is running (the callout thread must be blocked trying to get
         * lwp_token) then wait for us to get scheduled.
         */
        if (timo) {
                while (td->td_flags & TDF_TIMEOUT_RUNNING) {
                        /* else we won't get rescheduled! */
                        if (lp->lwp_stat != LSSTOP)
                                lp->lwp_stat = LSSLEEP;
                        lwkt_deschedule_self(td);
                        td->td_wmesg = "tsrace";
                        lwkt_switch();
                        kprintf("td %p %s: timeout race\n", td, td->td_comm);
                }

I am not sure this is even relevant - I haven't looked deeper, but thought I would throw this out just in case.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018

@vangyzen

This comment has been minimized.

Show comment
Hide comment
@vangyzen

vangyzen Feb 15, 2018

I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system:

https://svnweb.freebsd.org/base?view=revision&revision=329254

I you want to test that change, you'll also need this one:

https://svnweb.freebsd.org/base?view=revision&revision=329252

vangyzen commented Feb 15, 2018

I haven't read the details of this issue, but even so, it could be fixed by this recent change in the FreeBSD VM system:

https://svnweb.freebsd.org/base?view=revision&revision=329254

I you want to test that change, you'll also need this one:

https://svnweb.freebsd.org/base?view=revision&revision=329252

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Feb 15, 2018

Contributor

It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one.
Do not call pmap_enter() with invalid protection mode.
Ensure memory consistency on COW

I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great.

Contributor

stevenh commented Feb 15, 2018

It's not clear if these two kernel fixes will help fix the issues identified here but they are definitely in the right area, particularly the second one.
Do not call pmap_enter() with invalid protection mode.
Ensure memory consistency on COW

I'm trying to confirm either way if the above do have an impact, tests in progress. I'll report back when I have more, but if anyone else has a good reproduction environment which they can apply the above patches and test that would be great.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Feb 15, 2018

Contributor

Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)

Contributor

stevenh commented Feb 15, 2018

Hehe, I guess I should have refreshed my page before replying, thanks @vangyzen :)

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Feb 17, 2018

Contributor

Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state
Contributor

stevenh commented Feb 17, 2018

Having applied the two patches I've unfortunately manged to confirm it hasn't solved the issue as I had a crash:

acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD
which also helps prevent crashes / hangs due to the outstanding fork
crash bug:
golang/go#15658

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

@stevenh stevenh referenced this issue in shirou/gopsutil Mar 11, 2018

Merged

Eliminate call to swapinfo on FreeBSD #497

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to swapinfo on FreeBSD
Improve performance by eliminating the fork out to swapinfo on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

This also fixes the value reported by SwapMemory and SwapMemoryWithContext on FreeBSD which previously only included the first swap device and also reported the values in terms of 1K blocks instead of bytes.

stevenh added a commit to stevenh/gopsutil that referenced this issue Mar 11, 2018

Eliminate call to uname on FreeBSD
Improve performance by eliminating the fork out to uname on FreeBSD which also helps prevent crashes / hangs due to the outstanding fork crash bug:
golang/go#15658

Also added a test for PlatformInformation.

@stevenh stevenh referenced this issue in shirou/gopsutil Mar 11, 2018

Merged

Eliminate call to uname on FreeBSD #498

@sean-

This comment has been minimized.

Show comment
Hide comment
@sean-

sean- Jun 14, 2018

@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293

sean- commented Jun 14, 2018

@stevenh could you pull down https://svnweb.freebsd.org/base?view=revision&revision=335171 and try again? It seems like this is a different manifestation of the bug described in https://reviews.freebsd.org/D15293

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 14, 2018

Contributor

Just seen that and thought that it could be related.

I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.

If the patch is compatible with 11.x I'll grab it and run some more tests.

Contributor

stevenh commented Jun 14, 2018

Just seen that and thought that it could be related.

I've asked @kostikbel for clarification about how that may exhibit it with thought it may be related to this.

If the patch is compatible with 11.x I'll grab it and run some more tests.

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 15, 2018

Contributor

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.

Contributor

stevenh commented Jun 15, 2018

From the description I got from alc, the way the issue this fix address exhibits, it sounds like its not our cause, but I'm test running a test none the less.

@justinclift

This comment has been minimized.

Show comment
Hide comment
@justinclift

justinclift Jun 15, 2018

The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards

Not sure if such a thing exists for FreeBSD though. 👼

The way this bug exhibits, it sounds like having some form of system wide reverse / time-travel debugger running would be handy. eg The bug exhibits, so hit "stop" then track it backwards

Not sure if such a thing exists for FreeBSD though. 👼

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 18, 2018

Contributor

no go still crashes

  
Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c6fe1, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
        /usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
Contributor

stevenh commented Jun 18, 2018

no go still crashes

  
Starting 16 forking goroutines...
GOMAXPROCS: 24
acquirep: p->m=5491744(0) p->status=1
fatal error: acquirep: invalid p state

runtime stack:
runtime.throw(0x4c6fe1, 0x19)
        /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc420077e00 sp=0xc420077de0 pc=0x426fc5
runtime.acquirep1(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3699 +0x16d fp=0xc420077e38 sp=0xc420077e00 pc=0x43205d
runtime.acquirep(0xc42002a600)
        /usr/local/go/src/runtime/proc.go:3671 +0x2b fp=0xc420077e50 sp=0xc420077e38 pc=0x431eab
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1689 +0x124 fp=0xc420077e78 sp=0xc420077e50 pc=0x42c924
runtime.findrunnable(0xc420020c00, 0x0)
        /usr/local/go/src/runtime/proc.go:2135 +0x4d2 fp=0xc420077f10 sp=0xc420077e78 pc=0x42dac2
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2255 +0x12c fp=0xc420077f58 sp=0xc420077f10 pc=0x42e57c
runtime.goexit0(0xc42013bc80)
        /usr/local/go/src/runtime/proc.go:2406 +0x236 fp=0xc420077fb0 sp=0xc420077f58 pc=0x42efd6
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:286 +0x5b fp=0xc420077fc0 sp=0xc420077fb0 pc=0x44dfab
@bartekrutkowski

This comment has been minimized.

Show comment
Hide comment
@bartekrutkowski

bartekrutkowski Jun 18, 2018

@justinclift Wouldn't DTrace be enough to investigate this problem?

bartekrutkowski commented Jun 18, 2018

@justinclift Wouldn't DTrace be enough to investigate this problem?

@stevenh

This comment has been minimized.

Show comment
Hide comment
@stevenh

stevenh Jun 18, 2018

Contributor
Contributor

stevenh commented Jun 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment