Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: panic marked free object in span[1.17] #54247

Closed
linsite opened this issue Aug 4, 2022 · 10 comments
Closed

runtime: panic marked free object in span[1.17] #54247

linsite opened this issue Aug 4, 2022 · 10 comments
Labels
compiler/runtime NeedsInvestigation WaitingForInfo
Milestone

Comments

@linsite
Copy link

linsite commented Aug 4, 2022

What version of Go are you using (go version)?

$ go version
go version go1.17.12 linux/amd64

Does this issue reproduce with the latest release?

Yes, reproduced with 1.17.13.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ulin/.cache/go-build"
GOENV="/home/ulin/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ulin/gopkg/gow/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ulin/gopkg/gow"
GOPRIVATE=""
GOROOT="/home/ulin/bin/go"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/home/ulin/bin/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.12"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/ulin/dl/tmp/exporter/poc/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4248681014=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Using go build to compile the following PoC.

package main

import (
        "fmt"
)

type TaskRunner struct {
        limitChan chan struct{}
}

func NewTaskRunner(concurrency int) *TaskRunner {
        return &TaskRunner{
                limitChan: make(chan struct{}, concurrency),
        }
}

func (rp *TaskRunner) Schedule(task func()) {
        rp.limitChan <- struct{}{}

        go func() {
                var a [125]int
                defer Recover(
                        func() {
                                <-rp.limitChan
                        },
                        func() {
                                a[0] = 1
                        },
                )

                task()
        }()
}

func Recover(cleanups ...func()) {
        for _, cleanup := range cleanups {
                cleanup()
        }

        if p := recover(); p != nil {
                fmt.Println(p)
        }
}

func main() {
        schd := NewTaskRunner(10)
        for {
                schd.Schedule(func() {
                        j := 0
                        for i := 0; i < 2000; i++ {
                                j += 1
                        }
                })
        }
}

Run the following script to launch 90 processes. The count can be small, 90 only to fast the reproducing.

for i in $(seq 1 90); do
    ./poc &>> poc-$$.log&
done
    echo "poc-$$.log"

What did you expect to see?

The 90 processes can run forever without any panics.

What did you see instead?

Found panics, the stderr back trace as follow:

runtime: pointer 0xc0002f5fa0 to unused region of span span.base()=0xc00047c000 span.limit=0xc00047dff8 span.state=1
runtime: found in object at *(0xc0001def90+0x0)
object=0xc0001def90 s.base()=0xc0001de000 s.limit=0xc0001e0000 s.spanclass=4 s.elemsize=16 s.state=mSpanInUse
 *(object+0) = 0xc0002f5fa0 <==
 *(object+8) = 0xc0002f5f90
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x49c418, 0x6})
        /home/ulin/bin/go/src/runtime/panic.go:1198 +0x71 fp=0xc000093df8 sp=0xc000093dc8 pc=0x42f991
runtime.badPointer(0x7fe4b63a8090, 0xc0002b7ba0, 0xc0001def90, 0xc000093ec0)
        /home/ulin/bin/go/src/runtime/mbitmap.go:367 +0x150 fp=0xc000093e48 sp=0xc000093df8 pc=0x4108d0
runtime.findObject(0xc00003e400, 0xc000093f08, 0x418d70)
        /home/ulin/bin/go/src/runtime/mbitmap.go:409 +0xa6 fp=0xc000093e80 sp=0xc000093e48 pc=0x410a66
runtime.scanobject(0xc000023698, 0xc000023698)
        /home/ulin/bin/go/src/runtime/mgcmark.go:1280 +0x190 fp=0xc000093f18 sp=0xc000093e80 pc=0x41b190
runtime.gcDrain(0xc000023698, 0x3)
        /home/ulin/bin/go/src/runtime/mgcmark.go:1047 +0x1ba fp=0xc000093f70 sp=0xc000093f18 pc=0x41aa1a
runtime.gcBgMarkWorker.func2()
        /home/ulin/bin/go/src/runtime/mgc.go:1269 +0xa5 fp=0xc000093fc0 sp=0xc000093f70 pc=0x417c25
runtime.systemstack()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:383 +0x49 fp=0xc000093fc8 sp=0xc000093fc0 pc=0x458b89

goroutine 93390 [GC worker (idle)]:
runtime.systemstack_switch()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:350 fp=0xc0003aaf60 sp=0xc0003aaf58 pc=0x458b20
runtime.gcBgMarkWorker()
        /home/ulin/bin/go/src/runtime/mgc.go:1256 +0x1b1 fp=0xc0003aafe0 sp=0xc0003aaf60 pc=0x4178d1
runtime.goexit()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0003aafe8 sp=0xc0003aafe0 pc=0x45ac01
created by runtime.gcBgMarkStartWorkers
        /home/ulin/bin/go/src/runtime/mgc.go:1124 +0x25

goroutine 1 [chan send]:
runtime.gopark(0xc00009aef8, 0x40b8c5, 0x68, 0x9a, 0x6)
        /home/ulin/bin/go/src/runtime/proc.go:366 +0xd6 fp=0xc00009ae78 sp=0xc00009ae58 pc=0x432436
runtime.chansend(0xc000094060, 0xc00009af40, 0x1, 0x47e36f)
        /home/ulin/bin/go/src/runtime/chan.go:257 +0x425 fp=0xc00009af00 sp=0xc00009ae78 pc=0x404765
runtime.chansend1(0x48c480, 0xc0001be708)
        /home/ulin/bin/go/src/runtime/chan.go:143 +0x1d fp=0xc00009af30 sp=0xc00009af00 pc=0x40431d
main.(*TaskRunner).Schedule(0xc0000b2018, 0x49c800)
        /home/ulin/dl/tmp/exporter/poc/poc.go:22 +0x32 fp=0xc00009af50 sp=0xc00009af30 pc=0x47e312
main.main()
        /home/ulin/dl/tmp/exporter/poc/poc.go:67 +0x67 fp=0xc00009af80 sp=0xc00009af50 pc=0x47e767
runtime.main()
        /home/ulin/bin/go/src/runtime/proc.go:255 +0x227 fp=0xc00009afe0 sp=0xc00009af80 pc=0x432067
runtime.goexit()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00009afe8 sp=0xc00009afe0 pc=0x45ac01
goroutine 2 [force gc (idle), 2 minutes]:
runtime.gopark(0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ulin/bin/go/src/runtime/proc.go:366 +0xd6 fp=0xc00003afb0 sp=0xc00003af90 pc=0x432436
runtime.goparkunlock(...)
        /home/ulin/bin/go/src/runtime/proc.go:372
runtime.forcegchelper()
        /home/ulin/bin/go/src/runtime/proc.go:306 +0xad fp=0xc00003afe0 sp=0xc00003afb0 pc=0x4322cd
runtime.goexit()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00003afe8 sp=0xc00003afe0 pc=0x45ac01
created by runtime.init.7
        /home/ulin/bin/go/src/runtime/proc.go:294 +0x25

goroutine 3 [GC sweep wait]:
runtime.gopark(0x49ca01, 0x5264a0, 0xc, 0x14, 0x1)
        /home/ulin/bin/go/src/runtime/proc.go:366 +0xd6 fp=0xc00003b7b0 sp=0xc00003b790 pc=0x432436
runtime.goparkunlock(...)
        /home/ulin/bin/go/src/runtime/proc.go:372
runtime.bgsweep()
        /home/ulin/bin/go/src/runtime/mgcsweep.go:182 +0xd8 fp=0xc00003b7e0 sp=0xc00003b7b0 pc=0x41f2d8
runtime.goexit()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00003b7e8 sp=0xc00003b7e0 pc=0x45ac01
created by runtime.gcenable
        /home/ulin/bin/go/src/runtime/mgc.go:181 +0x55

goroutine 4 [runnable]:
runtime.gopark(0xc00003bfb8, 0x0, 0x0, 0x0, 0x0)
        /home/ulin/bin/go/src/runtime/proc.go:366 +0xd6 fp=0xc00003bf80 sp=0xc00003bf60 pc=0x432436
runtime.goparkunlock(...)
        /home/ulin/bin/go/src/runtime/proc.go:372
runtime.bgscavenge()
        /home/ulin/bin/go/src/runtime/mgcscavenge.go:314 +0x288 fp=0xc00003bfe0 sp=0xc00003bf80 pc=0x41d548
runtime.goexit()
        /home/ulin/bin/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00003bfe8 sp=0xc00003bfe0 pc=0x45ac01
created by runtime.gcenable
        /home/ulin/bin/go/src/runtime/mgc.go:182 +0x65

...

More information:

  1. If the var a [125]int line and the 2nd Recover argument are removed, no panics will occur;
  2. If set GOEXPERIMENT=noregabi no panics will occur;
  3. The kernel is 4.18, and can be reproduced on Ubuntu18.04(Linux version 4.15.0-163-generic (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #171-Ubuntu SMP Fri Nov 5 11:55:11 UTC 2021)
@gopherbot gopherbot added the compiler/runtime label Aug 4, 2022
@linsite linsite changed the title runtime: marked free object in span[1.17] runtime: panic marked free object in span[1.17] Aug 4, 2022
@seankhliao
Copy link
Member

seankhliao commented Aug 4, 2022

does this reproduce with 1.18 or 1.19?
1.17 is no longer supported

@linsite
Copy link
Author

linsite commented Aug 4, 2022

1.18 didn't reproduce in 4h. I'm trying with 1.19.

@dmitshur dmitshur added the NeedsInvestigation label Aug 4, 2022
@dmitshur dmitshur added this to the Backlog milestone Aug 4, 2022
@dmitshur
Copy link
Contributor

dmitshur commented Aug 4, 2022

CC @golang/runtime.

@cherrymui cherrymui added the WaitingForInfo label Aug 4, 2022
@linsite
Copy link
Author

linsite commented Aug 5, 2022

1.19 didn't reproduce either. Close it if you guys see O.K. as 1.17 is no longer supported. As I'm a little bit interesting what's causing this, I think it may relate to the goroutine switching where the context RDX register is not correctly restored regarding to closure defer statement. It'll be great if it can be verified from Go Team. Thanks for your time anyway. :)

@linsite
Copy link
Author

linsite commented Aug 6, 2022

May be solved by this commit: 0a0e3a3

@seankhliao seankhliao closed this as not planned Aug 6, 2022
@linsite
Copy link
Author

linsite commented Aug 8, 2022

May be solved by this commit: 0a0e3a3

confirmed.

@prattmic
Copy link
Member

prattmic commented Aug 8, 2022

Interesting. I'm not sure that 0a0e3a3 was expected to fix anything, making me wonder if we actually fixed or just masked a problem.

cc @mdempsky

@mdempsky
Copy link
Member

mdempsky commented Aug 8, 2022

Yeah, 0a0e3a3 wasn't meant to fix anything, but maybe improved escape analysis in some cases and masked the issue.

I'm curious if changing the Recover call to:

                cleanups := []func() {
                        func() {
                                <-rp.limitChan
                        },
                        func() {
                                a[0] = 1
                        },
                }
                defer Recover(cleanups...)

changes things at all?

I could believe the 16-byte object is the backing array for the variadic function call.

@linsite
Copy link
Author

linsite commented Aug 8, 2022

Yeah, 0a0e3a3 wasn't meant to fix anything, but maybe improved escape analysis in some cases and masked the issue.

I'm curious if changing the Recover call to:

                cleanups := []func() {
                        func() {
                                <-rp.limitChan
                        },
                        func() {
                                a[0] = 1
                        },
                }
                defer Recover(cleanups...)

changes things at all?

I could believe the 16-byte object is the backing array for the variadic function call.

Yes, you're right. After changing Recover's arguments to a predefined slice, the issue no longer reproduced. If change the arguments to add 1 more a[0]=1 function, the back trace message will be like this:

runtime: found in object at *(0xc0003d3f68+0x0)
object=0xc0003d3f68 s.base()=0xc0003d2000 s.limit=0xc0003d3ff8 s.spanclass=6 s.elemsize=24 s.state=mSpanInUse
 *(object+0) = 0xc0005eff98 <==
 *(object+8) = 0xc0005eff88
 *(object+16) = 0xc0005eff78

The elemsize becomes 24 bytes.

@mdempsky
Copy link
Member

mdempsky commented Aug 8, 2022

Thanks. There's a lot of subtlety to how escape analysis, defers, and variadic functions interacted before.

Internally, the compiler rewrites

defer f(arg)

into something like:

f_tmp, arg_tmp := f, arg
defer func() { f_tmp(arg_tmp) }()

However, this rewriting was happening later in the compiler, after escape analysis. It's possible there was a misalignment between how escape analysis was analyzing the original form and the data flow of the resulting form.

The effect of that change was to move the rewriting to before escape analysis, so that escape analysis can see how we actually rewrote the statement, rather than inferring how it should be rewritten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime NeedsInvestigation WaitingForInfo
Projects
None yet
Development

No branches or pull requests

7 participants