Skip to content

runtime: possible GC bug with msan #76138

@RaduBerinde

Description

@RaduBerinde

Go version

go version go1.25.3 linux/amd64

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/runner/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/runner/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1722691548=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/work/go.mod'
GOMODCACHE='/home/runner/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/runner/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/runner/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.25.3'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I have an uninitialized memory crash with -msan that I can only reproduce in Github CI and using a docker container that mirrors the CI runner.

This issue is a request to guide me towards obtaining more information.

I'll start with the relevant parts of my program. The relevant source code is here: https://github.com/RaduBerinde/pebble/blob/e016d44c914386a5b45e7956d5d5dce1aa5090d8/internal/deletepacer/delete_pacer.go#L236

Basically I have a queue storing entries of this type:

type queueEntry struct {
	ObsoleteFile
	JobID int
}
type ObsoleteFile struct {
	FileType base.FileType
	FS       vfs.FS
	Path     string
	FileNum  base.DiskFileNum
	FileSize uint64 // approx for log files
	IsLocal  bool
}

The queue is implemented using a structure (https://github.com/RaduBerinde/pebble/blob/e016d44c914386a5b45e7956d5d5dce1aa5090d8/internal/deletepacer/queue.go) which will essentially allocate this object (with T = queueEntry):

type queueNode[T any] struct {
	buf       [queueNodeSize]T
	head, len int32
	next      *queueNode[T]
}

There is something important about this specific structure. If I replace this queue with a simple slice I can no longer reproduce the failure.

In my reproduction case, we push exactly one entry into this queue. A single background goroutine (mainLoop) pops it from the queue - first it makes a copy here: https://github.com/RaduBerinde/pebble/blob/e016d44c914386a5b45e7956d5d5dce1aa5090d8/internal/deletepacer/delete_pacer.go#L236

The problem is with the file.Path field. At some later point, when we actually try to delete the file, msan complains that this string is uninitialized. The string is generated via fmt.Sprintf() and passed through a path.Join; it's a simple heap string, no unsafe shenanigans. I made sure the string pointer is correct (same one that was enqueued).

I cannot reproduce if I turn the GC off. I used runtime.trace and GODEBUG=traceallocfree=1 and confirmed that at some point after we pop the element from the queue, the string object gets freed.

I figured out that if I add a runtime.GC call right after the pop (this is the code version I linked), the GC finds this problem right there:

file to delete _meta/foo/standard-012/data/000003.log 0xc00048a030 38

PushBack 0xc00048a030 _meta/foo/standard-012/data/000003.log
notification wait end


popped
 - [JOB 6] compacting(move) L0 [000006] (3.2KB) Score=0.00 + L6 [] (0B) Score=0.00; OverlappingRatio: Single 0.00, Multi 0.00
runtime: marked free object in span 0x7a50dc78d660, elemsize=48 freeindex=1 (bad use of unsafe.Pointer or having race conditions? try -d=checkptr or -race)
0xc00048a000 alloc marked  
0xc00048a030 free  marked   zombie              <-----------

I looked at the assembly and found where the structure is stored on the stack, and later we retrieve the string from the stack. So the pointer is on the stack for this time period and does not change. I looked at the stkobj map and from my understanding it does mark that stack slot as containing a pointer.

This is part of a larger codebase that does employ various unsafe tricks. However, I am able to reproduce with a fairly small amount of code actually running in-between the time we push the entry onto the queue and the failure, and I couldn't find any unsafe use there.

I was never able to reproduce any failure without -msan (I tried -race, -asan). I am suspecting some obscure bug that is specific to msan integration.

I can provide instructions on how I reproduced, but it would be pretty tedious so I believe at this point it would be easiest if I did the work to get more information. I would appreciate some guidance on how to narrow down the problem.

What did you see happen?

Uninitialized memory crashes, and later (after adding runtime.GC() at the right place) fatal error: found pointer to free object

What did you expect to see?

No failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugReportIssues describing a possible bug in the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions