Skip to content

unique: Handle memory leaks; cleanup goroutine can't keep up #71772

@bradfitz

Description

@bradfitz

Go version

go version go1.24.0 darwin/arm64
go version go1.24.0 linux/amd64
go version go1.23.6 darwin/arm64
go version go1.23.6 linux/amd64

What did you do?

We have a server that processes a stream of user connect & disconnect events from our edge servers. Users are identified by a [32]byte public key, wrapped up like:

type NodePublic struct {
   k [32]byte
}

This server maintains ~4 maps keyed by NodePublic. Out of curiosity and as an experiment, we decided to kick the tires with unique.Handle to make those maps instead be keyed by unique.Handle[NodePublic] to make them 8 bytes instead of 32. It saves about 96 bytes per connection. Not a ton, but ... an experiment.

The experiment didn't go as expected.

What did you see happen?

What we saw happen was unbounded memory growth and kernel OOMs. These are the two instances of this server we run. (it's an internal debug tool only, so we only run two for HA reasons)

Image

It previously took about ~1GB of memory at steady state. With unique.Handle, it just keeps growing until the kernel OOM kills it.

The flat parts in the graph are where we changed the map key types from unique.Handle[NodePublic] back to just NodePublic, changing nothing else.

To debug, I hacked up the unique package (patch: https://gist.github.com/bradfitz/06f57b1515d58fb4d8b9b4eb5a84262e) to try out two things:

  • see if the weak pointers were getting tinyalloc'ed and preventing GC. To counter this, I tried making the unique package use 24 byte value types in its uniqueMap. No change.
  • to log stats when the cleaners ran to see how much it tried to delete
  • Oh, and I also tried both Go 1.23.6 and Go 1.24.0

The result is that never frees anything. The weak pointer is never nil. It takes ever longer to iterate over the map (up to 2.7 seconds now), never finding anything to delete:

Feb 16 00:30:44 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  16µs
Feb 16 00:30:46 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2829766 ; deleted= 0 ; kept= 2829766 ; took  2.585244s
Feb 16 00:32:51 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  14µs
Feb 16 00:32:53 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2850211 ; deleted= 0 ; kept= 2850211 ; took  2.692595s
Feb 16 00:34:58 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  19µs
Feb 16 00:35:01 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2870410 ; deleted= 0 ; kept= 2870410 ; took  2.728681s
Feb 16 00:37:06 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  19µs
Feb 16 00:37:08 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2890817 ; deleted= 0 ; kept= 2890817 ; took  2.635071s
Feb 16 00:39:13 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  18µs
Feb 16 00:39:16 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2911098 ; deleted= 0 ; kept= 2911098 ; took  2.600478s
Feb 16 00:41:20 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  19µs
Feb 16 00:41:22 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2931399 ; deleted= 0 ; kept= 2931399 ; took  2.606853s
Feb 16 00:43:27 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2 ; deleted= 0 ; kept= 2 ; took  20µs
Feb 16 00:43:29 derptrackd2 derptrackd[771925]: XXX did unique cleanup loop: items= 2951574 ; deleted= 0 ; kept= 2951574 ; took  2.701304s

Looking at historical logs, it never deletes anything on Linux.

I tried to write a repro (on my Mac) and I failed, despite a bunch of efforts.

I ended up doing too much debugging on a remote AWS VM, with a hacked up Go toolchain.

Eventually I ran the same code on my Mac (same server + same hacked up Go toolchain), pulling from the same data sources, and surprisingly: it worked! unique finds weak things to delete in its cleaner and it doesn't grow.

In summary:

  • unique leaks on linux/amd64 (prod cloud VM)
  • same code on same data doesn't leak on darwin/arm64 (my dev laptop)

I'll now try revisiting my standalone repro but running it on Linux instead of macOS.

I had hoped to deliver @mknyszek a bug on a silver platter before filing, but I'm going a little crazy at this pointed and figured it was time to start sharing.

What did you expect to see?

No memory leaks on Linux also.

Metadata

Metadata

Assignees

Labels

BugReportIssues describing a possible bug in the Go implementation.NeedsFixThe path to resolution is known, but the work has not been done.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions