-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unique: Handle memory leaks; cleanup goroutine can't keep up #71772
Comments
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Actually, I'm going to close this for now until there's something actionable. I'll reopen if/when I have a repro. |
Tangentially related, but we should probably remove the loop and switch to |
I played with hacking away at the server, trying to minimize the real code down to minimal repro instead of building up a minimal repro from scratch, with the goal to eventually plug in a in-memory generated random data source rather than reading from the network. That seemed to have worked.... at a certain point, Linux started reclaiming a small amount of memory on a test Linux box I was running. And over night it did one huge unique.Handle reclaim during one of its GCs, but zero for all the others throughout the night. Now I need to binary search my deletions and see what the key deletion was. In any case, it still reclaims super slowly on Linux even after my minimization. Will report back when I have more. |
Okay, I think I've narrowed it down and figured out the macOS-vs-Linux difference: this is about the number goroutines producing new unique.Handles overwhelming the sole cleanup goroutine. My Mac laptop (M4 Max) is just significantly beefier than my dinky Linux VMs (both prod & dev). Now I understand why my ground-up repro didn't work before: it was using a single goroutine. I have a standalone repro now from the full server that doesn't depend on the network (using 100 goroutines adding+removing keys) but now that I think I know what's happening, I think I'll switch back to my bottom-up repro, slapping on some goroutines. So @mknyszek, maybe |
Here ya go: https://go.dev/play/p/rrNRBh6g-WF Ignore that that's a playground link and also that it's written as a test function. Just run it locally (on either macOS or Linux) and watch it leak. Running
But using
Note that there are multiples GCs happening while the |
Yeah, https://gist.github.com/bradfitz/a136006f112fee47ce1d87232e779d24 (I've signed the CLA; consider that trashy patch safe to look at 😅) |
A potential fix for Go 1.23 (without using Go 1.24's AddCleanup) is https://gist.github.com/bradfitz/901f14f9c6dfd5c9344673a2c9478932 , making new handles wait for any ongoing cleanup to finish, but it's a bit too aggressive: it makes all goroutines wait. But it works. 🤷♂ |
Hm... We should make the In other words, if there are so many new entries being created that deletion can't keep up, that suggests there isn't really all that much deduplication happening; that there are more inserts than lookups. ... But also maybe I'm wrong about that. Maybe there's enough of a benefit even if you only get, for example, 1 lookup on average after an insert, before a deletion. I'd love to understand this better if you can collect any metrics from production in an ad-hoc way, @bradfitz. And perhaps we need diagnostics ( Off the top of my head, I'm not sure what the right back-pressure policy would be here. An alternative would be is that we could scale up deletions instead. Deletions on independent parts of the tree (especially for a big tree) should scale well in my testing. The |
Thank you for the thorough investigation @bradfitz!! |
While that's likely true (we just did this as an experiment to see if it'd save the expected few hundred MB of heap), I strongly feel that it shouldn't be possible to use a Go API and see runaway memory growth. That is, users should be able to evaluate whether |
Yes, totally agree. My apologies, I did not intend to suggest we should leave this problem alone. Mostly I just wanted to understand your workload better. I plan to work on this this week; it shouldn't take too long to fix. While I'm at it I have some ideas to reduce the memory overhead of each entry in the global map, since using (As an aside, yet another back-pressure idea I have is to let callers into |
Change https://go.dev/cl/650256 mentions this issue: |
CL 650256 is step one. We still need a back-pressure mechanism, and I'd like to make |
Change https://go.dev/cl/650697 mentions this issue: |
https://go.dev/cl/650697 (and changes beneath it) automatically scales up to @bradfitz if you're willing to give it a shot in production, I'd be curious to know if it's good enough. Applying real backpressure in |
I'm traveling this week and getting very little computer time but I'll get back to this next week. I'm excited about all the changes! 🙏 |
@mknyszek, results from prod, from a Go built at 486a06e3770a2331ae4141fd9f1197372075fbc2 (PS 14 of https://go-review.googlesource.com/c/go/+/650697)
It's better than before, but not perfect. Over the past 7 days: ![]() Green is not using unique. The yellow instance is the test, using unique at 486a06e3770a23. Let me know if I can run any other experiments. |
Those results roughly reflect what I saw in the reproducer, which is good at least in terms of the reproducer being accurate. I'm curious to know if memory continues to grow further if left alone. I acknowledge there's a trend here over several days, and maybe this is just wishful thinking, but it kinda sorta looks like it's leveling out? Maybe? If the leak problem is mostly fixed and it's just a new higher memory footprint, there might not be a way to make it perfect. Still, applying backpressure would probably make this all more stable. Still not sure what the right backpressure mechanism is here. Simplest policy: each |
I will get back to you once I have a clearer idea of what else to try, but these patches seem like a good place to start. |
I'll keep it running and get back to you with what the shape looks like in a few days. |
OK, so problem definitely not solved. Thank you for confirming! |
Go version
go version go1.24.0 darwin/arm64
go version go1.24.0 linux/amd64
go version go1.23.6 darwin/arm64
go version go1.23.6 linux/amd64
What did you do?
We have a server that processes a stream of user connect & disconnect events from our edge servers. Users are identified by a
[32]byte
public key, wrapped up like:This server maintains ~4 maps keyed by
NodePublic
. Out of curiosity and as an experiment, we decided to kick the tires withunique.Handle
to make those maps instead be keyed byunique.Handle[NodePublic]
to make them 8 bytes instead of 32. It saves about 96 bytes per connection. Not a ton, but ... an experiment.The experiment didn't go as expected.
What did you see happen?
What we saw happen was unbounded memory growth and kernel OOMs. These are the two instances of this server we run. (it's an internal debug tool only, so we only run two for HA reasons)
It previously took about ~1GB of memory at steady state. With
unique.Handle
, it just keeps growing until the kernel OOM kills it.The flat parts in the graph are where we changed the map key types from
unique.Handle[NodePublic]
back to justNodePublic
, changing nothing else.To debug, I hacked up the
unique
package (patch: https://gist.github.com/bradfitz/06f57b1515d58fb4d8b9b4eb5a84262e) to try out two things:unique
package use 24 byte value types in its uniqueMap. No change.The result is that never frees anything. The weak pointer is never
nil
. It takes ever longer to iterate over the map (up to 2.7 seconds now), never finding anything to delete:Looking at historical logs, it never deletes anything on Linux.
I tried to write a repro (on my Mac) and I failed, despite a bunch of efforts.
I ended up doing too much debugging on a remote AWS VM, with a hacked up Go toolchain.
Eventually I ran the same code on my Mac (same server + same hacked up Go toolchain), pulling from the same data sources, and surprisingly: it worked!
unique
finds weak things to delete in its cleaner and it doesn't grow.In summary:
unique
leaks onlinux/amd64
(prod cloud VM)darwin/arm64
(my dev laptop)I'll now try revisiting my standalone repro but running it on Linux instead of macOS.
I had hoped to deliver @mknyszek a bug on a silver platter before filing, but I'm going a little crazy at this pointed and figured it was time to start sharing.
What did you expect to see?
No memory leaks on Linux also.
The text was updated successfully, but these errors were encountered: