runtime: Unable to calculate memory usage of an object accurately #34561
What version of Go are you using (
The text was updated successfully, but these errors were encountered:
One possibility is that there are lingering pointers elsewhere in the system that are keeping the evictions from being GC'ed. Another possibility is that there hasn't been a full GC since the evictions and not all memory has been reclaimed.
Do you know if the numbers are more accurate when data is only added, before any evictions?
I checked out the HEAD of that PR but I was unable to figure out what to build and what to run to get that output. What is the sequence of commands to run to reproduce the output you showed?
So, we are force running GC every 5s or so, if a GC hasn't already run. If we force run it, we also spit out the memory stats.
In this case, I stopped the loading of data after cost-added reached 505MB (cost-added = 529481634, cost-evicted = 0), and lot before Ristretto would start evictions.
The effect is reversed this time. ReadPostingList in this case is only showing 428MB, i.e. under what Ristretto is seeing (505MB), which is off by 15%. Assuming that go profiler has a 3% error rate, this is still off by ~10%.
Daniel is going to post the instructions on how to run Dgraph and do a data load, so the caching would trigger.
(Instructions work on Linux only)
To build PR dgraph-io/dgraph#4072:
Relevant logs would be from Alpha:
For simplicity, I did this without Docker. I ran in three different windows:
First I ^C'ed the third command once dgraph alpha was printing about 1 GB of data. It had just started evicting, but only about 1 MB by its accounting.
The profile looks like this. Note that ReadPostingList is charged with 1271 MB, compared to the 1e9/1024/1024 = 953 MB that dgraph had calculated.
There are three functions doing most of the allocations: ReadPostingList, a closure inside ReadPostingList (ReadPostingList.func1), and a proto Unmarshal called from that closure. Here are the profiled source listings.
There are about 230 MB of List structs, 230 MB of pb.PostingList structs, and 120 MB of allocated empty maps.
I count 1+3+1+3+0+3+1 = 12 words = 96 bytes, which is exactly an allocation size in Go, so no wastage there.
I count 11 words, which will round up to 12 in the allocator, so it makes sense that this code, which allocates the same number of Lists as pb.PostingLists, would see the same allocation counts for those two. 230*1024*1024/96 = 2.51e6 of each.
Given that the map is 123 MB, it sounds like the empty map takes up 6 words. Sure enough, it does:
Now there's 665 MB in the call to the closure via item.Value, so let's look at that next:
Another 126 MB of pb.PostingList structs, compared to the 232 MB allocated in the outer function, suggests that on average there are only about half as many item.Value closure calls as there are ReadPostingList calls. More precisely, 126.5*1024*1024/96 = 1.38e6, compared to 2.51e6 of the outer List allocations. That is, at least about half the Lists returned by ReadPostingList have an empty map. Maybe 1/2 have an empty map and 1/2 have one pb.PostingList in the map. Maybe 3/4 are empty and 1/4 have two pb.PostingLists in the map. Maybe something else. But at least half are entirely empty (item.Value is never called). So by object count (not necessarily object size), the cache is at least half full with empty posting lists.
The other allocation is a map bucket, which for a map[uintptr]*pb.PostingList is 18 words = 144 bytes, which is another size class, so no wasted allocated space. 165.73*1024*1024/144 = 1.21e6 buckets (out of 2.51e6 maps). So at least 1.30e6 (more than half) of the maps are empty. And the 1.21e6 buckets hold only 1.38e6 entries, so at least 1.04e6 of the buckets hold only one entry.
And then there is 373.56MB in 1.38e6 calls to pl.Unmarshal, or about 284 bytes per call. Most are in this line:
Here all the allocations are on the line
The 346.05MB/1.38e6 calls/160 bytes = 1.64 Postings allocated per call, on average, or 2.27e6 total. The 20.51MB of append data for 2.27e6 pointers ends up being 9.24 bytes per pointer, which is quite good. Probably most of the lists have a single entry, and we do handle that case optimally in an append loop.
Overall, this profile shows that there are:
230 MB = 2.51e6 List
That adds up to 1240.5 MB, which is close enough to the 1270 MB pprof number that we can stop. (Another 7 MB is in string data in the pb.Postings, for what that's worth.)
The old sizing code did this:
This is not accounting for the size of the List struct itself, nor the size of the map header, nor the unused map bucket entries. And then it uses l.plist.Size() and pl.Size(), clearly intended as the total, recursive in-memory size of the protobuf structures involved. But those functions calculate the wire representation size, which is far more compact than the in-memory representation. That's why it was so far off.
The new package dgraph uses in the PR has its own problems. It does:
There are a few problems with this code:
Actually, scratch that. The struct field loop ignores maps entirely. So the entire 288 MB of map data never is accounted for, plus anything reachable through the map. Starting over, we have:
That's a net 550.5 MB missing from the actual counts, which is a wider gap than observed (1270 - 953 = 317 MB). I'm sure the remaining 232 MB can be found somewhere, but I'll stop here.
Overall I would suggest that using this complex nest of protobufs and maps as the representation of values in a large-memory cache is incredibly wasteful. A representation storing only the values you care about, in a simple format, would let you store easily 2X, possibly even 10X the number of values in the same amount of memory. And the simpler format would be easier to calculate the size of. (And it could be done without reflection and the corresponding overheads.)
The simpler representation should probably also have a special case small way to represent "no list". Fully 300 MB out of the 1270 MB, or 23% of the cache space, is being taken up by Lists with empty maps.
Finally, I would suggest that if being able to calculate the size of cached values is important, then there should be a test of that calculation specifically. A simple test with a few manually constructed values would have turned up at least some of the bugs in both the sizing implementations quite quickly, instead of having to run an entire dgraph cluster and workload.
Overall, "unable to calculate memory usage of an object accurately" is because of major bugs in the two different calculation codes, not Go itself. The reflect-based approach could be made to work if the bugs were all fixed. It would just be slow. And pprof itself appears to be working fine.
Thanks for the detailed response.
I agree with the conclusion that it would be way easier to convert the List into a different format. We were trying to avoid that because doing that for values with very large posting lists (i.e very large indices) might be more expensive than using the reflect package but we'll look into it.
One question I have after reading the post, how is the size of a map bucket calculated?
Thanks a lot for the detailed analysis, Russ! This is extremely helpful. This is not actually how we are planning to use the cache. Instead this is a quick way for us to showcase how inaccurate our memory calculation algorithm is.
We would try to improve the calculation of the memory usage of the list structure given your guidelines. At the same time, we would also try to see if we can simplify our data structure itself, but that would come at the added complexity of the underlying algorithm.
More importantly though, the sort of calculations involved do lead me to think that it would be very useful if Go provided a function which would recursively calculate the memory usage of any given Go struct so the users do not have to have such a deep understanding of how Go allocates memory internally.
As you can see from how wrong both our own attempts and third party attempts at memory calculation were, it is hard to gain this level of internal understanding of Go memory allocation. And our engineers have tried.
This would be an arbitrarily expensive function - in the limit it is a full garbage collection - and it's not even a particularly well-formed concept, since there might be overlap with other possible starting points, so you can't in general add the results of multiple calls. Such a function might be interesting for debugging but would be a serious mistake for production use. Having it in the standard library would encourage the latter.
I'm sorry, but we can't put every function someone writes incorrectly into the standard library, especially functions that might look cheap but are actually quite expensive.
If this functionality is particularly important to you, I would suggest putting in the time to write a good implementation and then either contributing it back to the current third-party location or publishing your own open source version that others can use. This code does not need to be in the standard library.
The analysis above requires understanding how to use pprof but not really anything deep about Go memory layout. The bulk of the memory is in plain structs containing pointers to structs, which is not any different from C or Java or nearly any other language. The only really Go-specific thing is the layout of maps, which I don't know off the top of my head any more than other Go programmers - I looked it up while doing the analysis (see runtime/map.go).
I'm very sorry, but I find it exceedingly hard to credit the engineering effort that went into either of the two attempts at memory calculation.
The first attempt evidently did not get as far as reading the documentation for the method being called, which clearly states, "Size returns the encoded size of a protocol buffer message."
The second attempt grabbed a third-party library off the internet, apparently without inspection, and hoped it would do a better job. Your engineers overlooked that the library - on Aug 5, the day it was adopted had no tests whatsoever - and even today has only one trivial test. See my blog post Our Software Dependency Problem for more about why and how to evaluate third-party dependencies before adopting them.
Worst of all, neither attempt included any kind of test for the functionality in question!
Writing a test is, hands down, the number one most effective way to arrive at code that works.
If your engineers did not think it worthwhile to write a simple, standalone test for this functionality, even after realizing that the code being used didn't work, that's a serious problem with your engineering practices; addressing it would pay significant dividends.
There's nothing left to do on the Go side, so I'm going to close this issue.
I know this is a closed issue, but I want to point out a fundamental issue: Not only is this hard, it is impossible to do what we really want from a size computation.
Each of these criteria is obviously desireable.
Now consider a struct with a byte slice in it, and two such structs with the same byte slice. The size of each struct must include the storage for the slice. The size of a thing containing those two structs should only include the slice once.
You cannot win. The traits essential for a size calculation to do what we generally want it to do are mutually exclusive because pointers can point to the same thing. To get a usable size calculation, you have to decide what tradeoffs you're looking for, and that will be too application-specific to make sense to try to build into stdlib.
(Counterargument: Runtime can tell you how much space a slice really takes in the case where it's a reference into storage that extends past its bounds, such as bigarray[1000:], user code can't. But I don't think that's enough of the problem to justify the massive effort of resolving the definitional problems several different and mutually exclusive ways.)
Calculating sizes like this is a use case for
With all due respect, I wish you would present your arguments with a bit more empathy for the receiver. We already know we are off in our calculation of the memory usage. A test does not give us any more help there than what pprof did already.
We do not have the engineering resources of Google to look deeply into every problem that we face. Sometimes we look outside towards the experts for guidance. But, for all that Dgraph team has given the Go ecosystem (two databases and a cache, if you missed), I really think we could use a bit more respect from the Go team.
In fact, we just needed a few guiding pointers, which is why I raised this on Twitter. I was talked down into creating this issue, only then to turn around and be degraded. I'll remind myself not to raise issues here in the future, or tag the Go team on Twitter.
Talking about significant dividents, I run multiple open source communities myself. And we do our best to guide our users, even if their approach was wrong, or they didn't read docs properly, or their engineers are sub par (all these 3 things happen all the time).
So, let me present your arguments for you, in how I would treat a user of Dgraph, Badger or Ristretto:
This would be an arbitrarily expensive function - in the limit it is a full garbage collection - and it's not even a particularly well-formed concept, since there might be overlap with other possible starting points, so you can't in general add the results of multiple calls. Such a function might be useful for debugging but would be too slow for production use. Having it in the standard library would encourage the latter.
Calculating memory seems general enough that it might be useful to others as well. I would suggest putting in the time to write a good implementation and then either contributing it back to the current third-party location or publishing your own open source version that others can use. While this might not make it to standard library, I or the Go team can help review the code, so it would be correct as of Go 1.13s memory layout.
While the analysis above seems hard, it only requires understanding how to use pprof, not really anything deep about Go memory layout. The bulk of the memory is in plain structs containing pointers to structs, which is not any different from C or Java or nearly any other language. The only really Go-specific thing is the layout of maps, which I don't know off the top of my head any more than other Go programmers - I looked it up while doing the analysis (see runtime/map.go).
While we don't expect typical Go users to read the source code of runtime/map.go, given your use case, it would be recommended to look at map.go, slice.go and these other files <xyz.go>, while doing your implementation.
I understand you felt that your team spent a bunch of work trying to do memory calculations. The first attempt had some issues like using the encoded size, which wouldn't work because of inmemory representation of structs. And the second attempt was to use a third-party library, which lacked tests and was also wrong. I'd recommend writing unit tests, because they could tell you exactly how much memory is being used (quicker than pprof would), which would shorten your iteration time.
Clearly, we could do a better job at documenting how Go structs consume memory. It might even make a nice blog post, or a talk in the next Go conference. But meanwhile, look at the resources I mentioned above and if you open source your memory calculator, my team would be happy to help. Or, if they are not available, perhaps xy person (from the community) should also be able to help.
Closing this issue for now.
-- A Go team who cares
I apologize for the way my reply came across. I absolutely want you and other Go users to succeed at using Go, and I want you to have the tools to do that on your own.
I couldn't disagree more.
A simple, standalone test absolutely gives you more help. It lets you narrow down the problem to a simpler case that is easier to reproduce, easier to share with others, and—most critically of all—easier to debug on your own. If I had noticed a problem in a large production run, writing a simple, standalone test to check the behavior that seems to be broken is literally the first thing I would do.
This is the most important point I was trying to make in my last reply. The response to "this big program is not working right" should almost always be "write (or re-examine) a test of the smaller piece that seems to be wrong".