New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memprof support for native allocations #9230
Conversation
Thanks, you have been faster than me! I will look further the details tomorrow. I however had another implementation plan which (I think) would have avoided the burden of local arrays with the corresponding CAMLxparams, at the cost of the need for My plan was to use the |
Also, I am not completely convinced this implementation is correct, because if memprof is stopped (or restarted) while a callback is running, then we need to discard the samples instead of inserting them. |
This is better. I'd thought it would be hard to keep the entries contiguous in the presence of blocking callbacks, but as you say this doesn't happen if we create all entries before running any callbacks. I'll try this today.
Well spotted! |
OK, I've looked at this a bit and now I think using Suppose that we add all of the Comballoc blocks to trackst.entries early, and then start running callbacks. One of these callbacks blocks and another thread runs, doing some Memprof callbacks of its own. Since all of our callbacks are already in trackst, this other thread can run some of those. Two bad things happen:
This means that the trick of using a single idx_ptr to track the contiguous comballoc'd entries isn't reliable, so I think we're back to a local array of tracking pointers (at which point it's simpler not to use trackst.entries during construction, and atomically add all of the entries after the callbacks have run). |
2c55baf
to
f3a4232
Compare
I've just pushed a simpler implementation: it removes Placeholder_value but doesn't use CAMLxparamN. Instead, once >=2 samples happen in the same comballoc block, it allocates a block on the GC heap to hold them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this hard work!
Here is the work which I think is left to be done:
- Either drop support for tags or give a correct implementation. Such a correct implementation will probably need to add some more information in the frame tables.
- Have different callstack for the various blocks of a comballoc set.
In addition, I still think that an implementation based on pre-allocation in the trackst.entries
array would be simpler. I think it would make simpler to fix the two issues of deallocation callbacks not being called which I noted bellow.
Thanks for reviewing!
Agreed. I lean towards dropping support for tags - my impression is that they were originally present only because the information was to hand rather than because there was any compelling use-case.
Agreed, but it'll probably be next week before I get around to this.
I tried this yesterday but didn't see any easy way around the reuse of idx_ptr and the shrinking issues. Am I missing something? |
(One tag-related aspect that we may want is the ability to behave differently on the allocation of "custom" values, which could benefit from problem-domain-specific resource-consumption measures.) |
This sounds useful, but I don't think a tag field in memprof helps. The tag field lets us distinguish "Custom_tag" from not custom tag, but doesn't help distinguishing 'int32' from bigarray / socket / whatever. (Some more general way to pass information from allocation sites to memprof callbacks might be handy, though). |
I'm currently trying this. |
Agreed. tracking custom blocks is something we need, but just knowing the tag is clearly not enough. Let's postpone this for later. |
I proposed in stedolan#2 an alternative implementation inspired from what I was discussing above. I gave up on pre-allocating entries, because of the issues relate to what @stedolan was mentioning. Instead, I use one and only loop for sampling, allocating entries and calling the callbacks in one go. The code is a bit longer, but has the advantage of fixing several bugs: Memprof.start/stop can be safely called in a comballoc set and deallocation callbacks are called even if an exception is raised. |
Naively, my impression was that comballoc should be fairly easy:
This adds a slight delay when there are several sampled blocks; in a sense they are all delayed to run only after the last one was allocated (in term of the source code with non-combined blocks), but it is not clear to me that this is an issue. Is this reasoning wrong? Is it in fact much more complex to get this basic level of behavior, or are you both trying to enable extra guarantees that make it much more difficult? |
Essentially, if we do that, we delay until the next async callback check. This can be a lot later the last allocated block. |
In my (simplified) mental model, the allocation of the combined blocks which have at least one sample triggers a minor-collection-like cold path (at least in bytecode mode, this is done by having set the young_limit at sampling time; I thought the same mechanism was used in native as well). At this point we are in runtime GC code, we can do the allocation, add tracking entries, and then run some callbacks, without waiting for the next async-handling event. |
The problem then is that the code which does the allocation is the native code, not the GC code. So we cannot perform the allocation in the memprof code. |
to elaborate slightly on @jhjourdan's answer, for @gasche and anyone following along:
The three steps are indeed (a) do the allocation, (b) add tracking entries and (c) run some callbacks. But we can't do them in that order. The issue is that the initialisation of the newly-allocated block is done after the GC returns (either in ocamlopt-compiled native code or in the interpreter loop). This initialisation code assumes, reasonably enough, that the block that it allocated is the most recently allocated block. That means that we must do step (a) after step (c), since the callbacks might allocate. If we do step (b) before step (c), then we have some tracking entries in a weird state (tracking an allocation that's not yet done) during callbacks, when the GC or other threads might run. My patch here does (c) (b) (a), creating the tracking entries only once their contents are known, by locally buffering some state while callbacks run. @jhjourdan's patch does (b) (c) (a), carefully ensuring that the weird-state tracking entries are kept in a consistent state across (c).
Nice, that looks good to me. Just to check that I've got the idx_ptr invariants right (which I think are the most subtle part by far): at any point where memprof might run,
The reason that 2 and 3 don't conflict is that state 3 is only entered after the allocation callback has run, and the promotion/deallocation callbacks can't trigger until |
Indeed. Thanks for making this explicit. Could you write this explanation somewhere in |
e76eece
to
f38ade7
Compare
These are all done now. |
c1234a7
to
5b97a9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike what @damiendoligez and myself did for #8920, I don't have a convenient time to do a full review of this in the next week. I just discussed the review status of this with @jhjourdan, who claims that he reviewed the code carefully and that @stedolan also did -- plus evidence of good-quality review from @gadmm. On this basis I approve the PR on their behalf -- I looked at it and had the design explained to me, but did not do a full review.
This is almost the last change for Memprof in 4.11. The remaining items in my mental TODO-list are:
- I would like to change the
Memprof.start
API to use records instead of optional parameters, for reasons that will be explained in the PR that I am hoping to send in the next few days. - @jhjourdan needs to port the user-side library that he wrote for the earlier version of Memprof.
- Hopefully we get user testing and possible feedback.
We also discussed with @jhjourdan the idea of having OCAMLRUNPARAM support for a basic memprof client (that would, for example, dump (to a environment-specified file or stderr) a tree of backtraces at each major-GC full cycle), so that people can get some memory-profiling support without changing their application code. To be discussed.
As discussed, I did not really review it (and I am not sure I will have time to do it shortly...) |
ec74919
to
1839d52
Compare
I've made the last few fixes here, so I think this is ready to merge. |
Modulo the failure on Mingw, the changes look good to me. I have no idea why there is one extra backtrace slot on mingw32, though. |
The failure on Mingw reminds me of another issue we already had with backtraces on mingw: #8641 (comment). This seems like a real bug, but particularly hard to reproduce.... |
I started a precheck job, which succeeded so far: https://ci.inria.fr/ocaml/job/precheck/332/. |
b310d31
to
83a47c2
Compare
I just rebased on top of trunk, and fixed the conflict. |
What's the status of the PR now? Are you satisfied with the CI results? There is a "revert later" commit still in the patchset. |
The issue is the Mingw32 failure which appeared twice (once here and once in #8641), and that we do not know how to debug since we cannot observe it reliably anywhere. That said, since the issue was observed in #8641, this is clearly not related to that specific PR, so we could merge. @stedolan, what do you think? Could you prepare a mergeable patchset? |
eaf8557
to
ca5f81c
Compare
I've removed the "revert later" commit, which was some CI debugging. I'm going to wait until CI either passes or fails with no errors other than nondeterministic "Called from unknown location" one (which has nothing to do with this patch), then merge. BTW, I think we have a fix for the weird CI failures in #9268. |
I think we can now say that the first version of Memprof is fully merged and ready to be released! Congrats and thank to all who contributed! |
I'm keen to try this out, is https://github.com/jhjourdan/statmemprof-emacs/ supposed to work with this first version that is now in trunk? |
You are right that statmemprof-emacs is still not compatible with the version of Memprof merged in trunk. This needs to be done, but my time resources are getting sparse. |
Memprof support for native allocations (cherry picked from commit d0e0cc8)
Memprof support for native allocations
Memprof support for native allocations
This patch adds support for native allocations to Memprof. The tricky bit is keeping track of allocations that were combined via Comballoc, but these days the frametable has enough information to support this.
This PR depends on #9229 and #9225.
This PR doesn't yet implement precise call stacks for Comballoc: combined allocations are individually tracked, but are all given the same stack. (I'll try fixing this tomorrow).
Currently, tags are not tracked in the frame table for native allocations, so all native minor tags are reported as 0. This is fixable at some space cost in the frametables. Is this an important feature?
cc @jhjourdan