Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memprof support for native allocations #9230

Merged
merged 6 commits into from Jan 28, 2020
Merged

Conversation

stedolan
Copy link
Contributor

@stedolan stedolan commented Jan 7, 2020

This patch adds support for native allocations to Memprof. The tricky bit is keeping track of allocations that were combined via Comballoc, but these days the frametable has enough information to support this.

This PR depends on #9229 and #9225.

This PR doesn't yet implement precise call stacks for Comballoc: combined allocations are individually tracked, but are all given the same stack. (I'll try fixing this tomorrow).

Currently, tags are not tracked in the frame table for native allocations, so all native minor tags are reported as 0. This is fixable at some space cost in the frametables. Is this an important feature?

cc @jhjourdan

@jhjourdan
Copy link
Contributor

Thanks, you have been faster than me! I will look further the details tomorrow.

I however had another implementation plan which (I think) would have avoided the burden of local arrays with the corresponding CAMLxparams, at the cost of the need for Placeholder_value. This is the reason why I left Placeholder_value, even though I knew it was not necessary for the single allocation case.

My plan was to use the trackst.entries array to store all the temporary information, in contiguous entries. First, we do the sampling and allocate the entries. Second, we store the index of the first sampled entry in a local variable, which we register in its idx_ptr. Third, we run the callbacks. Fourth, we write the blocks values. It seems to me that we never need any local array nor any local root. This might be significantly shorter, don't you think?

@jhjourdan
Copy link
Contributor

Also, I am not completely convinced this implementation is correct, because if memprof is stopped (or restarted) while a callback is running, then we need to discard the samples instead of inserting them.

@stedolan
Copy link
Contributor Author

stedolan commented Jan 8, 2020

This might be significantly shorter, don't you think?

This is better. I'd thought it would be hard to keep the entries contiguous in the presence of blocking callbacks, but as you say this doesn't happen if we create all entries before running any callbacks. I'll try this today.

if memprof is stopped (or restarted) while a callback is running, then we need to discard the samples instead of inserting them

Well spotted!

@stedolan
Copy link
Contributor Author

stedolan commented Jan 8, 2020

This might be significantly shorter, don't you think?

This is better.

OK, I've looked at this a bit and now I think using trackst.entries is more awkward than the current local roots.

Suppose that we add all of the Comballoc blocks to trackst.entries early, and then start running callbacks. One of these callbacks blocks and another thread runs, doing some Memprof callbacks of its own. Since all of our callbacks are already in trackst, this other thread can run some of those. Two bad things happen:

  • We fight over ownership of entry.idx_ptr - we're using it to keep track of the new comballoc'd entries, which conflicts with the other thread using it during run_callback_exn.

  • The new comballoc'd entries can shrink - the allocations won't be garbage-collected but some of the alloc callbacks can return None, and since the other thread can call flush_delete some of the entries can go away.

This means that the trick of using a single idx_ptr to track the contiguous comballoc'd entries isn't reliable, so I think we're back to a local array of tracking pointers (at which point it's simpler not to use trackst.entries during construction, and atomically add all of the entries after the callbacks have run).

@stedolan
Copy link
Contributor Author

stedolan commented Jan 8, 2020

I've just pushed a simpler implementation: it removes Placeholder_value but doesn't use CAMLxparamN. Instead, once >=2 samples happen in the same comballoc block, it allocates a block on the GC heap to hold them.

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this hard work!

Here is the work which I think is left to be done:

  • Either drop support for tags or give a correct implementation. Such a correct implementation will probably need to add some more information in the frame tables.
  • Have different callstack for the various blocks of a comballoc set.

In addition, I still think that an implementation based on pre-allocation in the trackst.entries array would be simpler. I think it would make simpler to fix the two issues of deallocation callbacks not being called which I noted bellow.

runtime/memprof.c Outdated Show resolved Hide resolved
runtime/memprof.c Outdated Show resolved Hide resolved
runtime/caml/stack.h Outdated Show resolved Hide resolved
runtime/caml/memory.h Outdated Show resolved Hide resolved
runtime/memprof.c Show resolved Hide resolved
testsuite/tests/statmemprof/lists_in_minor.opt.reference Outdated Show resolved Hide resolved
testsuite/tests/statmemprof/lists_in_minor.ml Show resolved Hide resolved
testsuite/tests/statmemprof/comballoc.ml Show resolved Hide resolved
testsuite/tests/statmemprof/comballoc.ml Outdated Show resolved Hide resolved
testsuite/tests/statmemprof/arrays_in_minor.ml Outdated Show resolved Hide resolved
@stedolan
Copy link
Contributor Author

stedolan commented Jan 9, 2020

Thanks for reviewing!

Either drop support for tags or give a correct implementation. Such a correct implementation will probably need to add some more information in the frame tables.

Agreed. I lean towards dropping support for tags - my impression is that they were originally present only because the information was to hand rather than because there was any compelling use-case.

Have different callstack for the various blocks of a comballoc set.

Agreed, but it'll probably be next week before I get around to this.

In addition, I still think that an implementation based on pre-allocation in the trackst.entries array would be simpler. I think it would make simpler to fix the two issues of deallocation callbacks not being called which I noted bellow.

I tried this yesterday but didn't see any easy way around the reuse of idx_ptr and the shrinking issues. Am I missing something?

@gasche
Copy link
Member

gasche commented Jan 9, 2020

(One tag-related aspect that we may want is the ability to behave differently on the allocation of "custom" values, which could benefit from problem-domain-specific resource-consumption measures.)

@stedolan
Copy link
Contributor Author

stedolan commented Jan 9, 2020

(One tag-related aspect that we may want is the ability to behave differently on the allocation of "custom" values, which could benefit from problem-domain-specific resource-consumption measures.)

This sounds useful, but I don't think a tag field in memprof helps. The tag field lets us distinguish "Custom_tag" from not custom tag, but doesn't help distinguishing 'int32' from bigarray / socket / whatever. (Some more general way to pass information from allocation sites to memprof callbacks might be handy, though).

@jhjourdan
Copy link
Contributor

I tried this yesterday but didn't see any easy way around the reuse of idx_ptr and the shrinking issues. Am I missing something?

I'm currently trying this.

runtime/caml/stack.h Outdated Show resolved Hide resolved
@jhjourdan
Copy link
Contributor

This sounds useful, but I don't think a tag field in memprof helps. The tag field lets us distinguish "Custom_tag" from not custom tag, but doesn't help distinguishing 'int32' from bigarray / socket / whatever. (Some more general way to pass information from allocation sites to memprof callbacks might be handy, though).

Agreed. tracking custom blocks is something we need, but just knowing the tag is clearly not enough. Let's postpone this for later.

runtime/memprof.c Outdated Show resolved Hide resolved
@jhjourdan
Copy link
Contributor

I proposed in stedolan#2 an alternative implementation inspired from what I was discussing above.

I gave up on pre-allocating entries, because of the issues relate to what @stedolan was mentioning. Instead, I use one and only loop for sampling, allocating entries and calling the callbacks in one go.

The code is a bit longer, but has the advantage of fixing several bugs: Memprof.start/stop can be safely called in a comballoc set and deallocation callbacks are called even if an exception is raised.

@gasche
Copy link
Member

gasche commented Jan 10, 2020

Naively, my impression was that comballoc should be fairly easy:

  • allocate the combined blocks
  • add a tracking entry for each sampled block, without running any callback
  • call caml_memprof_check_action_pending

This adds a slight delay when there are several sampled blocks; in a sense they are all delayed to run only after the last one was allocated (in term of the source code with non-combined blocks), but it is not clear to me that this is an issue.

Is this reasoning wrong? Is it in fact much more complex to get this basic level of behavior, or are you both trying to enable extra guarantees that make it much more difficult?

@jhjourdan
Copy link
Contributor

This adds a slight delay when there are several sampled blocks; in a sense they are all delayed to run only after the last one was allocated (in term of the source code with non-combined blocks), but it is not clear to me that this is an issue.

Essentially, if we do that, we delay until the next async callback check. This can be a lot later the last allocated block.

@gasche
Copy link
Member

gasche commented Jan 10, 2020

In my (simplified) mental model, the allocation of the combined blocks which have at least one sample triggers a minor-collection-like cold path (at least in bytecode mode, this is done by having set the young_limit at sampling time; I thought the same mechanism was used in native as well). At this point we are in runtime GC code, we can do the allocation, add tracking entries, and then run some callbacks, without waiting for the next async-handling event.

@jhjourdan
Copy link
Contributor

The problem then is that the code which does the allocation is the native code, not the GC code. So we cannot perform the allocation in the memprof code.

@stedolan
Copy link
Contributor Author

to elaborate slightly on @jhjourdan's answer, for @gasche and anyone following along:

In my (simplified) mental model, [... snip snip snip ...] we can do the allocation, add tracking entries, and then run some callbacks

The three steps are indeed (a) do the allocation, (b) add tracking entries and (c) run some callbacks. But we can't do them in that order.

The issue is that the initialisation of the newly-allocated block is done after the GC returns (either in ocamlopt-compiled native code or in the interpreter loop). This initialisation code assumes, reasonably enough, that the block that it allocated is the most recently allocated block. That means that we must do step (a) after step (c), since the callbacks might allocate.

If we do step (b) before step (c), then we have some tracking entries in a weird state (tracking an allocation that's not yet done) during callbacks, when the GC or other threads might run. My patch here does (c) (b) (a), creating the tracking entries only once their contents are known, by locally buffering some state while callbacks run. @jhjourdan's patch does (b) (c) (a), carefully ensuring that the weird-state tracking entries are kept in a consistent state across (c).

@jhjourdan

I proposed in stedolan#2 an alternative implementation inspired from what I was discussing above.

I gave up on pre-allocating entries, because of the issues relate to what @stedolan was mentioning. Instead, I use one and only loop for sampling, allocating entries and calling the callbacks in one go.

Nice, that looks good to me. Just to check that I've got the idx_ptr invariants right (which I think are the most subtle part by far): at any point where memprof might run, t->idx_ptr always exactly one of:

  1. NULL
  2. a pointer to the stack of the thread currently running a callback (with t->callback_running = 1)
  3. a pointer to the stack of the thread doing a Comballoc allocation (with t->block = Placeholder_offs(...) and t->cb_alloc = 1)

The reason that 2 and 3 don't conflict is that state 3 is only entered after the allocation callback has run, and the promotion/deallocation callbacks can't trigger until t->block becomes something other than Placeholder_offs(...).

@jhjourdan
Copy link
Contributor

Just to check that I've got the idx_ptr invariants right (which I think are the most subtle part by far): at any point where memprof might run, t->idx_ptr always exactly one of:

Indeed. Thanks for making this explicit. Could you write this explanation somewhere in memprof.c ?

@stedolan stedolan force-pushed the memprof-native branch 3 times, most recently from e76eece to f38ade7 Compare January 14, 2020 15:35
@stedolan
Copy link
Contributor Author

Thanks for this hard work!

Here is the work which I think is left to be done:

  • Either drop support for tags or give a correct implementation. Such a correct implementation will probably need to add some more information in the frame tables.
  • Have different callstack for the various blocks of a comballoc set.

In addition, I still think that an implementation based on pre-allocation in the trackst.entries array would be simpler. I think it would make simpler to fix the two issues of deallocation callbacks not being called which I noted bellow.

These are all done now.

runtime/backtrace_nat.c Outdated Show resolved Hide resolved
@stedolan stedolan force-pushed the memprof-native branch 2 times, most recently from c1234a7 to 5b97a9a Compare January 14, 2020 16:11
Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike what @damiendoligez and myself did for #8920, I don't have a convenient time to do a full review of this in the next week. I just discussed the review status of this with @jhjourdan, who claims that he reviewed the code carefully and that @stedolan also did -- plus evidence of good-quality review from @gadmm. On this basis I approve the PR on their behalf -- I looked at it and had the design explained to me, but did not do a full review.

This is almost the last change for Memprof in 4.11. The remaining items in my mental TODO-list are:

  1. I would like to change the Memprof.start API to use records instead of optional parameters, for reasons that will be explained in the PR that I am hoping to send in the next few days.
  2. @jhjourdan needs to port the user-side library that he wrote for the earlier version of Memprof.
  3. Hopefully we get user testing and possible feedback.

We also discussed with @jhjourdan the idea of having OCAMLRUNPARAM support for a basic memprof client (that would, for example, dump (to a environment-specified file or stderr) a tree of backtraces at each major-GC full cycle), so that people can get some memory-profiling support without changing their application code. To be discussed.

@gadmm
Copy link
Contributor

gadmm commented Jan 15, 2020

plus evidence of good-quality evidence from @gadmm. On this basis I approve the PR on their behalf

As discussed, I did not really review it (and I am not sure I will have time to do it shortly...)

@stedolan
Copy link
Contributor Author

I've made the last few fixes here, so I think this is ready to merge.

@jhjourdan
Copy link
Contributor

Modulo the failure on Mingw, the changes look good to me.

I have no idea why there is one extra backtrace slot on mingw32, though.

@jhjourdan
Copy link
Contributor

The failure on Mingw reminds me of another issue we already had with backtraces on mingw: #8641 (comment).

This seems like a real bug, but particularly hard to reproduce....

@jhjourdan
Copy link
Contributor

I started a precheck job, which succeeded so far: https://ci.inria.fr/ocaml/job/precheck/332/.

@jhjourdan
Copy link
Contributor

I just rebased on top of trunk, and fixed the conflict.

@gasche
Copy link
Member

gasche commented Jan 26, 2020

What's the status of the PR now? Are you satisfied with the CI results? There is a "revert later" commit still in the patchset.

@jhjourdan
Copy link
Contributor

The issue is the Mingw32 failure which appeared twice (once here and once in #8641), and that we do not know how to debug since we cannot observe it reliably anywhere.

That said, since the issue was observed in #8641, this is clearly not related to that specific PR, so we could merge.

@stedolan, what do you think? Could you prepare a mergeable patchset?

@stedolan
Copy link
Contributor Author

I've removed the "revert later" commit, which was some CI debugging. I'm going to wait until CI either passes or fails with no errors other than nondeterministic "Called from unknown location" one (which has nothing to do with this patch), then merge.

BTW, I think we have a fix for the weird CI failures in #9268.

@stedolan stedolan merged commit d0e0cc8 into ocaml:trunk Jan 28, 2020
@jhjourdan
Copy link
Contributor

I think we can now say that the first version of Memprof is fully merged and ready to be released! Congrats and thank to all who contributed!

@hannesm
Copy link
Member

hannesm commented Jan 31, 2020

first version of Memprof is fully merged and ready to be released

I'm keen to try this out, is https://github.com/jhjourdan/statmemprof-emacs/ supposed to work with this first version that is now in trunk?

@jhjourdan
Copy link
Contributor

You are right that statmemprof-emacs is still not compatible with the version of Memprof merged in trunk.

This needs to be done, but my time resources are getting sparse.

stedolan added a commit to janestreet/ocaml that referenced this pull request Mar 6, 2020
Memprof support for native allocations

(cherry picked from commit d0e0cc8)
stedolan added a commit to janestreet/ocaml that referenced this pull request Mar 17, 2020
mshinwell pushed a commit to mshinwell/ocaml that referenced this pull request Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants