Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistical memory profiling #847

Open
wants to merge 40 commits into
base: trunk
from

Conversation

Projects
None yet
@jhjourdan
Copy link
Contributor

commented Oct 10, 2016

This GPR implements a mechanism to statistically profile the heap used by an OCaml program. The sampling rate is tunable so that the overhead can be reduced as much as desired. More information about the general idea in this document.

The patch can divided in different parts:

  • Additions to the runtime system allowing to statistically sample allocations. Most of the changes in the runtime system is in the file byterun/memprof.c.
  • Modifications in the native compiler allowing to revert at runtime the optimization of combinations of allocations. Specifically, we store in the frame tables the headers of the allocated blocks together with debugging information even if the allocation point allocate several blocks at once.
  • New module Memprof in the standard library

This GPR is still a work in progress. I would be happy to receive general comments from the OCaml development team. I have also other concerns:

  • The user interface is, for now, rather rudimentary: the Memprof module of the standard library provides only a sampling mechanism, and the memprofHelpers.ml file, available at the root of the repository, provides a basic user interface. I think this system could share at least a part of the user interface with the spacetime profiler.
  • The caml_alloc_shr function and its variants becomes a mess.
    • The first reason is that I need, when collecting the minor heap, a way of allocating in the major heap without sampling these allocations.
    • The other reason is that, currently, this function is guaranteed to not trigger a GC (that fact is actually used in interp.c), but that contradicts with the sampling mechanism that possibly calls an arbitrary OCaml function at each allocation. As a result, I created yet another variant of caml_alloc_shr, allowing GC calls, which is called as often as possible. So here is my question: apart from the OCaml runtime system, do you think there is much code in the wild exploiting the fact that caml_alloc_shr does not call the GC? Would it be possible to remove this guarantee?
@nojb

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2016

Could you say a few words about how this work compares with Spacetime ? When would one use one or the other ? What are the advantages/disadvantages of one over the other ?

@lefessan

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2016

I won't comment more on this PR, as I am biaised towards OCamlPro's Memory Profiler (ocp-memprof), but since you compared the two memory profilers in your talk, I will give my own comparison here:

  • ocp-memprof had no additionnal cost in memory (the only information is stored in the standard block header) and is always active (you can run the program in production, and just ask for a snapshot at any time), which is not the case of this work: you need to have started the profiling at the beginning, thus leaking memory for the profiler itself;
  • with this work, you get information on the backtraces at allocation points (as with Spacetime, but with a much smaller cost), but to recover the types, you need to go back the sources. ocp-memprof on the contrary was able to recover types directly from the allocation points, which made it possible to aggregate the types, and have an immediate interpretation of the results.
  • ocp-memprof could store the graph of pointers in its (compressed) snapshots, which means you could go back from the (problematic) blocks to the roots that retained them, an information that is not available with the census provided by this work and Spacetime. More generally, many computations were possible offline on ocp-memprof snapshots.

FWIW, I am sad to see so much work (with this work and Spacetime) on building a competitor for ocp-memprof, whose license was really cheap and provided full access to the sources, and for which there is now little interest for OCamlPro to work on. Sometimes, it's better to pay a little to get a well-crafted tool with an efficient GUI than a plethora of unmaintained prototypes.

@mshinwell

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2016

@lefessan I think many people would be pleased if OCamlPro would consider contributing the code for storing a compressed version of the heap graph (as applied to Spacetime snapshots). Although I understand this might not be possible for commercial reasons.

@lefessan

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2016

@mshinwell

Although I understand this might not be possible for commercial reasons.

It's not a matter of commercial reasons, it's just that we just have no good reason to spend any more time (and thus, money) on this subject. And actually, technically, releasing this code would be useless without releasing the code to analyse them, i.e. all ocp-memprof.

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2016

Could you say a few words about how this work compares with Spacetime ? When would one use one or the other ? What are the advantages/disadvantages of one over the other ?

From a technical point of view, this is very different to Spacetime: spacetime not only profiles allocation it also constructs a call graph that can be used for debugging as well as for profiling purposes.

On the other hand, this approach chooses at random only a few allocations, and gives the user the opportunity to gather a lot of information about the state of the program when allocating. This has two major advantages: first, this is much more lightweight, making it possible to enable it in production with almost no overhead. Second, because any user-chosen function can be called when sampling, it gives more flexibility to which information is gathered.

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2016

you need to have started the profiling at the beginning, thus leaking memory for the profiler itself;

This is true, but a little bit exaggerated: first, the memory overhead is, in fact, negligible if the sampling rate is low enough and second, the memory is not "leaked", because it is recovered as soon as the heap is freed.

jhjourdan added some commits Oct 11, 2016

Fix build with MSVC.
When configured with --statmemprof, there is still a use of VLAs in memprof.c, making the code non C90-compliant and hence not compatible with MSVC.
@lefessan

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2016

@jhjourdan

the memory overhead is, in fact, negligible if the sampling rate is low enough

Do you have numbers ? For example on Coq, where backtraces might become large with some tactics ?

the memory is not "leaked", because it is recovered as soon as the heap is freed.

What I meant was, if the program is leaking memory, it will leak even more memory, since these leaked blocks are never freed, neither are the attached backtraces.

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2016

the memory overhead is, in fact, negligible if the sampling rate is low enough

Let us consider an extremely bad case : the captured callstacks have average length 10^4. Then, if we set the sampling rate to 10^-5 (with which you already get much statistical information), then the memory overhead is bounded by 10%.

Given that, in practice, I expect callstacks to be much smaller (that would mean that the program is quite close to stack overflow, in which case the programmer has probably other problems in her head), I really think that the memory overhead is negligible.

If you are still worried about you memory being filled with callstacks, you can still either compress them when sampling or limit their size.

@mshinwell mshinwell changed the title [WIP] Statistical memory profiling - Request for comments Statistical memory profiling - Request for comments Dec 28, 2016

@damiendoligez damiendoligez added this to the long-term milestone Sep 29, 2017

@jhjourdan jhjourdan force-pushed the jhjourdan:memprof_trunk branch from b48d787 to 5883eee Nov 20, 2018

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2018

Any update on this PR since I rebased it on trunk?

@braibant , have you been able to test the 4.07 variant it on your projects?

@gasche

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

I would be interested in getting statmemprof integrated as soon as possible, if we can find people willing to review it. For me the Spacetime integration is not a requirement, and that work has produced no observable output (I pinged Mark earlier today who will ping the secret people working on it, just in case), so I would be happy to do without it.

@braibant

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

I experimented a bit with statmemprof. Here are some observations:

  • The backtrace does not contain enough information to retrieve the complete filename (it only provides the basename). My understanding is that this is due to the limited information available in Printexc.location.
  • The memory profiler has no knowledge of the data allocated outside of the OCaml heap. Two examples I care about are Bigstring.t (bigarrays -- custom blocks with small length on the OCaml heap) and Zarith.Q.t (pairs on the ocaml heap), where the actual value lives on the C heap. The samples coming from statmemprof were roughly in line with what I saw using more crude measures (htop), but there were discrepancies potentially coming with the above.
  • I have not seen, but not measured either, a performance hit when using low enough sampling rates. I have not tried running the statistical memory profiler with high sampling rates.
  • The profiler exposes data about the tag of the blocks being allocated and the size of the block. I tried figuring out if there is any way to use the tag for analysis, in a way that can be documented, but did not arrive at something satisfying.
  • The driver that I used is an adapted version of https://opam.ocaml.org/packages/statmemprof-emacs/. The emacs integration is not something that is easy to port and I replaced this with a simple file backend. However, my understanding is that someone is working on a better terminal UI. That would be great.
  • Most of my runs where showing that almost 100% of the data was allocated with allocation kind minor. This is expected since the values were initially allocated on the minor heap, rather than them being still on the minor heap when sampled. Some documentation about typical use cases should maybe mention that the profiler does not allow us to inspect the "age" of a value.
  • A follow up of the above: would it be possible to have an alloc_kind tracking compaction? Would it be useful (the interesting bit there would be values surviving multiple compactions...)

Things that might be interesting to know / do.

  • Testing the performance regression in the null-case scenario (runtime patched with the statistical memory profiler vs unpatched runtime)
  • Testing the performance regression as a function of the sampling rate
  • Using #1738, the new custom block allocation functions could allow us to track in some capacity the data allocated on the C heap.

I am very enthusiastic about this work, and would love to see it reviewed and merged. I don't think the integration with the spacetime viewer is a pre-requisite.

@gasche

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

Nice! Thanks a lot @braibant. Two quick comments on low-hanging fruits/remarks:

The memory profiler has no knowledge of the data allocated outside of the OCaml heap. Two examples I care about are Bigstring.t (bigarrays -- custom blocks with small length on the OCaml heap) and Zarith.Q.t (pairs on the ocaml heap), where the actual value lives on the C heap.

At the Mirage retreat @hannesm played with statmemprof and had the same issue (bigarrays are heavily used in the Mirage codebase due to Cstruct), and @chambart helped him do a sort of ad-hoc hack to track bigarray allocations (adding instrumentation directly in the C implementation of bigarrays) -- if I understand correctly, in a separate ephemeron. It would be nicer to have built-in support for out-of-heap resources (and why not also the "virtual cost" API of custom values) built in statmemprof, which might emerge as an enhancement PR from @hannesm / @chambart's experiment, but I would not bet on that -- someone with the resource to do something complete for upstreaming may go quicker than this pair of already-overloaded programmers.

(For Zarith, I would assume that the out-of-heap size is relatively small, and thus fairly proportional to the OCaml-side tracking? Do bignum get arbirary large in real-world computations, the way Bigarrays do?)

The profiler exposes data about the tag of the blocks being allocated and the size of the block. I tried figuring out if there is any way to use the tag for analysis, in a way that can be documented, but did not arrive at something satisfying.

I know that @let-def has done some amusing experiments encoding type information in "the rest of the header" (what is typically used by ocp-memprof or spacetim), it might be possible to combine the two works in interesting ways!

@mshinwell

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

@jhjourdan In terms of getting this upstreamed, I think we need to break this down into its constituent parts, much like I'm doing for the gdb work. I believe there is sufficient consensus as to the overall aim here that we should merge individual parts of the work as they become ready, even if some parts have a few ragged edges and stubs, rather than doing everything strictly in dependency order. My previous experience has shown that the latter often leads to long delays.

As discussed on caml-devel, JS are willing to devote resources to reviewing this from approximately the start of April onwards, and @let-def has kindly agreed to contribute some of his time as well. We can do the splitting up of the patch at that time if necessary. The rough idea so far is to concentrate on getting the core parts in for 4.09 and defer some of the more elaborate pieces for 4.10.

There are two elaborate pieces that I have thought about so far:

  1. The complications (which I haven't completely read in detail yet) about caml_alloc_shr and so forth, which I think stem from the decision to allow arbitrary OCaml code to run on allocations triggered from C code. (This of course includes all major heap allocations.) How about we simplify this for a first version? Here is a sketch of a plan. Add a C hook that is called at the relevant point during allocation. Add a C function that can be called by the implementation of such hook in order to trigger a previously-registered OCaml callback at a later stage. The deferral mechanism here would be much like signal handling. This would mean that all allocations from C that issue from a given external call in OCaml code would be lumped together. However for a first version, this would probably dramatically simplify the code, and practical experience suggests that the sort of allocations involved here (e.g. create an array, create a string, etc.) are "nearly" uniquely determined from the point in OCaml code at which they are called. (At least when a human is involved, looking at the output.)

  2. The Comballoc-related code (which incidentally will have merge conflicts with the gdb work; I can help resolve that). This seems as if it would be a nice candidate to defer until 4.10, but unfortunately it doesn't seem obvious how to do this, without losing the required statistical properties and risking some allocations never being sampled at all.

@lpw25 and I have discussed what to do about the problem relating to backtraces that @braibant mentions. We think we have an approach that should be straightforward to implement without much code: add some functionality to retrieve the backtrace as a list of return addresses. These can then be put through the same mechanisms (which if I remember correctly use @let-def 's owee library) as used in the Spacetime viewer to retrieve the full source pathname, etc. from the DWARF information in the executable.

@mshinwell mshinwell changed the title Statistical memory profiling - Request for comments Statistical memory profiling Mar 15, 2019

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

@braibant Thanks for the feedback!

The memory profiler has no knowledge of the data allocated outside of the OCaml heap. Two examples I care about are Bigstring.t (bigarrays -- custom blocks with small length on the OCaml heap) and Zarith.Q.t (pairs on the ocaml heap), where the actual value lives on the C heap. The samples coming from statmemprof were roughly in line with what I saw using more crude measures (htop), but there were discrepancies potentially coming with the above.

I can indeed plan to support out-of heap support as an improvement of the current implementation. I do not think this is a difficult addition. It will require, however, some support from the corresponding C libraries. This could be as simple as requiring them to use caml_alloc_custom_mem.

The driver that I used is an adapted version of https://opam.ocaml.org/packages/statmemprof-emacs/. The emacs integration is not something that is easy to port and I replaced this with a simple file backend. However, my understanding is that someone is working on a better terminal UI. That would be great.

AFAIK, @let-def was already doing something in that direction.

Some documentation about typical use cases should maybe mention that the profiler does not allow us to inspect the "age" of a value.

I would say that this is something that can be done outside of the OCaml runtime, e.g., as part of statmemprof-emacs or similar tooling. It is easy to imagine that the sampling callback inspects e.g., the current time and the current status of the GC (via GC.quick_stat) to remember the date of birth of each block.

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

The complications (which I haven't completely read in detail yet) about caml_alloc_shr and so forth, which I think stem from the decision to allow arbitrary OCaml code to run on allocations triggered from C code. (This of course includes all major heap allocations.) How about we simplify this for a first version? Here is a sketch of a plan. Add a C hook that is called at the relevant point during allocation. Add a C function that can be called by the implementation of such hook in order to trigger a previously-registered OCaml callback at a later stage. The deferral mechanism here would be much like signal handling. This would mean that all allocations from C that issue from a given external call in OCaml code would be lumped together. However for a first version, this would probably dramatically simplify the code, and practical experience suggests that the sort of allocations involved here (e.g. create an array, create a string, etc.) are "nearly" uniquely determined from the point in OCaml code at which they are called. (At least when a human is involved, looking at the output.)

There already is such a mechanism in this merge request, and this deferral mechanism is the source of much of the complication, since it requires a specific data structure for recording the deferred allocations. The handling of non-deffered allocations is actually rather simple.

The only simplification of your proposal (deferring all C allocations) would be in the public interface of caml_alloc_shr, not in its implementation. There will anyway be some complexity related to the fact that the minor GC must be able to call a special version of caml_alloc_shr, since sampling should be deactivated during minor GC. From my point of view, the added complexity of the interface of caml_alloc_shr is by far justified by the fact that backtraces for e.g., arrays or strings are well-located.

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

The Comballoc-related code (which incidentally will have merge conflicts with the gdb work; I can help resolve that). This seems as if it would be a nice candidate to defer until 4.10, but unfortunately it doesn't seem obvious how to do this, without losing the required statistical properties and risking some allocations never being sampled at all.

I agree that the Comballoc adds a large amount of complexity to the code. Perhaps a solution for a first version would be to have statmemprof activated only with a specific configure option, and this configure option would deactivate Comballoc? I don't have a clear understanding of the performance impact of deactivating Comballoc, though. @xavierleroy?

@jhjourdan

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2019

@jhjourdan In terms of getting this upstreamed, I think we need to break this down into its constituent parts, much like I'm doing for the gdb work. I believe there is sufficient consensus as to the overall aim here that we should merge individual parts of the work as they become ready, even if some parts have a few ragged edges and stubs, rather than doing everything strictly in dependency order. My previous experience has shown that the latter often leads to long delays.

Alright. Then I'll try to prepare smaller patches to review when I'll have a bit more time than now.

@mshinwell

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

@jhjourdan OK, I'm going to have to look more carefully at the details for the caml_alloc_shr code.

@hannesm

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

Some documentation about typical use cases should maybe mention that the profiler does not allow us to inspect the "age" of a value.

I would say that this is something that can be done outside of the OCaml runtime, e.g., as part of statmemprof-emacs or similar tooling. It is easy to imagine that the sampling callback inspects e.g., the current time and the current status of the GC (via GC.quick_stat) to remember the date of birth of each block.

@chambart presented statmemprof at the MirageOS retreat in Marrakesh last week. He used a slightly modified statmemprof-emacs which includes 3 more numbers (roughly GC generation number): generation of first allocation, generation of last allocation, average generation. If the first and last are 0 and max, and average is more than the middle, the allocation may be a leak (worth to look for these) - I find these additions incredible helpful.

out of OCaml heap allocation

yes, this would be great to have integrated. I'm not sure whether it should be statistical as well, or taking into account every allocation. Some initial places where code needs to be hooked were identified (esp. for bigarray), I will see whether I can further develop something.

file backend

I adapted the statmemprof-emacs to work in a MirageOS unikernel, using TCP to transport data from the inside to the emacs proxy (using Marshal), see https://github.com/hannesm/statmemprof-mirage. While doing this, I ran into the issue that Printexc.raw_backtrace (part of Memprof.sample_info) is not safe to Marshal. I use Printexc.backtrace_slots info.Memprof.callstack and Marshal the backtrace_slot array option, which works fine.

-> Could the Memprof.sample_info contain backtrace_slot array option instead of a raw_backtrace, or are there usages where the raw_backtrace is needed, or is the function backtrace_slots too expensive?

@gasche

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

Converting a raw_bactrace into a higher-level representation has a noticeable cost (backtraces can be long); in fact we introduced raw_backtrace precisely for the use-case of Statmemprof. I think having the lowest-level representation, with clearly documented tools to let users that need it convert to higher-level representations, is the right design for a profiling tool designed for low overhead.

jhjourdan added some commits Apr 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.