Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statmemprof is absent in OCaml 5.0 #11911

Closed
NickBarnes opened this issue Jan 18, 2023 · 28 comments
Closed

statmemprof is absent in OCaml 5.0 #11911

NickBarnes opened this issue Jan 18, 2023 · 28 comments

Comments

@NickBarnes
Copy link
Contributor

The statistical memory profiler, also known as "statmemprof", is currently stubbed-out. But it was tremendously useful in OCaml 4. This high-level issue exists to discuss, reason through, and then track the work to resurrect it, which I hope to complete during 2023Q1. There may be several subsidiary issues covering parts of that work.

Statmemprof was stubbed-out because of unanswered questions about its behaviour in the multicore world, both at the implementation level and for the API: what semantics are desired or even possible in the presence of multiple domains? We have to identify and answer those questions.

For reference:

  • the implementation is in runtime/memprof.c;
  • the API is in the Memprof module in stdlib/gc.ml[i].

I'm starting work on this by reading through the existing code (in 4.14) in fine detail, documenting it to my own satisfaction, with the intention of then putting together a proposal for multicore semantics and notes on implementing them. All of that work will end up in comments on this issue and/or in source code comments in PR(s) linked from here.

@NickBarnes
Copy link
Contributor Author

I would particularly appreciate input on the following questions:

  • Would a limited single-domain-only version of statmemprof be worth having? That is, would it be a useful goal, maybe as an intermediate step, to make statmemprof work but solely for single-domain programs, failing (or producing unreliable results) if other domains exist or are created?
  • In a multi-domain program, should statmemprof work independently in separate domains (that is, with per-domain context (rate / stack depth / tracker)? Or is "domain" the wrong granularity of independence? Perhaps it should run per Domainslib.Task.pool ? Do we need to provide a new abstraction ("memory profiling context") which would allow developers of parallel-programming libraries to make these choices?
  • In slightly deeper detail on that last bullet point, how should this "memory profiling context" interact with effects?

@gasche
Copy link
Member

gasche commented Jan 18, 2023

Statmemprof was stubbed-out because of unanswered questions about its behaviour in the multicore world, both at the implementation level and for the API: what semantics are desired or even possible in the presence of multiple domains?

My understanding is a bit different: Statmemprof required a very careful polishing of the runtime system to be able to tell, at each allocation point, how to handle the trigger of a statmemprof sampling event (can we run callbacks at this point and, if not, when is the next safepoint to run one?). This careful engineering, which happened in upstream 4.x after the forking of the Multicore codebase, was never back-ported into the Multicore OCaml branch because of lack of expertise (or prioritization) from the Multicore developers, and as a result it disappeared from the 5.x runtime. It would need to be restored to get Statmemprof back, even before asking about the behavior in multi-domain contexts.

(@jhjourdan is the original author of statmemprof; @stedolan and @gadmm reviewed it carefully and could probably assist in those questions; related issues/PRs from @gadmm: #10915, #11057 )

Would a limited single-domain-only version of statmemprof be worth having? That is, would it be a useful goal, maybe as an intermediate step, to make statmemprof work but solely for single-domain programs, failing (or producing unreliable results) if other domains exist or are created?

Yes, because it forces you to address the issues above.

In a multi-domain program, should statmemprof work independently in separate domains (that is, with per-domain context (rate / stack depth / tracker)?

Whatever is simple and is compatible with good performance. A first proposal:

  • statistical sampling is done independently on each domain
  • but the sampling probability is maintained as a global value shared by all domains

(The idea of "context" that you mention could come later, if there is need.)

@kayceesrk
Copy link
Contributor

kayceesrk commented Jan 19, 2023

Statmemprof was stubbed-out because of unanswered questions

Here's my understanding of the history. Since Multicore OCaml was being developed as a fork, we typically caught up with mainline OCaml by rebasing the changes onto the Multicore OCaml branch one release at a time. Multicore OCaml typically remained a few releases behind the latest OCaml release. Hence, we could see the upstream developments several releases ahead of where Multicore OCaml was. Separately, in parallel to Multicore OCaml, Statmemprof was being developed and its API was refined over multiple OCaml releases. It was painful for the Multicore OCaml team to understand and integrate Statmemprof changes in a particular OCaml version with the knowledge that the subsequent version would change the statmemprof API (for the better). So we decided to not integrate Statmemprof until the API had stabilised (which it has now).

@sadiqj is also a good person to comment on this issue since he reviewed a number of PRs by @gadmm which improved how OCaml 5.0 treats asynchronous actions. The work by @gadmm gets (most/all?) of the changes that were introduced to treat asynchronous actions in such a way that it is compatible with what Statmemprof requires.

@NickBarnes
Copy link
Contributor Author

For the record, memprof.c on trunk is very nearly identical to 4.14, apart from the fact that it's #ifdeffed-out. There are literally three different lines (passing Caml_state to caml_set_action_pending and caml_reset_young_limit, instead of no arguments). So the development and then refinement of it on the 4 trunk is all present on trunk, although not compiled, or probably compilable, at present. This is good as it means I don't have a complex merge history to pick apart and consider.

@gasche
Copy link
Member

gasche commented Jan 19, 2023

But statmemprof relies on a lot of small changes in the rest of the runtime to work well. See the changelog entries around the Statmemprof merge:

Main PRs:

Buildup work:

  • Resource-safe C interface, part 1 (the 4.10 backwards-compatibility edition) #8993: New C functions caml_process_pending_actions{,_exn} in
    caml/signals.h, intended for executing all pending actions inside
    long-running C functions (requested minor and major collections,
    signal handlers, finalisers, and memprof callbacks). The function
    caml_process_pending_actions_exn returns any exception arising
    during their execution, allowing resources to be cleaned-up before
    re-raising.
    (Guillaume Munch-Maccagnoni, review by Jacques-Henri Jourdan,
    Stephen Dolan, and Gabriel Scherer)

Follow-ups:

As mentioned by @kayceesrk, it may be that the 5.x runtime is now in a good-enough state wrt. Statmemprof requirements (@gadmm would know), but this would be something to check.

@NickBarnes
Copy link
Contributor Author

Thanks for all of this, these are exactly the sorts of pointers I need. The innards of memprof.c are also not at all thread-safe (in the C sense): there's a lot of global shared state all the way down to the random number generators. It's a SMOP to put all of that into some sort of thread-safe object, created on demand. It seems likely to me that this object should be per-domain, linked from the domain state (analogously to, say, the caml_intern_state).

@gasche
Copy link
Member

gasche commented Jan 19, 2023

I would distinguish between:

  • the "state" of memprof sampling (the PRNG state, entry_array, callback_idx etc.), which probably should be domain-local / per-domain by default, if only for performance reasons, and
  • the "configuration" of memprof (lambda , callstack_size, init , started etc.), for which (in absence of strong opinion otherwise) we probably want a policy similar to other GC configuration parameters for consistency. Currently this mostly means "global to all domains, with a stop-the-world pause to change it". A first iteration could even be "fails with an error if we detect that more than one domain is currently running'.

(I don't remember the use-cases for "pausing" and restarting memprof sampling, it may be that those are used to protect critical sections, and therefore should be fast and per-domain as well.)

@gadmm
Copy link
Contributor

gadmm commented Jan 19, 2023

  • Would a limited single-domain-only version of statmemprof be worth having? That is, would it be a useful goal, maybe as an intermediate step, to make statmemprof work but solely for single-domain programs, failing (or producing unreliable results) if other domains exist or are created?

This should be easy to do, at the time I contemplated to do exactly that on top of #11307.

The work by @gadmm gets (most/all?) of the changes that were introduced to treat asynchronous actions in such a way that it is compatible with what Statmemprof requires.

We needed to go a bit further than 4 due to some subtleties (e.g. replacement of signals_are_pending with something else); pending on #11307 one should finally get there. But it is possible to work on restoring statmemprof in parallel.

  • In a multi-domain program, should statmemprof work independently in separate domains (that is, with per-domain context (rate / stack depth / tracker)? Or is "domain" the wrong granularity of independence? Perhaps it should run per Domainslib.Task.pool ? Do we need to provide a new abstraction ("memory profiling context") which would allow developers of parallel-programming libraries to make these choices?
  • In slightly deeper detail on that last bullet point, how should this "memory profiling context" interact with effects?

I am not worried about needing support for user schedulers, etc. One example is memprof-limits, which attaches per-task context to callbacks. This is done by the library without need for special support from memprof. Concretely, I have a thread-safe map from Thread.id to per-task context (several tasks can be nested per thread; the thread id is enough to find the innermost context). I think that profilers, likewise, might want to be aware of user threads. If I want to support a notion of task tied to a user-side thread abstraction, I just need to make it parametric over the notion of thread (concretely have an interface that is parametric on a particular notion of thread id or thread-local storage). If possible it is better to avoid to make memprof dependent on the user-side threading abstractions; the profiler might need to worry about this but not memprof itself.

Another point to consider is if a computation (continuation) is sent from one domain to another. I think it is much simpler to have the memprof status (started/stopped) and starting parameters be global, than having the profiler negociate with the user threading library how it should be started on every domain it manages.

Another aspect of interaction with effects is whether one is allowed to perform effects inside memprof callbacks. Here, note that no effect can escape from a callback for the simple reason that the capture of C stack frames is not supported ATM (and it is unclear that it will be). Thus any effect will have to be handled inside the callback, which makes sense from a computational point of view.

As for changing the parameters per-domain, for memprof-limits I do not need it but it does not bother me either (I would just not use it).

On the other hand, a useful property of memprof is that it guarantees that delayed callbacks run on the right thread; this should be easily preserved for domains. This might not be doable for user threads. I have to think more about it, but it might be a useful information to have, whether a callback has been delayed or not, because the latter guarantees that it runs on the sampled thread. (In this example, directly querying the user-thread id from memprof at moment of sampling is not doable, because it might be impossible to run any OCaml code for the same reason that callbacks are delayed). There is a third option to solve this issue but this needs its own discussion thread.

until the API had stabilised (which it has now).

AFAIR it was still marked as "experimental" and susceptible to change, but it would be good to stabilize it. (It would be good to have the change proposed at #9267; the discussion there is a bit long-winded but it is mostly subsumed by my papers and talks about asynchronous exceptions and progress with Stephen and Leo on this topic, so it could benefit from starting the discussion anew. Stephen told me verbally that he agrees with the change so I hope I can have positive reviews without too much discussion effort from the community.)

"global to all domains, with a stop-the-world pause to change it"

With moderate priority, I'd avoid a stop-the-world pause if possible.

(I don't remember the use-cases for "pausing" and restarting memprof sampling, it may be that those are used to protect critical sections, and therefore should be fast and per-domain as well.)

Memprof wants to avoid sampling its own allocations (edit: those done inside memprof callbacks). Thus per domain (edit: this is already per-thread; furthermore one should not expect to have to make it aware of user threads because one cannot perform outside effects inside callbacks).

@sadiqj
Copy link
Contributor

sadiqj commented Jan 19, 2023

In a multi-domain program, should statmemprof work independently in separate domains (that is, with per-domain context (rate / stack depth / tracker)? Or is "domain" the wrong granularity of independence? Perhaps it should run per Domainslib.Task.pool ? Do we need to provide a new abstraction ("memory profiling context") which would allow developers of parallel-programming libraries to make these choices?

I was thinking the simplest extension would be to keep a global rate but that the allocating domain's id would be available in the allocation callbacks.

* Would a limited single-domain-only version of statmemprof be worth having? That is, would it be a useful goal, maybe as an intermediate step, to make statmemprof work but solely for single-domain programs, failing (or producing unreliable results) if other domains exist or are created?

I'm not entirely convinced we'd save that much time overall with single-domain. Happy to discuss in-front of a whiteboard though.

@gadmm
Copy link
Contributor

gadmm commented Jan 19, 2023

I was thinking the simplest extension would be to keep a global rate but that the allocating domain's id would be available in the allocation callbacks.

And such a global/default rate can be made customizable per-domain later without breaking the API.

I am not sure we need the allocating domain id. The current statmemprof implementation already guarantees that the callback runs on the same thread, so one just has to call Domain.self.

@jhjourdan
Copy link
Contributor

  • Would a limited single-domain-only version of statmemprof be worth having? That is, would it be a useful goal, maybe as an intermediate step, to make statmemprof work but solely for single-domain programs, failing (or producing unreliable results) if other domains exist or are created?

I'm not entirely convinced we'd save that much time overall with single-domain. Happy to discuss in-front of a whiteboard though.

I tend to agree. Actually, I think that trying to isolate domains is rather an additional feature than a simplification. To me, what this means is simply that we have a guarantee that the various callbacks for a block are all called in the allocating domain. This is clearly desirable in some use cases, but I would say that it adds some complexity, because once a block is in the major, the GC does not give us any guarantee in the general case on the deallocating domain for a given block (right?).

Anyway, I'm happy to discuss and give some help.

@gadmm
Copy link
Contributor

gadmm commented Jan 19, 2023

I do not think that we need more guarantees than currently exists w.r.t. threads. We already have some guarantees right? So it is probably more work to remove those guarantees.

I do not think we need constraints on which domain calls a deallocating callback, this does not make much sense to me.

I agree that simply reactivating memprof for a single domain is an intermediate step that can be done in parallel of discussions about the right multicore API.

@jhjourdan
Copy link
Contributor

I do not think we need constraints on which domain calls a deallocating callback, this does not make much sense to me.

I think it does make sense, because client code will likely store information about tracked blocks in domain-local data structures.

@gasche
Copy link
Member

gasche commented Jan 19, 2023

My non-expert understanding of the 5.x major GC is that each domain sweeps its domain-local shared heap, in which it promoted its domain-local minor heap. So it may be the case that we can call the deallocating callback on the allocating domain.

@kayceesrk
Copy link
Contributor

I am not worried about needing support for user schedulers, etc.

Agree with this. OCaml 5.0 does not have a primitive notion of user-level schedulers. The language supports delimited continuations, which may be used for building lots of things including user-level threading. In the first iteration, it would be prudent to focus only on domains.

What I suspect may happen is that once we have a good understanding of how things work with multiple domains, we may add some primitive features to the compiler so that support for user-level threads can be built outside the compiler. A good example of this is how we've developed runtime events. Runtime events don't understand user-level threading but are aware of domains. Recently, support for custom events was merged: #11474, which is used by the meio observability tool for the eio library, which has a user-level scheduler. I am hoping we can do something similar for Statmemprof.

@sadiqj
Copy link
Contributor

sadiqj commented Jan 20, 2023

The GC does not give us any guarantee in the general case on the deallocating domain for a given block (right?).

This is correct and why we can't guarantee the callbacks will always happen in the same domain. Adopted pools from terminated domains can be swept by a different domain eventually.

@gasche
Copy link
Member

gasche commented Jan 20, 2023

But pools are adopted explicitly on domain shutdown, we could call another callback on sampled values there that would arrange ownership transition if necessary. (This could make adoption more costly, but it is not a performance-critical operation.)

@sadiqj
Copy link
Contributor

sadiqj commented Jan 20, 2023

But pools are adopted explicitly on domain shutdown, we could call another callback on sampled values there that would arrange ownership transition if necessary. (This could make adoption more costly, but it is not a performance-critical operation.)

Pools are moved to the global available-for-adoption lists on shutdown but there isn't an explicit handoff to a new domain. That happens when another domain needs a new pool of the appropriate size and grabs one from the available-for-adoption lists (or we get to the end of a major cycle and there's still pools waiting for adoption). In both of those two cases the original domain the pool came from has been terminated.

@gasche
Copy link
Member

gasche commented Jan 20, 2023

We could call a callback on the terminating domain when the sampled block is about to be orphaned, which would be in charge of moving the user-implemented tracking data from the domain-local pool to a global resource. This could be rather cheap if we track, per-pool, whether there exists one sampled block that has such a call-me-on-orphaning callback (common case: no block has this).

@gasche
Copy link
Member

gasche commented Jan 20, 2023

(We could also have a callback on readoption, or not.)

@gadmm
Copy link
Contributor

gadmm commented Jan 20, 2023

Doesn't this amount to writing in marble stone implementation details of the GC? These could change in the future.

What are the alternatives for storing information on blocks? Is there a simple thread-safe data structure that could take advantage of the bias towards the current domain?

@gadmm
Copy link
Contributor

gadmm commented Jan 20, 2023

Another API-changing aspect is that we really would like to sample stack allocations, now that it is treated like dynamically allocated memory.

@NickBarnes
Copy link
Contributor Author

I see that the backtrace modules have diverged somewhat between 4 and 5, so memprof will also have to adapt to that.

@sadiqj
Copy link
Contributor

sadiqj commented Jan 20, 2023

Doesn't this amount to writing in marble stone implementation details of the GC? These could change in the future.

Agree with @gadmm here, we end up exposing a lot of the internals of the GC really. We'd end up with similar things needed for compaction too.

Here's a potentially controversial proposal. Instead of keeping domain-affinity for callbacks we drop it and go further. We include the allocating domain in the tracking data that memprof keeps and then only a single domain (the one that registers callbacks) receives them.

This has a couple of benefits. It doesn't require domains doing actual useful work to stop what they're doing and run callbacks and you can get away with using fast non-thread-safe data structures in the callbacks.

The downside is you are limited by the throughput of a single domain's callback processing but I can't think of a realistic usecase where you'd hit that (you would reduce the sampling rate).

@gasche
Copy link
Member

gasche commented Jan 20, 2023

I don't see the problem with exposing the detail that the GC uses per-domain major heaps. If users were expecting their statmemprof code to keep working flawlessly in all future OCaml versions, they must be seriously disappointed by now! (Also if we have an 'orphaning' callback, we can just stop calling it if the GC changes and does not orphan anymore, and things will keep working without changing the API.)

@gadmm
Copy link
Contributor

gadmm commented Jan 21, 2023

@sadiqj I remember being told that memprof tried hard to run the callback immediately at the allocation point if possible. In addition this design would make it impossible for memprof-limits to raise an exception in the right thread.

I think we may be overthinking this issue. I think there's a solution on the side of the user. You can use a domain-local single-threaded data-structure at first and then use Domain.at_exit to move the surviving data to an auxiliary global thread-safe data-structure. (Assuming your reason for preferring a domain-local structure was performance.)

@gasche There is no reason to not be as backwards-compatible as possible. I wonder how much the current API is designed for this. For instance, if the tracker data-type was abstract and its elements constructed with a function, then it might be easier to add new kinds of callbacks (by adding optional arguments).

@NickBarnes It will also be necessary to add a mechanism to adopt orphaned pending memprof callbacks, like there is for the finalizer queue.

@edwintorok
Copy link
Contributor

Would a limited single-domain-only version of statmemprof be worth having?
I agree that simply reactivating memprof for a single domain is an intermediate step that can be done in parallel of discussions about the right multicore API.

Just to provide a use-case here: yes, it would be very useful if OCaml 5+ supported statmemprof even if restricted just to a single domain.
Here is a concrete example that I've run into recently:
eio requires OCaml 5.0+ and is usually very fast due to io_uring, however 'statmemprof/memtrace' doesn't work on OCaml 5.0, so when flamegraphs inevitably show that most time is spent in the GC I wouldn't have the tools to find out what is causing all those allocations, so I can't use eio.
So my choices currently are: use Lwt, which is slower, but works on both OCaml 4 and 5 and I can use memtrace to debug performance issues due to allocation hotspots; or write code that uses uring or iomux directly (which works on both OCaml 4 and 5), or the several other alternatives that work on both versions (e.g. luv, but that has its own allocation hotspots on the OCaml/C interface side).
I ended up writing some test code with iomux to identify where the allocation hotspot was (it was in angstrom when called from httpaf to parse responses), but that approach won't work if the hotspot ends up somewhere inside eio itself.

This would've been a lot easier if statmemprof worked on OCaml 5+, even if just on a single domain.

@NickBarnes
Copy link
Contributor Author

Fixed (finally!) by #12923.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants