New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memprof new api #8920
Memprof new api #8920
Conversation
I like the new API. One tiny suggestion: it might be a good idea to make |
What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API? As far as I know, private record can be extended without compatibility issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callbacks let you return a None
-- in fact it looks like the previous API also did, I had just never wondered about it. What's the semantics if you return None
, the value is not instrumented, but does this affect the sampling in any way? (How do we reason about uniformity in this case?)
This is version is quite nice. However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive. One other nice addition would be to allow the user to add a value to the tracked list. But this can be done later, outside of this PR. |
A private record is good too (although I don't think the list of accessors is actually longer than the record definition). |
This access to the value as an
Given all this it seemed reasonable to remove the . I would be curious to understand your proposed use-case. Your idea is that at a given time in the program execution we can trace the whole live heap, record a path from memory roots to each value, and store that path for those values that are currently being sampled? |
As documented:
As for uniformity: if you use this mechanism to filter out some blocks, then you sample according to the probability distribution of blocks that satisfy the given filter. You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program. |
d43e6e8
to
2194b84
Compare
Another use case would be sampling allocations without caring about how long they were live. Spacetime's viewer has a graph of just allocations over time and it is often quite useful. |
2194b84
to
8efb8f1
Compare
Apparently, one test fails with MinGW32, with a weird error code. This looks like a segfault, but I cannot reproduce on linux (even with Valgrind with the debug runtime). |
@gasche I don't need to have access to the value itself, but to be able to follow a specific value. For instance a fresh int would be sufficient, and would in fact be easier to follow than an Obj.t. My use case would be to produce a dump of the heap and be able to point in the memory graph which are the values followed by memprof. |
Otherwise a specific type which usually matter a lot for memory usage are bigarrays, and can be worth following specifically and introspecting. Given that this comes from C side allocations, it could have a different API that triggers when returning from C to OCaml, where we know that the value should be completely initialized. But this seems outside of the scope of that PR. |
Maybe we could have a generic approach to track custom values, not just bigarrays; in particular, we could use their "virtual size" to decide how to sample them. This is indeed outside the scope of the present PR. |
@stedolan : gentle ping. Are you aware that this PR exists? Are you planning to do a review? |
@jhjourdan Yes and yes! Apologies for the delay, I've been pretty busy (I was visiting the JS NYC office over the last two weeks, just back now). |
The new API looks great. (I'm particularly looking forward to sampling blocks that get promoted). I haven't finished reading the implementations yet, but abstractly I prefer the malloc-based version, both for performance and because it has less impact on the GC behaviour of the program being traced. I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in
To clarify: Ephemerons are supported in multicore, as @kayceesrk's already done the hard work of making them work. However, I would like to revisit the API (particularly the mutable bits like set_key), since the current one causes quite a lot of headache in multicore. So, I'd much prefer if the Memprof API didn't depend on mutating the key of an ephemeron. |
Here is the relevant issue in multicore repo that discusses the ephemerons implementation ocaml-multicore/ocaml-multicore#88 But as @stedolan pointed out, I’d much prefer not using set_key as a cleaner, alternative API does not provide set_key. |
8efb8f1
to
8d1f6fd
Compare
I did not try this, but I don't think this will simplify the code significantly. Much of the complexity is in a data structure which makes it possible to:
All this complexity already arises with promotion only, and delegating deallocation to finalisers will only marginally simplify the code. Moreover, note that the semantics for finilisers is slightly different: a finiliser is a closure attached to a block, while memprof attaches some user data that has to be given to a globally defined closure. Resetting this globally defined closure cancels the tracking of the corresponding blocks, which is not something which is currently possible with finalisers. That's not an untrackable issue with your approach, but at least this is an annoyance. |
8d1f6fd
to
09c16c7
Compare
Ah, now that I think about it, I understand what you mean by reusing the mechanism of finalizers. You will have to extend finilizers to support "promotion finalizers", and do some tricks (using references?) to allow threading user data through the various callbakcs. This can certainly work, that's effectively an intermediary solution between the malloc-based and the heap-based solution I proposed, since some of the data structures will be in the OCaml heap (the references used to thread the user data) and some will be in the malloc heap (the finalizers information). This will probably have more overhead (since data structures would be allocated both on the OCaml and the malloc heaps), but that might be acceptable. Feel free to give it a try. I don't forsee any more difficulties than the ones I presented above. |
Thanks for the explanation! I'll have a go at the finaliser-based one today and see how it looks. |
I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all. Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here. |
Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:
Benchmark code here. All numbers are median-of-5 executions in seconds. Edit: I've just added numbers for trunk, using the previous ephemeron-based API. The 'untracked' versions do not create an ephemeron, while the 'tracked' versions create an ephemeron and store the last 16k ephemerons in an array. (If the ephemeron is just discarded, then it's much faster as the ephemeron itself gets collected. But then no actual lifetime tracking gets done). With the trunk numbers, either implementation of this API seems to be much, much faster than using ephemerons to track lifetimes. |
Thanks for this attempt. Indeed, this is particularly shorter with this single array of tracked blocks. However, as is, I have several concerns:
|
Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message). |
Hmmm. Thinking about this, I think the simplest approach is to make memprof callbacks single-threaded, using a lock. As well as not having to worry about reentrancy in the runtime, this also makes life simpler for the user of the memprof API, who need not worry about making the callbacks themselves threadsafe.
Thanks for spotting this bug, but I think the fix is very easy! Updating the young pointer when writing user_data is fine: at the next minor GC, the part of the array examined will be the shortest suffix of the array containing recently sampled blocks and recently executed callbacks.
Agreed. I'll do some debugging and see how it goes. A performance improvement is nice, but my main motivation here is shorter code. The other feature I haven't implemented yet is handling of exceptions raised from memprof callbacks. I'm wondering what the right thing to do here is: is there any reason to want these? Shouldn't they just be a fatal error? |
Aren't we risking a sequential bottleneck at moderate sampling rates (say 1/100)? If a few minor allocations are both much slower than the other and sequential, this sounds like a bad yet fairly common/natural scenario. The most common memprof callback is to just return/store the backtrace. We can expect most callbacks to be performing aggregation, which is easy to do in a thread-local way and gather at statistics-production time. This sounds like a problem domain where thread-safety is natural. (Of course it also makes sense to aim for implementation simplicity at first, and refine the implementation if we do observe a bottleneck in practice.) |
I disagree that this makes for a stronger spec and makes Memprof safer: indeed, if this is what they want, printing a message to I hope this clarifies the circumstances in which these exceptions appear. I feel like dropping events is a big deal to others and not to me, and I do not understand why. |
I feel a bit guilty for going deep in this question of exception-handling. I think we collectively spent a lot of time on it and I would rather see Memprof as a whole go forward, support native allocations and be usable by everyone in OCaml 4.11, even if this handling of exceptions-within-callbacks turns out to be not-perfect -- we can always iterate to improve it later. None of the proposal so far are obviously better than the current behavior. (I'm not convinced that the new |
Sure, let us not pressure ourselves to fix this for 4.11 and take our time. This API is still documented as experimental after all. I use this PR because this is where the discussion happened, and scattering the discussion all over the place would do harm. However, I was pointing out a bug in the current API that I find important: you cannot program a reliable However, if I might have been a bit annoying, it is because this was not my impression with the replies. Instead, my impression is that we are misunderstanding each other. For instance, in case it was not clear, what @jhjourdan described is not what I was asking for: I have the impression that it mixes what will need to be done when I introduce masking, with what could be done to fix this new bug which I noticed afterwards (which is about having a convenient API as the default). It is worth to me spending time to lift this misunderstanding, because to me it is not about Memprof. We already had some of this discussion for So, what are we supposed to do in the I will stop for now. |
I'm not sure if this has already been suggested, but how about we always treat an exception from a memprof handler by dropping any remaining callbacks and stopping memprof? IIUC that would make "processing all callbacks as part of I also think it is sensible behaviour -- if you think it is a bug for a callback to raise an uncaught exception then it is reasonable to stop executing callbacks known to be buggy. |
(Note that my above suggestion says nothing about how you report the failure of the callback -- e.g. via stderr or by re-raising the exception. I think that issue is orthogonal). |
This makes sense to me. In that case I think that you want to re-raise the exception to tell whoever started Memprof that it has stopped. This is a bit more radical than my suggestion. In one use-case that I had in mind (allocation limits), you might want to keep raising new exceptions if further limits are reached. This is what Haskell's allocation limits do. This might be less important in theory than in practice. |
( @gadmm : I certainly wasn't complaining about your comments in the thread, and I completely agree that letting users define a correct |
BTW, if there is an agreement on a solution, I don't mind coding it. But indeed it is not clear whether we are close to falling in agreement. |
Note that I think it is possible to reliably stop memprof using something like: let rec do_stop_memprof () =
try Gc.Memprof.stop ()
with exn ->
(* HANDLE EXCEPTION *);
do_stop_memprof () The only possible way of leaving the loop is by succeeding a call to As for a Now, if you think the current API is not the right one, then I actually think @lpw25 proposal is sensible: we could use that behavior before settling down to something that everybody agrees on, and document that the behavior wrt. exceptions is subject to change in the future. |
Memprof new api (cherry picked from commit d0fe00c)
Memprof new api
I am happy to pre-announce memprof-limits, an implementation of global memory limits, and allocations limits à la Haskell, compatible with systhreads, based on memprof. The main reason I did this now is because I found it fairer if you all had the opportunity to criticise concrete code rather than (what looked like) hypotheticals on my part. It consisted in two exercises, which allow me to provide some further feedback on the API:
I'll try to make it self-contained.
I do not mind coding if we fall in agreement (at least the ones that are easy to do), so that is not an obstacle. For self-containedness, let me recall why we need memprof-limits. It can be useful regarding known issues with loss of session in Coq when memory grows unpredictably (or similar tools, which already know how to handle asynchronous exceptions), or runaway requests on a server. I also think it can make realistic a relaxed specification for Out_of_memory that rationalizes the currently awful behaviour of not raising on minor allocation, one usage being replaced by the other. In combination with other improvements I have in mind to Out_of_memory, it solve the issues (when the alternative is to implement reliable OOM during minor GC, which is hard). |
While I do like this idea, I'm still unconvinced that basing it on Memprof is a good implementation. If I set an allocation limit of 100 KB, and run a function that allocates exactly 50 KB, then using Haskell allocation limits the limit will not be reached and the function will succeed. Using
I agree that this should certainly hold. Earlier, I said that I preferred a stronger statement with the guarantee that all callbacks run holding even in the failure case. (I mean "stronger" in the sense of a logically stronger proposition, not necessarily "better". Apologies if this caused confusion). I'm more on the fence about this now.
No. Making this guarantee can cause unbounded delays. Suppose an allocation is sampled on a thread in a C call (where the callback cannot run promptly), and the C call then enters a blocking section. It can be a very long time before the blocking section returns. |
You are right, but I think it is still a valuable proof of concept:
I see. This assumption is made in statmemprof-emacs, and while I do not use it currently, I agree that it is a useful assumption. However an alternative would be to record the allocating thread id in the allocation record. In fact, it can be hard to know what to do if we are in an allocation callback for an allocation that happened in a different thread, if we do not have this information. |
Memprof new api
I am very thankful for @gadmm to have done this experiment. I would like us to converge on what we believe is a solid API before the 4.11 release (which will be the first exposing the Memprof API to the world). My feeling is that only little tweaks are needed, if any, to get something nice, but that we needed to get experience from actual use-cases. Whether or not memprof is the right long-term approach for memory limits, it sounds like an excellent use-case to clean up corner cases of the API. Below I try to discuss each point of @gadmm's feedback that could be interpreted as a modest/reasonable API change/improvement suggestion, and propose an API to implement it. If people were kind enough to give an opinion on the proposed APIs, we could try to implement them quickly. It would be nice to prioritize changes to existing APIs (in opposition to just adding more stuff), to avoid changing existing programs after the first API is released. (Of course, we marked things as experimental so that we are allowed to change them.)
|
Memprof.stopApparently you all more or less agree with the fact that Formally, there is no way to observe that this guarantee indeed holds because asynchronous exception are, well, asynchronous. That is, the exception can occur immediately before Said otherwise : in a sense, this guarantee already holds, since when the function raises an exception, we can always pretend that it has never been called and that the exception occured just before the call. The user will never be able to prove the contrary. In the same order of ideas, @gadmm's function Note that the problem is actually quite fundamental, since it also exists with signals. And I think the API of signals have currently the same "defect" than the current API of memprof. ThreadsAs already mentioned, I have an almost-ready PR to submit which provides this guarantee. There is a question we will have to answer, though: if we provide this guarantee and a Perhaps the right thing to do, then, is simply to drop the remaining callbacks in Querying the running state and parametersI agree with @gasche's proposal. Mutating the running parameters?Mutating the parameters using start/stop is, IMO, vain. This would not be atomic, hence incompatible with threads, and this will likely drop callbacks if we implement the semantics of Having a function which allows mutating the runtime parameters is, I think, opening a can of worms because of callbacks remaining in the queue
|
Taking a step back, what I want is to be able to run a piece of code with memprof monitoring on, and the guarantee that memprof monitoring is stopped after the piece of code is done running (through normal return or an exception, synchronous or not). So really I would like to have a function
If there is an interface for |
@stedolan asked an important question, of knowing whether it was really possible to program at all for things like allocation limits given the statistical nature of memprof. This is important in the discussion given that I proposed such an application to decide in favour of allowing asynchronous exceptions being raised. I have done a statistical analysis of memprof-limits here: https://gitlab.com/gadmm/memprof-limits#documentation. First, I have found that the statistical nature of memprof makes it very easy to reason about the application and not worry about implementation details (e.g. other threads interfering with the sampling). I have also found that its deterministic nature was (essential and) useful for reproducing runs. I find the idea and the design of Memprof very very nice. Long story short, memprof-limits starts being accurate enough starting around a safe allocation value of 100 KB (meaning a limit of 1 to 3 MiB depending on chosen precision), with the ratio between the maximal safe allocation and the limit dropping very quickly for higher values. Correctly, the analysis shows that limits under 500 KiB, such as the example objected by Stephen, are unreliable. It was entirely true that memprof-limits was incomplete before we had the statistical analysis, and Stephen was right to object. As to know whether users can program with memprof-limits, that is, write programs and reason about them, I also make two hypotheses:
Under these hypotheses, the statistical limit is just as reliable as precise limits à la Haskell. Also, the precision is much better than I expected, so this opens the possibility for reliable-enough local memory limits based on the allocator-pays model (that is, like allocation limits but also counting deallocations) with low-enough sampling rate (which are more intensive computationally, because they need to track blocks). Replies to @gasche and @jhjourdan about tweaks to the API coming later. |
@jhjourdan, does this sound realistic to you in light of your upcoming PR, or does it need more time and might require further API changes? I am happy to help if some things need action before 4.11. P.S.: apologies for the length Memprof.stop
I think you just need some way to force flushing, and this will come automatically with the BSP patch which AFAIU introduces a primitive to poll without allocating. Then the user can define: let try_stop = poll() ; stop() (as a stopgap measure one could have a C primitive to flush without allocating for expert users, pending further feedback.)
I agree that this reasoning makes sense and is sound, but to me all it shows is that you cannot reliably stop memprof in the normal path, and one should stop it instead in the release path, which ought to be safe for asynchronous exceptions. What we learn from systems programming languages is that the release path should have a special status. Furthermore, it would be wrong to say that OCaml cannot yet offer a release path which is safe for asynchronous exceptions. Indeed, the following implementation of Haskell's bracket in OCaml: let with_resource ~acquire arg f ~(release : _ -> unit) =
let release_noexn x = try release x with _ -> () in
let r = acquire arg in
match f r with
| y -> release_noexn r ; y
| exception e -> release_noexn r ; raise e is actually sound in native code under the condition that It is true that the naive code likely to be written by users: M.start() ;
Fun.protect ~finally:M.stop f () is incorrect in exactly one way: it makes one unprotected allocation of a callback, and if memprof samples it and the alloc_minor callback raises, then memprof will not be stopped. Another question which is implicit in your objection is whether users should be given
So one is stuck between wanting something convenient for one kind of users, and something general for another kind of users. What I propose does not change anything for the latter, but makes it slightly more convenient for the former. But if one would like to replace
I disagree, in memprof-limits I only use it in a way which is safe for asynchronous exceptions. (I have marked several places in the code where I would need an implementation of
For the problems with asynchronous exceptions, I agree. But Threads
I am more on the fence now. Looking at it more closely, memprof-limits has a bug to fix due to this lack of guarantees, but all I need to fix it is to obtain the thread id along with the allocation record. I do not have yet the details of @jhjourdan's new design, but I wonder what happens when you send the value across threads before the callback manages to run.
That would be furiously incompatible with reasonable implementations of masking, which should allow releasing the runtime lock and also let systhread yield inside the mask (such as the one I am working on).
Right, the alternative which adds a thread id to the allocation record cannot be left for later with your criterion, so it would be nice to see what @jhjourdan has in mind. Another note about the thread id: the notion could be moved to stdlib to not be systhread-specific. Currently, any memprof application that wants to be compatible with systhreads (such as memprof-limits) has to force the dependency on systhreads. But I do not need systhreads, I just need to know the thread id, which could have a default implementation that always returns 1 when not using systhreads.
So there is the question about what happens when the value is sent across threads before the callback ran, how is the ordering of callbacks (allocate, promote, deallocate) maintained? Especially in the light of unbounded delays explained by @stedolan.
By seeing it as a release vs. flush problem, you would indeed need a stop() function that drops callbacks, to be called in the release path, but you would also need a (Then one can wonder what should happen if an exception arises from a callback in another thread while flushing... then, I think we can consider dropping the callbacks of that thread. But then it makes it impossible to combine memprof-limits with a profiling library provided by the user, for instance, because callbacks from one will drop callbacks from the other... This does not happen with a global queue. Edit: this is not an issue when flushing is used before stopping, but it is an issue for flushing before mutating a parameter as suggested below.) Querying the running state and parameters
I remember thinking about it, and not finding concrete uses. The main issue is that stopping and starting again should not be a noop, because it should forget about all tracked blocks (or it would sound like a bug to me).
Indeed it would be odd to define a type What about simply: val samping_rate : unit -> float option
val callstack_size : unit -> int option or even simpler val started : unit -> bool
val samping_rate : unit -> float
val callstack_size : unit -> int with values Mutating the running parameters?
This is the "elaborate way to do that currently without losing samples" I alluded to. Again, if stopping and starting again does not lose samples, then something is broken I think.
I agree this should not happen, nor should callbacks be dropped in that case. The suggestion was to flush the callbacks before changing the value. With the design you propose above, flush would need to synchronise, but the principle is the same. Again, I find that per-thread memprof is rather more complicated (Jacques-Henri knows that I liked the idea originally) when in practice all I found to really need was the thread id. Do you have other reasons in favour? And indeed it might require to drop callbacks in more situations than just |
I am not the guy complaining about the current API. I consider my upcoming PR is just an improvement. To me, the current API is fine and ready for 4.11.
I think you are trying to rely on the absence of polling points at some places in the code, and I think this really is something you have very little control on. For example : are you aware that, in bytecode, any function call is a polling point? Basically, this mean that you cannot write a "release function" which does not poll, simply because the call of that function is already a polling point! In any case, I maintain my opinion: these functions are broken (at the very least on bytecode) since they cannot be called without polling, and this can result in undesired asynchronous exceptions.
Wrong. If a asynchronous exception occurs when calling this function, it will not set the signal.
Of course, we would not do that if masking is enabled. But masking should not be used for long periods of time, isn't it? Otherwise, yes, some callbacks will be delayed for long. But what can we do against that?
The ordering problem already exists in the current implementation, and is already taken care of. Essentially, you maintain flags for each sampled blocks indicating which callbacks have been called.
In memprof-limits, you use exceptions in callbacks to interrupt a thread which would allocate too much. Thus you rely on that guarantee. If callbacks are called far less often in the allocating thread, then you will have far less opportunities to interrupt it. |
Any feedback on the proposal to add val with_memprof : <memprof-parameters> -> (unit -> 'a) -> 'a and encourage users to use this instead of |
P.S.: I agree that the current API is fine for 4.11, my messages above are a thought experiment to see, if we shake the memprof-limits tree, if the fruits that fall down suggest changes that are better done now rather than later. But it is always an option to push them for later indeed -- the API being marked as experimental and all that. |
Regarding the Moreover, note that it is not just "synchronizing". It is also waiting for callbacks to terminate in other thread, which may take time (callbacks could even call blocking functions). So the correct implementation of such a function is difficult (if not impossible). I now think we should seriously consider dropping callbacks in places we would like to flush. |
From what I can see, In any case, the idea of a "scoped combinator" is broken in the presence of threads. So this is no the solution to all our problems. |
I finished what is urgent for 4.11 so I come back to this one.
In summary, thank you for patiently addressing all my comments. More focused discussion will continue on the spin-off PRs you have opened.
|
The current API for memprof uses ephemerons to registers tracked blocks. While this is an elegant use for ephemerons, but @stedolan suggested that it may not be the best thing to do, for two reasons :
This PR presents a new API for memprof: instead of using ephemerons, the client provides 5 callbacks for various events that can occur durang a sampled object lifetime: allocation (minor or major), promotion and deallocation (minor or major).
The main difficulty when implementing this new API is that the runtime now has to maintain a data structure for tracked memory blocks. The code related to this data structure is what constitutes most of the changes of this PR.
I can see two possible choices for storing this data structure: either we use
malloc
-based linked lists, or we used the same linked lists, but allocated in the major OCaml heap. The two commits of this PR are the implementations of these two choices, and we need to make a choice between these. Using the OCaml heap requires slightly simpler code, because we do not need to manage all the pointers to the OCaml heap which occur in this data structure (the GC does it for us automatically, since these are just OCaml memory blocks), and because we need not deallocate manually withfree
. On the other hand, preliminary benchmarks seem to indicate that themalloc
version is faster. Indeed,malloc
/free
is apparently not a bottleneck even if called once for each sampled block, but the extra work performed by the GC if the lists are allocated in the major heap happens to have a significant impact on the running time. In practice, when using the micro-benchmark of #8731, we have a slowdown of about 10%-15% when using the major heap.So, what should we do? I would be in favor of the
malloc
/free
version, since it is faster and a bit more readable.