Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memprof new api #8920

Merged
merged 1 commit into from Dec 21, 2019
Merged

Memprof new api #8920

merged 1 commit into from Dec 21, 2019

Conversation

jhjourdan
Copy link
Contributor

The current API for memprof uses ephemerons to registers tracked blocks. While this is an elegant use for ephemerons, but @stedolan suggested that it may not be the best thing to do, for two reasons :

  • Ephemerons are likely to be dropped from multicore OCaml, since there is no easy way to make them work in the new parallel GC.
  • Ephemerons are great for noticing the deallocation of an object, but they do not give any other information on the promotion of an object from the minor to the major heap. This is unfortunate, since the client may want to exclude, for performance reasons, short-lived non-promoted objects from its statistics.

This PR presents a new API for memprof: instead of using ephemerons, the client provides 5 callbacks for various events that can occur durang a sampled object lifetime: allocation (minor or major), promotion and deallocation (minor or major).

The main difficulty when implementing this new API is that the runtime now has to maintain a data structure for tracked memory blocks. The code related to this data structure is what constitutes most of the changes of this PR.

I can see two possible choices for storing this data structure: either we use malloc-based linked lists, or we used the same linked lists, but allocated in the major OCaml heap. The two commits of this PR are the implementations of these two choices, and we need to make a choice between these. Using the OCaml heap requires slightly simpler code, because we do not need to manage all the pointers to the OCaml heap which occur in this data structure (the GC does it for us automatically, since these are just OCaml memory blocks), and because we need not deallocate manually with free. On the other hand, preliminary benchmarks seem to indicate that the malloc version is faster. Indeed, malloc/free is apparently not a bottleneck even if called once for each sampled block, but the extra work performed by the GC if the lists are allocated in the major heap happens to have a significant impact on the running time. In practice, when using the micro-benchmark of #8731, we have a slowdown of about 10%-15% when using the major heap.

So, what should we do? I would be in favor of the malloc/free version, since it is faster and a bit more readable.

@lpw25
Copy link
Contributor

lpw25 commented Sep 5, 2019

I like the new API. One tiny suggestion: it might be a good idea to make alloc_info into an abstract type. I don't think users gain anything by knowing its a record and this should make it a little easier to change the type if necessary. I'd also probably just call it allocation, but that's totally subjective.

@jhjourdan
Copy link
Contributor Author

I like the new API. One tiny suggestion: it might be a good idea to make alloc_info into an abstract type. I don't think users gain anything by knowing its a record and this should make it a little easier to change the type if necessary. I'd also probably just call it allocation, but that's totally subjective.

What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API? As far as I know, private record can be extended without compatibility issues.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callbacks let you return a None -- in fact it looks like the previous API also did, I had just never wondered about it. What's the semantics if you return None, the value is not instrumented, but does this affect the sampling in any way? (How do we reason about uniformity in this case?)

runtime/memprof.c Outdated Show resolved Hide resolved
@chambart
Copy link
Contributor

chambart commented Sep 5, 2019

This is version is quite nice. However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive.

One other nice addition would be to allow the user to add a value to the tracked list. But this can be done later, outside of this PR.

@lpw25
Copy link
Contributor

lpw25 commented Sep 5, 2019

What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API?

A private record is good too (although I don't think the list of accessors is actually longer than the record definition).

@gasche
Copy link
Member

gasche commented Sep 6, 2019

@chambart

However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive.

This access to the value as an Obj.t was part of a previous iteration of the API, and it was removed to simplify the API. There were at last the following concerns with that part of the API:

  1. At the time where the callback is invoked, the Obj.t value is invalid: the memory space has been allocated (with a valid header) but the value itself is still missing -- in particular, it is not a valid memory representation at the value's type. It would be possible to store the Obj.t somewhere and use it later, after it has been initialized, but there is no way using the API to tell when that initialization has occurred, so you have to poke at the Obj.t value using the Obj API to try to guess whether it is a valid value, and which type it has. This seems deeply unsafe and fairly error-prone.

  2. If I understand correctly, the trick of calling the callback after the value is allocated only works for allocations in the major heap (so callback_alloc_major and callback_promote), not for allocations in the minor heap; @jhjourdan thus had different allocation_info types for minor and major allocations, which made the API irritatingly less uniform.

  3. At the time we discussed this (last Tuesday) we knew of no actual client making use of this Obj.t value. (One use-case we discussed is someone trying to analyze a program where most allocations happen at a very specific type whose memory representation is easily inspectable and contains useful information (typically, a meaningful unique identifier), but we didn't have any concrete example in mind.)

Given all this it seemed reasonable to remove the Obj.t access from this iteration of the API.

.

I would be curious to understand your proposed use-case. Your idea is that at a given time in the program execution we can trace the whole live heap, record a path from memory roots to each value, and store that path for those values that are currently being sampled?

@jhjourdan
Copy link
Contributor Author

The callbacks let you return a None -- in fact it looks like the previous API also did, I had just never wondered about it. What's the semantics if you return None, the value is not instrumented, but does this affect the sampling in any way? (How do we reason about uniformity in this case?)

As documented:

If such a callback returns [None], then the tracking of this particular block is cancelled.

As for uniformity: if you use this mechanism to filter out some blocks, then you sample according to the probability distribution of blocks that satisfy the given filter. You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program.

@lpw25
Copy link
Contributor

lpw25 commented Sep 6, 2019

You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program.

Another use case would be sampling allocations without caring about how long they were live. Spacetime's viewer has a graph of just allocations over time and it is often quite useful.

@jhjourdan
Copy link
Contributor Author

Apparently, one test fails with MinGW32, with a weird error code. This looks like a segfault, but I cannot reproduce on linux (even with Valgrind with the debug runtime).

@chambart
Copy link
Contributor

chambart commented Sep 6, 2019

@gasche I don't need to have access to the value itself, but to be able to follow a specific value. For instance a fresh int would be sufficient, and would in fact be easier to follow than an Obj.t. My use case would be to produce a dump of the heap and be able to point in the memory graph which are the values followed by memprof.

@chambart
Copy link
Contributor

chambart commented Sep 6, 2019

Otherwise a specific type which usually matter a lot for memory usage are bigarrays, and can be worth following specifically and introspecting. Given that this comes from C side allocations, it could have a different API that triggers when returning from C to OCaml, where we know that the value should be completely initialized. But this seems outside of the scope of that PR.

@gasche
Copy link
Member

gasche commented Sep 6, 2019

Maybe we could have a generic approach to track custom values, not just bigarrays; in particular, we could use their "virtual size" to decide how to sample them. This is indeed outside the scope of the present PR.

@jhjourdan
Copy link
Contributor Author

@stedolan : gentle ping. Are you aware that this PR exists? Are you planning to do a review?

@stedolan
Copy link
Contributor

stedolan commented Oct 7, 2019

@jhjourdan Yes and yes! Apologies for the delay, I've been pretty busy (I was visiting the JS NYC office over the last two weeks, just back now).

@stedolan
Copy link
Contributor

stedolan commented Oct 9, 2019

The new API looks great. (I'm particularly looking forward to sampling blocks that get promoted).

I haven't finished reading the implementations yet, but abstractly I prefer the malloc-based version, both for performance and because it has less impact on the GC behaviour of the program being traced.

I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in finalise.c) to trigger these callbacks, rather than having to reimplement finalisation in memprof.c. Did you try this? Is there some deep problem I'm missing? (If not, I'll have a go prototyping this approach on Friday)

Ephemerons are likely to be dropped from multicore OCaml, since there is no easy way to make them work in the new parallel GC.

To clarify: Ephemerons are supported in multicore, as @kayceesrk's already done the hard work of making them work. However, I would like to revisit the API (particularly the mutable bits like set_key), since the current one causes quite a lot of headache in multicore. So, I'd much prefer if the Memprof API didn't depend on mutating the key of an ephemeron.

@kayceesrk
Copy link
Contributor

Here is the relevant issue in multicore repo that discusses the ephemerons implementation ocaml-multicore/ocaml-multicore#88

But as @stedolan pointed out, I’d much prefer not using set_key as a cleaner, alternative API does not provide set_key.

@jhjourdan
Copy link
Contributor Author

jhjourdan commented Oct 10, 2019

I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in finalise.c) to trigger these callbacks, rather than having to reimplement finalisation in memprof.c. Did you try this? Is there some deep problem I'm missing? (If not, I'll have a go prototyping this approach on Friday)

I did not try this, but I don't think this will simplify the code significantly. Much of the complexity is in a data structure which makes it possible to:

  • remember postponed callbacks,
  • thread the data provided by the user through the various callbacks,
  • track the events happening to the tracked blocks,
  • register the right weak and strong roots to the GC.

All this complexity already arises with promotion only, and delegating deallocation to finalisers will only marginally simplify the code.

Moreover, note that the semantics for finilisers is slightly different: a finiliser is a closure attached to a block, while memprof attaches some user data that has to be given to a globally defined closure. Resetting this globally defined closure cancels the tracking of the corresponding blocks, which is not something which is currently possible with finalisers. That's not an untrackable issue with your approach, but at least this is an annoyance.

@jhjourdan
Copy link
Contributor Author

Ah, now that I think about it, I understand what you mean by reusing the mechanism of finalizers. You will have to extend finilizers to support "promotion finalizers", and do some tricks (using references?) to allow threading user data through the various callbakcs.

This can certainly work, that's effectively an intermediary solution between the malloc-based and the heap-based solution I proposed, since some of the data structures will be in the OCaml heap (the references used to thread the user data) and some will be in the malloc heap (the finalizers information). This will probably have more overhead (since data structures would be allocated both on the OCaml and the malloc heaps), but that might be acceptable.

Feel free to give it a try. I don't forsee any more difficulties than the ones I presented above.

@stedolan
Copy link
Contributor

Thanks for the explanation! I'll have a go at the finaliser-based one today and see how it looks.

@stedolan
Copy link
Contributor

I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all.

Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here.

@stedolan
Copy link
Contributor

stedolan commented Oct 15, 2019

Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:

short, untracked short, tracked long, untracked long, tracked
09c16c7 0.88 0.93 2.81 3.03
6e03559 0.85 0.85 2.81 2.94
trunk 0.84 1.30 2.72 5.69

Benchmark code here. All numbers are median-of-5 executions in seconds.

Edit: I've just added numbers for trunk, using the previous ephemeron-based API. The 'untracked' versions do not create an ephemeron, while the 'tracked' versions create an ephemeron and store the last 16k ephemerons in an array. (If the ephemeron is just discarded, then it's much faster as the ephemeron itself gets collected. But then no actual lifetime tracking gets done).

With the trunk numbers, either implementation of this API seems to be much, much faster than using ephemerons to track lifetimes.

@jhjourdan
Copy link
Contributor Author

I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all.

Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here.

Thanks for this attempt. Indeed, this is particularly shorter with this single array of tracked blocks. However, as is, I have several concerns:

  • In a multithreaded setting, a context switch can happen during a callback. As a result, the function caml_memprof_handle_postponed needs to be reentrant, and, as far as I understand, this is not the case currently. At least two assertions are broken: CAMLassert(tracked.len == prev_len); on line 396 and i == tracked.len - 1 on line 427.
  • In caml_memprof_oldify_young_roots, you start the traversal at tracked.young, but don't update tracked.young in run_callbacks when updating user_data. This can end up in a callback returning a young value which is not taken into account at the next minor collection. Of course, the issue here is that it is (probably ?) intractable to update the young pointer when writing user_data, because this could end up in reexamining a large part of the array at the next minor collection. That was the purpose of my young_user_data list.

@jhjourdan
Copy link
Contributor Author

Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:

Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message).

@stedolan
Copy link
Contributor

In a multithreaded setting, a context switch can happen during a callback. As a result, the function caml_memprof_handle_postponed needs to be reentrant, and, as far as I understand, this is not the case currently. At least two assertions are broken: CAMLassert(tracked.len == prev_len); on line 396 and i == tracked.len - 1 on line 427.

Hmmm. Thinking about this, I think the simplest approach is to make memprof callbacks single-threaded, using a lock. As well as not having to worry about reentrancy in the runtime, this also makes life simpler for the user of the memprof API, who need not worry about making the callbacks themselves threadsafe.

In caml_memprof_oldify_young_roots, you start the traversal at tracked.young, but don't update tracked.young in run_callbacks when updating user_data. This can end up in a callback returning a young value which is not taken into account at the next minor collection. Of course, the issue here is that it is (probably ?) intractable to update the young pointer when writing user_data, because this could end up in reexamining a large part of the array at the next minor collection. That was the purpose of my young_user_data list.

Thanks for spotting this bug, but I think the fix is very easy! Updating the young pointer when writing user_data is fine: at the next minor GC, the part of the array examined will be the shortest suffix of the array containing recently sampled blocks and recently executed callbacks.

Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message).

Agreed. I'll do some debugging and see how it goes. A performance improvement is nice, but my main motivation here is shorter code.

The other feature I haven't implemented yet is handling of exceptions raised from memprof callbacks. I'm wondering what the right thing to do here is: is there any reason to want these? Shouldn't they just be a fatal error?

@gasche
Copy link
Member

gasche commented Oct 21, 2019

I think the simplest approach is to make memprof callbacks single-threaded, using a lock

Aren't we risking a sequential bottleneck at moderate sampling rates (say 1/100)? If a few minor allocations are both much slower than the other and sequential, this sounds like a bad yet fairly common/natural scenario.

The most common memprof callback is to just return/store the backtrace. We can expect most callbacks to be performing aggregation, which is easy to do in a thread-local way and gather at statistics-production time. This sounds like a problem domain where thread-safety is natural.

(Of course it also makes sense to aim for implementation simplicity at first, and refine the implementation if we do observe a bottleneck in practice.)

@gadmm
Copy link
Contributor

gadmm commented Jan 7, 2020

I disagree that this makes for a stronger spec and makes Memprof safer: indeed, if this is what they want, printing a message to stderr and ignoring the exception is something the user can already do. They can also abort the program if they want.

I hope this clarifies the circumstances in which these exceptions appear. I feel like dropping events is a big deal to others and not to me, and I do not understand why.

@gasche
Copy link
Member

gasche commented Jan 7, 2020

I feel a bit guilty for going deep in this question of exception-handling. I think we collectively spent a lot of time on it and I would rather see Memprof as a whole go forward, support native allocations and be usable by everyone in OCaml 4.11, even if this handling of exceptions-within-callbacks turns out to be not-perfect -- we can always iterate to improve it later. None of the proposal so far are obviously better than the current behavior. (I'm not convinced that the new flush+stop API is such an improvement either.)

@gadmm
Copy link
Contributor

gadmm commented Jan 7, 2020

Sure, let us not pressure ourselves to fix this for 4.11 and take our time. This API is still documented as experimental after all. I use this PR because this is where the discussion happened, and scattering the discussion all over the place would do harm.

However, I was pointing out a bug in the current API that I find important: you cannot program a reliable with_memprof wrapper without going through hoops. And there is a simple fix. If you simply do not find the issue as bad as me, no problem, I will stop bothering.

However, if I might have been a bit annoying, it is because this was not my impression with the replies. Instead, my impression is that we are misunderstanding each other. For instance, in case it was not clear, what @jhjourdan described is not what I was asking for: I have the impression that it mixes what will need to be done when I introduce masking, with what could be done to fix this new bug which I noticed afterwards (which is about having a convenient API as the default).

It is worth to me spending time to lift this misunderstanding, because to me it is not about Memprof. We already had some of this discussion for Fun.protect and we will keep having it for similar resource interfaces like with_file.

So, what are we supposed to do in the finally clause? We agreed that collecting all the exceptions in some way, or dropping them all except the first one, is not something we want to do. I claim it is a natural and legitimate request to want to simply drop further events and stop Memprof. The current API does not just tries to run all callbacks, it forces it onto the programmer. But this is not what has been argued.

I will stop for now.

@lpw25
Copy link
Contributor

lpw25 commented Jan 7, 2020

I'm not sure if this has already been suggested, but how about we always treat an exception from a memprof handler by dropping any remaining callbacks and stopping memprof?

IIUC that would make "processing all callbacks as part of Memprof.stop" and "stop executing callbacks when one raises an exception and ensure that Memprof is disabled" actually be the same behaviours.

I also think it is sensible behaviour -- if you think it is a bug for a callback to raise an uncaught exception then it is reasonable to stop executing callbacks known to be buggy.

@lpw25
Copy link
Contributor

lpw25 commented Jan 7, 2020

(Note that my above suggestion says nothing about how you report the failure of the callback -- e.g. via stderr or by re-raising the exception. I think that issue is orthogonal).

@gadmm
Copy link
Contributor

gadmm commented Jan 7, 2020

This makes sense to me. In that case I think that you want to re-raise the exception to tell whoever started Memprof that it has stopped.

This is a bit more radical than my suggestion. In one use-case that I had in mind (allocation limits), you might want to keep raising new exceptions if further limits are reached. This is what Haskell's allocation limits do. This might be less important in theory than in practice.

@gasche
Copy link
Member

gasche commented Jan 8, 2020

( @gadmm : I certainly wasn't complaining about your comments in the thread, and I completely agree that letting users define a correct with_memprof function should be a goal of the API. I don't think there is anything wrong with the discussion, expect that over time it started feeling too detailed, too long, and too time-consuming for a point that is probably less important than other forms of memprof work right now. )

@gadmm
Copy link
Contributor

gadmm commented Jan 8, 2020

BTW, if there is an agreement on a solution, I don't mind coding it. But indeed it is not clear whether we are close to falling in agreement.

@jhjourdan
Copy link
Contributor Author

jhjourdan commented Jan 8, 2020

Note that I think it is possible to reliably stop memprof using something like:

let rec do_stop_memprof () =
  try Gc.Memprof.stop ()
  with exn ->
    (* HANDLE EXCEPTION *);
    do_stop_memprof ()

The only possible way of leaving the loop is by succeeding a call to Gc.Memprof.stop, which guaranteed that Memprof is indeed stopped. But I agree this is (somewhat) cumbersome.

As for a with_memprof function: there is a choice to be made about exception arising as part ofr remaining callbacks... If the choice is to stop sampling as soon as an exception coming from a callback escapes with_memprof, then the client code can get the desired behavior by using a callback wrapper that checks a Boolean indicating whether sampling has been stopped. This is not particularly elegant, but this works.

Now, if you think the current API is not the right one, then I actually think @lpw25 proposal is sensible: we could use that behavior before settling down to something that everybody agrees on, and document that the behavior wrt. exceptions is subject to change in the future.

stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 6, 2020
Memprof new api

(cherry picked from commit d0fe00c)
stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 17, 2020
@gadmm
Copy link
Contributor

gadmm commented Apr 6, 2020

I am happy to pre-announce memprof-limits, an implementation of global memory limits, and allocations limits à la Haskell, compatible with systhreads, based on memprof. The main reason I did this now is because I found it fairer if you all had the opportunity to criticise concrete code rather than (what looked like) hypotheticals on my part.

It consisted in two exercises, which allow me to provide some further feedback on the API:

  1. implementing limits themselves,
  2. implementing a server that combines limits with an arbitrary profiler provided by the user (I settled on a very simple implementation but I went through several more complicated ideas that that let me think about the API).

I'll try to make it self-contained.

  • I did not find a use for the current behaviour of Memprof.stop, which does not guarantee that it stops, and always had to use a loop like do_stop_memprof as described by Jacques-Henri above. Besides forcing to provoke exceptions which then have to be ignored, it is fairly obvious that this will lead to bugs since not all programmers will notice that they need a loop. (It also currently has the bad consequence that in case of bugs such as somebody stopping memprof prematurely, it will cause a non-interruptible loop when stop raises a Failure, but this is easy to fix.)
  • As for Leo's proposition, while sound, it is too conservative for memprof-limits: the latter relies on asynchronous exceptions being raised and caught inside pairs of start/stop, which in his case can only happen once. I still believe that the best specification is the one I gave and which I recall: Memprof is guaranteed to be stopped upon normal or exceptional return of [Memprof.stop ()]; in addition, if it returns normally then it is guaranteed that all callbacks ran. Indeed, assuming correct discipline for exceptions (which ought to be encouraged separately with Memprof: wrap uncaught exceptions before re-raising them #9267), either the asynchronous exception crosses the start/stop boundary (in which case it behaves like Leo's proposition: Memprof is stopped), or it does not. In that case, it has to be assumed that recovering from the asynchronous exception is part of the normal behaviour of the program, as in the case of memprof-limits. So Memprof has to keep running. (I mentioned the analogy with fclose. For the curious, there in fact, a strikingly similar discussion which led to clarify the specification of POSIX close https://www.austingroupbugs.net/view.php?id=529.)
  • More refined applications, for instance a resource limit based on an “allocator-pays” model as in the Yang & Mazières paper (but with statistical approximation of memory consumption), might need a higher sampling rate. To lower the runtime cost of such an approach it would be nice to be able to start/stop memprof on a per-thread basis.
  • Is there a guarantee that if an allocation callback is delayed, it is going to run in the same thread as the allocation happened?
  • I have found that duplication was needed to query the started state and the sampling_rate for no good reason.
  • Moreover, one could imagine wanting to change the sampling rate and the callstack size (if such functions are provided, they could for instance flush the delayed callbacks before performing the change). There is no elaborate way to do that currently without losing samples.

I do not mind coding if we fall in agreement (at least the ones that are easy to do), so that is not an obstacle.

For self-containedness, let me recall why we need memprof-limits. It can be useful regarding known issues with loss of session in Coq when memory grows unpredictably (or similar tools, which already know how to handle asynchronous exceptions), or runaway requests on a server. I also think it can make realistic a relaxed specification for Out_of_memory that rationalizes the currently awful behaviour of not raising on minor allocation, one usage being replaced by the other. In combination with other improvements I have in mind to Out_of_memory, it solve the issues (when the alternative is to implement reliable OOM during minor GC, which is hard).

@stedolan
Copy link
Contributor

stedolan commented Apr 6, 2020

I am happy to pre-announce memprof-limits, an implementation of global memory limits, and allocations limits à la Haskell, compatible with systhreads, based on memprof.

While I do like this idea, I'm still unconvinced that basing it on Memprof is a good implementation. If I set an allocation limit of 100 KB, and run a function that allocates exactly 50 KB, then using Haskell allocation limits the limit will not be reached and the function will succeed. Using memprof-limits, the function might well fail, due to the random aperiodic nature of Memprof sampling.

Memprof is guaranteed to be stopped upon normal or exceptional return of [Memprof.stop ()]; in addition, if it returns normally then it is guaranteed that all callbacks ran

I agree that this should certainly hold.

Earlier, I said that I preferred a stronger statement with the guarantee that all callbacks run holding even in the failure case. (I mean "stronger" in the sense of a logically stronger proposition, not necessarily "better". Apologies if this caused confusion). I'm more on the fence about this now.

Is there a guarantee that if an allocation callback is delayed, it is going to run in the same thread as the allocation happened?

No. Making this guarantee can cause unbounded delays. Suppose an allocation is sampled on a thread in a C call (where the callback cannot run promptly), and the C call then enters a blocking section. It can be a very long time before the blocking section returns.

@gadmm
Copy link
Contributor

gadmm commented Apr 6, 2020

While I do like this idea, I'm still unconvinced that basing it on Memprof is a good implementation. If I set an allocation limit of 100 KB, and run a function that allocates exactly 50 KB, then using Haskell allocation limits the limit will not be reached and the function will succeed. Using memprof-limits, the function might well fail, due to the random aperiodic nature of Memprof sampling.

You are right, but I think it is still a valuable proof of concept:

  • In the specifics: allocations limits do not need to be precise, as nobody is asking to stop at 100KB precisely, rather wants to be sure that a runaway process will stop eventually. To reason about your program, you want instead 1) the assurance that it does not stop below X with high probability, 2) high probability that it is stopped around N*X, 3) certainty that it eventually stops if it keeps allocating. The sampling rate depends on X and N, so you also want advice for values for X and N for which performance is not too affected. I believe Memprof can deliver that. From that point of view it is true that the prototype only delivers 3) and is lacking, for instance 100KB is about 10 samples currently which is too low, and the sampling rate was picked with much higher values in mind (without any probability calculation).
  • In principle: using Memprof is the only way currently to implement such limits, barring changes to the runtime. The question was not “how can we implement allocation limits the best way?” but “what can we do with memprof?”, having in mind that a language feature that finds uses it was not meant for is the sign of a good design. This takes the point of view of the OCaml user who wants to implement something good enough for them with what they have (in particular there is not plan to propose memprof-limits for the standard distribution, for instance). But, if we could afford changing the runtime to have a better implementation, what would we do? Maybe better tracking per-thread allocations would be easy and moderately invasive. Maybe also adding yet another kind of asynchronous callback if we really want to be precise with the limit, which would be fairly invasive, but there again we can reuse Memprof callbacks to perform a recurring check if we allow some inaccuracy. Is there room for either in the complexity budget? Given that the need for precise allocation limits is unclear, it is much simpler to have a library-side solution that does with what Memprof can offer.

Is there a guarantee that if an allocation callback is delayed, it is going to run in the same thread as the allocation happened?

No. Making this guarantee can cause unbounded delays. Suppose an allocation is sampled on a thread in a C call (where the callback cannot run promptly), and the C call then enters a blocking section. It can be a very long time before the blocking section returns.

I see. This assumption is made in statmemprof-emacs, and while I do not use it currently, I agree that it is a useful assumption. However an alternative would be to record the allocating thread id in the allocation record. In fact, it can be hard to know what to do if we are in an allocation callback for an allocation that happened in a different thread, if we do not have this information.

mshinwell pushed a commit to mshinwell/ocaml that referenced this pull request Apr 7, 2020
@gasche
Copy link
Member

gasche commented Apr 13, 2020

I am very thankful for @gadmm to have done this experiment. I would like us to converge on what we believe is a solid API before the 4.11 release (which will be the first exposing the Memprof API to the world). My feeling is that only little tweaks are needed, if any, to get something nice, but that we needed to get experience from actual use-cases. Whether or not memprof is the right long-term approach for memory limits, it sounds like an excellent use-case to clean up corner cases of the API.

Below I try to discuss each point of @gadmm's feedback that could be interpreted as a modest/reasonable API change/improvement suggestion, and propose an API to implement it. If people were kind enough to give an opinion on the proposed APIs, we could try to implement them quickly.

It would be nice to prioritize changes to existing APIs (in opposition to just adding more stuff), to avoid changing existing programs after the first API is released. (Of course, we marked things as experimental so that we are allowed to change them.)

Memprof.stop

Besides forcing to provoke exceptions which then have to be ignored, it is fairly obvious that this will lead to bugs since not all programmers will notice that they need a loop.

Yes, this is the strongest argument to me. Guillaume's code has the following:

  let rec really_do_stop_memprof () =
    (* FIXME: loops if already stopped *)
    try Gc.Memprof.stop () with _ -> really_do_stop_memprof ()

This is not very nice, and in any case most users are going to forget doing this, and they will write buggy code instead of reading the documentation carefully and using the complex correct way to stop memprof.

I vote for implementing @gadmm's proposed specification:

Memprof is guaranteed to be stopped upon normal or exceptional return of [Memprof.stop ()]; in addition, if it returns normally then it is guaranteed that all callbacks ran.

We can keep a Memprof.try_stop : unit -> unit function that implements the current, finer-grained behavior, but is out of the "path of least effort".

Threads

I agree with @gadmm (and @jhjourdan) that same-thread guarantees would be nice, and we might be able to give them without undue latency by handling pending callbacks before blocking calls (before releasing the runtime system). But this can be left for later, as not providing any guarantee is obviously forward-compatible.

Querying the running state and parameters

@gadmm asks about accessing the current running state and sampling rate, and then possibly about changing them.

-I have found that duplication was needed to query the started state and the sampling_rate for no good reason.

For the curious about why there is duplication in Guillaume's code: Guillaume wants "stackable" callbacks, so that user can use Memprof for memory profiling on top of his memprof-limit monitor. So he exposes an interface that resembles Memprof.start, but call his callbacks before the user callbacks. In this scenario his callbacks need to know what the current sampling rate is (to do some statistics computation from the number of samples, I think), and so he enriched the Memprof.start interface to take allocation closures parametrized over the current sampling rate.

One could think instead of a providing way to query the running state of Memprof.

I see two options. The first is to provide a global query function

val state : unit -> [ `Running of running_state | `Stopped ]

The second, lighter, only passes the running_state to the callbacks.

val running : unit -> bool
type ('minor, 'major) tracker = {
  alloc_minor: state -> allocation -> 'minor option;
  ...
}

The first one feels sensibly simpler to me and I don't see a downside. (I suppose there are moments where there is no well-defined running state, but at this point users cannot run OCaml code anyway.)

Finally, I wonder if this running_state should expose only sampling_rate and callback_size, or it should also provide an existential packing of the current callbacks. In the second case, this gives the expressivity to stop and restart Memprof with the same callbacks.

If there is consensus I would propose to start by exposing just the sampling_rate and callback_size in a record. So the API would be something like:

type runtime_parameters = {
  sampling_rate: float;
  callback_size: int;
}
type run_state = Running of runtime_parameters | Stopped

val run_state : unit -> run_state

We can always add access to existentially-packed callbacks later if the need arises.

(Should start take a runtime_parameters record instead of labelled arguments?)

Mutating the running parameters?

  • Moreover, one could imagine wanting to change the sampling rate and the callstack size (if such functions are provided, they could for instance flush the delayed callbacks before performing the change). There is no elaborate way to do that currently without losing samples.

Ideally, if we provided a way to change the sampling rate, it should be expressible from user code by finding the right way to stop Memprof, and then starting it again with the new sampling rate. I think @gadmm is saying that the current Memprof.stop approach does not let us do this (in addition, we cannot access the callbacks, so we cannot even try unless we statically know which callbacks were passed).

I would propose to leave this part for later, but I would be reassured by whatever is proposed for Memprof.stop () if we knew that stop (); start callbacks is equivalent to a callback-polling no-op when memprof is on, and the same callbacks that were running are passed.

@jhjourdan
Copy link
Contributor Author

Memprof.stop

Apparently you all more or less agree with the fact that Memprof.step should "guarantee" that the sampling is stopped even if it returns with an exception, so I won't fight too much. Let me just re-emphasize my argument once more:

Formally, there is no way to observe that this guarantee indeed holds because asynchronous exception are, well, asynchronous. That is, the exception can occur immediately before Memprof.stop is called, and there is no way to determine whether the exception occurred just before or just after the call.

Said otherwise : in a sense, this guarantee already holds, since when the function raises an exception, we can always pretend that it has never been called and that the exception occured just before the call. The user will never be able to prove the contrary.

In the same order of ideas, @gadmm's function really_do_stop_memprof and my do_stop_memprof are both either bogus or formally equivalent to Memprof.stop. They cannot be relied on, because if an exception occurs just before they are called, they don't provide any guarantee.

Note that the problem is actually quite fundamental, since it also exists with signals. And I think the API of signals have currently the same "defect" than the current API of memprof.

Threads

As already mentioned, I have an almost-ready PR to submit which provides this guarantee. There is a question we will have to answer, though: if we provide this guarantee and a Memprof.stop is called, then we need either to wait for the other threads to empty their queue (but this is cumbersome to implement since we need to somehow interrupt system calls...) or drop the remaining callbacks.

Perhaps the right thing to do, then, is simply to drop the remaining callbacks in Memprof.stop. This would also provide the guarantee you all would like for Memprof.stop, since it would no longer call callbacks. Of course, we will need to avaler une couleuvre: some callbacks will be dropped even though they should be called...

Querying the running state and parameters

I agree with @gasche's proposal.

Mutating the running parameters?

Mutating the parameters using start/stop is, IMO, vain. This would not be atomic, hence incompatible with threads, and this will likely drop callbacks if we implement the semantics of Memprof.stop I propose above.

Having a function which allows mutating the runtime parameters is, I think, opening a can of worms because of callbacks remaining in the queue

  • if you allow changing the closures, then you need to take care that the blocks already sampled will be called with the old callbacks, otherwise you break type safety (you pass the old user value to the new callbacks)
  • when changing the parameters, the old callbacks in the queue will still have been sampled with the old parameters, but be called with the new parameters.... This is rather confusing.

@gasche
Copy link
Member

gasche commented Apr 14, 2020

Taking a step back, what I want is to be able to run a piece of code with memprof monitoring on, and the guarantee that memprof monitoring is stopped after the piece of code is done running (through normal return or an exception, synchronous or not). So really I would like to have a function

monitor : <memprof-parameters> -> (unit -> 'a) -> 'a

If there is an interface for stop that lets us implement this correctly as a library, then I'm happy with this interface. Otherwise maybe we should consider providing this function directly in the memprof module -- assuming that we can implement it correctly with runtime support.

@gadmm
Copy link
Contributor

gadmm commented Apr 14, 2020

@stedolan asked an important question, of knowing whether it was really possible to program at all for things like allocation limits given the statistical nature of memprof. This is important in the discussion given that I proposed such an application to decide in favour of allowing asynchronous exceptions being raised.

I have done a statistical analysis of memprof-limits here: https://gitlab.com/gadmm/memprof-limits#documentation. First, I have found that the statistical nature of memprof makes it very easy to reason about the application and not worry about implementation details (e.g. other threads interfering with the sampling). I have also found that its deterministic nature was (essential and) useful for reproducing runs. I find the idea and the design of Memprof very very nice.

Long story short, memprof-limits starts being accurate enough starting around a safe allocation value of 100 KB (meaning a limit of 1 to 3 MiB depending on chosen precision), with the ratio between the maximal safe allocation and the limit dropping very quickly for higher values. Correctly, the analysis shows that limits under 500 KiB, such as the example objected by Stephen, are unreliable. It was entirely true that memprof-limits was incomplete before we had the statistical analysis, and Stephen was right to object.

As to know whether users can program with memprof-limits, that is, write programs and reason about them, I also make two hypotheses:

  1. Allocation limits as used in Haskell are used by determining peak reasonable allocation usage empirically and picking a limit at a comfortable margin over it, rather than computing a precise memory bound and using it as a limit. Indeed, 1) in situations where this would be possible, there would probably be more suitable tools than allocation limits, 2) to the best of my understanding, being able to predict memory usage has never been a design goal of Haskell, so I find it unlikely that this has ever happened.
  2. The user is fine with a very unlikely possibility of a false positive; indeed the program is already designed to let true positives fail without bringing down mission-critical parts of the program (that's the whole point of the limits). For instance they can prefer to see a legitimate client having a connexion closed once every 10^N year for N of their choosing, if that is the price to pay for avoiding being subject to DOS on maliciously-crafted requests.

Under these hypotheses, the statistical limit is just as reliable as precise limits à la Haskell.

Also, the precision is much better than I expected, so this opens the possibility for reliable-enough local memory limits based on the allocator-pays model (that is, like allocation limits but also counting deallocations) with low-enough sampling rate (which are more intensive computationally, because they need to track blocks).

Replies to @gasche and @jhjourdan about tweaks to the API coming later.

@gadmm
Copy link
Contributor

gadmm commented Apr 14, 2020

I would like us to converge on what we believe is a solid API before the 4.11 release (which will be the first exposing the Memprof API to the world). My feeling is that only little tweaks are needed, if any, to get something nice, but that we needed to get experience from actual use-cases.

@jhjourdan, does this sound realistic to you in light of your upcoming PR, or does it need more time and might require further API changes? I am happy to help if some things need action before 4.11.

P.S.: apologies for the length

Memprof.stop

@gasche:

We can keep a Memprof.try_stop : unit -> unit function that implements the current, finer-grained behavior, but is out of the "path of least effort".

I think you just need some way to force flushing, and this will come automatically with the BSP patch which AFAIU introduces a primitive to poll without allocating. Then the user can define:

let try_stop = poll() ; stop()

(as a stopgap measure one could have a C primitive to flush without allocating for expert users, pending further feedback.)

@jhjourdan:

Formally, there is no way to observe that this guarantee indeed holds because asynchronous exception are, well, asynchronous. That is, the exception can occur immediately before Memprof.stop is called, and there is no way to determine whether the exception occurred just before or just after the call.

Said otherwise : in a sense, this guarantee already holds, since when the function raises an exception, we can always pretend that it has never been called and that the exception occured just before the call. The user will never be able to prove the contrary.

I agree that this reasoning makes sense and is sound, but to me all it shows is that you cannot reliably stop memprof in the normal path, and one should stop it instead in the release path, which ought to be safe for asynchronous exceptions. What we learn from systems programming languages is that the release path should have a special status.

Furthermore, it would be wrong to say that OCaml cannot yet offer a release path which is safe for asynchronous exceptions. Indeed, the following implementation of Haskell's bracket in OCaml:

let with_resource ~acquire arg f ~(release : _ -> unit) =
  let release_noexn x = try release x with _ -> () in
  let r = acquire arg in
  match f r with
  | y -> release_noexn r ; y
  | exception e -> release_noexn r ; raise e

is actually sound in native code under the condition that acquire and release do not poll (for instance if they do not allocate and do not release the runtime lock). As it happens, start and stop fulfil the conditions.

It is true that the naive code likely to be written by users:

M.start() ;
Fun.protect ~finally:M.stop f ()

is incorrect in exactly one way: it makes one unprotected allocation of a callback, and if memprof samples it and the alloc_minor callback raises, then memprof will not be stopped. Another question which is implicit in your objection is whether users should be given start and stop instead of a safe with_memprof implementation, given that they are so easy to misuse. I see two aspects:

  • as some kind of convenience for the user who wants to quickly slap Memprof on their program. Then they probably want to use Fun.protect for stopping Memprof in case of an exception in their program. It is okay for them if this is "good enough". If M.stop does not stop, the data they have collected is invalid (the profile contains allocations that happened afterwards), so guaranteeing to stop is slightly more "good enough". However, given that they most likely use a scoped combinator, it would be just as good for them to providing them only with a safe with_memprof.

  • as a form which is more expressive than a scoped combinator. There is no good way currently of relaxing the LIFO nature of with_resource, however, people who choose to live without asynchronous exceptions (in particular who make sure to catch all exceptions inside memprof callbacks) are able to use start and stop safely, and would maybe find some use in having a non-scoped resource.

So one is stuck between wanting something convenient for one kind of users, and something general for another kind of users. What I propose does not change anything for the latter, but makes it slightly more convenient for the former.

But if one would like to replace start/stop with a correct with_memprof implementation, I think this is already possible (including for bytecode by being careful, if I remember correctly my experiments around with_resource).

In the same order of ideas, @gadmm's function really_do_stop_memprof and my do_stop_memprof are both either bogus or formally equivalent to Memprof.stop. They cannot be relied on, because if an exception occurs just before they are called, they don't provide any guarantee.

I disagree, in memprof-limits I only use it in a way which is safe for asynchronous exceptions. (I have marked several places in the code where I would need an implementation of with_resource with masking, but this is not the case for the one that calls really_do_stop_memprof for the reason explained above.)

Note that the problem is actually quite fundamental, since it also exists with signals. And I think the API of signals have currently the same "defect" than the current API of memprof.

For the problems with asynchronous exceptions, I agree. But Sys.set_signal never fails to set the signal, even if a signal is pending. So, while a naive “scoped” signal using Fun.protect is incorrect for the same reason as above, it is entirely possible to write a correct one by hand without having to drop a signal or an asynchronous exception it would have raised.

Threads

I agree with @gadmm (and @jhjourdan) that same-thread guarantees would be nice

I am more on the fence now. Looking at it more closely, memprof-limits has a bug to fix due to this lack of guarantees, but all I need to fix it is to obtain the thread id along with the allocation record. I do not have yet the details of @jhjourdan's new design, but I wonder what happens when you send the value across threads before the callback manages to run.

and we might be able to give them without undue latency by handling pending callbacks before blocking calls (before releasing the runtime system).

That would be furiously incompatible with reasonable implementations of masking, which should allow releasing the runtime lock and also let systhread yield inside the mask (such as the one I am working on).

But this can be left for later, as not providing any guarantee is obviously forward-compatible.

Right, the alternative which adds a thread id to the allocation record cannot be left for later with your criterion, so it would be nice to see what @jhjourdan has in mind.

Another note about the thread id: the notion could be moved to stdlib to not be systhread-specific. Currently, any memprof application that wants to be compatible with systhreads (such as memprof-limits) has to force the dependency on systhreads. But I do not need systhreads, I just need to know the thread id, which could have a default implementation that always returns 1 when not using systhreads.

@jhjourdan:

As already mentioned, I have an almost-ready PR to submit which provides this guarantee. There is a question we will have to answer, though: if we provide this guarantee and a Memprof.stop is called, then we need either to wait for the other threads to empty their queue (but this is cumbersome to implement since we need to somehow interrupt system calls...) or drop the remaining callbacks.

So there is the question about what happens when the value is sent across threads before the callback ran, how is the ordering of callbacks (allocate, promote, deallocate) maintained? Especially in the light of unbounded delays explained by @stedolan.

Perhaps the right thing to do, then, is simply to drop the remaining callbacks in Memprof.stop. This would also provide the guarantee you all would like for Memprof.stop, since it would no longer call callbacks. Of course, we will need to avaler une couleuvre: some callbacks will be dropped even though they should be called...

By seeing it as a release vs. flush problem, you would indeed need a stop() function that drops callbacks, to be called in the release path, but you would also need a flush function which is more involved since it waits for other threads to empty their queue, to be called in the normal path at the discretion of the user.

(Then one can wonder what should happen if an exception arises from a callback in another thread while flushing... then, I think we can consider dropping the callbacks of that thread. But then it makes it impossible to combine memprof-limits with a profiling library provided by the user, for instance, because callbacks from one will drop callbacks from the other... This does not happen with a global queue. Edit: this is not an issue when flushing is used before stopping, but it is an issue for flushing before mutating a parameter as suggested below.)

Querying the running state and parameters

Finally, I wonder if this running_state should expose only sampling_rate and callback_size, or it should also provide an existential packing of the current callbacks. In the second case, this gives the expressivity to stop and restart Memprof with the same callbacks.

I remember thinking about it, and not finding concrete uses. The main issue is that stopping and starting again should not be a noop, because it should forget about all tracked blocks (or it would sound like a bug to me).

If there is consensus I would propose to start by exposing just the sampling_rate and callback_size in a record.
[…]
(Should start take a runtime_parameters record instead of labelled arguments?)

Indeed it would be odd to define a type runtime_parameters and then not use it as an argument to start! But on the contrary, is it not too heavy to have to use the { default with … } pattern to define default arguments? Or then maybe it is better to have both parameters and callbacks in a single record, but then it should existentially pack the current callbacks and would allow to query them. Ah!

What about simply:

val samping_rate : unit -> float option
val callstack_size : unit -> int option

or even simpler

val started : unit -> bool
val samping_rate : unit -> float
val callstack_size : unit -> int

with values false, 0. and 0 respectively when stopped?

Mutating the running parameters?

@gasche:

I would propose to leave this part for later, but I would be reassured by whatever is proposed for Memprof.stop () if we knew that stop (); start callbacks is equivalent to a callback-polling no-op when memprof is on, and the same callbacks that were running are passed.

This is the "elaborate way to do that currently without losing samples" I alluded to. Again, if stopping and starting again does not lose samples, then something is broken I think.

@jhjourdan:

Having a function which allows mutating the runtime parameters is, I think, opening a can of worms because of callbacks remaining in the queue

* when changing the parameters, the old callbacks in the queue will still have been sampled with the old parameters, but be called with the new parameters.... This is rather confusing.

I agree this should not happen, nor should callbacks be dropped in that case. The suggestion was to flush the callbacks before changing the value. With the design you propose above, flush would need to synchronise, but the principle is the same.

Again, I find that per-thread memprof is rather more complicated (Jacques-Henri knows that I liked the idea originally) when in practice all I found to really need was the thread id. Do you have other reasons in favour? And indeed it might require to drop callbacks in more situations than just stop when an exception arise. One argument to explore is whether not adding synchronisation now in the API will mean that it will not be doable to port memprof to multicore later without changing the API.

@jhjourdan
Copy link
Contributor Author

@jhjourdan, does this sound realistic to you in light of your upcoming PR, or does it need more time and might require further API changes? I am happy to help if some things need action before 4.11.

I am not the guy complaining about the current API. I consider my upcoming PR is just an improvement. To me, the current API is fine and ready for 4.11.

I disagree, in memprof-limits I only use it in a way which is safe for asynchronous exceptions. (I have marked several places in the code where I would need an implementation of with_resource with masking, but this is not the case for the one that calls really_do_stop_memprof for the reason explained above.)

I think you are trying to rely on the absence of polling points at some places in the code, and I think this really is something you have very little control on. For example : are you aware that, in bytecode, any function call is a polling point? Basically, this mean that you cannot write a "release function" which does not poll, simply because the call of that function is already a polling point! In any case, I maintain my opinion: these functions are broken (at the very least on bytecode) since they cannot be called without polling, and this can result in undesired asynchronous exceptions.

But Sys.set_signal never fails to set the signal, even if a signal is pending.

Wrong. If a asynchronous exception occurs when calling this function, it will not set the signal.

and we might be able to give them without undue latency by handling pending callbacks before blocking calls (before releasing the runtime system).
That would be furiously incompatible with reasonable implementations of masking, which should allow releasing the runtime lock and also let systhread yield inside the mask (such as the one I am working on).

Of course, we would not do that if masking is enabled. But masking should not be used for long periods of time, isn't it? Otherwise, yes, some callbacks will be delayed for long. But what can we do against that?

So there is the question about what happens when the value is sent across threads before the callback ran, how is the ordering of callbacks (allocate, promote, deallocate) maintained? Especially in the light of unbounded delays explained by @stedolan.

The ordering problem already exists in the current implementation, and is already taken care of. Essentially, you maintain flags for each sampled blocks indicating which callbacks have been called.

Again, I find that per-thread memprof is rather more complicated. [...] Do you have other reasons in favour?

In memprof-limits, you use exceptions in callbacks to interrupt a thread which would allocate too much. Thus you rely on that guarantee. If callbacks are called far less often in the allocating thread, then you will have far less opportunities to interrupt it.

@gasche
Copy link
Member

gasche commented Apr 15, 2020

Any feedback on the proposal to add monitor or with_memprof

val with_memprof : <memprof-parameters> -> (unit -> 'a) -> 'a

and encourage users to use this instead of start/stop?

@gasche
Copy link
Member

gasche commented Apr 15, 2020

P.S.: I agree that the current API is fine for 4.11, my messages above are a thought experiment to see, if we shake the memprof-limits tree, if the fruits that fall down suggest changes that are better done now rather than later. But it is always an option to push them for later indeed -- the API being marked as experimental and all that.

@jhjourdan
Copy link
Contributor Author

Regarding the flush function: it would be anyway already required to synchronize, since we would need to wait for another thread to complete e.g., an allocation callback in order to run the corresponding deallocation callback if the value has been deallocated in between.

Moreover, note that it is not just "synchronizing". It is also waiting for callbacks to terminate in other thread, which may take time (callbacks could even call blocking functions).

So the correct implementation of such a function is difficult (if not impossible). I now think we should seriously consider dropping callbacks in places we would like to flush.

@jhjourdan
Copy link
Contributor Author

Any feedback on the proposal to add monitor or with_memprof

From what I can see, Fun.protect ~finally:M.stop f () should do the job, and is honestly quite intuitive to use. If it turns out that an asynchronous exception is raised during stop, well then Fun.protect will complain and the program will likely end abnormally. If Memprof.stop drops callbacks, then this should happen only very very rarely.

In any case, the idea of a "scoped combinator" is broken in the presence of threads. So this is no the solution to all our problems.

@gadmm
Copy link
Contributor

gadmm commented Apr 21, 2020

I finished what is urgent for 4.11 so I come back to this one.

  • The lack of way to drop callbacks is being remedied at Memprof: provide the guarantee that an allocation callback is always run in the same thread the allocation takes place. #9449, thanks! Unfortunately not in time for 4.11. Once this is done, I agree that the scope using Fun.protect is "good enough" for users who do not care about asynchronous exceptions (only possible bugs in callbacks), this is all I am asking for. I was not asking for more, as I have said that I will make it right for masking once it is ready, I just did not think it was good to make a principle out of the broken behaviour in bytecode.

  • By the way, while what I wrote explicitly concerned native, it is very simple to write a with_memprof function that works properly in both bytecode and native, if you are interested.

  • Thanks for the explanation about the synchronising nature of the complete flushing in both designs, that was clarifying. For the sake of the discussion one should therefore distinguish the best-effort one from complete one. Both are interesting. In single-threaded they match IIUC, and are easy to implement. Sure, it is not always going to be meaningful to completely flush, but I imagine it would be useful if in some well-controlled circumstances (e.g. good scoping, calling full-major just before) the user could deduce there is a leak if some sample has not been collected. But, you convinced me that both designs are equivalent on this point, and it is fine if it remains just a “thought experiment”.

In summary, thank you for patiently addressing all my comments. More focused discussion will continue on the spin-off PRs you have opened.

  • You are also asking me where I see OCaml being headed regarding polling points. I do indeed intend to rely on them for critical sections involving masking, as already do people who rely on them to write correct concurrent code. This is not going to break overnight. Furthermore, the current behaviour in native has very nice compositionality properties and I have written on the "Better Safe Points" page how these good properties could (and, in my opinion, should) be preserved when adding the missing polling points (more precisely when implementing the non-polling block). As for the behaviour in bytecode, this one indeed has bad compositionality and is hard to program with. I had initially hoped that the addition of missing polling points would also benefit bytecode so that it would be possible to bring bytecode behaviour in line with native (we have more details now because part of the implementation has landed in the multicore repository). Instead, it will be enough if the non-polling block is given appropriate semantics in bytecode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants