Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat as errors the uncaught exceptions in finalisers #8873

Closed
wants to merge 3 commits into from

Conversation

@gadmm
Copy link
Contributor

gadmm commented Aug 13, 2019

This old patch fixes the bug whereby catchable exceptions (e.g. Not_found) can in theory appear out of thin air, due to exceptions being raised at allocation points from finalisers. After discussion with @lpw25 and @stedolan, I decided to resurrect it and propose it as a PR.

Warning: this was my first modification to the runtime, initially meant as an exercise to find my way around the compiler. I cannot guarantee that I understand everything I wrote, so please review carefully. And no hard feelings if rejected!

The strategy is the same as for Fun.protect: we promote the uncaught exception into a serious exception by wrapping it into a new exception Finaliser_raised of exn. This uncaught exception is documented as a programming error or unexpected exception (OOM, Sys.Break...), exactly like Fun.Finally_raised, so one should not catch Finaliser_raised other than with a catch-all.

Right now, there is a consensus that uncaught exceptions in finalisers should be ignored or terminate the program (probably with a customisable dbuenzli-hook). I too think this is a good general direction, and the current PR does not aim to change what this ideal solution should be. There are a few comparison points where one might still want the current PR as a solution available right here and now:

  • The current PR is backwards-compatible: it is guaranteed to only reveal bugs (catching an exception out of a finaliser without a catch-all then re-raise, or at isolated boundaries, has to be a programming error). On the contrary, changing the behaviour to abort on uncaught exceptions in finalisers can turn buggy but hardened programs into abortive ones. So, one might prefer a smoother transition where one has an additional chance to detect these bugs in the meanwhile.

  • Besides programming errors, these uncaught exceptions in finalisers can currently be OOM or Sys.Break. In that case, it is better to ignore than to abort, but you at least need to mask signals while finalisers run. And if you want to allow aborting, you first need to sanitize OOM exceptions. Maybe it is a more involved change than it sounds, that pends a more general clarification of the exception safety model which is bound to happen with multicore. (It does make sense, for instance, to ask that finalisers are written in the same dialect as other resource cleanup functions where signals are masked and all exceptions should be handled.) In short, the alternative proposition is not obvious to design, and has no ETA.

The current PR fixes the bug now, and up until our multicore people ask that finalisers run in their own threads.

As an alternative design to the current Finaliser_raised which struck me due to the way the code is structured: it also makes sense to have a single Asynchronous_exception of exn to ensure that all asynchronous exceptions are promoted to a serious exception (that is: uncaught exceptions in finalisers, exceptions raised from signal handlers, and uncaught exceptions from memprof callbacks). Programs that catch Sys.Break explicitly might break, but one is already not supposed to catch Sys.Break without a catch-all.

Failing tests: currently, I have the following kind of test failure, e.g. typing-sigsubst/sigsubst.ml:

-Error: Illegal shadowing of included type t/97 by t/101
+Error: Illegal shadowing of included type t/98 by t/102

Is this a bug? Can you give me clues to fix it?

@dbuenzli

This comment has been minimized.

Copy link
Contributor

dbuenzli commented Aug 13, 2019

Did you consider simply not letting uncaught exceptions in finalizers subvert the control flow of the program but still letting the program be made aware that something bad happened ?

I always thought these kind of problems would be better solved by a trap (again sorry...) rather than effectively letting them turn any kind of exception into a asynchronous one (before this PR) or introducing a new, but single, one to represent them (this PR).

Basically the idea would be to either use or extend (my preference) the uncaught exception handler and simply pass these uncaught exceptions there hereby easily letting the program decide whatever action it wishes to perform according to where they occur. Something like:

type ctx = [ `Signal_handler | `Program | `Finalizer | `Self (* the uncaught handler itself raised *) ]
val set_uncaught_exception_handler : (ctx -> exn -> raw_backtrace -> unit) -> unit
val handle_uncaught_exception : ctx -> exn -> raw_backtrace -> unit
@gadmm

This comment has been minimized.

Copy link
Contributor Author

gadmm commented Aug 13, 2019

Did you consider simply not letting uncaught exceptions in finalizers subvert the control flow of the program but still letting the program be made aware that something bad happened ?

Yes, I explained in detail the challenges of this alternative approach. The point is that this simple fix is available right now, does not raise as many questions, and does not preclude the solution you suggest from being implemented once OCaml is ready for it.

I always thought these kind of problems would be better solved by a trap (again sorry...)

(No need to apologize for being insistent :)

Basically the idea would be to either use or extend (my preference) the uncaught exception handler and simply pass these uncaught exceptions there hereby easily letting the program decide whatever action it wishes to perform according to where they occur.

This is what I meant with a dbuenzli-hook. In the handler, there is indeed not much else you can do besides aborting or swallowing the exception (and logging). Maybe you were thinking of being able to do something else?

@dbuenzli

This comment has been minimized.

Copy link
Contributor

dbuenzli commented Aug 13, 2019

The point is that this simple fix is available right now,

Right but somehow I don't see it fixing much. I mean the only thing you get is more information right ? That is: rather than a puzzling exception out of of the blue you still get it but wrapped in another one. Agreed it's better, but still... I'm not sure the problem is common enough to warrant introducing a half-fix in the stdlib which will have to be deprecated.

Unless there's something I deeply misunderstand I don't really buy the "backwards-compatible" argument you are making in the first point above: these errors are non-deterministic and if they happen to occur outside an exn handler your program is already abortive now.

So why not introduce the trap immediately ? You could even elect not to abort on `Finalizer by default if you think it's a problem to change that behaviour -- maybe that's even what should be be done anyways. But even if you abort I wouldn't be too concerned to keep backward compatibility with what amounts to the non-deterministic abortive behaviour we have now.

Maybe you were thinking of being able to do something else?

No that pretty much sums it...

@dbuenzli

This comment has been minimized.

Copy link
Contributor

dbuenzli commented Aug 13, 2019

these errors are non-deterministic and if they happen to occur outside an exn handler your program is already abortive now.

Okay sorry of course not if you have a global try main () with exn -> .... Still I'm not sure the churn of introducing that exn and then deprecating is worth over directly putting a forward looking trap in place. Even for a hardened program you may get non-deterministic behaviour, also keeping backward compatibility for buggy programs is weird, release notes can simply indicate the change.

@gadmm

This comment has been minimized.

Copy link
Contributor Author

gadmm commented Aug 13, 2019

The point is that this simple fix is available right now,

Right but somehow I don't see it fixing much. I mean the only thing you get is more information right ? That is: rather than a puzzling exception out of of the blue you still get it but wrapped in another one. Agreed it's better, but still...

No, you get the (theoretical) assurance that when you catch for instance Not_found, it means what you expect. And it does not claim to achieve anything else.

I'm not sure the problem is common enough to warrant introducing a half-fix in the stdlib which will have to be deprecated.

I agree that it is quite theoretical, but it's pretty bad and gets mentioned often when discussing asynchronous exceptions. I do not think it is a half-fix, but a proper fix for OCaml as it currently stands (where you've currently got OOM, Sys.Break, and where you have to be careful about breaking changes).

So why not introduce the trap immediately ?

I'd be in favour of an alternative PR which introduces a trap right away, but note that this currently needs to at the very least implement masking first, and to investigate the consequences of the change.

You could even elect not to abort on `Finalizer by default if you think it's a problem to change that behaviour -- maybe that's even what should be be done.

In general I am not sure that ignoring bugs amounts to a healthy default, which is part of the complications. But I agree it might end up the best choice, i.e. not necessarily consider uncaught exceptions as a bug, e.g. if you have a mental model where a finalizer runs in an isolated thread.

Note that once you make ignoring the default, then it's hard to go back. And if you use libraries written when ignoring is the default, they might not be well-tested against changing this behaviour.

With the current PR we have the time to wait and see where multicore is headed before settling on a design.

Still I'm not sure the churn of introducing that exn and then deprecating is worth over directly putting a forward looking trap in place.

Yes, I used to think the same, which is why I did not think opening a PR until recently when I was encouraged to do so.

Developers have been speaking about sanitizing uncaught exceptions in finalisers for a while now, everybody agrees it's bad, but I have not seen much progress.

Sorry if it looks like a lot of discussion around tiny details, but it's important.

@gadmm

This comment has been minimized.

Copy link
Contributor Author

gadmm commented Aug 13, 2019

I did not see your edit.

Even for a hardened program you may get non-deterministic behaviour, also keeping backward compatibility for buggy programs is weird, release notes can simply indicate the change.

Not all exceptions out of finalisers are programming errors (currently).

@dbuenzli

This comment has been minimized.

Copy link
Contributor

dbuenzli commented Aug 14, 2019

No, you get the (theoretical) assurance that when you catch for instance Not_found, it means what you expect.

That's a convincing point.

Not all exceptions out of finalisers are programming errors (currently).

😱

@gadmm

This comment has been minimized.

Copy link
Contributor Author

gadmm commented Aug 14, 2019

Still I'm not sure the churn of introducing that exn and then deprecating is worth over directly putting a forward looking trap in place.

Also, note the proposition of introducing a common exception Asynchronous_exception of exn for all asynchronous exceptions, which does not have the same deprecation issue. I am ready to implement this alternative if people would prefer that (I am starting to like the idea more and more).

Continuing with the idea of Asynchronous_exception, I am not sure I made one point clearly enough. Doing this more ambitious change might break programs using the form try ... with Sys.Break -> ... (such as Coq) where this must be replaced with a catch-all. But regardless, if these programs were to start using Fun.protect, or use libraries that start using it, then they are already broken because Sys.Break can be converted into Fun.Finally_raised Sys.Break (and likewise with Base.Exn.protect or whatever non-broken implementation of unwind-protect: no messenger-shooting allowed here). So doing the more ambitious change would be a way to make programs future-proof (or force fixing actual latent bugs).

I am curious to audit the uses of Sys.Break in the opam repository. Has anybody some guidance on where to start?

There's a challenging problem for backwards compatibility. Maybe it is necessary to introduce deprecation warnings for patterns like | Sys.Break -> or | Out_of_memory -> with this change.

I'll be around at the OCaml workshop in the morning if people want to discuss this.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

dbuenzli commented Aug 14, 2019

I am curious to audit the uses of Sys.Break in the opam repository. Has anybody some guidance on where to start?

That could be a start:

mkdir /tmp/source && cd /tmp/source
for pkg in $(opam list -s --all); do opam source $pkg; done
grep -r 'Sys.Break'
@lthls

This comment has been minimized.

Copy link
Contributor

lthls commented Aug 14, 2019

You mention in stdlib.mli that this exception should not be caught except with a catch-all. You could enforce this by not exporting the exception in stdlib.mli, documenting it in Gc.finalise instead. As an alternative, you could mark it as private (using the syntax type exn += private Finaliser_raised of exn). This allows matching on it, but not creating such exceptions from OCaml code.

This PR does not go in a direction I'm interested in, but it doesn't seem to break anything I rely on.
The tests failures you're seeing (at least the ones where the differences are only an offset of one in the identifier stamps, like the one you showed) are expected because you've added a predefined value. If you've checked that all of them are of this form, you can do make promote in the testsuite directory and it will update the expected outputs of the tests to match the ones you have now.

@gadmm

This comment has been minimized.

Copy link
Contributor Author

gadmm commented Aug 15, 2019

@lthls Thanks for your input. One aspect is that you do not want to catch without a catch-all, but you are allowed to match for display or logging purposes for instance (hence the "general rule").

This is in theory. I am now closing this PR as I am going to explain.

@lpw25 @stedolan @gasche (and other people who have shown interest or displayed opinions about asynchronous exceptions), this is for the record. I did an audit of the use of asynchronous exceptions in OCaml packages published on opam. I performed the following searches.

grep -nr "Sys.Break\|Sys.catch.break"
grep -nr -C 5 "Sys.signal ", then Ctrl-s "raise"
grep -nr -C 5 "set_signal ", then Ctrl-s "raise"
grep -nr -C 5 "Gc.finalise ", then Ctrl-s "raise" and Ctrl-s "assert"
grep -nr -C 5 "Gc.create_alarm ", then Ctrl-s "raise"

Then I curated the results by hand by assessing whether they 1) purposefully raise exceptions in signal handlers or finalisers, 2) whether they explicitly match on this exception. Note that the goal was not to be exhaustive but to receive some concrete usage examples, hence the coarse search patterns and the search limited to opam. Nobody has the time nor the possibility to do an exhaustive audit.

Here is a list of packages that rely on asynchronous exceptions by criterion 1):

abella.2.0.6
acgtk.1.5.0
alt-ergo.2.3.0
archsat.1.0
assertions.0.1
b0.0.0.0
bap-saluki.bap-1.6
beluga.0.8.2
bindlib.5.0.1
bracetax.0.4
bsbnative.1.9.4
camlimages.5.0.1
camlp5.7.06.10-g84ce6cc4
cduce.0.5.5
cmdtui-lambda-term.0.4.3
cmdtui.0.4.3
coccinelle.1.0.2
coq-serapi.8.9.0+0.6.1
coq.8.9.1
coqide.8.9.1
core_kernel.v0.12.2
cubicle.1.1.2
dtools.0.4.1
dune-release.1.3.1
elpi.1.4.0
flowcaml.1.07
frama-c-base.15.0
frama-c.19.0
frenetic.5.0.3
fstar.0.9.6.0
glsurf.3.3.1
goblint.1.0.0
graphics.4.07.1
hack_parallel.0.1.1
haxe.3.4.7
imap.1.1.1
iocaml-kernel.0.4.8
iocaml.0.4.9
iocamljs-kernel.0.4.8
js_of_ocaml-camlp4.3.1.0
js_of_ocaml-compiler.3.4.0
js_of_ocaml-lwt.3.4.0
js_of_ocaml-ocamlbuild.3.1.0
js_of_ocaml-ppx.3.4.0
js_of_ocaml-ppx_deriving_json.3.4.0
js_of_ocaml-toplevel.3.4.0
js_of_ocaml-tyxml.3.4.0
js_of_ocaml.3.4.0
jupyter-archimedes.2.3.2
jupyter-kernel.0.4
jupyter.2.4.0
KaSim.4.0.0
lambda-term.2.0.1
lambdapi.1.0
ldap.2.4.0
learn-ocaml.0.10
libres3.1.3
links-postgresql.0.8
links.0.8
lutin.2.56
mccs.1.1+9
merlin.3.3.1
minilight.1.6
mirari.0.9.7
msat.0.8
ocaml-base-compiler.4.08.0
ocaml-http.0.1.6
ocaml-src.4.08.0
ocaml-variants.4.10.0+trunk+flambda
ocamlbuild.0.14.0
ocamlcc.1.0
ocamlfind.1.8.0
ocamlnet.4.1.6
ocp-browser.1.1.9
ocp-index.1.1.9
omake.0.10.3
opa-base.1.1.0+4263
opam-bundle.0.4
opam-client.2.0.4
opam-core.2.0.4
opam-depext.1.1.3
opam-devel.2.0.4
opam-ed.0.3
opam-format.2.0.4
opam-installer.2.0.4
opam-lib.1.3.1
opam-repository.2.0.4
opam-solver.2.0.4
opam-state.2.0.4
osbx.1.2.4
ox.1.1.1
patoline.0.1
pfff.0.37.1
phox.0.89.170929
pkcs11-cli.0.18.0
pkcs11-driver.0.18.0
pkcs11-rev.0.18.0
pkcs11.0.18.0
podge.0.8.0
procord.0.2.0
profiler-plugin.1.30
protocol-9p-tool.2.0.1
protocol-9p-unix.2.0.1
protocol-9p.2.0.1
rfsm.1.5
rmlbuild.0.11.0-00
shcaml.0.2.1
spdiff.0.1
spotlib.4.0.3
spotlib_js.2.2.0_js
unison.2.48.3
utop.2.4.0
why.2.41

The the best of my understanding, here is the sublist of packages that subscribe to the discipline of not discriminating the asynchronous exception from other kinds of unexpected exceptions (criterion 2):

frenetic.5.0.3
glsurf.3.3.1
js_of_ocaml-compiler.3.4.0
js_of_ocaml-lwt.3.4.0
js_of_ocaml-ppx.3.4.0
js_of_ocaml-ppx_deriving_json.3.4.0
js_of_ocaml-toplevel.3.4.0
js_of_ocaml-tyxml.3.4.0
js_of_ocaml.3.4.0
lutin.2.56
merlin.3.3.1
mirari.0.9.7
ocamlbuild.0.14.0
ocamlfind.1.8.0
ocamlnet.4.1.6
opam-core.2.0.4
ox.1.1.1
rfsm.1.5
spotlib.4.0.3
spotlib_js.2.2.0_js
why.2.41

The most prevalent use of async exceptions is raising an exception on SIGINT in interactive programs. Quite importantly, in many cases a specific action is performed, from a specific message ("Interrupted"), to a specific user interaction ("press Ctrl-C two more times to exit"), even to a distinct behaviour (taking the decision to respawn a task instead of aborting). It appears important to be able to discriminate Sys.Break. Besides SIGINT, it is also used to customize behaviour on SIGTERM, sanitize behaviour on SIGPIPE (here we raise a catchable exception: End_of_file), and abort computation on SIGALRM/SIGVTALRM.

As for finalisers, the first remark is that grep "Gc.finalise" does not return relevant cases, except ones raising fatal errors, sometimes explicitly such as with an assert. This correlates the intuition that purposefully raising from finalisers has no use-case, except for communicating fatal errors. Explicitly communicating failure does happen (contrary to the idea that ignoring the exception by default would be fine).

The prize of the most innovative use of asynchronous exceptions goes to msat and archsat which rely on the possibility to raise in finalisers to abort computation when a memory limit is reached. How so? By registering a Gc alarm (a function called at the end of major Gc cycles) that raise some "Out_of_space" exception. Gc alarms are indeed implemented with finalisers. The fact that finalisers are used is probably irrelevant and this can be replaced by something else in multicore, but that also means that you have a new kind of asynchronous exception to take into account.

Note again the empirical evidence in favour of the need to compute until some space or time bound is reached.

So, contrary to what I understood of the maintainer folklore, the established ocaml codebase is not ready even for the mildest proposed change (wrap exceptions in Finaliser_raised), because of signal handlers and Gc alarms.

As for multicore, I see new challenges. I clearly see a design for asynchronous exceptions in multicore, but there are two new questions: 1) is it possible to discriminate signals from other unexpected exceptions? (there are a few obvious ideas to explore), 2) for any such design, is there a clear evolution path from the established code base to the new design? (that sounds more challenging to me).

Hope to see you in Berlin!

@gadmm gadmm closed this Aug 15, 2019
@dbuenzli

This comment has been minimized.

Copy link
Contributor

dbuenzli commented Aug 15, 2019

I understand you went quickly over this but I'd just like to mention that:

b0.0.0.0

does not as far as I understand your definitions belong to 1: it only matches over known asynchronous OCaml runtime system exceptions to let them flow up and traps any other one. It does not raise exceptions from signal handlers or use finalizers. So I think it's rather in 2.

@gadmm

This comment has been minimized.

Copy link
Contributor Author

gadmm commented Aug 15, 2019

Ah, yes. I remember it as a special case because I remember even looking up details on opam and seeing you were the author. There are a few other packages I marked as "?" because they matched on Sys.Break without setting up signals. Here is the full list:

b0.0.0.0
camlimages.5.0.1
dune-release.1.3.1
iocamljs-kernel.0.4.8
lambda-term.2.0.1
mccs.1.1+9
ocp-browser.1.1.9
ocp-index.1.1.9
pkcs11-cli.0.18.0
pkcs11-driver.0.18.0
pkcs11-rev.0.18.0
pkcs11.0.18.0
protocol-9p-tool.2.0.1
protocol-9p-unix.2.0.1
protocol-9p.2.0.1

In doubt, in case I missed something, I registered the intent to match on Sys.Break.

For completeness, here's the list of ones I have marked as "bad" (1 but not 2 nor above):

abella.2.0.6
acgtk.1.5.0
alt-ergo.2.3.0
archsat.1.0
assertions.0.1
bap-saluki.bap-1.6
beluga.0.8.2
bindlib.5.0.1
bracetax.0.4
bsbnative.1.9.4
camlp5.7.06.10-g84ce6cc4
cduce.0.5.5
cmdtui-lambda-term.0.4.3
cmdtui.0.4.3
coccinelle.1.0.2
coq-serapi.8.9.0+0.6.1
coq.8.9.1
coqide.8.9.1
core_kernel.v0.12.2
cubicle.1.1.2
dtools.0.4.1
elpi.1.4.0
flowcaml.1.07
frama-c-base.15.0
frama-c.19.0
fstar.0.9.6.0
goblint.1.0.0
graphics.4.07.1
hack_parallel.0.1.1
haxe.3.4.7
imap.1.1.1
iocaml-kernel.0.4.8
iocaml.0.4.9
js_of_ocaml-camlp4.3.1.0
js_of_ocaml-ocamlbuild.3.1.0
jupyter-archimedes.2.3.2
jupyter-kernel.0.4
jupyter.2.4.0
KaSim.4.0.0
lambdapi.1.0
ldap.2.4.0
learn-ocaml.0.10
libres3.1.3
links-postgresql.0.8
links.0.8
minilight.1.6
msat.0.8
ocaml-base-compiler.4.08.0
ocaml-http.0.1.6
ocaml-src.4.08.0
ocaml-variants.4.10.0+trunk+flambda
ocamlcc.1.0
omake.0.10.3
opa-base.1.1.0+4263
opam-bundle.0.4
opam-client.2.0.4
opam-depext.1.1.3
opam-devel.2.0.4
opam-ed.0.3
opam-format.2.0.4
opam-installer.2.0.4
opam-lib.1.3.1
opam-repository.2.0.4
opam-solver.2.0.4
opam-state.2.0.4
osbx.1.2.4
patoline.0.1
pfff.0.37.1
phox.0.89.170929
podge.0.8.0
procord.0.2.0
profiler-plugin.1.30
rmlbuild.0.11.0-00
shcaml.0.2.1
spdiff.0.1
unison.2.48.3
utop.2.4.0

Edit: Here are my raw notes, for the record: https://gitlab.com/gadmm/stdlib-experiment/blob/master/other/async_audit/break_or_catch_break

@gadmm gadmm mentioned this pull request Sep 22, 2019
0 of 4 tasks complete
@gadmm gadmm mentioned this pull request Nov 13, 2019
gadmm added a commit to gadmm/ocaml that referenced this pull request Jan 27, 2020
Before this patch, a Memprof callback can potentially raise any
exception, including for instance Not_found out of thin air. This
means that when doing "try ... with Not_found ->...", this might not
always mean what the programmer intends. This is a well-known
theoretical bug affecting the raising behaviour of finalisers (ocaml#8873).

In general, any exception raised by Memprof is an asynchronous
exception subject to the discipline of ML interrupts (see e.g.
http://isabelle.in.tum.de/dist/library/Doc/Implementation/ML.html):
one must not discriminate interrupts, and an interrupt must always be
re-raised promptly except at isolation boundaries.

This patch avoids this sort of bugs, and it encourages and facilitates
following the discipline of interrupts for catching exceptions arising
from Memprof callbacks.

Note that some code in the wild exists of the form `try ... with
Sys.Break -> ...`. Such code does not follow the interrupt discipline
as it discriminates on interrupts. With this patch this practice
becomes incompatible with Memprof (similarly to what it is already
with Fun.protect). Users of such code who want to use Memprof must fix
it to properly catch all exceptions, and can record the cancellation
state into a boolean flag inside the signal handler of SIGINT for the
current thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.