Improve error message when bytecode program runs out of fds #2195

dra27 · 2018-12-11T09:42:55Z

The following program:

while true do
  open_in "foo" |> ignore
done

would terminate with Fatal error: executable program file not found if backtraces were enabled which is wrong for two reasons:

The file was found, the issue is that all fds have been exhausted
This error is not fatal

Furthermore, this error would happen if user code attempted to get the backtrace information (e.g. to display an error message). In this case, the information about the exception is completely lost, since the runtime terminates.

The first commit detects EMFILE in caml_attempt_open and returns a new flag to differentiate the error. caml_debuginfo_location is altered to return 0 on success but to pass on the return value of caml_attempt_open if this fails (native runtime is unaffected by this change). caml_print_exception_backtrace is then altered so that it displays a more helpful error message, and the error is no longer fatal. Note that loc_is_raise remains valid in this case, which means that the effect on user code would be to act as though debugging information were not available (where before the runtime had terminated)

The second commit optionally extends this further by causing the debug information to be loaded during startup if backtraces are enabled. The assumption is that the system is not likely to have run out of fds at this point, though in this case the runtime would in fact fail to start. Again, native code is not affected by this change.

The background for this seemingly mundane change is a non-trivial amount of time lost debugging Dune...

cc @jonludlam

Before, any errors when loading bytecode backtrace information were fatal and the error message was potentially misleading.

Load the debug information during runtime startup if OCAMLRUNPARAM=b=2. This guards against the specific case of running out of fds, since the debug information can't then be loaded.

dra27 · 2018-12-11T09:53:43Z

See also ocaml/dune#1633 (comment) for a log showing how weird this error message was before!

gasche

I read read_main_debug_info only so far, not the rest of the changes.

I wondered about whether reading the debug information at startup could be an issue in any way. The only potential issue I can see (besides making the startup logic more copmlex) is a slowdown at startup. Do we have reasonably big executables on which we could measure the debuginfo-loading time, to ensure that it remains unnoticeable in scripting scenarios?

gasche · 2018-12-11T10:41:24Z

runtime/backtrace_byt.c

-    di->events = process_debug_events(caml_start_code, events, &di->num_events);
+      di->events =
+        process_debug_events(caml_start_code, events, &di->num_events);
+    } else {


I would find the code more readable if the fd < 0 case was handled first.

This is the else branch for if (caml_seek_optional_section ...

(though your comment can still stand - this is actually a leaking fd bug in the code which was there before)

gasche · 2018-12-11T10:42:14Z

runtime/backtrace_byt.c

+    } else {
+      close(fd);
+    }
+    fd = 0;


With this write, isn't it the case that fd is always 0 at the CAMLreturnT point below? I don't understand when it is that we may have a non-zero return value.

No - if fd was negative after the call to caml_attempt_open then this code is never reached (see line 345)

gasche · 2018-12-11T10:48:45Z

runtime/backtrace.c

-      print_location(&li, i);
+      result = caml_debuginfo_location(dbg, &li);
+      if (result == 0)
+        print_location(&li, i);


Nitpicking, but I would find err or errno clearer than result, and I think the control flow would be clearer with a break right after the caml_debuginfo_location case --- than a test in the loop.

There are two loops - break would only exit the inner loop. The structuring relies on a "trick" which is namely that result will always be 0 in native code and in bytecode the inner loop only executes once (because the inner loop is to do with inlining).

Happy to rename the variable, though!

shindere · 2018-12-19T08:25:35Z

Thanks @dra27, I really like the idea of this being fixed! Do you intend to include a test program in this GRP? For instance, couldn't your little example program be made part of the testsuite?

dra27

Rebased and addressed @shindere's suggestion to incorporate the test (a few other notes inline)

dra27 · 2020-02-27T16:16:29Z

testsuite/tests/backtrace/pr2195.ml

@@ -0,0 +1,18 @@
+(* TEST
+   flags += "-g"
+   ocamlrunparam = "b"


@shindere, this idiom is used in other places apparently with the intention to cause OCAMLRUNPARAM to be set, but it doesn't seem to. Is it supposed to?

dra27 · 2020-02-27T16:16:52Z

testsuite/tests/backtrace/pr2195.run

@@ -0,0 +1,4 @@
+#!/bin/sh
+export OCAMLRUNPARAM=b


See comment above - this hook is only used to ensure that the test is run with OCAMLRUNPARAM=b

testsuite/tests/backtrace/pr2195.ml

xavierleroy

As a matter of general policy, I would still prefer to load debug information lazily, only when needed to produce a backtrace, rather than at program start-up. (You could imagine short-running programs with humonguous debugging information that takes forever, relatively speaking, to load at start-up time.)

Could we just improve the error message ("cannot display backtrace information because I'm out of file descriptors") instead? This problem happens excessively rarely, so that might be good enough.

shindere · 2020-03-02T14:38:20Z

@dra27: what makes you think OCAMLRUNPARAM is not defined / exported
as it should?

I just checked and, yes, something like

  ocamlrunparam += "b"

in a test block should add b to the current value of OCAMLRUNPARAM

I think this is working properly e.g. in testsuite/tests/backtrace/raw_backtrace.ml,
no?

damiendoligez · 2020-07-21T16:17:13Z

You could also open the file descriptor at the start, and only read from it when you need the info. This would avoid the problem with little overhead. The drawback is that you'd leak one fd every time the program execs an OCaml program.

dra27 · 2020-09-21T16:03:07Z

@damiendoligez - the drawback feels too expensive. In the original case with Dune, the problem was that there are not enough FDs - i.e. it wasn't bug per se in Dune, but that it needed to be careful to exhaust a finite resource.

@shindere - this is working this time, thanks. I'm not sure quite sure what was going wrong when I tried to set ocamlrunparam previously!

@xavierleroy - this is now done. See longer next comment!

dra27 · 2020-09-21T16:07:20Z

Rebased and revised version pushed:

The error message was only used in the default C exception handler, which meant it was basically never seen. I've now updated printexc.ml as well.
Rather than returning the load status all over the place, it's now encoded in the already_read field of the bytecode debuginfo struct. Before 0 meant not yet read and 1 meant loaded. Now negative numbers record the error which occurred.
Rather than always loading debug information at program startup if OCAMLRUNPARAM=b, it is now possible to have b=2. In the latter case, debug info is loaded at startup and the running out of FDs error can therefore advertise the existence of this setting. I updated the manual too. Please note that the overwhelming of this patch is improving the error message - adding b=2 is a relatively tiny change.
There's a test which tests b=0, b=1 and b=2 in a program which exhausts FDs.

dra27 · 2020-09-22T14:31:04Z

A small round of updates, in the light of a somewhat red precheck...

The attempt to load the debug info was too early in startup, and now takes place immediately after caml_sys_init in a dedicated function (instead of trying to piggy-back on to caml_init_debug_info). The NNP check was quite good at spotting this...
The message included in Sys_error is of course platform-dependent, so the test now converts it to an Exit exception
It turns out that Unix is a good deal easier to denial-of-service than Windows ... the test now includes a wrapper which calls ulimit -n before running the test.

dra27 · 2020-09-23T08:57:00Z

OK, so this PR now seems to fall into two parts, which I'm happy to separate:

Fixing the error message if the bytecode runtime is unable to open the bytecode executable to read debug information (first commit)
Providing a way to have the debug information, even if the program has run out of FDs (second commit)

Prior to splitting, though, I have 3 possible implementations of the second part:

The current PR which introduces OCAMLRUNPARAM=b=2 and which loads the debug information at program startup, thus side-stepping any problems. It's quite small but the downside is that it repeats all the logic for finding the bytecode executable and opens a bytecode executable which was just opened to read the actual bytecode.
dra27@9af0b57 is an alternate version of b=2 which doesn't re-open the file. This involves a bit of refactoring of functions to allow reading debug information from a channel rather than an fd. This has a big benefit that loading debug information is guaranteed to succeed - because we know that we opened the bytecode executable to run it! In theory, it means errors can only happen with the highly niche CAML_DEBUG_FILE (caml_cds_file) mode.
Executables in native code which have debug information always "load" it inasmuch as it's in the executable and the OS will have mapped and even pre-emptively fetched it. When we load a bytecode executable, the first which happens after opening is a seek to the end to get the trailer which is a pretty strong hint to the OS to load the whole thing - dra27@d2d120d reads the DBUG section but doesn't unmarshal until it's actually used. Apart from a tweak in intern to allow reading from a malloc'd buffer with an offset, this actually involves deleting a lot of code - in particular, there's no longer need for a special case for the linking mode for -output-complete-exe and the handling of caml_cds_file becomes much more transparent (in the implementation there, CAML_DEBUG_FILE is read at startup). It's not a major concern, although I'm told it has happened "in the wild" for native executables before, but it also stops the wrong debug information being loaded if the bytecode executable has been replaced while the process is running.

I personally lean towards the third one - at startup, it adds only the overhead of reading data from a file which is already open and the memory overhead of keeping that around. On the plus-side, backtraces will always work.

xavierleroy

Looks good to me, modulo the wording in the documentation (see below for suggestions).

I would have been happy with the first commit only (the one that explains the issue with running out of FDs), but the second commit (the one that introduces "b=2") is small enough.

xavierleroy · 2020-10-04T09:48:26Z

manual/manual/cmds/runtime.etex

+        when an uncaught exception aborts the program. This option does not
+        require an argument. If a bytecode program does not have any available
+        FDs when the runtime attempts to load debugging information, then
+        location information will not be displayed. You can force the runtime
+        to load debugging information at program startup by setting "b=2".


The documentation is not very clear. Not everyone knows what an "FD" is. Plus, the backtrace will tell when "b=2" is required. An example of an alternate wording:

Suggested change

when an uncaught exception aborts the program. This option does not

require an argument. If a bytecode program does not have any available

FDs when the runtime attempts to load debugging information, then

location information will not be displayed. You can force the runtime

to load debugging information at program startup by setting "b=2".

when an uncaught exception aborts the program. An optional argument

can be provided: "b=0" turns backtrace printing off; "b=1" is equivalent to "b"

and turns backtrace printing on; "b=2" turns backtrace printing on and forces

the runtime system to load debugging information at program startup time

instead of at backtrace printing time.

@dra27 could you please agree to reword the documentation, or state that you don't want it changed? We need to make progress on PRs.

xavierleroy · 2020-10-04T09:57:44Z

Concerning the other two alternatives:

dra27/ocaml@9af0b57 is an alternate version of b=2 which doesn't re-open the file.

That's quite a refactoring. Not sure I want to take that risk for such a rarely-used feature.

Executables in native code which have debug information always "load" it inasmuch as it's in the executable and the OS will have mapped and even pre-emptively fetched it.

The stack backtrace information produced by ocamlopt -g is MUCH smaller than the full debugging information produced by ocamlc -g, as the latter includes full typing information, which is huge. So, no, I refuse to load all bytecode debugging information at start-up every time.

(For the record: the little bit of DWARF debugging info that ocamlopt produces, and the much bigger chunk of DWARF debugging info that @mshinwell 's gdb support work produced, go into separate sections of the ELF executable file, sections which are never loaded in memory when the program starts.)

dra27 · 2020-11-02T11:47:11Z

Sorry this one got dropped, too - your wording is clearly better than mine! Rebased and updated to include it, all I added was a sentence explicitly saying when you might want to use b=2: ""b=2" can be used if the runtime is unable to load debugging information at backtrace printing time, for example if there are no file descriptors available."

dra27 · 2020-11-02T11:49:00Z

(Changes amended, too)

xavierleroy

Excellent! Merging right now...

dra27 · 2020-11-02T13:21:30Z

Is this still OK for 4.12, @Octachron?

Improve error message when bytecode program runs out of fds (cherry picked from commit 8575a4b)

dra27 added 2 commits December 11, 2018 08:54

Improve error messages when loading backtrace info

43b28fa

Before, any errors when loading bytecode backtrace information were fatal and the error message was potentially misleading.

OCAMLRUNPARAM=b=2 => load debuginfo

692bd73

Load the debug information during runtime startup if OCAMLRUNPARAM=b=2. This guards against the specific case of running out of fds, since the debug information can't then be loaded.

gasche reviewed Dec 11, 2018

View reviewed changes

dra27 force-pushed the better-ulimit branch from 25a395d to 991f018 Compare February 27, 2020 16:11

dra27 commented Feb 27, 2020

View reviewed changes

xavierleroy reviewed Mar 1, 2020

View reviewed changes

dra27 mentioned this pull request Mar 4, 2020

Program built with --output-complete-exe cannot read debug infos #9344

Closed

dra27 force-pushed the better-ulimit branch from 991f018 to c16c64a Compare September 21, 2020 16:00

dra27 force-pushed the better-ulimit branch 3 times, most recently from f3d4589 to 4746ea3 Compare September 22, 2020 14:22

dra27 force-pushed the better-ulimit branch from 4746ea3 to b7ed1ed Compare September 23, 2020 08:24

xavierleroy approved these changes Oct 4, 2020

View reviewed changes

xavierleroy added this to the 4.12 milestone Oct 7, 2020

dra27 force-pushed the better-ulimit branch from b7ed1ed to 5508eba Compare November 2, 2020 11:46

dra27 force-pushed the better-ulimit branch from 5508eba to 692bd73 Compare November 2, 2020 11:48

xavierleroy approved these changes Nov 2, 2020

View reviewed changes

xavierleroy merged commit 8575a4b into ocaml:trunk Nov 2, 2020

dra27 deleted the better-ulimit branch November 2, 2020 13:21

dra27 pushed a commit that referenced this pull request Nov 27, 2020

Merge pull request #2195 from dra27/better-ulimit

ae1c370

Improve error message when bytecode program runs out of fds (cherry picked from commit 8575a4b)

dra27 mentioned this pull request Dec 4, 2020

Limit declaration of C primitives in runtime headers #9919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error message when bytecode program runs out of fds #2195

Improve error message when bytecode program runs out of fds #2195

dra27 commented Dec 11, 2018 •

edited

dra27 commented Dec 11, 2018

gasche left a comment

gasche Dec 11, 2018

dra27 Dec 11, 2018

dra27 Dec 11, 2018

gasche Dec 11, 2018

dra27 Dec 11, 2018

gasche Dec 11, 2018

dra27 Dec 11, 2018

shindere commented Dec 19, 2018 via email

dra27 left a comment

dra27 Feb 27, 2020

dra27 Feb 27, 2020

xavierleroy left a comment

shindere commented Mar 2, 2020

damiendoligez commented Jul 21, 2020

dra27 commented Sep 21, 2020

dra27 commented Sep 21, 2020

dra27 commented Sep 22, 2020

dra27 commented Sep 23, 2020

xavierleroy left a comment

xavierleroy Oct 4, 2020

xavierleroy Oct 30, 2020

xavierleroy commented Oct 4, 2020

dra27 commented Nov 2, 2020

dra27 commented Nov 2, 2020

xavierleroy left a comment

dra27 commented Nov 2, 2020

Improve error message when bytecode program runs out of fds #2195

Improve error message when bytecode program runs out of fds #2195

Conversation

dra27 commented Dec 11, 2018 • edited

dra27 commented Dec 11, 2018

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shindere commented Dec 19, 2018 via email

dra27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy left a comment

Choose a reason for hiding this comment

shindere commented Mar 2, 2020

damiendoligez commented Jul 21, 2020

dra27 commented Sep 21, 2020

dra27 commented Sep 21, 2020

dra27 commented Sep 22, 2020

dra27 commented Sep 23, 2020

xavierleroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Oct 4, 2020

dra27 commented Nov 2, 2020

dra27 commented Nov 2, 2020

xavierleroy left a comment

Choose a reason for hiding this comment

dra27 commented Nov 2, 2020

dra27 commented Dec 11, 2018 •

edited