Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error message when bytecode program runs out of fds #2195

Merged
merged 2 commits into from
Nov 2, 2020

Conversation

dra27
Copy link
Member

@dra27 dra27 commented Dec 11, 2018

The following program:

while true do
  open_in "foo" |> ignore
done

would terminate with Fatal error: executable program file not found if backtraces were enabled which is wrong for two reasons:

  1. The file was found, the issue is that all fds have been exhausted
  2. This error is not fatal

Furthermore, this error would happen if user code attempted to get the backtrace information (e.g. to display an error message). In this case, the information about the exception is completely lost, since the runtime terminates.

The first commit detects EMFILE in caml_attempt_open and returns a new flag to differentiate the error. caml_debuginfo_location is altered to return 0 on success but to pass on the return value of caml_attempt_open if this fails (native runtime is unaffected by this change). caml_print_exception_backtrace is then altered so that it displays a more helpful error message, and the error is no longer fatal. Note that loc_is_raise remains valid in this case, which means that the effect on user code would be to act as though debugging information were not available (where before the runtime had terminated)

The second commit optionally extends this further by causing the debug information to be loaded during startup if backtraces are enabled. The assumption is that the system is not likely to have run out of fds at this point, though in this case the runtime would in fact fail to start. Again, native code is not affected by this change.

The background for this seemingly mundane change is a non-trivial amount of time lost debugging Dune...

cc @jonludlam

Before, any errors when loading bytecode backtrace information were
fatal and the error message was potentially misleading.
Load the debug information during runtime startup if OCAMLRUNPARAM=b=2.
This guards against the specific case of running out of fds, since the
debug information can't then be loaded.
@dra27
Copy link
Member Author

dra27 commented Dec 11, 2018

See also ocaml/dune#1633 (comment) for a log showing how weird this error message was before!

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read read_main_debug_info only so far, not the rest of the changes.

I wondered about whether reading the debug information at startup could be an issue in any way. The only potential issue I can see (besides making the startup logic more copmlex) is a slowdown at startup. Do we have reasonably big executables on which we could measure the debuginfo-loading time, to ensure that it remains unnoticeable in scripting scenarios?

di->events = process_debug_events(caml_start_code, events, &di->num_events);
di->events =
process_debug_events(caml_start_code, events, &di->num_events);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find the code more readable if the fd < 0 case was handled first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the else branch for if (caml_seek_optional_section ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though your comment can still stand - this is actually a leaking fd bug in the code which was there before)

} else {
close(fd);
}
fd = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this write, isn't it the case that fd is always 0 at the CAMLreturnT point below? I don't understand when it is that we may have a non-zero return value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - if fd was negative after the call to caml_attempt_open then this code is never reached (see line 345)

print_location(&li, i);
result = caml_debuginfo_location(dbg, &li);
if (result == 0)
print_location(&li, i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but I would find err or errno clearer than result, and I think the control flow would be clearer with a break right after the caml_debuginfo_location case --- than a test in the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two loops - break would only exit the inner loop. The structuring relies on a "trick" which is namely that result will always be 0 in native code and in bytecode the inner loop only executes once (because the inner loop is to do with inlining).

Happy to rename the variable, though!

@shindere
Copy link
Contributor

shindere commented Dec 19, 2018 via email

Copy link
Member Author

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased and addressed @shindere's suggestion to incorporate the test (a few other notes inline)

@@ -0,0 +1,18 @@
(* TEST
flags += "-g"
ocamlrunparam = "b"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shindere, this idiom is used in other places apparently with the intention to cause OCAMLRUNPARAM to be set, but it doesn't seem to. Is it supposed to?

@@ -0,0 +1,4 @@
#!/bin/sh
export OCAMLRUNPARAM=b
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above - this hook is only used to ensure that the test is run with OCAMLRUNPARAM=b

testsuite/tests/backtrace/pr2195.ml Show resolved Hide resolved
Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of general policy, I would still prefer to load debug information lazily, only when needed to produce a backtrace, rather than at program start-up. (You could imagine short-running programs with humonguous debugging information that takes forever, relatively speaking, to load at start-up time.)

Could we just improve the error message ("cannot display backtrace information because I'm out of file descriptors") instead? This problem happens excessively rarely, so that might be good enough.

@shindere
Copy link
Contributor

shindere commented Mar 2, 2020

@dra27: what makes you think OCAMLRUNPARAM is not defined / exported
as it should?

I just checked and, yes, something like

  ocamlrunparam += "b"

in a test block should add b to the current value of OCAMLRUNPARAM

I think this is working properly e.g. in testsuite/tests/backtrace/raw_backtrace.ml,
no?

@damiendoligez
Copy link
Member

You could also open the file descriptor at the start, and only read from it when you need the info. This would avoid the problem with little overhead. The drawback is that you'd leak one fd every time the program execs an OCaml program.

@dra27
Copy link
Member Author

dra27 commented Sep 21, 2020

@damiendoligez - the drawback feels too expensive. In the original case with Dune, the problem was that there are not enough FDs - i.e. it wasn't bug per se in Dune, but that it needed to be careful to exhaust a finite resource.

@shindere - this is working this time, thanks. I'm not sure quite sure what was going wrong when I tried to set ocamlrunparam previously!

@xavierleroy - this is now done. See longer next comment!

@dra27
Copy link
Member Author

dra27 commented Sep 21, 2020

Rebased and revised version pushed:

  • The error message was only used in the default C exception handler, which meant it was basically never seen. I've now updated printexc.ml as well.
  • Rather than returning the load status all over the place, it's now encoded in the already_read field of the bytecode debuginfo struct. Before 0 meant not yet read and 1 meant loaded. Now negative numbers record the error which occurred.
  • Rather than always loading debug information at program startup if OCAMLRUNPARAM=b, it is now possible to have b=2. In the latter case, debug info is loaded at startup and the running out of FDs error can therefore advertise the existence of this setting. I updated the manual too. Please note that the overwhelming of this patch is improving the error message - adding b=2 is a relatively tiny change.
  • There's a test which tests b=0, b=1 and b=2 in a program which exhausts FDs.

@dra27 dra27 force-pushed the better-ulimit branch 3 times, most recently from f3d4589 to 4746ea3 Compare September 22, 2020 14:22
@dra27
Copy link
Member Author

dra27 commented Sep 22, 2020

A small round of updates, in the light of a somewhat red precheck...

  • The attempt to load the debug info was too early in startup, and now takes place immediately after caml_sys_init in a dedicated function (instead of trying to piggy-back on to caml_init_debug_info). The NNP check was quite good at spotting this...
  • The message included in Sys_error is of course platform-dependent, so the test now converts it to an Exit exception
  • It turns out that Unix is a good deal easier to denial-of-service than Windows ... the test now includes a wrapper which calls ulimit -n before running the test.

@dra27
Copy link
Member Author

dra27 commented Sep 23, 2020

OK, so this PR now seems to fall into two parts, which I'm happy to separate:

  • Fixing the error message if the bytecode runtime is unable to open the bytecode executable to read debug information (first commit)
  • Providing a way to have the debug information, even if the program has run out of FDs (second commit)

Prior to splitting, though, I have 3 possible implementations of the second part:

  1. The current PR which introduces OCAMLRUNPARAM=b=2 and which loads the debug information at program startup, thus side-stepping any problems. It's quite small but the downside is that it repeats all the logic for finding the bytecode executable and opens a bytecode executable which was just opened to read the actual bytecode.
  2. dra27@9af0b57 is an alternate version of b=2 which doesn't re-open the file. This involves a bit of refactoring of functions to allow reading debug information from a channel rather than an fd. This has a big benefit that loading debug information is guaranteed to succeed - because we know that we opened the bytecode executable to run it! In theory, it means errors can only happen with the highly niche CAML_DEBUG_FILE (caml_cds_file) mode.
  3. Executables in native code which have debug information always "load" it inasmuch as it's in the executable and the OS will have mapped and even pre-emptively fetched it. When we load a bytecode executable, the first which happens after opening is a seek to the end to get the trailer which is a pretty strong hint to the OS to load the whole thing - dra27@d2d120d reads the DBUG section but doesn't unmarshal until it's actually used. Apart from a tweak in intern to allow reading from a malloc'd buffer with an offset, this actually involves deleting a lot of code - in particular, there's no longer need for a special case for the linking mode for -output-complete-exe and the handling of caml_cds_file becomes much more transparent (in the implementation there, CAML_DEBUG_FILE is read at startup). It's not a major concern, although I'm told it has happened "in the wild" for native executables before, but it also stops the wrong debug information being loaded if the bytecode executable has been replaced while the process is running.

I personally lean towards the third one - at startup, it adds only the overhead of reading data from a file which is already open and the memory overhead of keeping that around. On the plus-side, backtraces will always work.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, modulo the wording in the documentation (see below for suggestions).

I would have been happy with the first commit only (the one that explains the issue with running out of FDs), but the second commit (the one that introduces "b=2") is small enough.

Comment on lines 115 to 119
when an uncaught exception aborts the program. This option does not
require an argument. If a bytecode program does not have any available
FDs when the runtime attempts to load debugging information, then
location information will not be displayed. You can force the runtime
to load debugging information at program startup by setting "b=2".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is not very clear. Not everyone knows what an "FD" is. Plus, the backtrace will tell when "b=2" is required. An example of an alternate wording:

Suggested change
when an uncaught exception aborts the program. This option does not
require an argument. If a bytecode program does not have any available
FDs when the runtime attempts to load debugging information, then
location information will not be displayed. You can force the runtime
to load debugging information at program startup by setting "b=2".
when an uncaught exception aborts the program. An optional argument
can be provided: "b=0" turns backtrace printing off; "b=1" is equivalent to "b"
and turns backtrace printing on; "b=2" turns backtrace printing on and forces
the runtime system to load debugging information at program startup time
instead of at backtrace printing time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dra27 could you please agree to reword the documentation, or state that you don't want it changed? We need to make progress on PRs.

@xavierleroy
Copy link
Contributor

Concerning the other two alternatives:

dra27/ocaml@9af0b57 is an alternate version of b=2 which doesn't re-open the file.

That's quite a refactoring. Not sure I want to take that risk for such a rarely-used feature.

Executables in native code which have debug information always "load" it inasmuch as it's in the executable and the OS will have mapped and even pre-emptively fetched it.

The stack backtrace information produced by ocamlopt -g is MUCH smaller than the full debugging information produced by ocamlc -g, as the latter includes full typing information, which is huge. So, no, I refuse to load all bytecode debugging information at start-up every time.

(For the record: the little bit of DWARF debugging info that ocamlopt produces, and the much bigger chunk of DWARF debugging info that @mshinwell 's gdb support work produced, go into separate sections of the ELF executable file, sections which are never loaded in memory when the program starts.)

@xavierleroy xavierleroy added this to the 4.12 milestone Oct 7, 2020
@dra27
Copy link
Member Author

dra27 commented Nov 2, 2020

Sorry this one got dropped, too - your wording is clearly better than mine! Rebased and updated to include it, all I added was a sentence explicitly saying when you might want to use b=2: ""b=2" can be used if the runtime is unable to load debugging information at backtrace printing time, for example if there are no file descriptors available."

@dra27
Copy link
Member Author

dra27 commented Nov 2, 2020

(Changes amended, too)

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Merging right now...

@xavierleroy xavierleroy merged commit 8575a4b into ocaml:trunk Nov 2, 2020
@dra27
Copy link
Member Author

dra27 commented Nov 2, 2020

Is this still OK for 4.12, @Octachron?

@dra27 dra27 deleted the better-ulimit branch November 2, 2020 13:21
dra27 pushed a commit that referenced this pull request Nov 27, 2020
Improve error message when bytecode program runs out of fds

(cherry picked from commit 8575a4b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants