Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EINTR-based signals #1128

Closed
wants to merge 1 commit into from
Closed

EINTR-based signals #1128

wants to merge 1 commit into from

Conversation

stedolan
Copy link
Contributor

@stedolan stedolan commented Mar 29, 2017

(This is quite a subtle change, I'd appreciate nitpicking about locking, signal handling and error checking. It's still incomplete at the moment, but I'd like some discussion. Apologies about the length, but the fallout outside io.c is fairly minimal, and the changes within are mostly mechanical.)

Currently, OCaml signal handlers for signals arriving during IO are run directly from the Unix signal handler, which has a number of problems (see discussion on #1107 for details). Additionally, they may be run while holding channels' locks, which causes the deadlocks/crashes detailed in MPR#7503.

Nowadays, a better solution is available: by relying on blocking system calls to return EINTR when interrupted by a signal, we can do the OCaml signal handling outside of the Unix signal handler. (The old BSD systems that couldn't return EINTR are thankfully thin on the ground now).

This patch rewrites the error-handling logic for IO channels in this style, which makes signal handling much more robust. I need to do a lot more testing, but one program whose behaviour improves is this:

let counter = ref 0
let () =
      Sys.set_signal Sys.sigint (Sys.Signal_handle (fun _ ->
        incr counter;
        Printf.printf "\nsignalled %d times\n%!" !counter;
        if Random.int 10 = 0 then Gc.full_major ()));
      while true; do
        print_string "\r";
        flush stdout;
      done

This program does allocation, I/O, GC and compaction from a signal handler. (I killed it after handling ~500k signals successfully. Trunk sometimes manages 1).

The biggest code changes are in io.c, where code like this:

Lock(chan);
operation_that_might_fail();
Unlock(chan);

becomes this:

Lock(chan);
do{
  err = operation_that_might_fail();
} while (check_retry(chan, err));
Unlock(chan);

The old operation_that_might_fail possibly raises exceptions or invokes signal handlers, but these have to contend with the channel possibly being locked (and possibly being left in an inconsistent state if forcibly unlocked), the signal handler possibly trying to re-enter channel code (and deadlocking if it needs the same lock), or the signal handler causing GC which makes finalisers run under an unknown collection of channel locks.

The new operation_that_might_fail returns an error code, and if the error code is nonzero then check_retry unlocks the channel, handles the error or signal, and retries the operation (if it was interrupted by a signal). The code is longer, but the error paths are easier to understand.

Some details:

Low-level I/O

The low-level IO functions caml_read_fd and caml_write_fd now return an error code instead of raising exceptions, and never run signal handlers themselves (returning EINTR if a signal was triggered).

Similarly, there are new functions caml_try_putblock and caml_try_getblock in io.h which return error codes. These low-level operations expect their channel arguments to be locked, but the higher-level operations no longer do (Holding the channel locks across any operation which may allocate, or across caml_leave_blocking_section, causes deadlocks, so explicitly taking those locks is not something most code should do).

Blocking sections

caml_enter_blocking_section no longer runs signal handlers, because this caused a race condition in every IO operation:

CAMLparam1(vchannel);
struct channel * chan = Channel(vchannel);
Lock(chan);
caml_enter_blocking_section();
err = caml_read_fd(fd, &chan->buff, len, &nread);

In code like the above, if caml_enter_blocking_section runs a signal handler, then if the signal handler uses the same channel a deadlock arises when the signal handler tries to Lock(chan).

I think there is no need for caml_enter_blocking_section to run signal handlers: it is not a blocking operation, and it comes at the end of normal OCaml code in which signals were regularly polled.

caml_leave_blocking_section still checks for signals, but there is a new version caml_leave_blocking_section_nosig which does not, for use in caml_read_fd and other low-level IO functions that return explicit EINTR.

Atomicity

I tried to maintain the same atomicity guarantees that currently exist. In particular, caml_really_{get,put}block (which may be implemented as multiple calls to read / write internally) are atomic with respect to systhreads if not interrupted, since they hold the channel lock for the duration of the call.

These functions are not atomic if interrupted by a signal. In particular, if a signal handler writes to a channel, the signal handler's output may appear anywhere in the output stream.

To do

  • Spacetime support (Spacetime does some slightly complicated things with channel locking)
  • Windows support
  • Atomic test-and-clear of caml_pending_signals
  • Remove the now-unneeded mechanism for unlocking channels on exceptions

  - Avoid running OCaml code from Unix signal handlers
  - Avoid holding channel locks across signal handlers
  - Make low-level I/O primitives avoid raising
@mrvn
Copy link
Contributor

mrvn commented Mar 30, 2017

Blocking Sections

caml_enter_blocking_section no longer runs signal handlers, because this caused a race condition in every IO operation:

CAMLparam1(buf);
char* buf = &Byte(buf, 0);
caml_enter_blocking_section();
err = caml_read_fd(fd, buf, len, &nread);

In code like the above, if caml_enter_blocking_section runs a signal handler, then a GC may be triggered and buf might move, causing memory corrupting in the following caml_read_fd.

This code and argument is just wrong.

  1. After entering a blocking section nothing on the ocaml heap may be accessed because the GC may run in another thread and move things around. The use of buf in the caml_read_fd call is therefore illegal and may corrupt the ocaml heap at any time. Since caml_read_fd may block for a long time I would consider this near certain to happen.
  2. Running the signal handlers in caml_enter_blocking_section before releasing the runtime lock is equivalent to running the signal handler after aquiring the runtime lock in caml_leave_blocking_sectionin another thread. The reason to run in in caml_enter_blocking_section is to reduce delays, especially in the single threaded case. Anyway, the signal handler can (and will) run at any time between entering and leaving the blocking section. Not running it in caml_enter_blocking_section just makes corruption more random.

You might notice that in the stdlib caml_read_fd and caml_write_fd are called with channel->buff, which is a C structure and lives outside the ocaml heap. The channel->buf is unaffected by Gc compaction.

Similary in otherlibs/unix/read.c you have

char iobuf[UNIX_BUFFER_SIZE];

Begin_root (buf);
  numbytes = Long_val(len);
  if (numbytes > UNIX_BUFFER_SIZE) numbytes = UNIX_BUFFER_SIZE;
  caml_enter_blocking_section();
  ret = read(Int_val(fd), iobuf, (int) numbytes);
  caml_leave_blocking_section();
  if (ret == -1) uerror("read", Nothing);
  memmove (&Byte(buf, Long_val(ofs)), iobuf, ret);
End_roots();

As you can see the read is performed into a stack local buffer and only copied into the Byte later, after leaving the blocking section.

In conclusion I would suggest 2 things:

  1. The commit seems to change multiple things at once. Split the commit up into smaller chunks that fix a single problem.
  2. keep the part about running signal handlers before entering blocking sections. It reduces the delay and the risk of dropping a signal because it is already pending.

@stedolan
Copy link
Contributor Author

stedolan commented Apr 6, 2017

This code and argument is just wrong.

You're right, not quite sure what I was thinking there. I've updated the example to show the real issue.

Running the signal handlers in caml_enter_blocking_section before releasing the runtime lock is equivalent to running the signal handler after aquiring the runtime lock in caml_leave_blocking_sectionin another thread.

This is not true, and is part of what makes signal handling so hairy. If the current thread holds locks (like the channel locks in io.c), then signal handlers running in the current thread will deadlock the system or crash if they try to acquire them, while signal handlers running in other threads will simply wait until the operation completes and the lock can be acquired. See MPR#7503.

The commit seems to change multiple things at once. Split the commit up into smaller chunks that fix a single problem.

There is one change being made here, which is that IO functions should return EINTR and allow signals to be handled by the caller, rather than attempting to handle signals in an unknown contexts. Making this change requires touching a large amount of code, but I'm not sure that it's easily split into self-contained parts.

keep the part about running signal handlers before entering blocking sections. It reduces the delay and the risk of dropping a signal because it is already pending.

I think the updated example shows why this is a bad idea. Also, coalescing repeated signals is not a bad thing: this is how Unix signals have always worked.

@mrvn
Copy link
Contributor

mrvn commented Apr 18, 2017

This is not true, and is part of what makes signal handling so hairy. If the current thread holds locks (like the channel locks in io.c), then signal handlers running in the current thread will deadlock the system or crash if they try to acquire them, while signal handlers running in other threads will simply wait until the operation completes and the lock can be acquired.

Sure. But now think again: What happens if the other thread holds the locks (like the channel locks in io.c). Then the signal handler running in the other thread will deadlock.

There are only 2 real solutions: 1) add critical sections where signal handlers won't run, 2) don't allow function needing locks in signal handlers.

@stedolan
Copy link
Contributor Author

Sure. But now think again: What happens if the other thread holds the locks (like the channel locks in io.c). Then the signal handler running in the other thread will deadlock.

No. With this patch, threads never run signal handlers while holding locks. If a signal arrives during a blocking operation, the operation returns EINTR which is propagated upwards, and the lock is released before the signal is handled.

@mrvn
Copy link
Contributor

mrvn commented May 4, 2017

Actually yes, even with this patch. Your idea of having caml_leave_blockiong_section_nosig() is correct. But you only use that once for write. Reads can catch signals too.

If you fix that and signal handlers are never run while holding locks then why not keep running them before entering a blocking section? What I mean is that there should be caml_enter_blocking_section_nosig() too and anything holding a lock should use the _nosig flavour for both enter and leave. That way signal propagation is only degraded for the case where a lock is held.

On a side node: The problem here is that signals can't be run while holding a channel lock. Wouldn't it make sense to have the lock/unlock functions disable/enable signals processing? That way one couldn't forget to use caml_leave_blockiong_section_nosig() like in this patch.

@xavierleroy
Copy link
Contributor

Just a reminder: now that 4.06 is branched off, let's restart this proposal and try to make it converge in time for 4.07. I think it goes in the right direction, even though it's not completely clear in my head yet.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 25, 2017
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only partially reviewed this PR (I'll need to read it a few more times) but here are a few comments.

  • You write int* x instead of int *x. Don't do that because it goes against the grain: the * operator is right-associative and your style is misleading. Someday you'll end up writing int *x, y instead of int *x, *y. More important, it's inconsistent with the rest of the code base.
  • I'd suggest defining and using a macro:
#define perform_interruptible_io(channel, expr) \
  do { \
    io_result __caml__err; \
    do { __caml__err = expr; } while (check_retry(channel, __caml__err)); \
  }while(0)

static io_result caml_lseek_set(int fd, file_offset offset)
{
file_offset off;
off = lseek(fd, offset, SEEK_SET);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a blocking section around this call to lseek because it is a blocking system call, and indeed the original code has both calls to lseek inside a blocking section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference for this assertion? The (Linux) man page for lseek does not seem to mention anything like this. IT does not even mention the fact that it can return EINTR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, this should be made uniform with caml_lseek_end.

@mshinwell
Copy link
Contributor

@damiendoligez FWIW, I write "int*" rather than "int *" because I see the * as part of the type, to be separated from the name of the variable. I prefer to disallow the form where multiple variables are declared at the same time to avoid any confusion (especially where one of them has an initializer, as I pointed out to @nojb the other day).

@gasche
Copy link
Member

gasche commented Dec 13, 2017

(Count me in the int * camp, although int * p is allowed. The int* p model crumbles when it tries to say int const* p.)

@jhjourdan jhjourdan mentioned this pull request Dec 22, 2017
@damiendoligez
Copy link
Member

I write "int*" rather than "int *" because I see the * as part of the type, to be separated from the name of the variable.

Yes. I used to do that too and it would be the right way if C was designed properly. But it isn't.

@xavierleroy
Copy link
Contributor

Where are we with this PR? Bug reports that would be solved by this PR keep arriving, and I thought it was a prerequisite for multicore OCaml. So, what needs to be done (collectively) to finish and merge this PR?

@gadmm
Copy link
Contributor

gadmm commented Apr 25, 2020

To be more specific, could we have an idea of a deadline to fix the bug? I have it on my stack because I want to make sure that it interacts well with my ongoing implementation of masking. My understanding is that it is not urgent because multicore depends on it (indeed it should not break programs, so there is no need for a transition period), but only because it would already solve bugs.

@xavierleroy
Copy link
Contributor

I, too, would very much like to see this work restarted and brought to completion. Not only it would fix a number of issues, but it is also a prerequisite for Multicore OCaml, as far as I understand.

@kayceesrk
Copy link
Contributor

@gadmm You mentioned that this work is on your list. Would you have time to look at this in the coming weeks? Otherwise, we can schedule it in the multicore plan. Either way, we would be glad to help and see how this fits with your masking implementation.

@gadmm
Copy link
Contributor

gadmm commented Apr 30, 2020

Yes, I proposed some time ago to Stephen to have a look (but I did not have the time yet). I cannot yet wrap my head around what is the correct behaviour for EINTR inside acquisition and release functions (which are masked), so I need to look at the details. My current plan is to propose a masking design and implementation, maybe with a (hopefully quick) RFC process, with hopes that it gets into 4.12 (4.13 at the latest). I wanted to look at this patch once the details of masking are known, so I proposed to Stephen that we collectively aim for 4.13 with this PR. That is not exactly in the coming weeks, but does this sound good to you?

@kayceesrk
Copy link
Contributor

Sounds good. The multicore team is currently looking at the interaction between signal handling and domains. We should have a plan by the time your RFC for masking is ready.

@gadmm
Copy link
Contributor

gadmm commented Jun 14, 2020

Question for @stedolan or @xavierleroy.

I wonder if we should also look at the handlers trap_handler and segv_handler in signals_nat.c since they raise directly into OCaml. caml_raise can call caml_fatal_uncaught_exception which in turn can execute arbitrary OCaml code. (In addition, caml_array_bound_error can call fprintf, though only if it runs without the standard library which should be very rare). Is there an argument for saying this is safe? I wonder if the approach taken for AMD64+OSX (cf. RETURN_AFTER_STACK_OVERFLOW) should be generalised. Or at least one should handle uncaught exceptions differently. Any thoughts?

@xavierleroy
Copy link
Contributor

About RETURN_AFTER_STACK_OVERFLOW: it doesn't change the semantics in that the exception raising still takes place at the point in the OCaml code where the signal was received; however, it seems to behave better w.r.t. signal mask restoration and alternate stack switching, so maybe we could use this trick more widely.

About trap_handler and segv_handler: they handle "synchronous" signals, generated by exceptional conditions in the OCaml code itself, as opposed to being sent asynchronously from another process.

The SIGTRAP signal is generated on PowerPC by bound checking operations, so I would claim it is safe to raise an exception and execute arbitrary OCaml code at this point.

The SIGSEGV signal corresponds to a stack overflow, more precisely a memory access inside OCaml-compiled code just "below" the stack limit. I don't know whether all such memory accesses are "safe points" in the sense that it's OK to raise an exception and execute arbitrary OCaml code.

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a light review of this PR, and have some comments. I will do a more thorough review when this PR will be rebased, the bugs I mention will be fixed and some design decisions will be taken.

More in details, here is what I think is left to be done so that this PR could be merged:

  • rebasing on top of trunk
  • there seems to be a problem with GC roots management in io.c (or some more documentation is needed)
  • while I appreciate the clear distinction between low-level and high-level functions for manipulating channels, I would very much like having a simple way to distinguish the two, such as a clear naming convention
  • It is not clear to me which guarantee this PR is trying to achieve. Indeed, there is still no guarantee that a signal will be handled with reasonable delay: if a signal arrives right before a system call, then no EINTR error code will be returned, and we will need to wait for the end of the system call before the signal gets handled. This is made even worse by the fact that this PR uses caml_leave_blocking_section_nosig, which, well, does not check for signals. If we do not want that guarantee, then we can very well avoid checking for signlas during I/O. The fix for this problem is the infamous "self-pipe trick". I know this is a larger change, but at least it should be considered for this PR.

@@ -118,25 +154,45 @@ CAMLexport void caml_close_channel(struct channel *channel)
unlink_channel(channel);
caml_stat_free(channel->name);
caml_stat_free(channel);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary return.

Comment on lines -687 to -689
Lock(channel);
caml_output_val(channel, v, flags);
Unlock(channel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand well, this means that while we took the lock only once before this PR, we take the lock once for every externed block. I can see two problems with that:

  • Atomicity is no longer guaranteed, and the unmarshalled data can be interleaved with whatever a signal handler decided to write. But this is true in general for anything in this PR, so I don't think there is anything we can do.
  • Performance: are we sure that the repeated lock acquisition/release has negligible cost?

static io_result caml_lseek_set(int fd, file_offset offset)
{
file_offset off;
off = lseek(fd, offset, SEEK_SET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reference for this assertion? The (Linux) man page for lseek does not seem to mention anything like this. IT does not even mention the fact that it can return EINTR.

static io_result caml_lseek_set(int fd, file_offset offset)
{
file_offset off;
off = lseek(fd, offset, SEEK_SET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, this should be made uniform with caml_lseek_end.

Comment on lines +188 to +193
do {
err = caml_lseek_end(channel->fd, &end);
} while (check_retry(channel, err));
do {
err = caml_lseek_set(channel->fd, channel->offset);
} while (check_retry(channel, err));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with this code is that it is bogus if some signal handler is called between the two lseek calls, since the position of the file description will not match channel->offset.

return 0;
}

static io_result caml_lseek_set(int fd, file_offset offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static io_result caml_lseek_set(int fd, file_offset offset)
static io_result lseek_set(int fd, file_offset offset)

It seems like the convention is to not use the caml_ prefix for static functions.

return;
}

static io_result caml_lseek_end(int fd, file_offset* end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static io_result caml_lseek_end(int fd, file_offset* end)
static io_result lseek_end(int fd, file_offset* end)

exception */
// FIXME really?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the matter here?

@@ -643,9 +787,7 @@ CAMLprim value caml_ml_output_partial(value vchannel, value buff, value start,
struct channel * channel = Channel(vchannel);
int res;

Lock(channel);
res = caml_putblock(channel, &Byte(buff, Long_val(start)), Long_val(length));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an issue with roots here: altough buff is registered as a root, the pointer passed to caml_putblock isn't, and, due to the retry loop in caml_putblock, it is possible that this pointer be used after a signal handler (and thus, a GC pass) is executed.

The old code was fine (altough very fragile) because the pointer was only used in caml_putblock before releasing the runtime lock, which is no longer the case now.

More generally, I find that root management in this file is particularly difficult to follow. Some more review and documentation would be more than welcome.

@stedolan stedolan mentioned this pull request Jun 29, 2020
1 task
@stedolan
Copy link
Contributor Author

Closed in favour of #9722.

Thanks for the reviews, all. (I'll add you to the Changes entry for #9722)

@stedolan stedolan closed this Jun 29, 2020
stedolan pushed a commit to stedolan/ocaml that referenced this pull request Mar 21, 2023
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants