Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EINTR-based signals #1128

Open
wants to merge 1 commit into
base: trunk
from
Open

EINTR-based signals #1128

wants to merge 1 commit into from

Conversation

@stedolan
Copy link
Contributor

stedolan commented Mar 29, 2017

(This is quite a subtle change, I'd appreciate nitpicking about locking, signal handling and error checking. It's still incomplete at the moment, but I'd like some discussion. Apologies about the length, but the fallout outside io.c is fairly minimal, and the changes within are mostly mechanical.)

Currently, OCaml signal handlers for signals arriving during IO are run directly from the Unix signal handler, which has a number of problems (see discussion on #1107 for details). Additionally, they may be run while holding channels' locks, which causes the deadlocks/crashes detailed in MPR#7503.

Nowadays, a better solution is available: by relying on blocking system calls to return EINTR when interrupted by a signal, we can do the OCaml signal handling outside of the Unix signal handler. (The old BSD systems that couldn't return EINTR are thankfully thin on the ground now).

This patch rewrites the error-handling logic for IO channels in this style, which makes signal handling much more robust. I need to do a lot more testing, but one program whose behaviour improves is this:

let counter = ref 0
let () =
      Sys.set_signal Sys.sigint (Sys.Signal_handle (fun _ ->
        incr counter;
        Printf.printf "\nsignalled %d times\n%!" !counter;
        if Random.int 10 = 0 then Gc.full_major ()));
      while true; do
        print_string "\r";
        flush stdout;
      done

This program does allocation, I/O, GC and compaction from a signal handler. (I killed it after handling ~500k signals successfully. Trunk sometimes manages 1).

The biggest code changes are in io.c, where code like this:

Lock(chan);
operation_that_might_fail();
Unlock(chan);

becomes this:

Lock(chan);
do{
  err = operation_that_might_fail();
} while (check_retry(chan, err));
Unlock(chan);

The old operation_that_might_fail possibly raises exceptions or invokes signal handlers, but these have to contend with the channel possibly being locked (and possibly being left in an inconsistent state if forcibly unlocked), the signal handler possibly trying to re-enter channel code (and deadlocking if it needs the same lock), or the signal handler causing GC which makes finalisers run under an unknown collection of channel locks.

The new operation_that_might_fail returns an error code, and if the error code is nonzero then check_retry unlocks the channel, handles the error or signal, and retries the operation (if it was interrupted by a signal). The code is longer, but the error paths are easier to understand.

Some details:

Low-level I/O

The low-level IO functions caml_read_fd and caml_write_fd now return an error code instead of raising exceptions, and never run signal handlers themselves (returning EINTR if a signal was triggered).

Similarly, there are new functions caml_try_putblock and caml_try_getblock in io.h which return error codes. These low-level operations expect their channel arguments to be locked, but the higher-level operations no longer do (Holding the channel locks across any operation which may allocate, or across caml_leave_blocking_section, causes deadlocks, so explicitly taking those locks is not something most code should do).

Blocking sections

caml_enter_blocking_section no longer runs signal handlers, because this caused a race condition in every IO operation:

CAMLparam1(vchannel);
struct channel * chan = Channel(vchannel);
Lock(chan);
caml_enter_blocking_section();
err = caml_read_fd(fd, &chan->buff, len, &nread);

In code like the above, if caml_enter_blocking_section runs a signal handler, then if the signal handler uses the same channel a deadlock arises when the signal handler tries to Lock(chan).

I think there is no need for caml_enter_blocking_section to run signal handlers: it is not a blocking operation, and it comes at the end of normal OCaml code in which signals were regularly polled.

caml_leave_blocking_section still checks for signals, but there is a new version caml_leave_blocking_section_nosig which does not, for use in caml_read_fd and other low-level IO functions that return explicit EINTR.

Atomicity

I tried to maintain the same atomicity guarantees that currently exist. In particular, caml_really_{get,put}block (which may be implemented as multiple calls to read / write internally) are atomic with respect to systhreads if not interrupted, since they hold the channel lock for the duration of the call.

These functions are not atomic if interrupted by a signal. In particular, if a signal handler writes to a channel, the signal handler's output may appear anywhere in the output stream.

To do

  • Spacetime support (Spacetime does some slightly complicated things with channel locking)
  • Windows support
  • Atomic test-and-clear of caml_pending_signals
  • Remove the now-unneeded mechanism for unlocking channels on exceptions
  - Avoid running OCaml code from Unix signal handlers
  - Avoid holding channel locks across signal handlers
  - Make low-level I/O primitives avoid raising
@mrvn

This comment has been minimized.

Copy link
Contributor

mrvn commented Mar 30, 2017

Blocking Sections

caml_enter_blocking_section no longer runs signal handlers, because this caused a race condition in every IO operation:

CAMLparam1(buf);
char* buf = &Byte(buf, 0);
caml_enter_blocking_section();
err = caml_read_fd(fd, buf, len, &nread);

In code like the above, if caml_enter_blocking_section runs a signal handler, then a GC may be triggered and buf might move, causing memory corrupting in the following caml_read_fd.

This code and argument is just wrong.

  1. After entering a blocking section nothing on the ocaml heap may be accessed because the GC may run in another thread and move things around. The use of buf in the caml_read_fd call is therefore illegal and may corrupt the ocaml heap at any time. Since caml_read_fd may block for a long time I would consider this near certain to happen.
  2. Running the signal handlers in caml_enter_blocking_section before releasing the runtime lock is equivalent to running the signal handler after aquiring the runtime lock in caml_leave_blocking_sectionin another thread. The reason to run in in caml_enter_blocking_section is to reduce delays, especially in the single threaded case. Anyway, the signal handler can (and will) run at any time between entering and leaving the blocking section. Not running it in caml_enter_blocking_section just makes corruption more random.

You might notice that in the stdlib caml_read_fd and caml_write_fd are called with channel->buff, which is a C structure and lives outside the ocaml heap. The channel->buf is unaffected by Gc compaction.

Similary in otherlibs/unix/read.c you have

char iobuf[UNIX_BUFFER_SIZE];

Begin_root (buf);
  numbytes = Long_val(len);
  if (numbytes > UNIX_BUFFER_SIZE) numbytes = UNIX_BUFFER_SIZE;
  caml_enter_blocking_section();
  ret = read(Int_val(fd), iobuf, (int) numbytes);
  caml_leave_blocking_section();
  if (ret == -1) uerror("read", Nothing);
  memmove (&Byte(buf, Long_val(ofs)), iobuf, ret);
End_roots();

As you can see the read is performed into a stack local buffer and only copied into the Byte later, after leaving the blocking section.

In conclusion I would suggest 2 things:

  1. The commit seems to change multiple things at once. Split the commit up into smaller chunks that fix a single problem.
  2. keep the part about running signal handlers before entering blocking sections. It reduces the delay and the risk of dropping a signal because it is already pending.
@stedolan

This comment has been minimized.

Copy link
Contributor Author

stedolan commented Apr 6, 2017

This code and argument is just wrong.

You're right, not quite sure what I was thinking there. I've updated the example to show the real issue.

Running the signal handlers in caml_enter_blocking_section before releasing the runtime lock is equivalent to running the signal handler after aquiring the runtime lock in caml_leave_blocking_sectionin another thread.

This is not true, and is part of what makes signal handling so hairy. If the current thread holds locks (like the channel locks in io.c), then signal handlers running in the current thread will deadlock the system or crash if they try to acquire them, while signal handlers running in other threads will simply wait until the operation completes and the lock can be acquired. See MPR#7503.

The commit seems to change multiple things at once. Split the commit up into smaller chunks that fix a single problem.

There is one change being made here, which is that IO functions should return EINTR and allow signals to be handled by the caller, rather than attempting to handle signals in an unknown contexts. Making this change requires touching a large amount of code, but I'm not sure that it's easily split into self-contained parts.

keep the part about running signal handlers before entering blocking sections. It reduces the delay and the risk of dropping a signal because it is already pending.

I think the updated example shows why this is a bad idea. Also, coalescing repeated signals is not a bad thing: this is how Unix signals have always worked.

@mrvn

This comment has been minimized.

Copy link
Contributor

mrvn commented Apr 18, 2017

This is not true, and is part of what makes signal handling so hairy. If the current thread holds locks (like the channel locks in io.c), then signal handlers running in the current thread will deadlock the system or crash if they try to acquire them, while signal handlers running in other threads will simply wait until the operation completes and the lock can be acquired.

Sure. But now think again: What happens if the other thread holds the locks (like the channel locks in io.c). Then the signal handler running in the other thread will deadlock.

There are only 2 real solutions: 1) add critical sections where signal handlers won't run, 2) don't allow function needing locks in signal handlers.

@stedolan

This comment has been minimized.

Copy link
Contributor Author

stedolan commented Apr 20, 2017

Sure. But now think again: What happens if the other thread holds the locks (like the channel locks in io.c). Then the signal handler running in the other thread will deadlock.

No. With this patch, threads never run signal handlers while holding locks. If a signal arrives during a blocking operation, the operation returns EINTR which is propagated upwards, and the lock is released before the signal is handled.

@mrvn

This comment has been minimized.

Copy link
Contributor

mrvn commented May 4, 2017

Actually yes, even with this patch. Your idea of having caml_leave_blockiong_section_nosig() is correct. But you only use that once for write. Reads can catch signals too.

If you fix that and signal handlers are never run while holding locks then why not keep running them before entering a blocking section? What I mean is that there should be caml_enter_blocking_section_nosig() too and anything holding a lock should use the _nosig flavour for both enter and leave. That way signal propagation is only degraded for the case where a lock is held.

On a side node: The problem here is that signals can't be run while holding a channel lock. Wouldn't it make sense to have the lock/unlock functions disable/enable signals processing? That way one couldn't forget to use caml_leave_blockiong_section_nosig() like in this patch.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

xavierleroy commented Sep 23, 2017

Just a reminder: now that 4.06 is branched off, let's restart this proposal and try to make it converge in time for 4.07. I think it goes in the right direction, even though it's not completely clear in my head yet.

Copy link
Member

damiendoligez left a comment

I have only partially reviewed this PR (I'll need to read it a few more times) but here are a few comments.

  • You write int* x instead of int *x. Don't do that because it goes against the grain: the * operator is right-associative and your style is misleading. Someday you'll end up writing int *x, y instead of int *x, *y. More important, it's inconsistent with the rest of the code base.
  • I'd suggest defining and using a macro:
#define perform_interruptible_io(channel, expr) \
  do { \
    io_result __caml__err; \
    do { __caml__err = expr; } while (check_retry(channel, __caml__err)); \
  }while(0)
static io_result caml_lseek_set(int fd, file_offset offset)
{
file_offset off;
off = lseek(fd, offset, SEEK_SET);

This comment has been minimized.

Copy link
@damiendoligez

damiendoligez Dec 13, 2017

Member

You need a blocking section around this call to lseek because it is a blocking system call, and indeed the original code has both calls to lseek inside a blocking section.

@mshinwell

This comment has been minimized.

Copy link
Contributor

mshinwell commented Dec 13, 2017

@damiendoligez FWIW, I write "int*" rather than "int *" because I see the * as part of the type, to be separated from the name of the variable. I prefer to disallow the form where multiple variables are declared at the same time to avoid any confusion (especially where one of them has an initializer, as I pointed out to @nojb the other day).

@gasche

This comment has been minimized.

Copy link
Member

gasche commented Dec 13, 2017

(Count me in the int * camp, although int * p is allowed. The int* p model crumbles when it tries to say int const* p.)

@jhjourdan jhjourdan mentioned this pull request Dec 22, 2017
@damiendoligez

This comment has been minimized.

Copy link
Member

damiendoligez commented Feb 7, 2018

I write "int*" rather than "int *" because I see the * as part of the type, to be separated from the name of the variable.

Yes. I used to do that too and it would be the right way if C was designed properly. But it isn't.

@damiendoligez damiendoligez removed this from the consider-for-4.07 milestone Jun 5, 2018
@damiendoligez damiendoligez self-assigned this Jun 5, 2018
@gadmm gadmm mentioned this pull request Sep 22, 2019
0 of 4 tasks complete
@gadmm gadmm mentioned this pull request Sep 29, 2019
0 of 4 tasks complete
@xavierleroy

This comment has been minimized.

Copy link
Contributor

xavierleroy commented Oct 12, 2019

Where are we with this PR? Bug reports that would be solved by this PR keep arriving, and I thought it was a prerequisite for multicore OCaml. So, what needs to be done (collectively) to finish and merge this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.