New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EINTR-based signals #1128
EINTR-based signals #1128
Conversation
- Avoid running OCaml code from Unix signal handlers - Avoid holding channel locks across signal handlers - Make low-level I/O primitives avoid raising
Blocking Sections
This code and argument is just wrong.
You might notice that in the stdlib caml_read_fd and caml_write_fd are called with channel->buff, which is a C structure and lives outside the ocaml heap. The channel->buf is unaffected by Gc compaction. Similary in otherlibs/unix/read.c you have
As you can see the read is performed into a stack local buffer and only copied into the Byte later, after leaving the blocking section. In conclusion I would suggest 2 things:
|
You're right, not quite sure what I was thinking there. I've updated the example to show the real issue.
This is not true, and is part of what makes signal handling so hairy. If the current thread holds locks (like the channel locks in
There is one change being made here, which is that IO functions should return EINTR and allow signals to be handled by the caller, rather than attempting to handle signals in an unknown contexts. Making this change requires touching a large amount of code, but I'm not sure that it's easily split into self-contained parts.
I think the updated example shows why this is a bad idea. Also, coalescing repeated signals is not a bad thing: this is how Unix signals have always worked. |
Sure. But now think again: What happens if the other thread holds the locks (like the channel locks in io.c). Then the signal handler running in the other thread will deadlock. There are only 2 real solutions: 1) add critical sections where signal handlers won't run, 2) don't allow function needing locks in signal handlers. |
No. With this patch, threads never run signal handlers while holding locks. If a signal arrives during a blocking operation, the operation returns EINTR which is propagated upwards, and the lock is released before the signal is handled. |
Actually yes, even with this patch. Your idea of having caml_leave_blockiong_section_nosig() is correct. But you only use that once for write. Reads can catch signals too. If you fix that and signal handlers are never run while holding locks then why not keep running them before entering a blocking section? What I mean is that there should be caml_enter_blocking_section_nosig() too and anything holding a lock should use the _nosig flavour for both enter and leave. That way signal propagation is only degraded for the case where a lock is held. On a side node: The problem here is that signals can't be run while holding a channel lock. Wouldn't it make sense to have the lock/unlock functions disable/enable signals processing? That way one couldn't forget to use caml_leave_blockiong_section_nosig() like in this patch. |
Just a reminder: now that 4.06 is branched off, let's restart this proposal and try to make it converge in time for 4.07. I think it goes in the right direction, even though it's not completely clear in my head yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only partially reviewed this PR (I'll need to read it a few more times) but here are a few comments.
- You write
int* x
instead ofint *x
. Don't do that because it goes against the grain: the*
operator is right-associative and your style is misleading. Someday you'll end up writingint *x, y
instead ofint *x, *y
. More important, it's inconsistent with the rest of the code base. - I'd suggest defining and using a macro:
#define perform_interruptible_io(channel, expr) \
do { \
io_result __caml__err; \
do { __caml__err = expr; } while (check_retry(channel, __caml__err)); \
}while(0)
static io_result caml_lseek_set(int fd, file_offset offset) | ||
{ | ||
file_offset off; | ||
off = lseek(fd, offset, SEEK_SET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a blocking section around this call to lseek
because it is a blocking system call, and indeed the original code has both calls to lseek
inside a blocking section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reference for this assertion? The (Linux) man page for lseek does not seem to mention anything like this. IT does not even mention the fact that it can return EINTR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, this should be made uniform with caml_lseek_end
.
@damiendoligez FWIW, I write " |
(Count me in the |
Yes. I used to do that too and it would be the right way if C was designed properly. But it isn't. |
Where are we with this PR? Bug reports that would be solved by this PR keep arriving, and I thought it was a prerequisite for multicore OCaml. So, what needs to be done (collectively) to finish and merge this PR? |
To be more specific, could we have an idea of a deadline to fix the bug? I have it on my stack because I want to make sure that it interacts well with my ongoing implementation of masking. My understanding is that it is not urgent because multicore depends on it (indeed it should not break programs, so there is no need for a transition period), but only because it would already solve bugs. |
I, too, would very much like to see this work restarted and brought to completion. Not only it would fix a number of issues, but it is also a prerequisite for Multicore OCaml, as far as I understand. |
@gadmm You mentioned that this work is on your list. Would you have time to look at this in the coming weeks? Otherwise, we can schedule it in the multicore plan. Either way, we would be glad to help and see how this fits with your masking implementation. |
Yes, I proposed some time ago to Stephen to have a look (but I did not have the time yet). I cannot yet wrap my head around what is the correct behaviour for EINTR inside acquisition and release functions (which are masked), so I need to look at the details. My current plan is to propose a masking design and implementation, maybe with a (hopefully quick) RFC process, with hopes that it gets into 4.12 (4.13 at the latest). I wanted to look at this patch once the details of masking are known, so I proposed to Stephen that we collectively aim for 4.13 with this PR. That is not exactly in the coming weeks, but does this sound good to you? |
Sounds good. The multicore team is currently looking at the interaction between signal handling and domains. We should have a plan by the time your RFC for masking is ready. |
Question for @stedolan or @xavierleroy. I wonder if we should also look at the handlers |
About About trap_handler and segv_handler: they handle "synchronous" signals, generated by exceptional conditions in the OCaml code itself, as opposed to being sent asynchronously from another process. The SIGTRAP signal is generated on PowerPC by bound checking operations, so I would claim it is safe to raise an exception and execute arbitrary OCaml code at this point. The SIGSEGV signal corresponds to a stack overflow, more precisely a memory access inside OCaml-compiled code just "below" the stack limit. I don't know whether all such memory accesses are "safe points" in the sense that it's OK to raise an exception and execute arbitrary OCaml code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a light review of this PR, and have some comments. I will do a more thorough review when this PR will be rebased, the bugs I mention will be fixed and some design decisions will be taken.
More in details, here is what I think is left to be done so that this PR could be merged:
- rebasing on top of trunk
- there seems to be a problem with GC roots management in
io.c
(or some more documentation is needed) - while I appreciate the clear distinction between low-level and high-level functions for manipulating channels, I would very much like having a simple way to distinguish the two, such as a clear naming convention
- It is not clear to me which guarantee this PR is trying to achieve. Indeed, there is still no guarantee that a signal will be handled with reasonable delay: if a signal arrives right before a system call, then no EINTR error code will be returned, and we will need to wait for the end of the system call before the signal gets handled. This is made even worse by the fact that this PR uses
caml_leave_blocking_section_nosig
, which, well, does not check for signals. If we do not want that guarantee, then we can very well avoid checking for signlas during I/O. The fix for this problem is the infamous "self-pipe trick". I know this is a larger change, but at least it should be considered for this PR.
@@ -118,25 +154,45 @@ CAMLexport void caml_close_channel(struct channel *channel) | |||
unlink_channel(channel); | |||
caml_stat_free(channel->name); | |||
caml_stat_free(channel); | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary return
.
Lock(channel); | ||
caml_output_val(channel, v, flags); | ||
Unlock(channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand well, this means that while we took the lock only once before this PR, we take the lock once for every externed block. I can see two problems with that:
- Atomicity is no longer guaranteed, and the unmarshalled data can be interleaved with whatever a signal handler decided to write. But this is true in general for anything in this PR, so I don't think there is anything we can do.
- Performance: are we sure that the repeated lock acquisition/release has negligible cost?
static io_result caml_lseek_set(int fd, file_offset offset) | ||
{ | ||
file_offset off; | ||
off = lseek(fd, offset, SEEK_SET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reference for this assertion? The (Linux) man page for lseek does not seem to mention anything like this. IT does not even mention the fact that it can return EINTR.
static io_result caml_lseek_set(int fd, file_offset offset) | ||
{ | ||
file_offset off; | ||
off = lseek(fd, offset, SEEK_SET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, this should be made uniform with caml_lseek_end
.
do { | ||
err = caml_lseek_end(channel->fd, &end); | ||
} while (check_retry(channel, err)); | ||
do { | ||
err = caml_lseek_set(channel->fd, channel->offset); | ||
} while (check_retry(channel, err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with this code is that it is bogus if some signal handler is called between the two lseek
calls, since the position of the file description will not match channel->offset
.
return 0; | ||
} | ||
|
||
static io_result caml_lseek_set(int fd, file_offset offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static io_result caml_lseek_set(int fd, file_offset offset) | |
static io_result lseek_set(int fd, file_offset offset) |
It seems like the convention is to not use the caml_
prefix for static functions.
return; | ||
} | ||
|
||
static io_result caml_lseek_end(int fd, file_offset* end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static io_result caml_lseek_end(int fd, file_offset* end) | |
static io_result lseek_end(int fd, file_offset* end) |
exception */ | ||
// FIXME really? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the matter here?
@@ -643,9 +787,7 @@ CAMLprim value caml_ml_output_partial(value vchannel, value buff, value start, | |||
struct channel * channel = Channel(vchannel); | |||
int res; | |||
|
|||
Lock(channel); | |||
res = caml_putblock(channel, &Byte(buff, Long_val(start)), Long_val(length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be an issue with roots here: altough buff
is registered as a root, the pointer passed to caml_putblock
isn't, and, due to the retry loop in caml_putblock
, it is possible that this pointer be used after a signal handler (and thus, a GC pass) is executed.
The old code was fine (altough very fragile) because the pointer was only used in caml_putblock
before releasing the runtime lock, which is no longer the case now.
More generally, I find that root management in this file is particularly difficult to follow. Some more review and documentation would be more than welcome.
(This is quite a subtle change, I'd appreciate nitpicking about locking, signal handling and error checking. It's still incomplete at the moment, but I'd like some discussion. Apologies about the length, but the fallout outside
io.c
is fairly minimal, and the changes within are mostly mechanical.)Currently, OCaml signal handlers for signals arriving during IO are run directly from the Unix signal handler, which has a number of problems (see discussion on #1107 for details). Additionally, they may be run while holding channels' locks, which causes the deadlocks/crashes detailed in MPR#7503.
Nowadays, a better solution is available: by relying on blocking system calls to return
EINTR
when interrupted by a signal, we can do the OCaml signal handling outside of the Unix signal handler. (The old BSD systems that couldn't return EINTR are thankfully thin on the ground now).This patch rewrites the error-handling logic for IO channels in this style, which makes signal handling much more robust. I need to do a lot more testing, but one program whose behaviour improves is this:
This program does allocation, I/O, GC and compaction from a signal handler. (I killed it after handling ~500k signals successfully. Trunk sometimes manages 1).
The biggest code changes are in
io.c
, where code like this:becomes this:
The old
operation_that_might_fail
possibly raises exceptions or invokes signal handlers, but these have to contend with the channel possibly being locked (and possibly being left in an inconsistent state if forcibly unlocked), the signal handler possibly trying to re-enter channel code (and deadlocking if it needs the same lock), or the signal handler causing GC which makes finalisers run under an unknown collection of channel locks.The new
operation_that_might_fail
returns an error code, and if the error code is nonzero thencheck_retry
unlocks the channel, handles the error or signal, and retries the operation (if it was interrupted by a signal). The code is longer, but the error paths are easier to understand.Some details:
Low-level I/O
The low-level IO functions
caml_read_fd
andcaml_write_fd
now return an error code instead of raising exceptions, and never run signal handlers themselves (returningEINTR
if a signal was triggered).Similarly, there are new functions
caml_try_putblock
andcaml_try_getblock
inio.h
which return error codes. These low-level operations expect their channel arguments to be locked, but the higher-level operations no longer do (Holding the channel locks across any operation which may allocate, or acrosscaml_leave_blocking_section
, causes deadlocks, so explicitly taking those locks is not something most code should do).Blocking sections
caml_enter_blocking_section
no longer runs signal handlers, because this caused a race condition in every IO operation:In code like the above, if
caml_enter_blocking_section
runs a signal handler, then if the signal handler uses the same channel a deadlock arises when the signal handler tries toLock(chan)
.I think there is no need for
caml_enter_blocking_section
to run signal handlers: it is not a blocking operation, and it comes at the end of normal OCaml code in which signals were regularly polled.caml_leave_blocking_section
still checks for signals, but there is a new versioncaml_leave_blocking_section_nosig
which does not, for use incaml_read_fd
and other low-level IO functions that return explicitEINTR
.Atomicity
I tried to maintain the same atomicity guarantees that currently exist. In particular,
caml_really_{get,put}block
(which may be implemented as multiple calls toread
/write
internally) are atomic with respect to systhreads if not interrupted, since they hold the channel lock for the duration of the call.These functions are not atomic if interrupted by a signal. In particular, if a signal handler writes to a channel, the signal handler's output may appear anywhere in the output stream.
To do