[runtime] Fix backup thread vs interrupt race#13408
Conversation
jmid
left a comment
There was a problem hiding this comment.
LGTM!
This furthermore matches my observations from #13386.
Finally, with the fix, I'm not longer able to trigger the assertion failure
- after running on my local Linux box for ~4x as long as the longest it previously took to trigger and
- after hours of testing in both an Alpine Docker container and on a macBook Pro, where I could previously reproduce.
It calls for a Changelog entry, IMHO.
|
This is a nice analysis, but I'm not convinced by the fix itself. I think it is error-prone to put this responsability in each caller, while we could easily ensure that static int handle_incoming(struct interruptor* s)
{
int handled = interruptor_has_pending(s);
- CAMLassert (s->running);
if (handled) {
+ CAMLassert (s->running);
interruptor_set_handled(s);
stw_handler(domain_self->state);
}
return handled;
} |
|
Note that the (Is it, for example, synchronized correctly between threads that access it, or should it be made atomic? I'm not sure.) |
|
Your proposed diff ought to work as well. I am a bit reluctant at moving assertions however, as 1) I don't know what the original code author was thinking and whether this is a honest bug or an intentional placement of this assert, and 2) the fact that any given condition may have changed status after a call to a possibly blocking function is a textbook TOCTTOU, and won't disappear anytime soon. |
It only makes sense to confirm the domain interruptor is in running state if there are interrupts to handle. There is a possibility of a domain exiting while the backup thread, in BT_IN_BLOCKING_SECTION state, is waiting for the domain lock; it will then proceed to invoke handle_incoming() as there had been pending interrupts at the time it tried to acquire the domain lock, but domain termination has taken care of them. Therefore handle_incoming() should be safe to invoke with the interruptor in non-running state, as long as there are indeed no pending interrupts.
|
The alternative patch LGTM too! |
Hard to tell. The logs are useless and I am not sure the error output ( |
The same testcase failed on #13410 on POWER yesterday and succeeded after a rerun, so I suspect it is a buggy testcase. |
It was originally used for control flow. See d3e78c0. If it is no longer used other than assertions, I'd prefer to remove this variable from the code. |
It seems to be used in this other debug assertion block, which performs a different assertion check and which will then also have to go: Lines 1650 to 1660 in 7137407 |
|
What @jmid said. I think it is worth keeping at least the number-of-domains-in-an-STW accounting logic. |
|
Ok. It seems useful to keep the variable around. |
|
As a(n assertion) bugfix, could this be cherry-picked to 5.3? 🙏 |
[runtime] Fix backup thread vs interrupt race (cherry picked from commit 0125942)
|
Sure, done. |
This is a proposed fix for #13386. Quoting what I wrote there:
I think this is a subtle race when reusing domain states.
domain_lock. At the same time, its backup thread is inBT_IN_BLOCKING_SECTIONstate and waiting fordomain_lock, because thecaml_incoming_interrupts_queued()condition was true.domain_terminate, the interrupts are handled as part of thewhile (!finished)loop. And the the interruptor gets marked as no longer running.domain_lockgets released.domain_lockand resumes execution. Since thecaml_incoming_interrupts_queued()condition was satisfied prior to attempting to take the lock, it will now invokecaml_handle_incoming_interrupts(), which will trigger the assertion.What needs to be done is that, every time there is a possibly blocking operation between
caml_incoming_interrupts_queued()andcaml_handle_incoming_interrupts(), to check again forcaml_incoming_interrupts_queued().