Skip to content

[runtime] Fix backup thread vs interrupt race#13408

Merged
gasche merged 1 commit intotrunkfrom
unknown repository
Aug 30, 2024
Merged

[runtime] Fix backup thread vs interrupt race#13408
gasche merged 1 commit intotrunkfrom
unknown repository

Conversation

@ghost
Copy link

@ghost ghost commented Aug 29, 2024

This is a proposed fix for #13386. Quoting what I wrote there:

I think this is a subtle race when reusing domain states.

  1. Domain D1 is terminating, holds its own domain_lock. At the same time, its backup thread is in BT_IN_BLOCKING_SECTION state and waiting for domain_lock, because the caml_incoming_interrupts_queued() condition was true.
  2. During domain_terminate, the interrupts are handled as part of the while (!finished) loop. And the the interruptor gets marked as no longer running.
  3. Then the domain_lock gets released.
  4. Backup thread gets the domain_lock and resumes execution. Since the caml_incoming_interrupts_queued() condition was satisfied prior to attempting to take the lock, it will now invoke caml_handle_incoming_interrupts(), which will trigger the assertion.

What needs to be done is that, every time there is a possibly blocking operation between caml_incoming_interrupts_queued() and caml_handle_incoming_interrupts(), to check again for caml_incoming_interrupts_queued().

Copy link
Member

@jmid jmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
This furthermore matches my observations from #13386.
Finally, with the fix, I'm not longer able to trigger the assertion failure

  • after running on my local Linux box for ~4x as long as the longest it previously took to trigger and
  • after hours of testing in both an Alpine Docker container and on a macBook Pro, where I could previously reproduce.

It calls for a Changelog entry, IMHO.

@gasche
Copy link
Member

gasche commented Aug 29, 2024

This is a nice analysis, but I'm not convinced by the fix itself. I think it is error-prone to put this responsability in each caller, while we could easily ensure that caml_handle_incoming_interrupts is a no-op when someone else (the backup thread or the main thread) has handled the interrupts. In fact the code of handle_incoming almost satisfies this property, except for the assertion. My proposed diff would then be as follows:

static int handle_incoming(struct interruptor* s)
{
  int handled = interruptor_has_pending(s);
- CAMLassert (s->running);
  if (handled) {
+   CAMLassert (s->running);
    interruptor_set_handled(s);

    stw_handler(domain_self->state);
  }
  return handled;
}

@gasche
Copy link
Member

gasche commented Aug 29, 2024

Note that the s->running boolean is, as far as I can tell, never used for any actual control-flow logic, it seems to have been implemented to reflect our mental model of the system, to be used in debug assertions and -- presumably -- catch bugs. But since the 5.0 release we haven't caught a bug thanks to it, and you have spent a lot debugging its own debug assertions. It is not obvious to me that having this form of ghost state is worth it, and one option would be to remove it altogether to cut costs.

(Is it, for example, synchronized correctly between threads that access it, or should it be made atomic? I'm not sure.)

@ghost
Copy link
Author

ghost commented Aug 30, 2024

Your proposed diff ought to work as well.

I am a bit reluctant at moving assertions however, as 1) I don't know what the original code author was thinking and whether this is a honest bug or an intentional placement of this assert, and 2) the fact that any given condition may have changed status after a call to a possibly blocking function is a textbook TOCTTOU, and won't disappear anytime soon.

It only makes sense to confirm the domain interruptor is in running state
if there are interrupts to handle. There is a possibility of a domain
exiting while the backup thread, in BT_IN_BLOCKING_SECTION state, is
waiting for the domain lock; it will then proceed to invoke handle_incoming()
as there had been pending interrupts at the time it tried to acquire the
domain lock, but domain termination has taken care of them.

Therefore handle_incoming() should be safe to invoke with the interruptor in
non-running state, as long as there are indeed no pending interrupts.
@jmid
Copy link
Member

jmid commented Aug 30, 2024

The alternative patch LGTM too!
Looks like the MSVC CI failure in tests/lib-runtime-events/test_dropped_events.ml is unrelated.

@ghost
Copy link
Author

ghost commented Aug 30, 2024

The alternative patch LGTM too! Looks like the MSVC CI failure in tests/lib-runtime-events/test_dropped_events.ml is unrelated.

Hard to tell. The logs are useless and I am not sure the error output (stderr) has been captured.

@jmid
Copy link
Member

jmid commented Aug 30, 2024

Hard to tell. The logs are useless and I am not sure the error output (stderr) has been captured.

The same testcase failed on #13410 on POWER yesterday and succeeded after a rerun, so I suspect it is a buggy testcase.

@kayceesrk
Copy link
Contributor

Note that the s->running boolean is, as far as I can tell, never used for any actual control-flow logic,

It was originally used for control flow. See d3e78c0. If it is no longer used other than assertions, I'd prefer to remove this variable from the code.

@jmid
Copy link
Member

jmid commented Aug 30, 2024

Note that the s->running boolean is, as far as I can tell, never used for any actual control-flow logic,

[...]. If it is no longer used other than assertions, I'd prefer to remove this variable from the code.

It seems to be used in this other debug assertion block, which performs a different assertion check and which will then also have to go:

ocaml/runtime/domain.c

Lines 1650 to 1660 in 7137407

#ifdef DEBUG
{
int domains_participating = 0;
for(i=0; i<caml_params->max_domains; i++) {
if(all_domains[i].interruptor.running)
domains_participating++;
}
CAMLassert(domains_participating == stw_domains.participating_domains);
CAMLassert(domains_participating > 0);
}
#endif

@ghost
Copy link
Author

ghost commented Aug 30, 2024

What @jmid said. I think it is worth keeping at least the number-of-domains-in-an-STW accounting logic.

@kayceesrk
Copy link
Contributor

Ok. It seems useful to keep the variable around.

@gasche gasche merged commit 0125942 into ocaml:trunk Aug 30, 2024
@ghost ghost deleted the recycle_race branch August 30, 2024 13:37
@jmid
Copy link
Member

jmid commented Aug 30, 2024

As a(n assertion) bugfix, could this be cherry-picked to 5.3? 🙏

gasche added a commit that referenced this pull request Aug 30, 2024
[runtime] Fix backup thread vs interrupt race

(cherry picked from commit 0125942)
@gasche
Copy link
Member

gasche commented Aug 30, 2024

Sure, done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants