Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't exit while a domain is still running. #13190

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

damiendoligez
Copy link
Member

See also commit 78fd956

This test segfaults on my machine in cleanup-at-exit mode.

@gasche
Copy link
Member

gasche commented May 22, 2024

I'm lost. I think that the code of this test is erroneous, it is invalid for two threads to try to force the same lazy thunk concurrently. But of course such erroneous code should still be memory-safe.

  1. Is it intentional that the code is morally wrong, is the test trying to test this? (In this case, could it be written explicitly in a comment?)
  2. You said that you observe a segfault before, this is not supposed to happen, I suppose this a memory-safety bug? Are we hiding the bug by changing the test, will we also track the bug and try to fix it?

@kayceesrk
Copy link
Contributor

The crash indicates a bug. Even with unsynchronized access to lazy, the program shouldn't crash.

From https://v2.ocaml.org/api/Stdlib.Lazy.html,

Note: Lazy.force is not concurrency-safe. If you use this module with multiple fibers, systhreads or domains, then you will need to add some locks. The module however ensures memory-safety, and hence, concurrently accessing this module will not lead to a crash but the behaviour is unspecified.

Independently, the lazy3 test also seems buggy since it tests an implementation behaviour, whereas the behaviour is left unspecified by the language.

@kayceesrk
Copy link
Contributor

Also, the title of the PR is making me curious

Don't exit while a domain is still running.

Is that a recommendation that we will provide when using cleanup-at-exit mode? Even if the programmer doesn't follow that recommendation, the worst that should happen is that resources are not cleaned up (degenerate to the standard "no cleanup at exit" mode), and not a segfault.

@gasche
Copy link
Member

gasche commented May 23, 2024

My vague guess as what is going on:

  • the problem is not related to Lazy, but to the fact that when main raises Lazy.Undefined it never gets to call Domain.join d1, so the program is wrong because the main domain returns while another domain is running.
  • in cleanup-at-exit mode, when the main domain returns it will destroy the runtime system as well; the other domain is still running and a spectacular crash ensues (this is not a segfault in Lazy.force)
  • the fix of Damien is correct because it ensures that Domain.join is always called
  • the underlying problem is that caml_shutdown will segfault if other domains are still running, and that should be tested and fixed, but it should be done in another test (as this is not lazy-related)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there were two comments in the test:

  1. Point out that this test is intentionally creating a forcing race between domains, to check that we remain memory-safe in that case.
  2. Point out that the option-wrapping is done to ensure that Domain.join is always called.

@lthls
Copy link
Contributor

lthls commented May 23, 2024

  • the underlying problem is that caml_shutdown will segfault if other domains are still running, and that should be tested and fixed, but it should be done in another test (as this is not lazy-related)

In my experiments the segfault comes from the domain still running, not from caml_shutdown itself (but maybe that's what you meant ?).
I have managed to catch a segfaulting trace with rr, I can share my setup if someone else wants to investigate this way.

@gasche
Copy link
Member

gasche commented May 23, 2024

Well caml_shutdown will collect runtime resources, unmap the minor-heap memory, etc., so of course any remaining mutators will crash big time. I don't think there is much to debug as there is no hope of making this behavior safe, instead we must force mutators to stop before we pull the runtime rug out.

@kayceesrk
Copy link
Contributor

Thanks for the explanation @gasche. The source of the failure is clear now. And, thanks, @lthls, for confirming the failure location.

@damiendoligez damiendoligez marked this pull request as draft May 27, 2024 09:22
@damiendoligez
Copy link
Member Author

  • the problem is not related to Lazy, but to the fact that when main raises Lazy.Undefined it never gets to call Domain.join d1, so the program is wrong because the main domain returns while another domain is running.
  • in cleanup-at-exit mode, when the main domain returns it will destroy the runtime system as well; the other domain is still running and a spectacular crash ensues (this is not a segfault in Lazy.force)
  • the fix of Damien is correct because it ensures that Domain.join is always called
  • the underlying problem is that caml_shutdown will segfault if other domains are still running, and that should be tested and fixed, but it should be done in another test (as this is not lazy-related)

That's completly right. Sorry I didn't think of writing down a detailed explanation. Note that the underlying bug is being worked on in #12964 and #13010. It just happens that our CI is testing cleanup-at-exit while it's not completely implemented yet.

@gasche I've added the comments you suggested.

@damiendoligez damiendoligez marked this pull request as ready for review May 27, 2024 09:37
should not crash. Currently, the implementation raises Undefined,
and that's what we test here.

Note: due to a bug in the current implementation of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is likely to go out of sync. When we fix this bug, it is likely that we will not fix this comment as the existing tools (such as running the testsuite) won't help catch this.

My preference should be to remove this Note altogether, with the idea that memory cleanup at exit feature will have a recommendation something to the effect of "When used in a multi-domain program, it is undefined behaviour not to join all the spawned domain before the main domain terminates. Undefined behaviour here includes crashes.".

Additionally, It is not clear what this bug is. If we are keeping this comment in some form, it would be useful to link to the bug report somewhere (or the wip PR #12964).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, so I removed the merge-me label for now. (I am not patient enough to wait for @damiendoligez to update his PR, but then I am not motivated enough to fix the comment myself, so for now let's just wait.)

@gasche gasche removed the merge-me label May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants