-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix caml pooled mode and recursive startup #10304
Conversation
In addition, the last commit prevents reparsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are various things I don't understand with this change. I guess the main point is that there are several changes and it is not clear what does what.
The description announces that the PR fixes a crash, but how do we observe the crash? (Is it something that we should consider adding in our testsuite as a regression test?)
More importantly: before caml_startup_aux
had a very simple role, it would count the number of times it was called to cutoff initialization, and also call caml_stat_create_pool
which is related to the startup/shutdown logic. With your PR, a lot of the "init" logic is moved inside this function (why? this is not explained; is it to fix a bug or for code factorization?), but some of the logic remains in startup_byt
and startup_nat
. Looking at the diff there is no clear reason for what is moved where.
My guess would be that what stays in startup_nat
is exactly, in your design, the initialization functions that need to be implemented differently in the bytecode and native runtime. Here is maybe a naming scheme that could make this simpler to follow/understand:
caml_starting_aux
is left as it was before, and possibly renamed intocaml_on_first_startup
- the new stuff you moved in it moves to a different
caml_init_aux
function - the remaining
init_*
bits in the byt/natcaml_startup_*
functions are moved tocaml_init_{byt,nat}
functions, that start by callingcaml_init_aux
(so as a function they are in charge of all initialization)
Ah, I realize now that some of my questions are answered by reviewing commit-by-commit, I had only read the whole diff. So if I understand correctly, the main fix is to ensure that |
Mea culpa, I forgot to describe the main bug and fix: the second commit moves functions that use caml_stat_alloc after the initialisation of the pool. Initialising the pool after something has been allocated will prevent the latter from being deallocated on shutdown, and will cause UB (most likely a crash) when attempting to free it manually. The logic of startup is a bit convoluted and recent evolutions forgot to take this logic into account (also it is hard to see how this can be caught by testing, but I think adding comments in the source will avoid these issues in the future). |
We could expose Then a test could a C program starting the pooled runtime, interpreting a toy bytecode program, shutting down the runtime and cleaning the pool, then checking that the |
I started implementing the idea of a flag but then realised that for people who call caml_stat_alloc*_noexc functions without the runtime lock, this flag would need to be synchronised. The documentation of some _noexc functions explicitly mention that the runtime lock is not needed. The pooled mode was not made the default because of backwards compatibility concerns (with lots of positives found in the original PR), but in fact the new requirements were not added to the documentation (again: in pooled mode you need the runtime lock to call the _noexc functions) and are not enforced in any way with default OCaml settings, so library writers are unlikely to have started to follow the new discipline. These requirements are not even necessarily easy to follow, think about needing to acquire the runtime lock (and thus knowing whether you already have it) if you need to free a block given to foreign code during clean-up after an error condition (when the documentation of caml_stat_free does not even mention needing the runtime lock). It seems to me that a better course of action would be to:
The I want to stress that to take advantage of the pooled allocations the program or library must be explicitly designed for it. Indeed, for a library to correctly release memory allocated with caml_stat_alloc*, it must still call caml_stat_free explicitly because pooling is not the default behaviour. In this scenario, when activating the pooled mode, either the pooled mode has no useful effect (if the deletions are all done before shutdown and with runtime lock held), or it always causes UB (otherwise). So it does not make sense to have a switch to choose whether allocations are pooled or not. Anyway, these comments go fairly beyond the purpose of this PR. For the current PR I prefer to stop at its current state I think. |
Sounds reasonable, thanks. Do you have an opinion on the refactoring of the |
Yes, I do not see what is wrong about moving stuff to startup_aux. Even if it had a specific purpose I do not see the issue with repurposing it. |
The function moves from "one purpose" to "two purposes" (what it was doing now + some of the runtime initialization logic), so it is getting strictly more complex, which is bad. |
For the record: the "c" parameter is tested as part of the CI at Inria, in the "other configs" script: ocaml/tools/ci/inria/other-configs/script Line 46 in 414bdec
and also in the "sanitizers" script, in conjunction with a leak detector: ocaml/tools/ci/inria/sanitizers/script Lines 88 to 94 in 414bdec
So I'm reasonably confident that the "c" mode works and is maintained. The "pooled" API is another matter, though. |
Hello, Gabriel, Guillaume, and Monsieur Leroy! I still use the pooled (unloadable) runtime in 4.06, and totally care about it. This PR is quite interesting, and I will do a proper review a bit later (in a few days). |
This got me puzzled as to why it did not catch that |
I have pushed some simplifications (make static what does not need to be extern, etc.). I have also found a hidden function that could be factored (init_atom_table). Its place inside init_static in nat was nice and appropriate but I think it is even better to factor the call between nat and byt and move it to the common file. As discussed with @gasche I do not see how to easily improve the organisation further but if you want to show me what you have in mind I will be happy to consider it as a patch. |
See gadmm#1 for a modest proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: this PR lacks structure, which made it difficult to review. It is good that the changes are split into commits, but these commits lack proper messages (with explanations), and aligning these commits with the PR comments required effort.
Changes
entry could be more detailed. Perhaps it should be split into 2-3 entries.
In addition, the last commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM.
[...]
Mea culpa, I forgot to describe the main bug and fix: the second commit moves functions that use caml_stat_alloc after the initialisation of the pool. Initialising the pool after something has been allocated will prevent the latter from being deallocated on shutdown, and will cause UB (most likely a crash) when attempting to free it manually. The logic of startup is a bit convoluted and recent evolutions forgot to take this logic into account (also it is hard to see how this can be caught by testing, but I think adding comments in the source will avoid these issues in the future).
Correct. Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit
.
See gadmm#1 for a modest proposal.
What strikes me wrong with this approach is that caml_{startup,shutdown}_needed
are named like predicates, but actually increment/decrement global counters on calls, which is completely unobvious.
How about just renaming caml_stratup_aux
to caml_startup_common
and documenting it properly?
I [...] take advantage of this open PR to introduce a minor clarification to the documentation (closes #10179).
This change is correct and fixes a mistake that I added in af5899f.
runtime/startup_aux.c
Outdated
if (pooling) | ||
caml_cleanup_on_exit = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the meaning in this change.
Previously, if the user set OCAMLRUNPARAM=c
, then caml_cleanup_on_exit
became 1, which enabled pooling.
Now, if pooling is enabled (via a call to caml_startup_pooled
), it sets caml_cleanup_on_exit
to 1, which later forces caml_shutdown
on exit
. But the user didn't ask for this! Leak detection mode (caml_cleanup_on_exit
) isn't the same as ability to unload the runtime manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this change, which stemmed from my misunderstanding. I also add a clarification of the difference between the two features in the documentation if you accept, to explain what you just wrote.
I will come back to this PR in a moment. In the meanwhile, I'd like to know what you think @gasche :
I tend to agree with @murmour, for this reason and because in the end the code to call the two functions ends up duplicated between the three call sites. Would this alternative be fine with you? |
I think; I timed out on this problem and I am okay with a code proposal that you and @murmour would both like. |
@gadmm, regarding your ideas on |
We move duplicated code to startup_aux to simplify further modifications. This should not change the meaning of the code.
caml_init_domain and CAML_EVENTLOG_INIT call caml_stat_alloc functions, so they are moved after the pool initialisation. Suubsequently, the call to caml_record_backtrace, which accesses the Caml_state, must be moved later. We document that the pool is only available after some point, to hope avoid similar bugs in the future.
This commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM. It is unlikely that the reparsing of OCAMLRUNPARAM was relied on in practice.
The calls to caml_init_atom_table can be factored. Then, make static what can be.
Fix ocaml#10179 (wrong example of thing that is released on shutdown), and clarify the difference between startup_pooled and the "cleanup_on_exit" function (at least to me). Fix typo.
Also the previous caml_startup_common should be static; rename it.
Here is a rebased version. Comments have been added in the logs, otherwise only the last three commits have changed. Thanks for the review. I took your comments into account. You are right that I could have made your task easier by including detailed commit logs and mentioning that it is best reviewed commit-per-commit. I will be more careful next time. I do not see how to improve the Changes entry. I think all the user-visible changes have been described. |
The failing test appears unrelated. |
Happy belated birthday, PR#10304! I am closing this PR since I no longer have time not interest in this problem in light of the absence of review and the work needed to rebase it on top of multicore (and duplication of work to fix the issues in both OCaml 4 and 5). I have tried to use the pooled mode for our library, and as explained in #10304 (comment), this mode needs a rework before it can be useful, so we are going to do without it. For now I simply recommend to stay clear from the " But anyone who cares about these issues should feel free to adopt it and/or open an issue about it. I am not deleting the branch so the code will remain available for now. |
Terminology:
caml_stat_alloc*
is freed upon shutdown.caml_startup*
several times (not technically recursively but whatever) described in the manual.This PR fixes a crash in the instrumented runtime in pooled mode, and fixes some leaks in the normal runtime in pooled mode or when calling
caml_startup*
several times. It also factors some duplicated startup code and adds some comments to avoid similar issues in the future.This affects all OCaml versions starting from 4.10.