Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing GC root registrations in runtime/io.c #12445

Merged
merged 1 commit into from Jul 31, 2023
Merged

Conversation

gasche
Copy link
Member

@gasche gasche commented Jul 29, 2023

#12436 led me to realize that most of the code of io.c is subtly wrong because it does not root its value arguments when it should. (This was already broken for OCaml 4.x.)

In this PR I decided to systematically review all functions, root arguments if necessary, and add an explicit justification comment for any function with unrooted value arguments or temporaries.

(cc @damiendoligez @yallop)

(This PR supersedes #12436.)

@xavierleroy xavierleroy changed the title carefully root value arguments in the runtime libraries Missing GC root registrations in runtime/io.c Jul 30, 2023
@xavierleroy
Copy link
Contributor

I think the changes are good, but I have a couple of nits to pick :-)

The comment /* No need to root the arguments as we never call the GC. */ is not super clear. Who is "we" ? I would expect "we" to be the code of the function where this comment occurs, but even in the buggy case (the one reported in #12436) the function doesn't call the GC directly nor indirectly by calling functions that allocate; the GC is triggered by concurrently-running threads. Something like

/* No need to register vchan as root since no GC can occur while this function is running */

would be more accurate.

Also, after rooting through several English dictionaries, I still don't root for your use of "to root value arguments" to mean "to register value arguments as memory roots for the GC". Perhaps that's because the practice of verbing nouns is not deeply rooted in my native language. But the only meaning I know so far for "to root something" is to install a rootkit on something or more generally to elevate one's privileges to the "root" user level on something.

@gasche
Copy link
Member Author

gasche commented Jul 30, 2023

The comment /* No need to root the arguments as we never call the GC. */ is not super clear.

Thanks for the nitpick! I had to search a bit for a comment I liked myself, so I am happy to discuss this further.

I would expect "we" to be the code of the function where this comment occur.

Yes, that was my intent.

No need to register [vchan] as root since no GC can occur while this function is running

Counter-nitpick: in OCaml 5 some parts of the GC may always be running concurrently in other domains.
Note: it is not obvious to me why this does not immediately destroy any hope of intentionally non-registering some roots, but my non-expert reasoning about why this may be safe is that the concurrent parts are the major GC, which only collect things that have been dead since the last cycle or something, in particular before the last minor collection. We (the function) can assume that our value arguments were live immediately before we were called, so they cannot be swept by a concurrent GC.

I think that a fully precise comment would be as follows:

We never give control to the GC in a way that could invalidate local (unregistered) values or call finalizers.

"Giving control" to the GC can happen by a direct call or indirectly by releasing the runtime lock. I thought that "call" was an acceptable substitute for "giving control" (or "transferring control") in this broad sense.

What would you think of the following?

/* No need to register roots since we never give control to the GC. */

@gasche
Copy link
Member Author

gasche commented Jul 30, 2023

On a different note: I was surprised to find so many subtle bugs in one file of the runtime -- I started auditing other parts of the runtime and didn't find any issue; what is different in the production context of this one file that explains the somewhat high defect rate? I found the answer: the functions that are buggy here are all functions in which locking operations were added just before the 5.0 release, namely in #11171. There probably aren't rampant root-registration issues in the runtime libraries, but maybe @Octachron and @OlivierNicole need a crash course on being paranoid about the GC.

@gasche
Copy link
Member Author

gasche commented Jul 30, 2023

After thinking more about this, I decided to remove the comments/justifications on why certain functions do not register their value arguments as roots. I suspect that it would take a lot more work to converge towards a common vocabulary for what is (un)safe. I am happy to send the comments as a separate PR if there is interest, but for now I reduced the current PR to contain only the extra root registrations.

@gasche
Copy link
Member Author

gasche commented Jul 30, 2023

I pushed a Changes entry for the 5.1 section because I think that the fixes are safe and worth having in 5.1 as well.

For example, those issues could show up in automated property-based testing. They may also be the cause of some of the flaky CI failures we have been observing: @damiendoligez mentioned that an intext_par failure.ml was fixed by #12436.

cc @Octachron

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, modulo one change that may not be necessary. I'll let @Octachron decide how and when to merge this.

runtime/io.c Outdated
Comment on lines 1046 to 1049
CAMLparam1 (vchannel);
struct channel * channel = Channel(vchannel);
int num = caml_num_rows_fd(channel->fd);
CAMLreturn (Val_int(num));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the original code is safe and this change is not necessary. "But it doesn't hurt!" you're saying? I prefer bug-fix commit that contain only code strictly necessary to fix the bug.

Copy link
Member Author

@gasche gasche Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I removed the root-registration here.

(caml_num_rows_fd does not currently call the GC in either its Unix and Windows versions. This is not documented and might change in the future, so I thought that it was more robust to root anyway.)

Copy link
Member

@yallop yallop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(half-written comment deleted. The changes are clearly correct.)

@yallop
Copy link
Member

yallop commented Jul 31, 2023

Another similar case, from caml_ba_blit:

As @gasche points out below: I misread the code; apologies!

@Octachron
Copy link
Member

I believe that the fact that I had missed the adding a lock broke the non-GC invariant in those functions is an argument for adding comments (to help future myopic reviewers like myself).

I agree that it is better to merge those fixes in 5.1 to decrease the number of runtime bugs in 5.1.0 .

@OlivierNicole
Copy link
Contributor

Apologies on introducing the bugs, I hadn’t realized at the time that the CAML* macros were about GC roots. The manual says to use them, but most of the code in io.c didn’t, so I wrongly assumed that they didn’t matter.

@gasche
Copy link
Member Author

gasche commented Jul 31, 2023

I believe that the fact that I had missed the adding a lock broke the non-GC invariant in those functions is an argument for adding comments (to help future myopic reviewers like myself).

I think that what would really help is a discipline to distinguish functions that may transfer functions to the GC, and those that do not. Currently one has to transitively unfold definitions to tell if a function is in one category or another; this should be property that can be more easily checked.

I have an idea for a discipline that could work, which is to systematically use CAMLparam<n>/CAMLreturn for any function that may transfer control to the GC. This should probably be discussed elsewhere.

Each added root corresponds to a bug in the previous code, except for
`caml_terminfo_rows` which was safe for an arguably-fragile reason.
@gasche
Copy link
Member Author

gasche commented Jul 31, 2023

Thanks @xavierleroy and @yallop for the reviews. I will go ahead and merge (also in 5.1) once the CI is happy again.

@gasche
Copy link
Member Author

gasche commented Jul 31, 2023

@yallop I don't see the issue with caml_ba_blit: the values vsrc and vdst are registered as roots, and the data pointers are not internal pointers into the custom block (which could be invalidated by the GC) but pointers to external (non-heap-allocated) memory, if I understand correctly. Can you elaborate?

@gasche gasche merged commit f272e37 into ocaml:trunk Jul 31, 2023
9 checks passed
gasche added a commit that referenced this pull request Jul 31, 2023
Missing GC root registrations in runtime/io.c

(cherry picked from commit f272e37)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants