-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add padding before and after the atom table #9128
Conversation
Of course, this is a fix for a bug, which we may consider critical since it can cause segmentation faults. So we probably want to backport it to the versions of OCaml for which we consider releasing a bugfix release sometimes. |
Backporting: yes! Alpine linux is heavily used for virtual machines and CI purpose, so I think a backport in 4.08 in addition to active versions makes sense. But first we need a review and approval. |
Thanks for the detective work on the Alpine Linux crashes. An alternate, perhaps simpler fix would be to allocate the atom table dynamically, using For native code, I'm confident that (read-only) code and (read-write) data will never end up in the same page. That might not be the case for readonly data, which is why we should not put initialized data in readonly areas until the page table goes away. |
Nicely found! The patch looks good, but when reading it I think I saw an off-by-one error in
This adds |
I was a bit surprised by @xavierleroy's comment that the atom table is not performance-critical, because I remembered seeing I noticed two specific use-cases that one could try to exploit to generate contradictory micro-benchmarks:
|
I don't think there will be any performance effect: there is no extra indirection. |
Sorry for the confusion; I was discussing Xavier's proposal to allocate the atom table dynamically at runtime startup. |
And what about static OCaml data which end up next to data allocated by some C stub, to which OCaml heap block points to (without no-naked-pointers) ? |
Indeed, fixed. |
Indeed. I agree this is should not be performance critical. @gasche, do you confirm, shall I implement this? |
@gasche thanks, I understand your comment now. I thought that there would be a problem with this approach in compiling the expression |
I can't confirm because I know less than the three of you on how to assess the performance-impact of a runtime change. I looked at the uses of |
Out of curiosity, is there a bug report anywhere about the Alpine crashes ? |
An empty string has size 1... Empty arrays maybe? |
It's not performance-critical in the sense that allocation in the minor heap is performance-critical. The only use, as far as I can remember, is to evaluate the empty array Also it opens the way to lazy allocation and initialization of the atom table the first time it's needed, which might actually save time and space in many programs (all those that don't use empty arrays). Just a thought.
Right. The main reason for the atom table is that we cannot allocate zero-sized objects in the minor heap (EDIT: because there would be no room for a forwarding pointer) nor in the major heap (EDIT: because there would be n room to maintain a free list), so static allocation is a must. In native code we have a solid general mechanism for statically-allocated data. In bytecode we don't so we need the atom table. If you wonder whether |
Naked pointers must go away, the sooner the better. (Mental note: update chapter 20 of the reference manual to say "don't use naked pointers".) |
a34e2b1
to
a2c06d9
Compare
I've pushed a new version of the PR, which:
|
a2c06d9
to
4c4046f
Compare
The test |
e0b63e5
to
9c7cc72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Some unimportant remarks and suggestions below. I still think this must be reviewed by @damiendoligez himself.
857e7c1
to
cb32991
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very small change, then good to go.
0d6c8bf
to
ad4595e
Compare
Thanks @damiendoligez. I included your change. This should be ready to merge as soon as CI passes. @gasche could you backport to 4.10 (and 4.09 ?)? |
For 4.10, "yes of course". For 4.09, we need some sort of assessment of the regression-risk of the change. If you are confident that this is a low-risk change, it can go into a maintenance branch, but otherwise I would be hesitant to backport it, at least before it gets more testing in 4.10 and trunk. (To me, a non-runtime-expert, it looks like a change that might go wrong in unplanned ways?) |
As you prefer. To me, it is low-risk, but I have a hard time to have a objective assessment of the risk of a change. Moreover, it seems that without my memprof change, the memory mapping was somehow avoiding the issue. So we may decide not to backport on 4.09. |
This is a large-ish diff, so if the bug was not observed in the wild on 4.09, I'd advise against backporting. |
There are, in fact, reports in the wild of segfault of OCaml 4.09 on alpine+musl Docker images, which are the initial reason for @jhjourdan's investigation and the currently proposed fix. I haven't been able to find a link to the discussions just now, but I ended up on older issues, such as #7562, which may in fact be related. |
Then let's backport. |
I'll start by running |
(enthusiasm, which I hope is correlated with a careful code review!) |
Is there anything blocking merging for this PR, now that we have agreed to backport to 4.09? |
…me and inline it at the only place it is used.
We need to make sure they do not share a page with code, otherwise the GC and the polymorphic hash and comparison functions will follow code pointers and potentially trigger a segfault.
We must wait for two major cycles to finish in order to make sure that finalizers have indeed be examined.
ad4595e
to
d5d972d
Compare
(I have just fixed the conflict in Changes.) |
Sorry, I just forgot about it. I'll try to remember to merge after the CI comes back green again. Thanks for pinging! |
(The precheck job was https://ci.inria.fr/ocaml/job/precheck/329/ and passed.) |
Add padding before and after the atom table (cherry picked from commit 11b5182)
Add padding before and after the atom table (cherry picked from commit 11b5182)
We need to make sure it does not share a page with code, otherwise the GC and the polymorphic hash and comparison functions will follow code pointers and potentially trigger a segfault.
This seems to be the cause of the segfaults that we observed recently on Alpine Linux. It seems like the linker and musl conspired by putting some bytecode next to the atom table. Then, the polymorphic hash function decided to follow the code pointer of a closure, which did not point to any valid block, finally triggering the segfault.
Even though this PR fixes the issue of Alpine, I am still slightly concerned by a similar issue : in native mode, static data is not surrounded by similar padding, so that, in theory, the OS could decide to map the static data in the same page as some code. I know most OSes would typically use different pages for code and data in order to give the different permissions, but there is no guarantee here. Also, when not in the no-naked-pointers mode, it is possible that the heap contains a pointer to some non-heap data in a page registered as a value page.
So, there is a possible extension to this fix: add padding before and after any static data segment in native mode. Of course, this means we increase by 8kB the size of the binary generated for any OCaml compilation unit...