Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Dynlink startup to avoid parsing bytecode sections twice #12599

Merged
merged 5 commits into from
Jan 19, 2024

Conversation

stedolan
Copy link
Contributor

Dynlink startup currently re-parses the bytecode executable (via Symtable) to extract various bits of information, most of which the runtime has already gathered. This PR makes the runtime hang on to this information, and provide it directly to Dynlink startup. (The dependency between Symtable and bytecode parsing (Bytesections) caused some annoyance for @shindere in #11996, and is removed here)

For reviewers: Start by reading bytecomp/symtable.ml - the simplification of init_toplevel is the point of this patch.

(thanks @dra27 for help figuring out interactions with -output-complete-obj and other unusual cases of bytecode loading)

@xavierleroy
Copy link
Contributor

I haven't reviewed in details yet, but I have the impression that this PR is increasing the memory requirements and start-up times of all bytecode programs, not just those that use Dynlink. Is this correct?

@dra27
Copy link
Member

dra27 commented Sep 25, 2023

I don't think it adds to start-up time - the primitive table and DLLs were always processed before, but there is an increase in memory because those structures are then kept.

@stedolan
Copy link
Contributor Author

I tried benchmarking ocamlc -version, which seems like a worst-case: a large binary but a very short runtime, so startup costs are noticeable.

I was not able to measure any slowdown. Slightly more work is being done: some extra binary sections are being read. They aren't parsed until Dynlink is initialised, though, and reading some bytes from a file into a buffer does not take long. Before and after this patch, caml_init_dynlink takes about half a millisecond (I renamed it, but it was always unconditionally called). This half a millisecond turns out to be almost entirely spent on a O(n^2) loop comparing primitive names (ah, the things you discover when you run a profiler....).

Memory usage goes up by about 20k. This represents a ~0.2% increase in the memory use of ocamlc -version (which is around 9MB, measured by linux maxrss), and is much less than the run-to-run variance. At the risk of making it a bit more complicated, we could get this back by loading more lazily, but I don't think it's worth it: if we're chasing a few tens of k, we'd have more impact lazily creating stdin's 64k buffer.

@damiendoligez damiendoligez self-assigned this Oct 4, 2023
shindere pushed a commit to shindere/ocaml that referenced this pull request Oct 5, 2023
Refactor Dynlink startup to avoid parsing bytecode sections twice.

This removes the dependency from Symtable->Bytesections, because now
Dynlink and toplevel startup can ask the runtime for the bytecode
sections that were parsed at startup time, rather than re-parsing
them in OCaml.
shindere pushed a commit to shindere/ocaml that referenced this pull request Oct 5, 2023
Refactor Dynlink startup to avoid parsing bytecode sections twice.

This removes the dependency from Symtable->Bytesections, because now
Dynlink and toplevel startup can ask the runtime for the bytecode
sections that were parsed at startup time, rather than re-parsing
them in OCaml.
bytecomp/dll.ml Show resolved Hide resolved
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo a few suggestions, mostly comments that need updating.

extern void caml_build_primitive_table(char_os * lib_path,
char_os * libs,
char * req_prims);
extern void caml_init_dynlink(char_os * lib_path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update the comment above this declaration.

/* Build the table of primitives, given a search path and a list
of shared libraries (both 0-separated in a char array).
Abort the runtime system on error. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to update this comment.

Comment on lines 565 to 558
req_prims = read_section(fd, &trail, "PRIM", NULL);
symb_section = read_section(fd, &trail, "SYMB", &symb_section_len);
crcs_section = read_section(fd, &trail, "CRCS", &crcs_section_len);
if (req_prims == NULL) caml_fatal_error("no PRIM section");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for req_prims == NULL should stay next to the assignment to req_prims. (diff doesn't do a good job on this one)

Suggested change
req_prims = read_section(fd, &trail, "PRIM", NULL);
symb_section = read_section(fd, &trail, "SYMB", &symb_section_len);
crcs_section = read_section(fd, &trail, "CRCS", &crcs_section_len);
if (req_prims == NULL) caml_fatal_error("no PRIM section");
req_prims = read_section(fd, &trail, "PRIM", NULL);
if (req_prims == NULL) caml_fatal_error("no PRIM section");
symb_section = read_section(fd, &trail, "SYMB", &symb_section_len);
crcs_section = read_section(fd, &trail, "CRCS", &crcs_section_len);

runtime/dynlink.c Show resolved Hide resolved
Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming I didn't mess something else up, at the moment this breaks -output-complete-exe and so forth subtly. My test is:

let () = Dynlink.allow_unsafe_modules true in
let () = Dynlink.loadfile (Filename.concat Config.standard_library "unix/unix.cma") in
print_endline "Hello, pointless world!"

and compile with -output-complete-exe. It will fail to find dllunixbyt.so

This issue is that ld.conf and CAML_LD_LIBRARY_PATH are not inspected in the image-as-data path, and they now need to be - this will thread through as the field for dlpt.

It's imperfect, given that at the moment the Symtable.init_toplevel API model doesn't allow a program which doesn't use Dynlink to release memory, but it seems a fairly easy win to release all the memory held over at bytecode startup at the end of the first call to the primitive?

bytecomp/meta.ml Show resolved Hide resolved
bytecomp/bytelink.ml Outdated Show resolved Hide resolved
runtime/dynlink.c Outdated Show resolved Hide resolved
runtime/dynlink.c Outdated Show resolved Hide resolved
runtime/dynlink.c Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

I apologize for this, but I'm still stuck at my question of 3 weeks ago:

[is this] increasing the memory requirements and start-up times of all bytecode programs, not just those that use Dynlink?

to which @dra27 replied

I don't think it adds to start-up time - the primitive table and DLLs were always processed before, but there is an increase in memory because those structures are then kept.

DLPT, DLLS and PRIM sections were always processed before, but now every bytecode program is also reading and keeping in memory SYMB and CRCS sections, which are probably bigger. I'm not saying this is a show-stopper, and @stedolan's quick measurements suggest it is not, but let's present the facts accurately.

For the same reasons, @stedolan's claim that

Dynlink startup currently re-parses the bytecode executable (via Symtable) to extract various bits of information, most of which the runtime has already gathered.

is not true, since the runtime doesn't need nor gather SYMB and CRCS sections, these are only for Dynlink and for the toplevel.

Going back to the premises of this PR: what is wrong with reading sections off the bytecode executable in Dynlink and the toplevel?

@dra27
Copy link
Member

dra27 commented Oct 19, 2023

@stedolan's exposition does say most of which, not all; inaccuracies in my first assessment were covered by I don't think (which apparently I hadn't done enough of).

The overall premise is to reduce the things which both ocaml/ocamlc and dynlink have to do in #11996, as that either increases code duplication between bytecomp/symtable.ml and otherlibs/dynlink/byte/dynlink_symtable.ml or means we have to duplicate modules again.

For the extra memory and processing, there's a reasonably straightforward approach which I hinted at in #12599 (comment). In making the bootstrap repeatable in #11149, I added the stripping of the CRCS section to tools/stripdebug.ml. An alternative is to determine in Bytelink whether it needs to be written in the first place, which is easily done by noting the use of the caml_get_section_table/caml_get_bytecode_sections primitive (using a very similar to the "weak" bytecode primitives idea in amongst all the zstd experiments).

The presence of a CRCS section can then be used by startup_byt.c to determine if any of this processing needs to be done (and the memory can then be released on the first call to caml_get_bytecode_sections). I think1 that that would then mean that programs which don't use dynlink or ocamlbytecomp would not consume any additional memory at startup, and in fact would be slightly smaller, having lost their unnecessary CRCS section. It's possible a similar trick could be done with -g and the SYMB section (which IIUC is only used by the toplevel, dynlink and the debugger.

Footnotes

  1. hopefully a bit harder than before

@xavierleroy
Copy link
Contributor

Maybe my message came out as harsher than I wanted it to be, sorry about that. Still, I remain more comfortable with the current approach, where Dynlink reads (possibly again) the bytecode sections it needs off the bytecode executable file, without interfering with bytecode program startup, which is quite complicated already. I think there are ways to keep doing this while cutting the dependency on compiler-libs, e.g. by duplicating the (small) OCaml code that reads bytecode sections, or by adding a C primitive to runtime/meta.c that wraps caml_read_section_descriptors / caml_seek_section /read_section, or by other ways to be discussed later.

@dra27
Copy link
Member

dra27 commented Oct 23, 2023

Ah, OK - the caml_dynlink_get_bytecode_sections primitive already in the PR could instead be re-opening the bytecode executable, indeed. The thing I personally like more about the changes here is that caml_startup_code and caml_main become slightly more uniform, with Symtable no longer having to worry about how the program started, and that would remain. I'm less bothered about whether the bytecode image is physically read twice (and I wholeheartedly agree that bytecode startup is complicated...!)

@stedolan
Copy link
Contributor Author

There are four sections that Dynlink needs the contents of:

  • PRIM: the list of primitives
  • DLPT: part of the shared library search path (to be combined with CAML_LD_LIBRARY_PATH and ld.conf)
  • SYMB: the map of module block offsets in the global table
  • CRCS: the list of compilation unit hashes

The runtime requires PRIM and DLPT, regardless of whether Dynlink is in use. (I was confused on this point before because the logic that parses these lives in dynlink.c but is not Dynlink-specific).

There are various ways to get this information to Dynlink. (I don't claim the list here is complete: if I'm missing someone's preferred option, say so):

  1. (Current): Have Dynlink include a big chunk of the compiler via Dynlink_compilerlibs, including a copy of the bytecode format parsing logic, and load and parse the bytecode executable at Dynlink startup.
  2. (This PR, currently): Load all four sections at runtime startup instead of just two. SYMB and CRCS are loaded but left unparsed (i.e. not unmarshalled) until Dynlink startup.
  3. Load only PRIM and DLPT at runtime startup, but record the offsets of SYMB and CRCS while the runtime is parsing the bytecode header. At Dynlink startup, re-open the bytecode and load these two sections, while taking PRIM and DLPT from the runtime.
  4. Load only PRIM and DLPT at runtime startup. Copy the bytecode parsing logic to Dynlink and have it re-parse the executable, reload PRIM and DLPT and load SYMB and CRCS.

I don't like the current option (1) because it is one of the few remaining things keeping dynlink_compilerlibs around. @xavierleroy doesn't like option (2) because it keeps the contents of SYMB and CRCS around pointlessly in programs not using Dynlink.

I'm happy with anything that's not (1), but I'd now prefer (3): that way only Dynlink loads the SYMB and CRCS sections, but it doesn't have to duplicate the bytecode table-of-contents logic nor (worse) the ld.conf parsing. @xavierleroy does that seem reasonable to you?

@xavierleroy
Copy link
Contributor

I didn't think of option (3) before, but it looks good to me, and a nice way to move forward on this PR.

For what it's worth, I had a look at (4) but didn't go very far yet. The code to read sections off bytecode executable files is rather simple and easy to maintain: we have an implementation in Perl (!) in tools/ocamlsize that had its last nontrivial change in 2000 (when the "sections" mechanism was introduced)...

@dra27 mentioned yet another alternative: (5) export some of the C code that reads TOC and sections off bytecode files as OCaml primitives, and use them to read sections in byte/dynlink.ml. I didn't look into this yet.

@stedolan stedolan force-pushed the dynlink-parse-bytecode-once branch 2 times, most recently from 8ad439e to cd4196b Compare December 4, 2023 18:24
@stedolan
Copy link
Contributor Author

stedolan commented Dec 4, 2023

@xavierleroy I've updated the code to approach, eh, (3.5): it reopens the bytecode executable and re-parses the section table (using the nice C API already present in the runtime), but keeps the primitive table and search path as previously computed by the runtime.

(I still need to address @dra27 's comments about output-(complete?)-obj, I'll have a look at those tomorrow)

@stedolan
Copy link
Contributor Author

stedolan commented Dec 5, 2023

(I still need to address @dra27 's comments about output-(complete?)-obj, I'll have a look at those tomorrow)

This is done now

@stedolan stedolan requested a review from dra27 December 5, 2023 13:51
@gasche gasche added this to the 5.2 milestone Dec 13, 2023
Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The -output-complete-... stuff indeed now works. I have thoroughly (hopefully!) re-reviewed the logic. A few minor suggestions, and a tedious tweak needed to list_of_ext_table to cope with UCS-2 strings on Windows (that I should have spotted the first time round).

runtime/dynlink.c Outdated Show resolved Hide resolved
runtime/dynlink.c Show resolved Hide resolved
bytecomp/meta.ml Show resolved Hide resolved
runtime/dynlink.c Show resolved Hide resolved
runtime/dynlink.c Outdated Show resolved Hide resolved
This removes the dependency from Symtable->Bytesections, because now
Dynlink and toplevel startup can ask the runtime for the bytecode
sections that were parsed at startup time, rather than re-parsing
them in OCaml.
@stedolan
Copy link
Contributor Author

Thanks for the review @dra27. Just pushed an updated version.

I ended up deleting list_of_ext_table and inlining it in its two callsites: your observation that the two uses should use different copy string functions means it's harder to share code, but after applying your suggestion to use caml_alloc_2 its only 3 lines long, so I no longer feel the need to share the uses.

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you! Marking as approved to clear my "requesting changes", but this shouldn't be merged until the bootstrap commit is re-done.

Unfortunately, you've been stung on the rebase and the bootstrap isn't repeatable: boot/ocamlrun boot/ocamlc -vnum reports 5.2.0+dev0-2023-04-11 where it ought to be reporting 5.3.0+dev0-2023-12-22 (I expect you didn't re-run configure after the rebase; a fact which really the build system ought to complain about).

The bootstrap, as now, needs to be in a separate commit from the main change - I guess you're planning on squashing the 5 commits together and then the 6th commit?

@dra27
Copy link
Member

dra27 commented Jan 19, 2024

From a side-channel discussion with @stedolan - the bootstrap part has been removed from here, so this PR can be squash-merged once CI catches up; the removal of the old primitive can then be put more cleanly in a separate PR, which is slightly less awkward.

@stedolan stedolan merged commit b851fea into ocaml:trunk Jan 19, 2024
11 checks passed
dra27 pushed a commit that referenced this pull request Feb 1, 2024
…2599)

This removes the dependency from Symtable->Bytesections, because now
Dynlink and toplevel startup can ask the runtime for the bytecode
sections that were parsed at startup time, rather than re-parsing
them in OCaml.

(cherry picked from commit b851fea)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants