Skip to content

Conversation

@hhugo
Copy link
Contributor

@hhugo hhugo commented Oct 11, 2024

Context

Js_of_ocaml uses different memory representation for bytes and strings. It leverages JavaScript immutable strings to implement ocaml ones (initial change in ocsigen/js_of_ocaml#923, enabled by default in ocsigen/js_of_ocaml#976).

This means that Bytes.unsafe_to_string, Bytes.unsafe_of_string are not longer the identity.
The additional cost is usually ok because theses conversions often comes next to computations that are linear in the size of the argument of the conversions. However, some use cases convert large strings/bytes but only look at a small substring/subbytes.

This issue was spotted by @vouillon in ocsigen/js_of_ocaml#1703.
The Yojson parser/lexer uses the following helper that currently result in converting the whole lexbuf to string over and over again.

  let add_lexeme buf (lexbuf : Lexing.lexbuf) =
    let len = lexbuf.lex_curr_pos - lexbuf.lex_start_pos in
    Buffer.add_subbytes buf lexbuf.lex_buffer lexbuf.lex_start_pos len

This PR

In this PR, One tried to find and fix such problematic usage in the stdlib.

  • Buffer.add_subbytes no longer rely on the string implementation.
    and as a result Buffer.add_bytes not longer perform conversion either.

  • Digest.(sub)?bytes and Digest.MD5.(sub)?bytes no longer rely on the string implementation.

  • In addition, one has removed useless conversions when hashing the content of a channel.

The PR also contains another kind of change, that has to do with "fixing" the fast path of String.escaped with jsoo.
String.escaped now avoids a conversion from bytes to string if the optimisation/fastpath in Bytes.unsafe_escape triggers. This change could be dropped from this PR if it's problematic.

@gasche
Copy link
Member

gasche commented Oct 12, 2024

This looks okay to me from a distance. (The String.escaped change is different in nature, I suppose you will explain it separately.) I suppose you are planning to do this anyway, but a bit of explanations on the js_of_ocaml string/bytes representation choices could help appreciate the change.

@hhugo hhugo marked this pull request as ready for review October 12, 2024 12:30
@hhugo
Copy link
Contributor Author

hhugo commented Oct 12, 2024

This looks okay to me from a distance. (The String.escaped change is different in nature, I suppose you will explain it separately.) I suppose you are planning to do this anyway, but a bit of explanations on the js_of_ocaml string/bytes representation choices could help appreciate the change.

@gasche, I completed the PR, added proper description and undrafted the PR

@gasche
Copy link
Member

gasche commented Oct 12, 2024

If I understand things correctly, one way to reason about performance for js_of_ocaml is to consider that the unsafe_{of,to}_string functions have linear cost (they behave morally like their safe counterparts), and to avoid the cases where this strictly worsens the complexity of the code.

Besides the cases that you have already noticed, I think that there are write_substring and send{,to}_substring functions in otherlibs/unix/unix_{unix,win32}.ml that may need to be changed along these lines, typically by exposing new primitives caml_unix_write_string and caml_unix_send_string. (Maybe js_of_ocaml programs never use those anyway?) I think that would cover all the cases I found when grepping for unsafe_{of,to}_string in the compiler distribution -- outside the testsuite.

@hhugo
Copy link
Contributor Author

hhugo commented Oct 12, 2024

If I understand things correctly, one way to reason about performance for js_of_ocaml is to consider that the unsafe_{of,to}_string functions have linear cost (they behave morally like their safe counterparts), and to avoid the cases where this strictly worsens the complexity of the code.

It's currently a bit more complicated but could become exactly that if we merge ocsigen/js_of_ocaml#1229.
Bytes are currently able to use js string or array as their underlying datastructure and will convert between the two depending on operations performed on them.

Bytes.unsafe_of_string alone will be O(1) as it currently reuse its argument for the underlying data.
Bytes.unsafe_to_string alone can be O(1) or O(n) depending on the underlying structure.

Besides the cases that you have already noticed, I think that there are write_substring and send{,to}substring functions in otherlibs/unix/unix{unix,win32}.ml that may need to be changed along these lines, typically by exposing new primitives caml_unix_write_string and caml_unix_send_string. (Maybe js_of_ocaml programs never use those anyway?) I think that would cover all the cases I found when grepping for unsafe_{of,to}_string in the compiler distribution -- outside the testsuite.

write_substring and send{,to}_substring are currenly not supported by jsoo and are not problematic because Bytes.unsafe_of_string is currently O(1)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with the PR, but I would prefer if you could get rid of the code duplication in the C code. (Unless I am missing something, this should be very easy.)

@gasche
Copy link
Member

gasche commented Oct 12, 2024

This is good to go, would you like to rebase to squash/fixup the relevant commits?

@hhugo
Copy link
Contributor Author

hhugo commented Oct 13, 2024

This is good to go, would you like to rebase to squash/fixup the relevant commits?

This is done

@gasche gasche merged commit c256a92 into ocaml:trunk Oct 13, 2024
16 checks passed
@gasche
Copy link
Member

gasche commented Oct 13, 2024

Merged, thanks! Does it make a difference for you whether we cherry-pick in 5.3 (in which case I need to ask our glorious release manager)?

@hhugo hhugo deleted the less-unsafe branch October 13, 2024 10:59
@hhugo
Copy link
Contributor Author

hhugo commented Oct 13, 2024

Merged, thanks! Does it make a difference for you whether we cherry-pick in 5.3 (in which case I need to ask our glorious release manager)?

It could help with merging ocsigen/js_of_ocaml#1229 sooner but it's not a big deal if it doesn't make it,

@gasche
Copy link
Member

gasche commented Oct 13, 2024

@Octachron let us know if you agree with backporting this in 5.3. This is a performance bugfix that only affects js_of_ocaml. It changes the set of runtime primitives (it exposes bytes-specific variants of caml_md5_string and caml_blake2_string), so I could see it breaking the build of certain systems that are tightly coupled to the runtime (for example variants of Mirage).

@avsm
Copy link
Member

avsm commented Oct 13, 2024

We can adapt those Mirage-specific breakages without much trouble I think, but I think we have our own stubs independent of the runtime in mirage-crypto /cc @hannesm @samoht

@Octachron
Copy link
Member

As a well-scoped performance bug fix, this looks ok to integrate at this point of the release cycle.

gasche added a commit that referenced this pull request Oct 14, 2024
Remove some String to/form Bytes conversion to behave better with jsoo

(cherry picked from commit c256a92)
@gasche
Copy link
Member

gasche commented Oct 14, 2024

Okay, I merged this in 5.3, along with its sidekick #13550 (but not #13546, which is doc-only so I didn't bother).

@hhugo
Copy link
Contributor Author

hhugo commented Oct 14, 2024

If we ever make Bytes.unsafe_of_string O(n), such as in ocsigen/js_of_ocaml#1229, we would also need a fix for String.sub, String.get_intN_*, String.get_uintN_*, String.get_utf_*, String.is_valid_utf_*. They convert the whole string argument to bytes but only look at a small subset.

@gasche
Copy link
Member

gasche commented Oct 14, 2024

I think that we are not opposed to such changes in principle, as long as the maintenance costs are reasonable. The liberal use of those unsafe conversions in the current codebase comes from a least-effort approach when we did the split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants