Skip to content

Split runtime/array.c functions, use the new floatarray primitives in Float.Array#13361

Merged
gasche merged 8 commits intoocaml:trunkfrom
gasche:split-array-functions
Aug 5, 2024
Merged

Split runtime/array.c functions, use the new floatarray primitives in Float.Array#13361
gasche merged 8 commits intoocaml:trunkfrom
gasche:split-array-functions

Conversation

@gasche
Copy link
Member

@gasche gasche commented Aug 4, 2024

This PR is buildup work for ocaml/RFCs#37. It splits the caml_array_foo functions in runtime/array.c in three versions, caml_flotarray_foo, caml_uniform_array_foo, and finally caml_array_foo. The new floatarray_foo versions are used in Float.Array, which could noticeably improve performance as they replace pure-OCaml implementations.

@nojb would you maybe be interested in reviewing this?

@gasche gasche force-pushed the split-array-functions branch from 4efdca5 to 85eb4b3 Compare August 4, 2024 13:31
Copy link
Contributor

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, the patch is correct. LGTM

@gasche gasche added runtime-system refactoring Performance PR or issues affecting runtime performance of the compiled programs labels Aug 4, 2024
@gasche gasche force-pushed the split-array-functions branch 3 times, most recently from ec9c7b6 to a527e53 Compare August 4, 2024 16:23
@gasche
Copy link
Member Author

gasche commented Aug 4, 2024

The debug CI caught a couple mistakes in CAMLassert assertions -- in one place I used Double_array_tag instead of Double_tag, in another I neglected that floatarray can also have tag 0 when they are empty. I am fixing these.

@nojb
Copy link
Contributor

nojb commented Aug 4, 2024

The debug CI caught a couple mistakes in CAMLassert assertions -- in one place I used Double_array_tag instead of Double_tag, in another I neglected that floatarray can also have tag 0 when they are empty. I am fixing these.

That will teach me to look more carefully at assertions!

@gasche gasche force-pushed the split-array-functions branch from a527e53 to 4b8f4e1 Compare August 4, 2024 20:24
@gasche
Copy link
Member Author

gasche commented Aug 4, 2024

There was a tricky bug hiding in the CI failure: I believed that all empty arrays are (physically) equal to Atom(0), but in fact this invariant seems preserved by ocamlc but not by ocamlopt.

$ cat test.ml
let () = assert ([||] == Array.init 0 (fun i -> i))
$ ocamlopt -o test test.ml
$ ./test
Fatal error: exception Assert_failure("test.ml", 1, 9)

I was using this assumption in the following implementation of caml_array_gather, which is the basically the only new piece of logic in array.c:

  for (mlsize_t i = 0; i < num_arrays; i++) {
    /* An array is either the empty array Atom(0),
       or a float array, or a non-float array.
       We know which implementation to use on the first non-empty array. */
    if (arrays[i] == Atom(0))
      continue;
    else if (Tag_val(arrays[i]) == Double_array_tag)
      return caml_floatarray_gather(num_arrays, arrays, offsets, lengths);
    else
      break;
  }
  return caml_uniform_array_gather(num_arrays, arrays, offsets, lengths);

(This code was written to avoid a second full traversal of the arrays array, by doing an early exit on the first non-empty array.)

This code is wrong due to the existence of empty arrays that are not Atom(0), and I fixed it as follows:

  for (mlsize_t i = 0; i < num_arrays; i++) {
    /* An array is either an empty array,
       or a float array, or a non-float array.
       We know which implementation to use on the first non-empty array. */
    if (Wosize_val(arrays[i]) == 0)
      continue;
    else if (Tag_val(arrays[i]) == Double_array_tag)
      return caml_floatarray_gather(num_arrays, arrays, offsets, lengths);
    else
      break;
  }
  return caml_uniform_array_gather(num_arrays, arrays, offsets, lengths);

@lthls
Copy link
Contributor

lthls commented Aug 4, 2024

There was a tricky bug hiding in the CI failure: I believed that all empty arrays are (physically) equal to Atom(0), but in fact this invariant seems preserved by ocamlc but not by ocamlopt.

I suspect that even ocamlc was supposed to treat [||] as a static constant (so stored in the symbol tables of the bytecode executable, not as an atom) but the code for traducing constant arrays as static constants was delayed when flambda was introduced, and as a result small constant arrays are not treated as constants anymore in bytecode.
I'll try to look at the history to see if that was intended or not.

@alainfrisch
Copy link
Contributor

FTR, we are currently exploring the performance of Float.Array.fill. It turns out that for small-ish arrays (say, below 10 elements), the for-loop version is significantly faster than calling into C; so, this PR actually slows down this function (and we need to go to relatively large sizes, around 100) to see a consistant and significant speed up. I don't know what to do with this information though (small sizes are fast anyway...) : could we imagine having some logic in the stdlib code to use either a for-loop and a call into C, depending on the size? (Of course, the threshold would be different depending on the backend, and very different between bytecode and native.)

Another point about the performance of Float.Array.fill. It seems that with the default flags, this function is not inlined (-inline 2 is enough, but the default is 1.25). In the code we are currently optimizing, the float argument that we pass to this function is read from a floatarray; the lack of inlining forces an allocation to box that float, which is quite sad, because it is then passed in unboxed form to the C primitive. Several ways to address it:

  • Add an explicit [@ocaml.inline always] annotation just for that function (and other similar ones).
  • Increase the -inline parameter when compiling float.ml.
  • Reconsider the current default threshold (1.25) which seems quite low to me.
  • Change the inlining heuristics to make it more likely to inline functions that unbox some of their arguments (the rationale being that inlining them can avoid useless boxing).

@gasche
Copy link
Member Author

gasche commented Feb 26, 2025

Looking at the definition of the external in float.ml, I see that I forgot to mark it @@noalloc. Doing it should decrease the FFI costs, possibly significantly. If you already have a small microbenchmark that you are playing with, can you try it? Otherwise I will write one.

@alainfrisch
Copy link
Contributor

I was using this trivial microbenchmark : test_floatarray.ml.txt

I'll try to find time to play with your suggestion next week.

@alainfrisch
Copy link
Contributor

I've tried different variant with/without the noalloc attribute, with/without inlining, with/without bound checking.

Conclusion: yes, adding noalloc brings a visible gain for small sizes, and there is no downside, I guess. Forcing the inlining of check and fill is also quite useful for such sizes (but there is small risk of increasing code size).

The inlined+noalloc version (with bound checks) is still quite a bit slower than the version with a for-loop (with bound checks), but it's much less dramatic than with the current version (not inlined, no noalloc attribute).

@gasche gasche mentioned this pull request Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance PR or issues affecting runtime performance of the compiled programs refactoring runtime-system

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants