Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigarray: do not change GC pace when creating sub-arrays #12500

Closed
wants to merge 4 commits into from

Conversation

gasche
Copy link
Member

@gasche gasche commented Aug 24, 2023

This is another attempt to fix #12491 (the first attempt #12493 did not convince me).

This PR introduces a new bigarray-creation function

static value caml_ba_inherit(value vb, int num_dims, intnat * dim)

that creates a sub-array: a new OCaml bigarray that shares its data with a previous OCaml bigarray -- and does not increase the pace of the GC / correctly accounts for external memory usage.

The behavior of caml_ba_alloc is unchanged.

This caml_ba_inherit function could be exported to users, it could be useful if they want to define their own bigarray-reusing function, but this is not done in this PR -- it is only visible inside bigarray.c, for data-reusing functions exposed in the stdlib.

I checked that the reproduction case of #12491 terminates immediately with this PR.

PR structure / review tips

This PR is best reviewed commit by commit.

The first two commits are refactoring commits that should keep the behavior unchanged:

  • the first commit factors out the overflow-resilient size computation,
    to make bigarray-creating functions easier to follow
  • the second introduces the caml_ba_inherit helper and uses it
    (at this point the GC accounting is still wrong)

The last commit fixes the bug by introducing a caml_ba_alloc_gen function that takes an extra parameter owns_data, that indicates whether the external memory should be credited to the OCaml heap, and using it in caml_ba_inherit.

@gasche gasche force-pushed the bigarray-proxy-accounting-2 branch from 594f0be to 9df9ee9 Compare August 26, 2023 04:35
@xavierleroy
Copy link
Contributor

I think this PR implements the correct behavior, but I have a feeling that it duplicates too much code, even with the factoring of the overflow-avoiding size computation. To make this feeling precise, here are two alternate implementations:

This one just adds a CAML_BA_SUBARRAY flag that lets caml_ba_alloc do the right thing for subarrays. The flag is recorded in the subarray, and might be useful one day (?)

This one adds a caml_ba_sub_alloc function that duplicates a bit of code from caml_ba_alloc, but less than in your solution, because for sub arrays there's no need to allocate and there's no need to compute the total size.

@gasche
Copy link
Member Author

gasche commented Aug 31, 2023

Thanks for the alternate implementations.

My implementation is sensibly more invasive, but it has the benefits of factoring some common logic among all sub-array-construction functions, namely the management of Custom_ops and of the proxy. I find the resulting code a bit cleaner, because it hides proxy management from domain-specific code.

If we decide that we don't care about this factoring/simplification (after all we are just trying to fix a bug), I am seduced by the minimality of the flags-using approach. (I considered adding new flags, but I thought that it should be a new flag in the caml_ba_managed enum, and I was worried about the backwards-compatibility implications of changing CAML_BA_MANAGED_MASK.) I agree that keeping the information in the metadata is a nice side-effect.

In your second proposal, I am not sure that it is a good idea to skip dimension checking in caml_ba_sub_alloc. I agree that all current callers respect the invariant that the new total size is smaller or equal to the previous one (and so cannot overflow, etc.), but I think it would be safer (and is quite easy) to check this in some way in the array-creation function.

@gasche
Copy link
Member Author

gasche commented Aug 31, 2023

(Note: in my mind the next step for this PR is for @xavierleroy to formulate a clear preference among one of the three approaches proposed so far. If it's for one of his, I will force-push the PR to adopt his commit instead and do the reviewing myself.)

@xavierleroy
Copy link
Contributor

The flag-based solution is short and sweet. But I notice that in my second approach the caml_ba_sub_alloc function could also take care of

  /* Copy the finalization function from the original array (PR#8568) */
  Custom_ops_val(res) = Custom_ops_val(vb);
  /* Create or update proxy in case of managed bigarray */
  caml_ba_update_proxy(b, Caml_ba_array_val(res));

This code is currently replicated 4 times, which is not nice and error prone. Can we take a bit more time to think about this?

In your second proposal, I am not sure that it is a good idea to skip dimension checking in caml_ba_sub_alloc. I agree that all current callers respect the invariant that the new total size is smaller or equal to the previous one

This is not checked currently either. The only check performed is that the total size of the subarray (as determined from its dimensions) doesn't overflow.

@gasche
Copy link
Member Author

gasche commented Aug 31, 2023

If you want to add the custom-ops+proxy handling to caml_ba_sub_alloc, you need to take the value bigarray in parameter, and this becomes very close the current PR.

(I am not sure what part of the current PR you count as code duplication. Can you elaborate?)

This is not checked currently either. The only check performed is that the total size of the subarray (as determined from its dimensions) doesn't overflow.

Right, and in fact this is also true of the code in my PR (in addition to the code in trunk). So maybe we can set this aspect aside for now.

@xavierleroy
Copy link
Contributor

Indeed, your caml_ba_inherit function is fine, the code duplication I was unhappy about is in caml_ba_alloc_gen.

@gasche
Copy link
Member Author

gasche commented Aug 31, 2023

But caml_ba_alloc_gen replaces caml_ba_alloc (which is just a wrapper around it), the code is moved rather than copied; is it the duplication of documentation and parameter list that you are worried about? (I could probably just remove the documentation of caml_ba_alloc.)

@xavierleroy
Copy link
Contributor

Sorry, I've been working from memory on this PR, and clearly my memory isn't good enough. Let me reconsider that later, OK?

@fabbing
Copy link
Contributor

fabbing commented Nov 14, 2023

It looks like your PR solves the performance regression reported in #12460!
I haven't yet looked at the changes yet but I'll be reviewing it in details (I'm not yet familiar with Bigarray) and also consider @xavierleroy's two alternative implementations.

@fabbing
Copy link
Contributor

fabbing commented Nov 15, 2023

I can confirm that this PR solves the performance issue, or at least drastically reduces it, in @ghennequin's par-matmul-profiling reproduction example.
owl uses Bigarray and the increase in the GC pace causes many more GC major collections (x7 times more in 5.1 than 5.0), causing more running time spent in GC work rather than mutator work.

Both @gasche's PR current code and @xavierleroy's bigarray-pace-1 help the GC to perform the same number of major collections as before in 5.0.0.

GS stats from `par-matmul-profiling`
  5.0.0 5.1.0 pr12500 pace-1
minor_collections 625 310 809 848
major_collections 39 285 39 39
compactions 0 0 0 0
forced_major_collections 1 1 1 1
         
minor_words 127335409 54592452 115691617 110620325
promoted_words 96211 105612 96117 95962
major_words 103751 113691 103658 103369
         
top_heap_words 1917231 1589551 1696047 1790255
heap_words 196508 188316 204700 200604
live_words 18897 18968 18901 18901
free_words 176704 168460 184880 180786
largest_free 0 0 0 0
fragments 907 888 919 917
         
live_blocks 1706 1716 1707 1707
free_blocks 0 0 0 0
heap_chunks 0 0 0 0

As for the code, if I had to rank them by preference, I would say:
1- bigarray-pace-1 is appealing because it is particularly concise.
2- @gasche's code is more verbose but I don't think it duplicates code, rather it refactors ba_compute_size outside of caml_ba_alloc and caml_ba_deserialize. If bigarray-pace-1 is favoured, maybe this idea could be integrated into bigarray-pace-1 or another PR?
3- bigarray-pace-2 feels riskier as it duplicates code from caml_ba_alloc in caml_ba_sub_alloc.

Speedup on my (noisy) 4 cores machine
Domains 5.0 5.1 pr12500 pace-1
1 1,00106 1,03771 0,988304 0,971236
2 1,36456 0,980097 1,24194 1,12929
4 1,97288 1,53115 1,87095 1,83326

@gasche
Copy link
Member Author

gasche commented Nov 17, 2023

Thanks @fabbing for the detective hunt putting this PR/bugfix in focus again.

After giving it a few months rest, I think that all the options proposed are basically fine. I thus feel tempted by the minimal approach of using a new flag, it is the least amount of code and thus the easiest to review, and it may be useful in the future. If we want to factorize the proxy stuff away, we can send (you @fabbing can send!) a separate PR for that.

@xavierleroy what do you think of submitting trunk...xavierleroy:ocaml:bigarray-pace-1 as a PR of your own? I could help review it and get it merged. (I don't mind replacing my PR with your code, but then I cannot act as reviewer.)

@xavierleroy
Copy link
Contributor

Submitted the minimal fix as #12754.

xavierleroy added a commit to xavierleroy/ocaml that referenced this pull request Nov 21, 2023
)

This is achieved by adding a CAML_BA_SUBARRAY flag that is honored by caml_ba_alloc.

Fixes: ocaml#12491
Closes: ocaml#12500
Octachron pushed a commit that referenced this pull request Nov 28, 2023
This is achieved by adding a CAML_BA_SUBARRAY flag that is honored by caml_ba_alloc.

Fixes: #12491
Closes: #12500
(cherry picked from commit 6125c29)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bigarray sub-arrays counted towards total out-of-heap memory usage
3 participants