Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect GC statistics on program exit #11663

Closed
gasche opened this issue Oct 24, 2022 · 6 comments
Closed

incorrect GC statistics on program exit #11663

gasche opened this issue Oct 24, 2022 · 6 comments
Assignees
Labels

Comments

@gasche
Copy link
Member

gasche commented Oct 24, 2022

In some cases the option OCAMLRUNPARAM="v=0x400, which prints GC statistics on program exit, reports incorrect statistics on trunk (and I suppose 5.0).

Repro case:

let nb_iters = 1
let nb_allocs = 1_000_000
let nb_domains = 2

let work () =
  for _ = 1 to nb_iters do
    ignore (Array.init nb_allocs (fun i -> Some i))
  done

let () =
  let workers = Array.init (nb_domains - 1) (fun _ -> Domain.spawn work) in
  work ();
  Array.iter Domain.join workers

On my machine, running this program on trunk with OCAMLRUNPARAM="v=0x400" shows GC stats on program exit with the following:

heap_words: 6059142
top_heap_words: 3039811

it makes no sense to have heap_words > top_heap_words, clearly the statistics are wrong.

(On my machine, the wrong result is deterministic with these constants in the repro case, but with other constants only some runs would have this wrong property.)

Adding a call to Gc.major () at the end of the program fixes the issue.

let () =
  let workers = Array.init (nb_domains - 1) (fun _ -> Domain.spawn work) in
  work ();
  Array.iter Domain.join workers;
  Gc.major ()
heap_words: 3039811
top_heap_words: 3039811

(Calling Gc.full_major () also works and decreases the final heap_words value by a lot.)

It is known that statistics are only in a partially-inconsistent state in-between two major GCs, and that some statistics can only be trusted at the end of each GC cycle. The observed behavior is still clearly a bug. Forcing a major collection cycle on program exit in v=0x400 mode would fix the observed behavior, but I suspect that it would only hide the underlying bug -- I don't see why it would ever be correct to have heap_words > top_heap_words, even in-between cycles.

@gasche
Copy link
Member Author

gasche commented Oct 24, 2022

(cc @Engil who worked with me on the GC-statistics code in the past and @lthls who recently tried to figure out weird GC issues to fix an flambda-specific testsuite failure.)

@lthls
Copy link
Contributor

lthls commented Oct 26, 2022

I think the answer is unfortunately rather trivial. If you look at the code of caml_compute_gc_stats, pool_max_words is very coarsely approximated by taking the sum of the maximums of all the currently domains... but not the orphaned stats.
So you have two domains, one of them terminates and orphans its shared heap, and the program terminates without any other domain adopting the pools. The result is that the reported heap words represent the words from both the main domain and the orphaned pools, while the reported max heap words only takes into account the words from the main domain..
I suspect this can be fixed by adding the orphaned heap stats' max words to the accumulators before looping on the domains.

@gasche
Copy link
Member Author

gasche commented Oct 26, 2022

Thanks for the analysis! I haven't checked but it makes sense.

I think the answer is unfortunately rather trivial.

Why "unfortunately"? I like when bugs have simple fixes.

Are you planning to submit a bugfix PR, or should I try to do it based on your analysis?

P.S.: yes, we should rewrite the computation of top_heap_words to be less incorrect. It somehow hasn't found itself on the top of anyone's TODO list yet.

@lthls
Copy link
Contributor

lthls commented Oct 26, 2022

Why "unfortunately"? I like when bugs have simple fixes.

I was hoping that I would learn more about the multicore GC in the process, but in the end there wasn't much to look at outside the GC stats computation. I did take the opportunity to look at the shared heap implementation though, so not everything is lost.

Are you planning to submit a bugfix PR, or should I try to do it based on your analysis?

I'm not planning a PR, but if we forget about this long enough for the stale bot to complain I'll probably feel guilty enough to write and submit a patch.

@gasche
Copy link
Member Author

gasche commented Oct 27, 2022

@Engil I missed your self-assignment, sorry about this, but I sent a bugfix proposal at #11671.

@gasche
Copy link
Member Author

gasche commented Oct 27, 2022

(It would be more work, more useful and more interesting to compute proper maxima at the end of each major GC cycle, to get a meaningful value instead of our shoddy sum of local maxima. I'm not planning to work on this just yet.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants