Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new release] moonpool (0.4) #24306

Merged
merged 1 commit into from
Aug 30, 2023
Merged

Conversation

c-cube
Copy link
Contributor

@c-cube c-cube commented Aug 24, 2023

Pools of threads supported by a pool of domains

CHANGES:
  • add Fut.{reify_error,bind_reify_error}

  • full lifecycle for worker domains, where a domain
    will shutdown if no thread runs on it, after a
    short delay.

  • fix: generalize type of create_arg

  • perf: in Bb_queue, only signal condition on push if queue was empty

@c-cube
Copy link
Contributor Author

c-cube commented Aug 24, 2023

So that's going to be interesting to debug. What machine does this run on?

@avsm
Copy link
Member

avsm commented Aug 24, 2023

How many domains are the tests trying to create? It'll be running on a many core x86 machine in that first test, but may be running into cgroups resource limits.

@c-cube
Copy link
Contributor Author

c-cube commented Aug 24, 2023 via email

@c-cube
Copy link
Contributor Author

c-cube commented Aug 25, 2023

So this is quite puzzling. I tried on a c3.4xlarge instance and a friend tried on his M1, to no avail — the test always passes. Can it be the cgroup settings?

@avsm
Copy link
Member

avsm commented Aug 29, 2023

It's very likely to do with cgroups restrictions in opam-repo-ci. /cc @mtelvers to perhaps shed light on this.

@mtelvers
Copy link
Contributor

This has been a real puzzle for me this afternoon, taking a long look a cgroup limits on the cluster workers, running the spec file on different workers, and getting some expert advice from @dra27. The test fails when nproc >= Max_domains. The machines where it fails include: cree, pima and marpe which have nproc of 128, 128 and 256, respectively. On smaller machines where nproc < 127, there is no issue. This has also been tested as a bare metal installation on pima (no Docker or OBuilder) and the results are the same.

@dra27
Copy link
Member

dra27 commented Aug 29, 2023

I'm not sure where the bug lies, but the domains are never being joined. I was able to reproduce the failures very easily with a custom compiler with Max_domains at 16 on the same machine (that triggered even more test failures with the same exception). I hacked together some domain joining logic (the hack was to record the Domain.t with the domain state and always join on it before re-using the slot). With that version, I'm not seeing the crash.

CHANGES:

- add `Fut.{reify_error,bind_reify_error}`
- full lifecycle for worker domains, where a domain
    will shutdown if no thread runs on it, after a
    short delay.

- fix: generalize type of `create_arg`
- perf: in `Bb_queue`, only signal condition on push if queue was empty
@c-cube
Copy link
Contributor Author

c-cube commented Aug 29, 2023

Thank you @mtelvers and @dra27 for the thorough investigation and the fix!!

@avsm
Copy link
Member

avsm commented Aug 30, 2023

Such a success disaster, that we have machines with so many cores that we're blowing past the maximum available domains in OCaml already ;-) Thank you for the thorough investigation @mtelvers and @dra27, and @c-cube for the revised package!

@avsm avsm merged commit 2bdf65b into ocaml:master Aug 30, 2023
2 checks passed
@c-cube
Copy link
Contributor Author

c-cube commented Aug 30, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants