Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM-kill on large PRs #220

Closed
kit-ty-kate opened this issue May 4, 2023 · 6 comments
Closed

OOM-kill on large PRs #220

kit-ty-kate opened this issue May 4, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@kit-ty-kate
Copy link
Contributor

ocaml/opam-repository#23742 takes all of the 32GB of RAM of the server opam-repo-ci is running on and is just crashing and restarting in a loop.

@kit-ty-kate kit-ty-kate added the bug Something isn't working label May 4, 2023
@rikusilvola
Copy link

opam.ci.ocaml.org has temporarily moved to toxis.caelum.ci.dev. An HTTP redirect is in place.

@rikusilvola
Copy link

Inspecting with memtrace shows that (among other things probably) the revdeps of all the JS packages are so big that computing that list and sending the build orders to the ocluster creates an enormous amount of in-memory data, that can't be garbage collected since the builds are so numerous (and non trivial). The team has identified this line https://github.com/ocurrent/opam-repo-ci/blob/master/service/pipeline.ml#L99 as being a source of a lot of builds being created.

Possible solutions that have been proposed are:

  • "batch" / block subset of the pipeline so that we have a smaller graph overall, but it's a pain to setup without having it blow up again in a few hours
  • Deport the revdeps pipeline on another server with ocluster custom jobs? (this will take time to setup)
  • Kate's memtrace reveals small memory waste that we could optimize in current_incr/ocurrent, but that won't save more than a 1-3 Gb overall (not significant enough)

In the meantime the service has been temporarily relocated to a host with sufficient memory.

@rikusilvola
Copy link

rikusilvola commented May 6, 2023

The service suffers from repeated segfaults after having been moved to toxis. Increasing stack and files ulimit from 8K to 64K has helped. After some days the process is still running at 46GB.

@tmcgilchrist
Copy link
Member

Additionally we noticed that the Current_ocluster.Connection.pool1 being used to throttle the rate of job creation seems to be preventing jobs being created for non-busy pools of workers like linux-arm64 and macOS (basically anything that isn't linux-x86_64). I'm going to see if I can change that to be per pool and start submitting those jobs.

@tmcgilchrist
Copy link
Member

We have permanently moved the opam-repo-ci application to a machine with more RAM to avoid the Linux OOM-kill on large PRs. This new machine has significantly more resources (36 physical cores/72 with SMT and 515Gb RAM) which covers the peak RAM usage of 42Gb comfortably. Additionally the backlog of jobs for opam-repo-ci has reduced to a normal level, after a period of higher than usual load. Extra capacity was added to the cluster to help with this backlog.

A full writeup of the incident will appear on http://infra.ocaml.org in the coming week.

@shonfeder
Copy link
Contributor

Looks like this has been resolved for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants