OOM-kill on large PRs #220

kit-ty-kate · 2023-05-04T18:58:51Z

ocaml/opam-repository#23742 takes all of the 32GB of RAM of the server opam-repo-ci is running on and is just crashing and restarting in a loop.

rikusilvola · 2023-05-05T14:14:11Z

opam.ci.ocaml.org has temporarily moved to toxis.caelum.ci.dev. An HTTP redirect is in place.

rikusilvola · 2023-05-05T14:23:32Z

Inspecting with memtrace shows that (among other things probably) the revdeps of all the JS packages are so big that computing that list and sending the build orders to the ocluster creates an enormous amount of in-memory data, that can't be garbage collected since the builds are so numerous (and non trivial). The team has identified this line https://github.com/ocurrent/opam-repo-ci/blob/master/service/pipeline.ml#L99 as being a source of a lot of builds being created.

Possible solutions that have been proposed are:

"batch" / block subset of the pipeline so that we have a smaller graph overall, but it's a pain to setup without having it blow up again in a few hours
Deport the revdeps pipeline on another server with ocluster custom jobs? (this will take time to setup)
Kate's memtrace reveals small memory waste that we could optimize in current_incr/ocurrent, but that won't save more than a 1-3 Gb overall (not significant enough)

In the meantime the service has been temporarily relocated to a host with sufficient memory.

rikusilvola · 2023-05-06T13:41:47Z

The service suffers from repeated segfaults after having been moved to toxis. Increasing stack and files ulimit from 8K to 64K has helped. After some days the process is still running at 46GB.

tmcgilchrist · 2023-05-10T00:08:19Z

Additionally we noticed that the Current_ocluster.Connection.pool1 being used to throttle the rate of job creation seems to be preventing jobs being created for non-busy pools of workers like linux-arm64 and macOS (basically anything that isn't linux-x86_64). I'm going to see if I can change that to be per pool and start submitting those jobs.

tmcgilchrist · 2023-05-22T07:38:50Z

We have permanently moved the opam-repo-ci application to a machine with more RAM to avoid the Linux OOM-kill on large PRs. This new machine has significantly more resources (36 physical cores/72 with SMT and 515Gb RAM) which covers the peak RAM usage of 42Gb comfortably. Additionally the backlog of jobs for opam-repo-ci has reduced to a normal level, after a period of higher than usual load. Extra capacity was added to the cluster to help with this backlog.

A full writeup of the incident will appear on http://infra.ocaml.org in the coming week.

shonfeder · 2024-05-22T00:31:01Z

Looks like this has been resolved for now.

kit-ty-kate added the bug Something isn't working label May 4, 2023

mtelvers mentioned this issue May 5, 2023

Opam-repo-ci (temporary) move to toxis ocurrent/ocurrent-deployer#187

Merged

shonfeder closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM-kill on large PRs #220

OOM-kill on large PRs #220

kit-ty-kate commented May 4, 2023

rikusilvola commented May 5, 2023

rikusilvola commented May 5, 2023

rikusilvola commented May 6, 2023 •

edited

Loading

tmcgilchrist commented May 10, 2023

tmcgilchrist commented May 22, 2023

shonfeder commented May 22, 2024

OOM-kill on large PRs #220

OOM-kill on large PRs #220

Comments

kit-ty-kate commented May 4, 2023

rikusilvola commented May 5, 2023

rikusilvola commented May 5, 2023

rikusilvola commented May 6, 2023 • edited Loading

tmcgilchrist commented May 10, 2023

tmcgilchrist commented May 22, 2023

shonfeder commented May 22, 2024

rikusilvola commented May 6, 2023 •

edited

Loading