Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

rrnewton · 2015-05-04T21:13:38Z

In test scripts used in continuous integration I typically do cabal install -j. Our Jenkins master is hooked up to various kinds of worker nodes, on some of them a job will be building alone, on others, we might be running up to 32 testing jobs simultaneously (on 32 cores).

When 16 or 32 cabal install -j instances get spawned simultaneously, a typical outcome is:

setup-Simple-Cabal-1.20.0.2-x86_64-linux-ghc-7.6.3: failed to create OS thread: Resource temporarily unavailable

On this particular four socket 32-core server, the max thread limit in the kernel is high:

$ cat /proc/sys/kernel/threads-max
1029964

So I think this is actually a matter of running out of memory (64G). As a separate question it would be interesting to know if the GHC RTS could actually catch this exception in some circumstances, for example GHC doesn't really need to have as many HECs as the user requests.

Obviously, it would be fantastic to have a way to dynamical throttle back some of cabal's parallelness when a machine looks to be running out of memory or threads. Or some degree of mutual exclusion between cabal processes. But whose job is this? It could be the subject for a pile of hacky wrapper scripts on top of cabal. But the software that knows best when cabal is running is cabal, and handling it internally would probably have some advantages. (E.g. if considering strategies such as dialing back the parallel width of already-running cabal jobs, this can only be done by cabal itself.)

Do cabal processes take any kind of lock-files or communicate with each other in any way currently?

The text was updated successfully, but these errors were encountered:

23Skidoo · 2015-05-04T21:35:48Z

Possibly related: #1476, deadlock in parallel install code that only happens on machines with a large number of cores (~= 32).

23Skidoo · 2015-05-04T21:42:16Z

Do cabal processes take any kind of lock-files or communicate with each other in any way currently?

No, though I have some code lying around that makes use of OS semaphores that I plan to integrate.

cabal-1.20 install --reinstall --with-ghc=ghc-7.6.3 --force-reinstalls -j -O0 --disable-library-profiling --disable-executable-profiling --disable-library-coverage -fthreaded --ghc-options=-threaded atomic-primops/ atomic-primops/testing/ atomic-primops-foreign/

If you're not running anything else on that machine simultaneously, then this command should not be enough to make it run out of memory. Looks like either a bug in GHC RTS or a bug in our parallel code. The error message comes from rts/Task.c.

23Skidoo · 2015-05-04T21:47:55Z

Possibly related: https://ghc.haskell.org/trac/ghc/ticket/8604 What is the stack size limit (ulimit -s) on that machine?

rrnewton · 2015-05-05T03:28:07Z

This is on RHEL 6.5.

The problem does trigger when simultaneous cabal installs run together, but it's still happening much earlier than I would expect. Currently I'm triggering it when running 8 copies of cabal-1.22.3.0 install -j8, which should only be 64 processes max!

Note I'm not passing through ghc-options=-jN which would compound things of course. Also, I can confirm from watching htop that nowhere near all the memory in the system is used when these errors happen.

So.... something seems off. Is there some reason Cabal is creating massive numbers of OS threads?

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514982
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 500
virtual memory          (kbytes, -v) 67108864
file locks                      (-x) unlimited

I can, btw, reproduce GHC's ticket 8604 on this machine, using the ulimits given and either GHC 7.8.4 or 7.10.1. But given the numbers above the normal ratio of -v to -s is 6553.6, which should allow for plenty of threads to run <=64 compiles.

rrnewton · 2015-05-05T03:55:55Z

After some grepping of the Cabal source, I'd just like to confirm -- Cabal isn't actually forking a bunch of threads to implement parallel builds, is it? It just creates subprocesses for all GHC-related calls, right?

So without lots of calls to forkOS or lots of HECs via -RTS -N, what are possible culprits for excessive thread creation?

23Skidoo · 2015-05-05T07:46:29Z

It shouldn't be creating a "massive" number of threads. There should be 1 HEC with 1 Haskell thread + at most $NCPUS threads for the FFI call thread pool. Can you measure how many threads are actually created?

Relevant parts of the code:

cabal/cabal-install/Distribution/Client/Install.hs

Line 1008 in 5c70361

executeInstallPlan :: Verbosity

https://github.com/haskell/cabal/blob/5c70361b362e41b3c13d48b58b46224d42f401dc/cabal-install/Distribution/Client/JobControl.hs

After some grepping of the Cabal source, I'd just like to confirm -- Cabal isn't actually forking a bunch of threads to implement parallel builds, is it?

Yes, with -j setup configure/build/install/... is done by a separate setup process for each package:

cabal/cabal-install/Distribution/Client/Install.hs

Line 1255 in 5c70361

onFailure ConfigureFailed $ withJobLimit buildLimit $ do

cabal/cabal-install/Distribution/Client/SetupWrapper.hs

Line 214 in 5c70361

externalSetupMethod :: SetupMethod

$ ulimit -a
[...]
max user processes              (-u) 500

Can this be the problem?

rrnewton · 2015-05-05T13:28:13Z

Thanks! Good catch re: -u. Turns out our RHEL machines are configured much more restrictively than our Ubuntu ones. I'm trying to get this changed now and will confirm that it solves the problem.

23Skidoo · 2015-05-15T11:02:20Z

@rrnewton Can this issue be closed?

rrnewton · 2015-05-15T15:56:10Z

Yep, it basically turned out to be spurious. Sorry!

23Skidoo added the type: bug label May 4, 2015

rrnewton closed this as completed May 15, 2015

23Skidoo added invalid and removed type: bug labels May 15, 2015

rrnewton mentioned this issue Jul 21, 2015

Support concurrent stack invocations or document current behavior commercialhaskell/stack#643

Closed

veprbl mentioned this issue May 1, 2020

haskellPackages: failed to create OS thread: Cannot allocate memory NixOS/nixpkgs#86496

Closed

effigies mentioned this issue Mar 7, 2024

Edge case: Large datalad saves with tight ulimits on many-core machines can fail datalad/datalad#7568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

rrnewton commented May 4, 2015

23Skidoo commented May 4, 2015

23Skidoo commented May 4, 2015

23Skidoo commented May 4, 2015

rrnewton commented May 5, 2015

rrnewton commented May 5, 2015

23Skidoo commented May 5, 2015

rrnewton commented May 5, 2015

23Skidoo commented May 15, 2015

rrnewton commented May 15, 2015

Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

Comments

rrnewton commented May 4, 2015

23Skidoo commented May 4, 2015

23Skidoo commented May 4, 2015

23Skidoo commented May 4, 2015

rrnewton commented May 5, 2015

rrnewton commented May 5, 2015

23Skidoo commented May 5, 2015

rrnewton commented May 5, 2015

23Skidoo commented May 15, 2015

rrnewton commented May 15, 2015