Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

Closed
rrnewton opened this issue May 4, 2015 · 9 comments
Closed

Can cabal -j help avoid "failed to create OS thread:" somehow? #2576

rrnewton opened this issue May 4, 2015 · 9 comments

Comments

@rrnewton
Copy link
Member

rrnewton commented May 4, 2015

In test scripts used in continuous integration I typically do cabal install -j. Our Jenkins master is hooked up to various kinds of worker nodes, on some of them a job will be building alone, on others, we might be running up to 32 testing jobs simultaneously (on 32 cores).

When 16 or 32 cabal install -j instances get spawned simultaneously, a typical outcome is:

setup-Simple-Cabal-1.20.0.2-x86_64-linux-ghc-7.6.3: failed to create OS thread: Resource temporarily unavailable

On this particular four socket 32-core server, the max thread limit in the kernel is high:

$ cat /proc/sys/kernel/threads-max
1029964

So I think this is actually a matter of running out of memory (64G). As a separate question it would be interesting to know if the GHC RTS could actually catch this exception in some circumstances, for example GHC doesn't really need to have as many HECs as the user requests.

Obviously, it would be fantastic to have a way to dynamical throttle back some of cabal's parallelness when a machine looks to be running out of memory or threads. Or some degree of mutual exclusion between cabal processes. But whose job is this? It could be the subject for a pile of hacky wrapper scripts on top of cabal. But the software that knows best when cabal is running is cabal, and handling it internally would probably have some advantages. (E.g. if considering strategies such as dialing back the parallel width of already-running cabal jobs, this can only be done by cabal itself.)

Do cabal processes take any kind of lock-files or communicate with each other in any way currently?

@23Skidoo
Copy link
Member

23Skidoo commented May 4, 2015

Possibly related: #1476, deadlock in parallel install code that only happens on machines with a large number of cores (~= 32).

@23Skidoo
Copy link
Member

23Skidoo commented May 4, 2015

Do cabal processes take any kind of lock-files or communicate with each other in any way currently?

No, though I have some code lying around that makes use of OS semaphores that I plan to integrate.

cabal-1.20 install --reinstall --with-ghc=ghc-7.6.3 --force-reinstalls -j -O0 --disable-library-profiling --disable-executable-profiling --disable-library-coverage -fthreaded --ghc-options=-threaded atomic-primops/ atomic-primops/testing/ atomic-primops-foreign/ 

If you're not running anything else on that machine simultaneously, then this command should not be enough to make it run out of memory. Looks like either a bug in GHC RTS or a bug in our parallel code. The error message comes from rts/Task.c.

@23Skidoo
Copy link
Member

23Skidoo commented May 4, 2015

Possibly related: https://ghc.haskell.org/trac/ghc/ticket/8604 What is the stack size limit (ulimit -s) on that machine?

@rrnewton
Copy link
Member Author

rrnewton commented May 5, 2015

This is on RHEL 6.5.

The problem does trigger when simultaneous cabal installs run together, but it's still happening much earlier than I would expect. Currently I'm triggering it when running 8 copies of cabal-1.22.3.0 install -j8, which should only be 64 processes max!

Note I'm not passing through ghc-options=-jN which would compound things of course. Also, I can confirm from watching htop that nowhere near all the memory in the system is used when these errors happen.

So.... something seems off. Is there some reason Cabal is creating massive numbers of OS threads?

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514982
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 500
virtual memory          (kbytes, -v) 67108864
file locks                      (-x) unlimited

I can, btw, reproduce GHC's ticket 8604 on this machine, using the ulimits given and either GHC 7.8.4 or 7.10.1. But given the numbers above the normal ratio of -v to -s is 6553.6, which should allow for plenty of threads to run <=64 compiles.

@rrnewton
Copy link
Member Author

rrnewton commented May 5, 2015

After some grepping of the Cabal source, I'd just like to confirm -- Cabal isn't actually forking a bunch of threads to implement parallel builds, is it? It just creates subprocesses for all GHC-related calls, right?

So without lots of calls to forkOS or lots of HECs via -RTS -N, what are possible culprits for excessive thread creation?

@23Skidoo
Copy link
Member

23Skidoo commented May 5, 2015

It shouldn't be creating a "massive" number of threads. There should be 1 HEC with 1 Haskell thread + at most $NCPUS threads for the FFI call thread pool. Can you measure how many threads are actually created?

Relevant parts of the code:

executeInstallPlan :: Verbosity

https://github.com/haskell/cabal/blob/5c70361b362e41b3c13d48b58b46224d42f401dc/cabal-install/Distribution/Client/JobControl.hs

After some grepping of the Cabal source, I'd just like to confirm -- Cabal isn't actually forking a bunch of threads to implement parallel builds, is it?

Yes, with -j setup configure/build/install/... is done by a separate setup process for each package:

onFailure ConfigureFailed $ withJobLimit buildLimit $ do

externalSetupMethod :: SetupMethod

$ ulimit -a
[...]
max user processes              (-u) 500

Can this be the problem?

@rrnewton
Copy link
Member Author

rrnewton commented May 5, 2015

Thanks! Good catch re: -u. Turns out our RHEL machines are configured much more restrictively than our Ubuntu ones. I'm trying to get this changed now and will confirm that it solves the problem.

@23Skidoo
Copy link
Member

@rrnewton Can this issue be closed?

@rrnewton
Copy link
Member Author

Yep, it basically turned out to be spurious. Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants