New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can cabal -j help avoid "failed to create OS thread:" somehow? #2576
Comments
Possibly related: #1476, deadlock in parallel install code that only happens on machines with a large number of cores (~= 32). |
No, though I have some code lying around that makes use of OS semaphores that I plan to integrate.
If you're not running anything else on that machine simultaneously, then this command should not be enough to make it run out of memory. Looks like either a bug in GHC RTS or a bug in our parallel code. The error message comes from |
Possibly related: https://ghc.haskell.org/trac/ghc/ticket/8604 What is the stack size limit ( |
This is on RHEL 6.5. The problem does trigger when simultaneous cabal installs run together, but it's still happening much earlier than I would expect. Currently I'm triggering it when running 8 copies of Note I'm not passing through So.... something seems off. Is there some reason Cabal is creating massive numbers of OS threads?
I can, btw, reproduce GHC's ticket 8604 on this machine, using the ulimits given and either GHC 7.8.4 or 7.10.1. But given the numbers above the normal ratio of |
After some grepping of the Cabal source, I'd just like to confirm -- Cabal isn't actually forking a bunch of threads to implement parallel builds, is it? It just creates subprocesses for all GHC-related calls, right? So without lots of calls to |
It shouldn't be creating a "massive" number of threads. There should be 1 HEC with 1 Haskell thread + at most $NCPUS threads for the FFI call thread pool. Can you measure how many threads are actually created? Relevant parts of the code:
https://github.com/haskell/cabal/blob/5c70361b362e41b3c13d48b58b46224d42f401dc/cabal-install/Distribution/Client/JobControl.hs
Yes, with
Can this be the problem? |
Thanks! Good catch re: |
@rrnewton Can this issue be closed? |
Yep, it basically turned out to be spurious. Sorry! |
In test scripts used in continuous integration I typically do
cabal install -j
. Our Jenkins master is hooked up to various kinds of worker nodes, on some of them a job will be building alone, on others, we might be running up to 32 testing jobs simultaneously (on 32 cores).When 16 or 32
cabal install -j
instances get spawned simultaneously, a typical outcome is:On this particular four socket 32-core server, the max thread limit in the kernel is high:
So I think this is actually a matter of running out of memory (64G). As a separate question it would be interesting to know if the GHC RTS could actually catch this exception in some circumstances, for example GHC doesn't really need to have as many HECs as the user requests.
Obviously, it would be fantastic to have a way to dynamical throttle back some of cabal's parallelness when a machine looks to be running out of memory or threads. Or some degree of mutual exclusion between cabal processes. But whose job is this? It could be the subject for a pile of hacky wrapper scripts on top of cabal. But the software that knows best when cabal is running is cabal, and handling it internally would probably have some advantages. (E.g. if considering strategies such as dialing back the parallel width of already-running cabal jobs, this can only be done by cabal itself.)
Do cabal processes take any kind of lock-files or communicate with each other in any way currently?
The text was updated successfully, but these errors were encountered: