New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker inconsistencies with large exported data and SLURM #179
Comments
Ok, that was a lot to read. Thank you for the detailed report! 👍 Issue 1I remember a similar issue when an interrupt was signaled while waiting for Issue 2 + 3Do you have your slurm log file that tells you why the workers were killed? Just because your data is (let's say) 5 GB doesn't mean that R won't hit the 50 GB mark doing some operation on it (even if it's just unserializing). This may not always be constant. Your scheduler log file should give you some insight on the total amount of memory used and why it was killed. Issue 4This is interesting. There's some issue with we (or Not sure yet if this is a bug or expected behavior. |
Thanks for the reply. Sorry that it has taken so long for me to get back to this. I was able to hack together a solution for a while by reducing the size and number of exported objects. However, now on a new problem, I am facing the same issue again. Related to Issues 2 and 3, the standard slurm log files ( For Issue 1, the crash behavior definitely seems to be related to the |
I tried this again, and I could go up to data <- rep(1, 1e9) which is is 7.6 Gb common, 55 Gb memory on the main process and 30 Gb on the workers. For 3 times, all workers returned results every time. Can you try the current I'll also move your "Issue 4" to a new issue. |
Hi @benmarchi! I've tracked this down to a couple of possibilities, but can unfortunately still not reproduce it. If you could spare some more time to help me finally fix this I would greatly appreciate it! Steps would be (ideally using one job with
|
I was able to play around a bit with the I have not looking into your fix for #200, but I will follow up there is I see any unexpected behavior. |
Thank you for checking! |
This is a follow up to Issue #146
Just to recap, I'm seeing a similar issue where workers appear to fail when exported objects are large. I've tried a variety of things, but I haven't found a consistent solution to the problem. In fact, there seems to be a number of things that might be contributing to the overall crash behavior. Some key system configuration points:
R Session info:
Issue 1
The first issue is that there seems to be different behavior when the master is remote versus already on the target host. All examples shown here use
foreach
, but the behavior is consistent ifQ
is used instead.R session on the same node as SLURM jobs
Here is a MWE:
From what I have been able to see, this is able to run consistently. If I increase the object size by 3x, it also seems fine.
R session on a remote node compared to SLURM jobs
The original MWE appears fine; however, the second example is hit or miss. There are times when all 8 workers return results, times when only a subset return results, and times when all fail to return results. Here is an example of the SLURM submission script:
When works do fail, their logs all look similar to
So it looks like somewhere along the way,
msg = rzmq::receive.socket(socket)
is returning a zero-sized object. If you watch the process monitor while the worker processes are starting up, most of the time workers seem to be starting to process ok (reasonable memory and CPU usage), but then die after a few seconds. After thetimeout
is reached,clustermq
returns a message indicating that worker processes have likely failed.Issue 2
There is another wrinkle to this behavior, which brings us to the second issue. When the export data size is sufficiently large, all workers fail regardless of the location of the original R session.
If you look at the system processes during the
foreach
call, SLURM seems to be correctly starting R processes for each worker. However, in the case offoreach
calls with large exported objects, those R processes are killed almost immediately. Reducingn_jobs
does not result in a successful function evaluation. The killed worker processes also share a similar log file to the one above. The workers are definitely being killed before the master processtimeout
.Issue 3
The next issue is a bit of a weird one. Recall that both MWEs seemed to work fine when the original R process was on the same node as the SLURM processes. In particular, increasing the exported object size by 3x did not cause workers to crash. So summarizing worker success with respect to exported object size:
data <- rep(1, 100000000)
: Successdata <- rep(1, 300000000)
: Successdata <- rep(1, 500000000)
: No successJust out of curiosity, I tried running the MWE with
data <- rep(1, 200000000)
, fully expecting the workers to process successfully. However, that was not the case. All workers crashed instantly, just like whendata <- rep(1, 500000000)
. So now the situation seems to be that objects can't be too big (which makes some sense), but also not certain sizes.Issue 4
This was moved to #200
R session on a remote node compared to SLURM jobs
When the original R process is on a different node relative to the SLURM worker processes, the
data <- rep(1, 100000000)
MWE appears to be consistently fine. However, while the standardforeach
call withdata <- rep(1, 300000000)
seemed hit or miss, when wrapped in the function, it now always appears to fails. Worker processes fail almost instantly after being started by SLURM. The worker log files look the same as the one I showed above.So, to summarize:
foreach
is called within a functiontimeout
, but with similar log filesforeach
is called within a function compared to by itself (in particular, with remote R processes)I know that there is a lot of information here, but I wanted to provide as clear of a picture as possible describing the issues I'm currently facing regarding large exported objects. Any help in figuring things out would be greatly appreciated.
The text was updated successfully, but these errors were encountered: