Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker inconsistencies with large exported data and SLURM #179

Closed
benmarchi opened this issue Nov 7, 2019 · 7 comments
Closed

Worker inconsistencies with large exported data and SLURM #179

benmarchi opened this issue Nov 7, 2019 · 7 comments

Comments

@benmarchi
Copy link

benmarchi commented Nov 7, 2019

This is a follow up to Issue #146

Just to recap, I'm seeing a similar issue where workers appear to fail when exported objects are large. I've tried a variety of things, but I haven't found a consistent solution to the problem. In fact, there seems to be a number of things that might be contributing to the overall crash behavior. Some key system configuration points:

  • I'm using SLURM as the backend
  • Each node in the cluster has 28 cores and 500 GB memory

R Session info:

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS: /opt/revr/ropen/3.5.3/lib64/R/lib/libRblas.so
LAPACK: /opt/revr/ropen/3.5.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] clustermq_0.8.8      foreach_1.4.7        RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.3   RevoUtils_11.0.3 codetools_0.2-16 iterators_1.0.12

Issue 1

The first issue is that there seems to be different behavior when the master is remote versus already on the target host. All examples shown here use foreach, but the behavior is consistent if Q is used instead.

R session on the same node as SLURM jobs

Here is a MWE:

library(foreach)
library(clustermq)

clustermq::register_dopar_cmq(n_jobs = 8, memory = 50000, timeout=180, template = list(log_file= "~/%a.log"))

data <- rep(1, 100000000)

fun <- function(i, data) {
  temp <- sum(data)
  i
}

x = foreach(i = 1:100, .export=c("data")) %dopar% fun(i, data)

From what I have been able to see, this is able to run consistently. If I increase the object size by 3x, it also seems fine.

library(foreach)
library(clustermq)

clustermq::register_dopar_cmq(n_jobs = 8, memory = 50000, timeout=180, template = list(log_file= "~/%a.log"))

data <- rep(1, 300000000)

fun <- function(i, data) {
  temp <- sum(data)
  i
}

x = foreach(i = 1:100, .export=c("data")) %dopar% fun(i, data)

R session on a remote node compared to SLURM jobs

The original MWE appears fine; however, the second example is hit or miss. There are times when all 8 workers return results, times when only a subset return results, and times when all fail to return results. Here is an example of the SLURM submission script:

#!/bin/sh
#SBATCH --job-name=job1
#SBATCH --output=~/%a.log
#SBATCH --error=~/%a.log
#SBATCH --mem-per-cpu=50000
#SBATCH --array=1-8
#SBATCH --cpus-per-task=1

ulimit -v $(( 1024 * 50000))
CMQ_AUTH=fqqpr R --no-save --no-restore -e 'clustermq:::worker("tcp://host1:6006")'

When works do fail, their logs all look similar to

> clustermq:::worker("tcp://host1:6006")
2019-11-07 10:08:13.587569 | Master: tcp://host1:6006
2019-11-07 10:08:13.599158 | WORKER_UP to: tcp://host1:6006
2019-11-07 10:08:21.299439 | 
Error in switch(msg$id, DO_CALL = { : EXPR must be a length 1 vector
Calls: <Anonymous>
Execution halted

So it looks like somewhere along the way, msg = rzmq::receive.socket(socket) is returning a zero-sized object. If you watch the process monitor while the worker processes are starting up, most of the time workers seem to be starting to process ok (reasonable memory and CPU usage), but then die after a few seconds. After the timeout is reached, clustermq returns a message indicating that worker processes have likely failed.

Issue 2

There is another wrinkle to this behavior, which brings us to the second issue. When the export data size is sufficiently large, all workers fail regardless of the location of the original R session.

library(foreach)
library(clustermq)

clustermq::register_dopar_cmq(n_jobs = 8, memory = 50000, timeout=180, template = list(log_file= "~/%a.log"))

data <- rep(1, 500000000)

fun <- function(i, data) {
  temp <- sum(data)
  i
}

x = foreach(i = 1:100, .export=c("data")) %dopar% fun(i, data)

If you look at the system processes during the foreach call, SLURM seems to be correctly starting R processes for each worker. However, in the case of foreach calls with large exported objects, those R processes are killed almost immediately. Reducing n_jobs does not result in a successful function evaluation. The killed worker processes also share a similar log file to the one above. The workers are definitely being killed before the master process timeout.

Issue 3

The next issue is a bit of a weird one. Recall that both MWEs seemed to work fine when the original R process was on the same node as the SLURM processes. In particular, increasing the exported object size by 3x did not cause workers to crash. So summarizing worker success with respect to exported object size:

  • data <- rep(1, 100000000): Success
  • data <- rep(1, 300000000): Success
  • data <- rep(1, 500000000): No success

Just out of curiosity, I tried running the MWE with data <- rep(1, 200000000), fully expecting the workers to process successfully. However, that was not the case. All workers crashed instantly, just like when data <- rep(1, 500000000). So now the situation seems to be that objects can't be too big (which makes some sense), but also not certain sizes.

Issue 4

This was moved to #200

R session on a remote node compared to SLURM jobs

When the original R process is on a different node relative to the SLURM worker processes, the data <- rep(1, 100000000) MWE appears to be consistently fine. However, while the standard foreach call with data <- rep(1, 300000000) seemed hit or miss, when wrapped in the function, it now always appears to fails. Worker processes fail almost instantly after being started by SLURM. The worker log files look the same as the one I showed above.

So, to summarize:

  • There seems to be a difference between the success rate for workers depending on the location of the original R process
  • Exported objects can not be too big, but also not certain sizes
  • Memory required by workers is different when foreach is called within a function
  • When workers fail, they either fail right away or after some time that is less than the timeout, but with similar log files
  • There is a difference in worker success rate when foreach is called within a function compared to by itself (in particular, with remote R processes)

I know that there is a lot of information here, but I wanted to provide as clear of a picture as possible describing the issues I'm currently facing regarding large exported objects. Any help in figuring things out would be greatly appreciated.

@mschubert
Copy link
Owner

Ok, that was a lot to read. Thank you for the detailed report! 👍

Issue 1

I remember a similar issue when an interrupt was signaled while waiting for receive.socket. If it is an interrupt, I've got no idea what causes it. If it's something else, I've got no idea what (yet).

Issue 2 + 3

Do you have your slurm log file that tells you why the workers were killed? Just because your data is (let's say) 5 GB doesn't mean that R won't hit the 50 GB mark doing some operation on it (even if it's just unserializing). This may not always be constant.

Your scheduler log file should give you some insight on the total amount of memory used and why it was killed.

Issue 4

This is interesting. There's some issue with we (or foreach?) collects exports from the environment. In the first case, from .GlobalEnv and data is added twice (once explicitly exported and once found in the environment). In the second case, the function environment is empty so only the explicit export is used.

Not sure yet if this is a bug or expected behavior.

@benmarchi
Copy link
Author

Thanks for the reply. Sorry that it has taken so long for me to get back to this. I was able to hack together a solution for a while by reducing the size and number of exported objects. However, now on a new problem, I am facing the same issue again.

Related to Issues 2 and 3, the standard slurm log files (#SBATCH --output and #SBATCH --error) for crashed jobs don't really provide any information related to usage. It turns out that the compute cluster I have access to does not have job accounting turned on. So, I am not able to pull numbers for actual resources being used by the job (no sacct information). I have reached out to the administrators to see if they can set it up. If so, I will update with more information as it becomes available.

For Issue 1, the crash behavior definitely seems to be related to the receive.socket call. For jobs that fail, the first time it is called it returns NULL. In worker.R, msg is being assigned to NULL, which is causing an error in switch. I can't figure out any link between successful calls to receive.socket and those that fail, but the failure seems to be consistent. Any help narrowing down the root cause of the workers failing to receive information from the socket would be greatly appreciated.

@mschubert
Copy link
Owner

mschubert commented Jun 21, 2020

I tried this again, and I could go up to

data <- rep(1, 1e9)

which is is 7.6 Gb common, 55 Gb memory on the main process and 30 Gb on the workers.

For 3 times, all workers returned results every time.

Can you try the current develop branch and see if your issue persists? There have been quite some changes under the hood. (Note that template=list(...) is currently broken there - I will fix this asap)

I'll also move your "Issue 4" to a new issue.

@mschubert
Copy link
Owner

mschubert commented Jun 29, 2020

Hi @benmarchi!

I've tracked this down to a couple of possibilities, but can unfortunately still not reproduce it. If you could spare some more time to help me finally fix this I would greatly appreciate it!

Steps would be (ideally using one job with log_worker=TRUE):

  1. Try with latest develop branch (remotes::install_github("mschubert/clustermq", ref="develop"))
  2. Try with issue-179 branch (remotes::install_github("mschubert/clustermq", ref="issue-179")) - this prints additional debug messages to the log

@benmarchi
Copy link
Author

I was able to play around a bit with the develop branch and can confirm that exporting large objects now appears way more stable. I have not seen any workers reporting issues or hanging due to msg being NULL. I will follow up with a new issue if this reappears. Thanks for spending some time tracking down this issue!

I have not looking into your fix for #200, but I will follow up there is I see any unexpected behavior.

@mschubert
Copy link
Owner

Thank you for checking!

@strazto
Copy link
Contributor

strazto commented May 13, 2021

Hey, just flagging that this fix is great :)
image

It didn't used to look like this after a few hours :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants