Worker inconsistencies with large exported data and SLURM #179

benmarchi · 2019-11-07T18:32:50Z

This is a follow up to Issue #146

Just to recap, I'm seeing a similar issue where workers appear to fail when exported objects are large. I've tried a variety of things, but I haven't found a consistent solution to the problem. In fact, there seems to be a number of things that might be contributing to the overall crash behavior. Some key system configuration points:

I'm using SLURM as the backend
Each node in the cluster has 28 cores and 500 GB memory

R Session info:

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS: /opt/revr/ropen/3.5.3/lib64/R/lib/libRblas.so
LAPACK: /opt/revr/ropen/3.5.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] clustermq_0.8.8      foreach_1.4.7        RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.3   RevoUtils_11.0.3 codetools_0.2-16 iterators_1.0.12

Issue 1

The first issue is that there seems to be different behavior when the master is remote versus already on the target host. All examples shown here use foreach, but the behavior is consistent if Q is used instead.

R session on the same node as SLURM jobs

Here is a MWE:

library(foreach)
library(clustermq)

clustermq::register_dopar_cmq(n_jobs = 8, memory = 50000, timeout=180, template = list(log_file= "~/%a.log"))

data <- rep(1, 100000000)

fun <- function(i, data) {
  temp <- sum(data)
  i
}

x = foreach(i = 1:100, .export=c("data")) %dopar% fun(i, data)

From what I have been able to see, this is able to run consistently. If I increase the object size by 3x, it also seems fine.

library(foreach)
library(clustermq)

clustermq::register_dopar_cmq(n_jobs = 8, memory = 50000, timeout=180, template = list(log_file= "~/%a.log"))

data <- rep(1, 300000000)

fun <- function(i, data) {
  temp <- sum(data)
  i
}

x = foreach(i = 1:100, .export=c("data")) %dopar% fun(i, data)

R session on a remote node compared to SLURM jobs

The original MWE appears fine; however, the second example is hit or miss. There are times when all 8 workers return results, times when only a subset return results, and times when all fail to return results. Here is an example of the SLURM submission script:

#!/bin/sh
#SBATCH --job-name=job1
#SBATCH --output=~/%a.log
#SBATCH --error=~/%a.log
#SBATCH --mem-per-cpu=50000
#SBATCH --array=1-8
#SBATCH --cpus-per-task=1

ulimit -v $(( 1024 * 50000))
CMQ_AUTH=fqqpr R --no-save --no-restore -e 'clustermq:::worker("tcp://host1:6006")'

When works do fail, their logs all look similar to

> clustermq:::worker("tcp://host1:6006")
2019-11-07 10:08:13.587569 | Master: tcp://host1:6006
2019-11-07 10:08:13.599158 | WORKER_UP to: tcp://host1:6006
2019-11-07 10:08:21.299439 | 
Error in switch(msg$id, DO_CALL = { : EXPR must be a length 1 vector
Calls: <Anonymous>
Execution halted

So it looks like somewhere along the way, msg = rzmq::receive.socket(socket) is returning a zero-sized object. If you watch the process monitor while the worker processes are starting up, most of the time workers seem to be starting to process ok (reasonable memory and CPU usage), but then die after a few seconds. After the timeout is reached, clustermq returns a message indicating that worker processes have likely failed.

Issue 2

There is another wrinkle to this behavior, which brings us to the second issue. When the export data size is sufficiently large, all workers fail regardless of the location of the original R session.

library(foreach)
library(clustermq)

clustermq::register_dopar_cmq(n_jobs = 8, memory = 50000, timeout=180, template = list(log_file= "~/%a.log"))

data <- rep(1, 500000000)

fun <- function(i, data) {
  temp <- sum(data)
  i
}

x = foreach(i = 1:100, .export=c("data")) %dopar% fun(i, data)

If you look at the system processes during the foreach call, SLURM seems to be correctly starting R processes for each worker. However, in the case of foreach calls with large exported objects, those R processes are killed almost immediately. Reducing n_jobs does not result in a successful function evaluation. The killed worker processes also share a similar log file to the one above. The workers are definitely being killed before the master process timeout.

Issue 3

The next issue is a bit of a weird one. Recall that both MWEs seemed to work fine when the original R process was on the same node as the SLURM processes. In particular, increasing the exported object size by 3x did not cause workers to crash. So summarizing worker success with respect to exported object size:

data <- rep(1, 100000000): Success
data <- rep(1, 300000000): Success
data <- rep(1, 500000000): No success

Just out of curiosity, I tried running the MWE with data <- rep(1, 200000000), fully expecting the workers to process successfully. However, that was not the case. All workers crashed instantly, just like when data <- rep(1, 500000000). So now the situation seems to be that objects can't be too big (which makes some sense), but also not certain sizes.

Issue 4

This was moved to #200

R session on a remote node compared to SLURM jobs

When the original R process is on a different node relative to the SLURM worker processes, the data <- rep(1, 100000000) MWE appears to be consistently fine. However, while the standard foreach call with data <- rep(1, 300000000) seemed hit or miss, when wrapped in the function, it now always appears to fails. Worker processes fail almost instantly after being started by SLURM. The worker log files look the same as the one I showed above.

So, to summarize:

There seems to be a difference between the success rate for workers depending on the location of the original R process
Exported objects can not be too big, but also not certain sizes
Memory required by workers is different when foreach is called within a function
When workers fail, they either fail right away or after some time that is less than the timeout, but with similar log files
There is a difference in worker success rate when foreach is called within a function compared to by itself (in particular, with remote R processes)

I know that there is a lot of information here, but I wanted to provide as clear of a picture as possible describing the issues I'm currently facing regarding large exported objects. Any help in figuring things out would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

mschubert · 2019-11-24T21:11:50Z

Ok, that was a lot to read. Thank you for the detailed report! 👍

Issue 1

I remember a similar issue when an interrupt was signaled while waiting for receive.socket. If it is an interrupt, I've got no idea what causes it. If it's something else, I've got no idea what (yet).

Issue 2 + 3

Do you have your slurm log file that tells you why the workers were killed? Just because your data is (let's say) 5 GB doesn't mean that R won't hit the 50 GB mark doing some operation on it (even if it's just unserializing). This may not always be constant.

Your scheduler log file should give you some insight on the total amount of memory used and why it was killed.

Issue 4

This is interesting. There's some issue with we (or foreach?) collects exports from the environment. In the first case, from .GlobalEnv and data is added twice (once explicitly exported and once found in the environment). In the second case, the function environment is empty so only the explicit export is used.

Not sure yet if this is a bug or expected behavior.

benmarchi · 2020-03-18T21:02:00Z

Thanks for the reply. Sorry that it has taken so long for me to get back to this. I was able to hack together a solution for a while by reducing the size and number of exported objects. However, now on a new problem, I am facing the same issue again.

Related to Issues 2 and 3, the standard slurm log files (#SBATCH --output and #SBATCH --error) for crashed jobs don't really provide any information related to usage. It turns out that the compute cluster I have access to does not have job accounting turned on. So, I am not able to pull numbers for actual resources being used by the job (no sacct information). I have reached out to the administrators to see if they can set it up. If so, I will update with more information as it becomes available.

For Issue 1, the crash behavior definitely seems to be related to the receive.socket call. For jobs that fail, the first time it is called it returns NULL. In worker.R, msg is being assigned to NULL, which is causing an error in switch. I can't figure out any link between successful calls to receive.socket and those that fail, but the failure seems to be consistent. Any help narrowing down the root cause of the workers failing to receive information from the socket would be greatly appreciated.

mschubert · 2020-06-21T20:40:25Z

I tried this again, and I could go up to

data <- rep(1, 1e9)

which is is 7.6 Gb common, 55 Gb memory on the main process and 30 Gb on the workers.

For 3 times, all workers returned results every time.

Can you try the current develop branch and see if your issue persists? There have been quite some changes under the hood. (~~Note that template=list(...) is currently broken there - I will fix this asap~~)

I'll also move your "Issue 4" to a new issue.

mschubert · 2020-06-29T21:33:06Z

Hi @benmarchi!

I've tracked this down to a couple of possibilities, but can unfortunately still not reproduce it. If you could spare some more time to help me finally fix this I would greatly appreciate it!

Steps would be (ideally using one job with log_worker=TRUE):

Try with latest develop branch (remotes::install_github("mschubert/clustermq", ref="develop"))
Try with issue-179 branch (remotes::install_github("mschubert/clustermq", ref="issue-179")) - this prints additional debug messages to the log

benmarchi · 2020-06-30T13:00:13Z

I was able to play around a bit with the develop branch and can confirm that exporting large objects now appears way more stable. I have not seen any workers reporting issues or hanging due to msg being NULL. I will follow up with a new issue if this reappears. Thanks for spending some time tracking down this issue!

I have not looking into your fix for #200, but I will follow up there is I see any unexpected behavior.

mschubert · 2020-06-30T16:04:42Z

Thank you for checking!

strazto · 2021-05-13T07:13:17Z

Hey, just flagging that this fix is great :)

It didn't used to look like this after a few hours :)

mschubert mentioned this issue Jun 21, 2020

foreach may export objects twice #200

Closed

mschubert added the can not reproduce label Jun 21, 2020

benmarchi closed this as completed Jun 30, 2020

This was referenced Jun 30, 2020

Job status Z for some parallel jobs #191

Closed

QUESTION: I have a problem with large data objects #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker inconsistencies with large exported data and SLURM #179

Worker inconsistencies with large exported data and SLURM #179

benmarchi commented Nov 7, 2019 •

edited by mschubert

mschubert commented Nov 24, 2019

benmarchi commented Mar 18, 2020

mschubert commented Jun 21, 2020 •

edited

mschubert commented Jun 29, 2020 •

edited

benmarchi commented Jun 30, 2020

mschubert commented Jun 30, 2020

strazto commented May 13, 2021

Worker inconsistencies with large exported data and SLURM #179

Worker inconsistencies with large exported data and SLURM #179

Comments

benmarchi commented Nov 7, 2019 • edited by mschubert

Issue 1

R session on the same node as SLURM jobs

R session on a remote node compared to SLURM jobs

Issue 2

Issue 3

Issue 4

R session on a remote node compared to SLURM jobs

mschubert commented Nov 24, 2019

Issue 1

Issue 2 + 3

Issue 4

benmarchi commented Mar 18, 2020

mschubert commented Jun 21, 2020 • edited

mschubert commented Jun 29, 2020 • edited

benmarchi commented Jun 30, 2020

mschubert commented Jun 30, 2020

strazto commented May 13, 2021

benmarchi commented Nov 7, 2019 •

edited by mschubert

mschubert commented Jun 21, 2020 •

edited

mschubert commented Jun 29, 2020 •

edited