Running the coordinating R/clustermq process on a different HPC node #216

mattwarkentin · 2020-11-04T16:47:45Z

When using the ssh + slurm combination run jobs, this is my mental model of how things seem to work (going with the simple case of 1 worker process):

Process 1 - The calling R process which exists on my desktop (calls Q())
Process 2 - The calling process then spawns a coordinating R process on the HPC via SSH
Process 3 - The coordinating R process then submits the batch job, creating 1 persistent worker process

The worker communicates with the coordinating process, and the coordinating process communicates back with the spawning process. Is this a correct mental model?

If so, is there a way to get Process 2 to run on a node other than the login/head node?

The text was updated successfully, but these errors were encountered:

mschubert · 2020-11-04T17:12:52Z

That's right, except that your Process 3 is one worker/process per array index.

The R process that runs in the head node should be very light in terms of CPU, but does hold your common_data in memory. There is currently no way to circumvent this.

I suppose the reason you're asking is because you're worried about memory usage on the head node?

My answer to this would be that common_data should be small when sent via SSH, and larger data sets should be accessed via network storage.

mattwarkentin · 2020-11-04T17:34:03Z

I suppose the reason you're asking is because you're worried about memory usage on the head node?

Actually, I just received a bit of a scathing email from my institutes sysadmin. They are rather vigilante at trying to stop users from executing any long-running processes on the head node, understandably so. In this unique case, I was doing some debugging yesterday and for various reasons my jobs were terminating in an unusual way that left many orphaned coordinating R/cmq processes on the head node that apparently were still running today.

Presumably if I provide a different address to options(clustermq.ssh.host = "..."), such as a dev/interactive node with slurm job submission permissions, this would circumvent the head node, right?

mschubert · 2020-11-04T17:53:19Z

Presumably if I provide a different address to options(clustermq.ssh.host = "..."), such as a dev/interactive node with slurm job submission permissions, this would circumvent the head node, right?

Yes, that should work - as long as you set it up in your .ssh/config (I assume you know how, otherwise I can type it out)

mattwarkentin · 2020-11-04T17:57:13Z

Okay, great. I will give it a try. Thanks.

mattwarkentin · 2020-11-04T18:11:32Z

While it's on my mind, do you think there is any value in adding an argument to Q()/Q_rows() that allows the user to directly pass the clustermq configuration options directly (e.g. clustermq.scheduler, etc.)? This would avoid the hidden argument issue. Right now Q() isn't self-contained, since its behaviour depends on externally defined global options. These options might live in the same script, a separate script, or a startup file like .Rprofile.

If it had an argument for opts/options/whatever, then you could pass these as a list. Global options set with options() could be used as a fallback. By default, the functions new options argument could look for global options and use qsys_default in their absence:

Q <- function(<current args>, options = getOption("clustermq.scheduler", qsys_default))

Thoughts?

mschubert · 2020-11-04T21:27:23Z

There is no hidden argument issue.

options only specify how Q is run, never what Q returns. Moving to a different setup with e.g. a different clustermq.scheduler does not rely on changing function arguments, and I'd argue that code should be portable between compute environments.

(Note that if you really want to you can already circumvent this by passing Q(..., workers=workers(qsys_id=...)).)

mattwarkentin closed this as completed Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the coordinating R/clustermq process on a different HPC node #216

Running the coordinating R/clustermq process on a different HPC node #216

mattwarkentin commented Nov 4, 2020

mschubert commented Nov 4, 2020

mattwarkentin commented Nov 4, 2020 •

edited

Loading

mschubert commented Nov 4, 2020

mattwarkentin commented Nov 4, 2020

mattwarkentin commented Nov 4, 2020 •

edited

Loading

mschubert commented Nov 4, 2020 •

edited

Loading

Running the coordinating R/clustermq process on a different HPC node #216

Running the coordinating R/clustermq process on a different HPC node #216

Comments

mattwarkentin commented Nov 4, 2020

mschubert commented Nov 4, 2020

mattwarkentin commented Nov 4, 2020 • edited Loading

mschubert commented Nov 4, 2020

mattwarkentin commented Nov 4, 2020

mattwarkentin commented Nov 4, 2020 • edited Loading

mschubert commented Nov 4, 2020 • edited Loading

mattwarkentin commented Nov 4, 2020 •

edited

Loading

mattwarkentin commented Nov 4, 2020 •

edited

Loading

mschubert commented Nov 4, 2020 •

edited

Loading