Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the coordinating R/clustermq process on a different HPC node #216

Closed
mattwarkentin opened this issue Nov 4, 2020 · 6 comments
Closed

Comments

@mattwarkentin
Copy link
Contributor

Hi @mschubert,

When using the ssh + slurm combination run jobs, this is my mental model of how things seem to work (going with the simple case of 1 worker process):

  • Process 1 - The calling R process which exists on my desktop (calls Q())
  • Process 2 - The calling process then spawns a coordinating R process on the HPC via SSH
  • Process 3 - The coordinating R process then submits the batch job, creating 1 persistent worker process

The worker communicates with the coordinating process, and the coordinating process communicates back with the spawning process. Is this a correct mental model?

If so, is there a way to get Process 2 to run on a node other than the login/head node?

@mschubert
Copy link
Owner

That's right, except that your Process 3 is one worker/process per array index.

The R process that runs in the head node should be very light in terms of CPU, but does hold your common_data in memory. There is currently no way to circumvent this.

I suppose the reason you're asking is because you're worried about memory usage on the head node?

My answer to this would be that common_data should be small when sent via SSH, and larger data sets should be accessed via network storage.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Nov 4, 2020

I suppose the reason you're asking is because you're worried about memory usage on the head node?

Actually, I just received a bit of a scathing email from my institutes sysadmin. They are rather vigilante at trying to stop users from executing any long-running processes on the head node, understandably so. In this unique case, I was doing some debugging yesterday and for various reasons my jobs were terminating in an unusual way that left many orphaned coordinating R/cmq processes on the head node that apparently were still running today.

Presumably if I provide a different address to options(clustermq.ssh.host = "..."), such as a dev/interactive node with slurm job submission permissions, this would circumvent the head node, right?

@mschubert
Copy link
Owner

Presumably if I provide a different address to options(clustermq.ssh.host = "..."), such as a dev/interactive node with slurm job submission permissions, this would circumvent the head node, right?

Yes, that should work - as long as you set it up in your .ssh/config (I assume you know how, otherwise I can type it out)

@mattwarkentin
Copy link
Contributor Author

Okay, great. I will give it a try. Thanks.

@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Nov 4, 2020

While it's on my mind, do you think there is any value in adding an argument to Q()/Q_rows() that allows the user to directly pass the clustermq configuration options directly (e.g. clustermq.scheduler, etc.)? This would avoid the hidden argument issue. Right now Q() isn't self-contained, since its behaviour depends on externally defined global options. These options might live in the same script, a separate script, or a startup file like .Rprofile.

If it had an argument for opts/options/whatever, then you could pass these as a list. Global options set with options() could be used as a fallback. By default, the functions new options argument could look for global options and use qsys_default in their absence:

Q <- function(<current args>, options = getOption("clustermq.scheduler", qsys_default))

Thoughts?

@mschubert
Copy link
Owner

mschubert commented Nov 4, 2020

There is no hidden argument issue.

options only specify how Q is run, never what Q returns. Moving to a different setup with e.g. a different clustermq.scheduler does not rely on changing function arguments, and I'd argue that code should be portable between compute environments.

(Note that if you really want to you can already circumvent this by passing Q(..., workers=workers(qsys_id=...)).)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants