Network interface for workers it not configurable #170

kendonB · 2019-09-13T04:30:31Z

Continuing from ropensci/drake#1005. I run this script:

library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to 'LOCAL'
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
library(rlang)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl")

test_run <- function(wait = 0.1) {
  workers <- workers(n_jobs = 2L, template = 
                       list(
                         log_file = "make.log", 
                         memory = 500,
                         walltime = 250
                       ))
  on.exit(workers$finalize())
  workers$set_common_data(
    export = list(),
    fun = identity,
    const = list(),
    rettype = list(),
    pkgs = character(0),
    common_seed = 0L,
    token = "set_common_data_token"
  )
  main_loop(workers = workers, wait = wait)
}

main_loop <- function(workers, wait) {
  counter <- 2L
  while (counter > 0L) {
    print(counter)
    msg <- workers$receive_data()
    if (!is.null(msg$result)) {
      counter <- counter - 1L
    }
    if (!identical(msg$token, "set_common_data_token")) {
      workers$send_common_data()
    }
    else if (counter > 0L) {
      workers$send_call(
        expr = {
          print("I'm working.")
          c(Sys.sleep(wait), 123)
          NULL
        },
        env = list(wait = wait)
      )
    } else {
      workers$send_shutdown_worker()
    }
  }
}

test_run(wait = 3600*1.5)

The master process hangs (presumably at the workers$receive_data() call) and the two worker logs are:

> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:54.413862 | Master: tcp://mahuika01:7649
2019-09-13 12:58:54.422901 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:54.424276 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:54.424635 | token from msg: set_common_data_token
2019-09-13 12:58:54.449676 | > DO_CALL (0.000s wait)
[1] "I'm working."
2019-09-13 14:28:54.486711 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
  Timeout reached, terminating
Execution halted

and

> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:55.203638 | Master: tcp://mahuika01:7649
2019-09-13 12:58:55.217463 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:55.218325 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:55.218681 | token from msg: set_common_data_token
2019-09-13 12:58:55.226476 | > DO_CALL (0.001s wait)
[1] "I'm working."
2019-09-13 14:28:55.327345 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
  Timeout reached, terminating
Execution halted

and on the master I see:

[1] 2
[1] 2
[1] 2
[1] 2
[1] 2

My template file is:

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}%a # you can add .%a for array index
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --account=landcare00063
#SBATCH --time={{ walltime }}
#SBATCH --partition={{ partition | large}}
source ~/.bashrc
source /etc/profile
## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

My session info is:

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.6.1 (2019-07-05)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_NZ.UTF-8
 ctype    en_NZ.UTF-8
 tz       NZ
 date     2019-09-13

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat    0.2.1   2019-03-21 [2] CRAN (R 3.6.1)
 backports     1.1.4   2019-04-10 [2] CRAN (R 3.6.1)
 callr         3.3.0   2019-07-04 [2] CRAN (R 3.6.1)
 cli           1.1.0   2019-03-19 [2] CRAN (R 3.6.1)
 clustermq   * 0.8.8   2019-06-05 [1] CRAN (R 3.6.1)
 crayon        1.3.4   2017-09-16 [2] CRAN (R 3.6.1)
 desc          1.2.0   2018-05-01 [2] CRAN (R 3.6.1)
 devtools      2.1.0   2019-07-06 [1] CRAN (R 3.6.1)
 digest        0.6.20  2019-07-04 [2] CRAN (R 3.6.1)
 fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
 glue          1.3.1   2019-03-12 [2] CRAN (R 3.6.1)
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.6.1)
 memoise       1.1.0   2017-04-21 [2] CRAN (R 3.6.1)
 pkgbuild      1.0.3   2019-03-20 [2] CRAN (R 3.6.1)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.1)
 prettyunits   1.0.2   2015-07-13 [2] CRAN (R 3.6.1)
 processx      3.4.0   2019-07-03 [2] CRAN (R 3.6.1)
 ps            1.3.0   2018-12-21 [2] CRAN (R 3.6.1)
 R6            2.4.0   2019-02-14 [2] CRAN (R 3.6.1)
 Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.1)
 remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
 rlang       * 0.4.0   2019-06-25 [2] CRAN (R 3.6.1)
 rprojroot     1.3-2   2018-01-03 [2] CRAN (R 3.6.1)
 rzmq          0.9.6   2019-03-01 [1] CRAN (R 3.6.1)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.1)
 testthat      2.2.1   2019-07-25 [1] CRAN (R 3.6.1)
 usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
 withr         2.1.2   2018-03-15 [2] CRAN (R 3.6.1)

And my ZeroMQ version is:

> rzmq::zmq.version()
[1] "4.2.5"

@mschubert what other troubleshooting steps could I do since it does seem environmental if this code is working for you and @wlandau.

The text was updated successfully, but these errors were encountered:

wlandau · 2019-09-13T12:08:38Z

To follow up, I tried again with 2 workers on SGE:

> library(clustermq)
* Option 'clustermq.scheduler' not set, defaulting to ‘SGE’
--- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
> options(
+   clustermq.scheduler = "sge",
+   clustermq.template = "sge_clustermq.tmpl"
+ )
> 
> test_run <- function(wait = 0.1) {
+   workers <- workers(n_jobs = 2L)
+   on.exit(workers$finalize())
+   workers$set_common_data(
+     export = list(),
+     fun = identity,
+     const = list(),
+     rettype = list(),
+     pkgs = character(0),
+     common_seed = 0L,
+     token = "set_common_data_token"
+   )
+   main_loop(workers = workers, wait = wait)
+ }
> 
> main_loop <- function(workers, wait) {
+   counter <- 4L
+   while (counter > 0L) {
+     print(counter)
+     msg <- workers$receive_data()
+     if (!is.null(msg$result)) {
+       counter <- counter - 1L
+     }
+     if (!identical(msg$token, "set_common_data_token")) {
+       workers$send_common_data()
+     }
+     else if (counter > 0L) {
+       workers$send_call(
+         expr = c(Sys.sleep(wait), 123),
+         env = list(wait = wait)
+       )
+     } else {
+       workers$send_shutdown_worker()
+     }
+   }
+ }
> 
> test_run(wait = 90 * 60)
[1] 4
[1] 4
[1] 4
[1] 4
[1] 4
[1] 3
[1] 2
[1] 1
> 
> proc.time()
     user    system   elapsed 
    0.408     0.165 10806.550

Template file:

#$ -N {{ job_name }}               # job name
#$ -t 1-{{ n_jobs }}               # submit jobs as array
#$ -o logs
#$ -e errs
#$ -cwd                            # use pwd as work dir
#$ -V                              # use environment variable
#$ -pe smp 1                       # request 1 core per job
module load CENSORED/3.6.0 # R version
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Results were similar for multicore. Using ZeroMQ 4.2.3 and the sessionInfo() at ropensci/drake#1005 (comment).

mschubert · 2019-09-13T13:11:37Z

Thank you both. This will be a bit difficult to debug, because neither me nor Will see the issue.

Can you try the following?

Using the Q function directly: clustermq::Q(function(x) Sys.sleep(x), x=rep(90*60,3), n_jobs=2)
Does the problem persist wit the current v0.9 head? (not sure if this works with drake yet)
Can you monitor your TCP/IP to see if the socket gets disconnected? (and check your system's timeout)

kendonB · 2019-09-15T20:51:14Z

As you suggested, I upgraded to the latest v0.9 branch commit (1a9843d) then ran this. The master output is also shown here including the error after I manually interrupted (after many hours):

library(clustermq)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl") # Same template as above

clustermq::Q(function(x) Sys.sleep(x), x=rep(90*60,3), n_jobs=2,
             template = list(
               log_file = "make.log", 
               memory = 500,
               walltime = 250
             ))
#> Submitting 2 worker jobs (ID: 7854) ...
#> Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> [----------------------------------------------------]   0% (2/2 wrk) eta:  ?s^C
#> Error in poll_socket(list(private$socket), timeout = msec) :
#>   Interrupted system call

The worker logs are:

> clustermq:::worker("tcp://mahuika01:7854")
2019-09-15 22:29:33.923444 | Master: tcp://mahuika01:7854
2019-09-15 22:29:33.929172 | WORKER_UP to: tcp://mahuika01:7854
2019-09-15 22:29:33.979543 | > DO_SETUP (0.000s wait)
2019-09-15 22:29:33.980051 | token from msg: vdqdw
2019-09-15 22:29:33.981401 | > DO_CHUNK (0.000s wait)

and:

> clustermq:::worker("tcp://mahuika01:7854")
2019-09-15 22:29:33.923453 | Master: tcp://mahuika01:7854
2019-09-15 22:29:33.929174 | WORKER_UP to: tcp://mahuika01:7854
2019-09-15 22:29:33.932309 | > DO_SETUP (0.000s wait)

which, in contrast to the original call, do not show the timeout messages.

I also tested this with 30 minutes of Sys.sleep (rep(30*60,3)) and it ran as expected returning a list of 3x NULL values.

I don't believe I'm able to monitor TCP/IP without sudo access? If you know of ways I can do this without sudo access I will be able to also do that.

mschubert · 2019-09-17T00:10:23Z

Ok, just to be clear: this happens on your computing cluster, but not your own machine (using multicore)? And it never happens with only one worker?

I think your error is somehow caused by closing the TCP/IP connection earlier than we want to, but I've got no idea why you have this problem and others don't. And I don't understand why it requires 2 workers.

What is your value of:

cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
cat /proc/sys/net/ipv4/tcp_keepalive_probes

?

kendonB · 2019-09-17T01:12:05Z

This happens on my computing cluster - I have not confirmed that it works on my own machine but I will make sure now.

I also have not confirmed whether it does or doesn't work with one worker. I was just using two workers as that's what I was originally running. I will run the above code with just the one worker as well to check.

And:

[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_time
1800
[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

Thanks heaps for you patience with this!

kendonB · 2019-09-17T04:22:22Z

I can confirm that I get the same behavior (i.e. it fails) using a single worker. I can also confirm that it works fine on my machine using multicore.

mschubert · 2019-09-17T13:11:15Z

Your tcp_keepalive_time is 0.5 hours, where the default is 2 hours.

If this is the cause, it should also fail for me with Sys.sleep(180*60). I'm testing this now.

kendonB · 2019-09-17T21:31:57Z

If that is indeed the cause, is it possible for clustermq to detect tcp_keepalive_time then restart the polling call before the end of that time?

mschubert · 2019-09-18T15:59:28Z

If that's really the cause we can have the sockets send keep-alives themselves so they don't get disconnected.

At this point, I'm pretty convinced that it has something to do with TCP/IP timeouts.

But they're not the whole story, because my 3 h test finishes fine with a 2 h timeout (and in the past I had 24 h jobs finish fine too).

Could it be that your computing cluster has some software running that more aggressively prunes apparently stale connections?

Alternatively, we can try explicit keepalives, but that will not happen before the 0.9 release (rzmq just does not support that)

kendonB · 2019-09-26T03:18:34Z

I have contacted NeSI support to see if they know whether our system has pruners in place.

mschubert · 2019-10-07T17:11:14Z

@kendonB Any news from your cluster support on what may have caused the apparent disconnects?

kendonB · 2019-10-08T05:31:49Z

This came through just now:

This situation is very strange. The user application relies on compute nodes being able to talk to the login node. And they can - but if there is no traffic for more than an hour, the packets sent from compute nodes just disappear into a black hole. All connections remain open, but nothing arrives to mahuika. I would say it is a problem on our end, but no idea how to fix it… Connection is NATed via [a certain host] and the packets appear on that [host], so something drops the packet further down the link...

One workaround is to submit a process from an interactive Slurm session, but the session needs to last for as long as the jobs run.

Another option is to use Infiniband networking on mahuika instead of Ethernet for the communication between workers and the master. I modified the clustermq library to verify that it works by hard coding the [host's IP] address [over Infiniband, which differs from its IP address over Ethernet]. Maybe ask the user to contact the clustermq developers for a proper way of doing that?

How likely do you think it would be that the clustermq developers could do a connection over Infiniband, or guide us to do the same?

mschubert · 2019-10-08T08:09:43Z

First of all, great job from your support team. They actually spent the time to track this down, I'm impressed 👍

The way I get the host is by Sys.info()["nodename"] and then stripping the qualification (using host instead of host.domain.name because I found that this resolves on more systems).

That said, it would probably be better to provide an option of which interface to use (this would solve both issues). I am, however, not entirely sure how to best implement this in a portable manner - so suggestions are welcome.

valtandor · 2019-10-09T02:28:28Z

@mschubert I'm a member of the NeSI support team and may be able to help come up with something.

How widely portable do you need the solution to be? E.g. Linux only but all flavours, Linux plus Mac OS X, Linux plus Mac OS X plus Windows?

valtandor · 2019-10-10T00:25:57Z

@mschubert @kendonB I've just done a PR that I hope offers a sensible solution. #172

mschubert · 2020-06-13T11:43:21Z

ifconfig is obsolete on Linux and will eventually be removed, so relying on this is not a good idea.

I've found that ZeroMQ supports network interfaces as endpoint identifiers, which is a better way to bind sockets to specific devices.

I will add an option in clustermq to specify this interface, which will solve this issue more robustly.

mschubert · 2020-06-21T15:03:41Z

Hi @kendonB,

I fixed this now by setting option clustermq.host to an interface name in develop. Please let me know if this works in your hands, and otherwise reopen.

mschubert added bug community request needs info labels Sep 13, 2019

jennysjaarda mentioned this issue Jan 23, 2020

Timeout clarification/issue ropensci/drake#1146

Closed

3 tasks

mschubert changed the title ~~clustermq run works for short-running targets but fails with long-running targets~~ Network interface for workers it not configurable Jan 27, 2020

mschubert added enhancement and removed needs info bug labels Jan 27, 2020

mschubert mentioned this issue Jun 13, 2020

Add an option to get an interface's IP address #172

Closed

mschubert added the priority label Jun 13, 2020

This was referenced Jun 13, 2020

"localhost" does not always work for "multicore" #192

Closed

add function to get last bound endpoint ropensci/rzmq#54

Merged

mschubert added a commit that referenced this issue Jun 21, 2020

option 'clustermq.host' to specify IP/interface name (#170)

ad47339

mschubert closed this as completed Jun 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network interface for workers it not configurable #170

Network interface for workers it not configurable #170

kendonB commented Sep 13, 2019

wlandau commented Sep 13, 2019

mschubert commented Sep 13, 2019

kendonB commented Sep 15, 2019 •

edited

mschubert commented Sep 17, 2019

kendonB commented Sep 17, 2019

kendonB commented Sep 17, 2019

mschubert commented Sep 17, 2019

kendonB commented Sep 17, 2019

mschubert commented Sep 18, 2019

kendonB commented Sep 26, 2019

mschubert commented Oct 7, 2019

kendonB commented Oct 8, 2019

mschubert commented Oct 8, 2019

valtandor commented Oct 9, 2019

valtandor commented Oct 10, 2019

mschubert commented Jun 13, 2020

mschubert commented Jun 21, 2020 •

edited

Network interface for workers it not configurable #170

Network interface for workers it not configurable #170

Comments

kendonB commented Sep 13, 2019

wlandau commented Sep 13, 2019

mschubert commented Sep 13, 2019

kendonB commented Sep 15, 2019 • edited

mschubert commented Sep 17, 2019

kendonB commented Sep 17, 2019

kendonB commented Sep 17, 2019

mschubert commented Sep 17, 2019

kendonB commented Sep 17, 2019

mschubert commented Sep 18, 2019

kendonB commented Sep 26, 2019

mschubert commented Oct 7, 2019

kendonB commented Oct 8, 2019

mschubert commented Oct 8, 2019

valtandor commented Oct 9, 2019

valtandor commented Oct 10, 2019

mschubert commented Jun 13, 2020

mschubert commented Jun 21, 2020 • edited

kendonB commented Sep 15, 2019 •

edited

mschubert commented Jun 21, 2020 •

edited