Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network interface for workers it not configurable #170

Closed
kendonB opened this issue Sep 13, 2019 · 17 comments
Closed

Network interface for workers it not configurable #170

kendonB opened this issue Sep 13, 2019 · 17 comments

Comments

@kendonB
Copy link

kendonB commented Sep 13, 2019

Continuing from ropensci/drake#1005. I run this script:

library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to 'LOCAL'
#> --- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
library(rlang)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl")

test_run <- function(wait = 0.1) {
  workers <- workers(n_jobs = 2L, template = 
                       list(
                         log_file = "make.log", 
                         memory = 500,
                         walltime = 250
                       ))
  on.exit(workers$finalize())
  workers$set_common_data(
    export = list(),
    fun = identity,
    const = list(),
    rettype = list(),
    pkgs = character(0),
    common_seed = 0L,
    token = "set_common_data_token"
  )
  main_loop(workers = workers, wait = wait)
}

main_loop <- function(workers, wait) {
  counter <- 2L
  while (counter > 0L) {
    print(counter)
    msg <- workers$receive_data()
    if (!is.null(msg$result)) {
      counter <- counter - 1L
    }
    if (!identical(msg$token, "set_common_data_token")) {
      workers$send_common_data()
    }
    else if (counter > 0L) {
      workers$send_call(
        expr = {
          print("I'm working.")
          c(Sys.sleep(wait), 123)
          NULL
        },
        env = list(wait = wait)
      )
    } else {
      workers$send_shutdown_worker()
    }
  }
}

test_run(wait = 3600*1.5)

The master process hangs (presumably at the workers$receive_data() call) and the two worker logs are:

> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:54.413862 | Master: tcp://mahuika01:7649
2019-09-13 12:58:54.422901 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:54.424276 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:54.424635 | token from msg: set_common_data_token
2019-09-13 12:58:54.449676 | > DO_CALL (0.000s wait)
[1] "I'm working."
2019-09-13 14:28:54.486711 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
  Timeout reached, terminating
Execution halted

and

> clustermq:::worker("tcp://mahuika01:7649")
2019-09-13 12:58:55.203638 | Master: tcp://mahuika01:7649
2019-09-13 12:58:55.217463 | WORKER_UP to: tcp://mahuika01:7649
2019-09-13 12:58:55.218325 | > DO_SETUP (0.000s wait)
2019-09-13 12:58:55.218681 | token from msg: set_common_data_token
2019-09-13 12:58:55.226476 | > DO_CALL (0.001s wait)
[1] "I'm working."
2019-09-13 14:28:55.327345 | eval'd: {print("I'm working.")c(Sys.sleep(wait), 123)NULL
Error in clustermq:::worker("tcp://mahuika01:7649") :
  Timeout reached, terminating
Execution halted

and on the master I see:

[1] 2
[1] 2
[1] 2
[1] 2
[1] 2

My template file is:

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}%a # you can add .%a for array index
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --account=landcare00063
#SBATCH --time={{ walltime }}
#SBATCH --partition={{ partition | large}}
source ~/.bashrc
source /etc/profile
## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>

ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

My session info is:

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.6.1 (2019-07-05)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_NZ.UTF-8
 ctype    en_NZ.UTF-8
 tz       NZ
 date     2019-09-13Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat    0.2.1   2019-03-21 [2] CRAN (R 3.6.1)
 backports     1.1.4   2019-04-10 [2] CRAN (R 3.6.1)
 callr         3.3.0   2019-07-04 [2] CRAN (R 3.6.1)
 cli           1.1.0   2019-03-19 [2] CRAN (R 3.6.1)
 clustermq   * 0.8.8   2019-06-05 [1] CRAN (R 3.6.1)
 crayon        1.3.4   2017-09-16 [2] CRAN (R 3.6.1)
 desc          1.2.0   2018-05-01 [2] CRAN (R 3.6.1)
 devtools      2.1.0   2019-07-06 [1] CRAN (R 3.6.1)
 digest        0.6.20  2019-07-04 [2] CRAN (R 3.6.1)
 fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
 glue          1.3.1   2019-03-12 [2] CRAN (R 3.6.1)
 magrittr      1.5     2014-11-22 [2] CRAN (R 3.6.1)
 memoise       1.1.0   2017-04-21 [2] CRAN (R 3.6.1)
 pkgbuild      1.0.3   2019-03-20 [2] CRAN (R 3.6.1)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.1)
 prettyunits   1.0.2   2015-07-13 [2] CRAN (R 3.6.1)
 processx      3.4.0   2019-07-03 [2] CRAN (R 3.6.1)
 ps            1.3.0   2018-12-21 [2] CRAN (R 3.6.1)
 R6            2.4.0   2019-02-14 [2] CRAN (R 3.6.1)
 Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.1)
 remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
 rlang       * 0.4.0   2019-06-25 [2] CRAN (R 3.6.1)
 rprojroot     1.3-2   2018-01-03 [2] CRAN (R 3.6.1)
 rzmq          0.9.6   2019-03-01 [1] CRAN (R 3.6.1)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.1)
 testthat      2.2.1   2019-07-25 [1] CRAN (R 3.6.1)
 usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
 withr         2.1.2   2018-03-15 [2] CRAN (R 3.6.1)

And my ZeroMQ version is:

> rzmq::zmq.version()
[1] "4.2.5"

@mschubert what other troubleshooting steps could I do since it does seem environmental if this code is working for you and @wlandau.

@wlandau
Copy link
Contributor

wlandau commented Sep 13, 2019

To follow up, I tried again with 2 workers on SGE:

> library(clustermq)
* Option 'clustermq.scheduler' not set, defaulting toSGE--- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
> options(
+   clustermq.scheduler = "sge",
+   clustermq.template = "sge_clustermq.tmpl"
+ )
> 
> test_run <- function(wait = 0.1) {
+   workers <- workers(n_jobs = 2L)
+   on.exit(workers$finalize())
+   workers$set_common_data(
+     export = list(),
+     fun = identity,
+     const = list(),
+     rettype = list(),
+     pkgs = character(0),
+     common_seed = 0L,
+     token = "set_common_data_token"
+   )
+   main_loop(workers = workers, wait = wait)
+ }
> 
> main_loop <- function(workers, wait) {
+   counter <- 4L
+   while (counter > 0L) {
+     print(counter)
+     msg <- workers$receive_data()
+     if (!is.null(msg$result)) {
+       counter <- counter - 1L
+     }
+     if (!identical(msg$token, "set_common_data_token")) {
+       workers$send_common_data()
+     }
+     else if (counter > 0L) {
+       workers$send_call(
+         expr = c(Sys.sleep(wait), 123),
+         env = list(wait = wait)
+       )
+     } else {
+       workers$send_shutdown_worker()
+     }
+   }
+ }
> 
> test_run(wait = 90 * 60)
[1] 4
[1] 4
[1] 4
[1] 4
[1] 4
[1] 3
[1] 2
[1] 1
> 
> proc.time()
     user    system   elapsed 
    0.408     0.165 10806.550 

Template file:

#$ -N {{ job_name }}               # job name
#$ -t 1-{{ n_jobs }}               # submit jobs as array
#$ -o logs
#$ -e errs
#$ -cwd                            # use pwd as work dir
#$ -V                              # use environment variable
#$ -pe smp 1                       # request 1 core per job
module load CENSORED/3.6.0 # R version
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Results were similar for multicore. Using ZeroMQ 4.2.3 and the sessionInfo() at ropensci/drake#1005 (comment).

@mschubert
Copy link
Owner

Thank you both. This will be a bit difficult to debug, because neither me nor Will see the issue.

Can you try the following?

  • Using the Q function directly: clustermq::Q(function(x) Sys.sleep(x), x=rep(90*60,3), n_jobs=2)
  • Does the problem persist wit the current v0.9 head? (not sure if this works with drake yet)
  • Can you monitor your TCP/IP to see if the socket gets disconnected? (and check your system's timeout)

@kendonB
Copy link
Author

kendonB commented Sep 15, 2019

As you suggested, I upgraded to the latest v0.9 branch commit (1a9843d) then ran this. The master output is also shown here including the error after I manually interrupted (after many hours):

library(clustermq)
options(clustermq.scheduler = "slurm", 
        clustermq.template = "slurm_clustermq.tmpl") # Same template as above

clustermq::Q(function(x) Sys.sleep(x), x=rep(90*60,3), n_jobs=2,
             template = list(
               log_file = "make.log", 
               memory = 500,
               walltime = 250
             ))
#> Submitting 2 worker jobs (ID: 7854) ...
#> Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> [----------------------------------------------------]   0% (2/2 wrk) eta:  ?s^C
#> Error in poll_socket(list(private$socket), timeout = msec) :
#>   Interrupted system call

The worker logs are:

> clustermq:::worker("tcp://mahuika01:7854")
2019-09-15 22:29:33.923444 | Master: tcp://mahuika01:7854
2019-09-15 22:29:33.929172 | WORKER_UP to: tcp://mahuika01:7854
2019-09-15 22:29:33.979543 | > DO_SETUP (0.000s wait)
2019-09-15 22:29:33.980051 | token from msg: vdqdw
2019-09-15 22:29:33.981401 | > DO_CHUNK (0.000s wait)

and:

> clustermq:::worker("tcp://mahuika01:7854")
2019-09-15 22:29:33.923453 | Master: tcp://mahuika01:7854
2019-09-15 22:29:33.929174 | WORKER_UP to: tcp://mahuika01:7854
2019-09-15 22:29:33.932309 | > DO_SETUP (0.000s wait)

which, in contrast to the original call, do not show the timeout messages.

I also tested this with 30 minutes of Sys.sleep (rep(30*60,3)) and it ran as expected returning a list of 3x NULL values.

I don't believe I'm able to monitor TCP/IP without sudo access? If you know of ways I can do this without sudo access I will be able to also do that.

@mschubert
Copy link
Owner

Ok, just to be clear: this happens on your computing cluster, but not your own machine (using multicore)? And it never happens with only one worker?

I think your error is somehow caused by closing the TCP/IP connection earlier than we want to, but I've got no idea why you have this problem and others don't. And I don't understand why it requires 2 workers.

What is your value of:

cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
cat /proc/sys/net/ipv4/tcp_keepalive_probes

?

@kendonB
Copy link
Author

kendonB commented Sep 17, 2019

This happens on my computing cluster - I have not confirmed that it works on my own machine but I will make sure now.

I also have not confirmed whether it does or doesn't work with one worker. I was just using two workers as that's what I was originally running. I will run the above code with just the one worker as well to check.

And:

[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_time
1800
[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

Thanks heaps for you patience with this!

@kendonB
Copy link
Author

kendonB commented Sep 17, 2019

I can confirm that I get the same behavior (i.e. it fails) using a single worker. I can also confirm that it works fine on my machine using multicore.

@mschubert
Copy link
Owner

Your tcp_keepalive_time is 0.5 hours, where the default is 2 hours.

If this is the cause, it should also fail for me with Sys.sleep(180*60). I'm testing this now.

@kendonB
Copy link
Author

kendonB commented Sep 17, 2019

If that is indeed the cause, is it possible for clustermq to detect tcp_keepalive_time then restart the polling call before the end of that time?

@mschubert
Copy link
Owner

If that's really the cause we can have the sockets send keep-alives themselves so they don't get disconnected.

At this point, I'm pretty convinced that it has something to do with TCP/IP timeouts.

But they're not the whole story, because my 3 h test finishes fine with a 2 h timeout (and in the past I had 24 h jobs finish fine too).

Could it be that your computing cluster has some software running that more aggressively prunes apparently stale connections?

Alternatively, we can try explicit keepalives, but that will not happen before the 0.9 release (rzmq just does not support that)

@kendonB
Copy link
Author

kendonB commented Sep 26, 2019

I have contacted NeSI support to see if they know whether our system has pruners in place.

@mschubert
Copy link
Owner

@kendonB Any news from your cluster support on what may have caused the apparent disconnects?

@kendonB
Copy link
Author

kendonB commented Oct 8, 2019

This came through just now:

This situation is very strange. The user application relies on compute nodes being able to talk to the login node. And they can - but if there is no traffic for more than an hour, the packets sent from compute nodes just disappear into a black hole. All connections remain open, but nothing arrives to mahuika. I would say it is a problem on our end, but no idea how to fix it… Connection is NATed via [a certain host] and the packets appear on that [host], so something drops the packet further down the link...

One workaround is to submit a process from an interactive Slurm session, but the session needs to last for as long as the jobs run.

Another option is to use Infiniband networking on mahuika instead of Ethernet for the communication between workers and the master. I modified the clustermq library to verify that it works by hard coding the [host's IP] address [over Infiniband, which differs from its IP address over Ethernet]. Maybe ask the user to contact the clustermq developers for a proper way of doing that?

How likely do you think it would be that the clustermq developers could do a connection over Infiniband, or guide us to do the same?

@mschubert
Copy link
Owner

First of all, great job from your support team. They actually spent the time to track this down, I'm impressed 👍

The way I get the host is by Sys.info()["nodename"] and then stripping the qualification (using host instead of host.domain.name because I found that this resolves on more systems).

That said, it would probably be better to provide an option of which interface to use (this would solve both issues). I am, however, not entirely sure how to best implement this in a portable manner - so suggestions are welcome.

@valtandor
Copy link

@mschubert I'm a member of the NeSI support team and may be able to help come up with something.

How widely portable do you need the solution to be? E.g. Linux only but all flavours, Linux plus Mac OS X, Linux plus Mac OS X plus Windows?

@valtandor
Copy link

@mschubert @kendonB I've just done a PR that I hope offers a sensible solution. #172

@mschubert mschubert changed the title clustermq run works for short-running targets but fails with long-running targets Network interface for workers it not configurable Jan 27, 2020
@mschubert
Copy link
Owner

ifconfig is obsolete on Linux and will eventually be removed, so relying on this is not a good idea.

I've found that ZeroMQ supports network interfaces as endpoint identifiers, which is a better way to bind sockets to specific devices.

I will add an option in clustermq to specify this interface, which will solve this issue more robustly.

@mschubert
Copy link
Owner

mschubert commented Jun 21, 2020

Hi @kendonB,

I fixed this now by setting option clustermq.host to an interface name in develop. Please let me know if this works in your hands, and otherwise reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants