New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network interface for workers it not configurable #170
Comments
To follow up, I tried again with 2 workers on SGE: > library(clustermq)
* Option 'clustermq.scheduler' not set, defaulting to ‘SGE’
--- see: https://mschubert.github.io/clustermq/articles/userguide.html#configuration
> options(
+ clustermq.scheduler = "sge",
+ clustermq.template = "sge_clustermq.tmpl"
+ )
>
> test_run <- function(wait = 0.1) {
+ workers <- workers(n_jobs = 2L)
+ on.exit(workers$finalize())
+ workers$set_common_data(
+ export = list(),
+ fun = identity,
+ const = list(),
+ rettype = list(),
+ pkgs = character(0),
+ common_seed = 0L,
+ token = "set_common_data_token"
+ )
+ main_loop(workers = workers, wait = wait)
+ }
>
> main_loop <- function(workers, wait) {
+ counter <- 4L
+ while (counter > 0L) {
+ print(counter)
+ msg <- workers$receive_data()
+ if (!is.null(msg$result)) {
+ counter <- counter - 1L
+ }
+ if (!identical(msg$token, "set_common_data_token")) {
+ workers$send_common_data()
+ }
+ else if (counter > 0L) {
+ workers$send_call(
+ expr = c(Sys.sleep(wait), 123),
+ env = list(wait = wait)
+ )
+ } else {
+ workers$send_shutdown_worker()
+ }
+ }
+ }
>
> test_run(wait = 90 * 60)
[1] 4
[1] 4
[1] 4
[1] 4
[1] 4
[1] 3
[1] 2
[1] 1
>
> proc.time()
user system elapsed
0.408 0.165 10806.550 Template file:
Results were similar for multicore. Using ZeroMQ 4.2.3 and the |
Thank you both. This will be a bit difficult to debug, because neither me nor Will see the issue. Can you try the following?
|
As you suggested, I upgraded to the latest v0.9 branch commit (1a9843d) then ran this. The master output is also shown here including the error after I manually interrupted (after many hours): library(clustermq)
options(clustermq.scheduler = "slurm",
clustermq.template = "slurm_clustermq.tmpl") # Same template as above
clustermq::Q(function(x) Sys.sleep(x), x=rep(90*60,3), n_jobs=2,
template = list(
log_file = "make.log",
memory = 500,
walltime = 250
))
#> Submitting 2 worker jobs (ID: 7854) ...
#> Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> [----------------------------------------------------] 0% (2/2 wrk) eta: ?s^C
#> Error in poll_socket(list(private$socket), timeout = msec) :
#> Interrupted system call The worker logs are: > clustermq:::worker("tcp://mahuika01:7854")
2019-09-15 22:29:33.923444 | Master: tcp://mahuika01:7854
2019-09-15 22:29:33.929172 | WORKER_UP to: tcp://mahuika01:7854
2019-09-15 22:29:33.979543 | > DO_SETUP (0.000s wait)
2019-09-15 22:29:33.980051 | token from msg: vdqdw
2019-09-15 22:29:33.981401 | > DO_CHUNK (0.000s wait) and: > clustermq:::worker("tcp://mahuika01:7854")
2019-09-15 22:29:33.923453 | Master: tcp://mahuika01:7854
2019-09-15 22:29:33.929174 | WORKER_UP to: tcp://mahuika01:7854
2019-09-15 22:29:33.932309 | > DO_SETUP (0.000s wait) which, in contrast to the original call, do not show the timeout messages. I also tested this with 30 minutes of I don't believe I'm able to monitor TCP/IP without sudo access? If you know of ways I can do this without sudo access I will be able to also do that. |
Ok, just to be clear: this happens on your computing cluster, but not your own machine (using multicore)? And it never happens with only one worker? I think your error is somehow caused by closing the TCP/IP connection earlier than we want to, but I've got no idea why you have this problem and others don't. And I don't understand why it requires 2 workers. What is your value of: cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl
cat /proc/sys/net/ipv4/tcp_keepalive_probes ? |
This happens on my computing cluster - I have not confirmed that it works on my own machine but I will make sure now. I also have not confirmed whether it does or doesn't work with one worker. I was just using two workers as that's what I was originally running. I will run the above code with just the one worker as well to check. And: [kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_time
1800
[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[kendon.bell@mahuika01 update_vcsn_fst]$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9 Thanks heaps for you patience with this! |
I can confirm that I get the same behavior (i.e. it fails) using a single worker. I can also confirm that it works fine on my machine using |
Your If this is the cause, it should also fail for me with |
If that is indeed the cause, is it possible for |
If that's really the cause we can have the sockets send keep-alives themselves so they don't get disconnected. At this point, I'm pretty convinced that it has something to do with TCP/IP timeouts. But they're not the whole story, because my 3 h test finishes fine with a 2 h timeout (and in the past I had 24 h jobs finish fine too). Could it be that your computing cluster has some software running that more aggressively prunes apparently stale connections? Alternatively, we can try explicit keepalives, but that will not happen before the |
I have contacted NeSI support to see if they know whether our system has pruners in place. |
@kendonB Any news from your cluster support on what may have caused the apparent disconnects? |
This came through just now: This situation is very strange. The user application relies on compute nodes being able to talk to the login node. And they can - but if there is no traffic for more than an hour, the packets sent from compute nodes just disappear into a black hole. All connections remain open, but nothing arrives to mahuika. I would say it is a problem on our end, but no idea how to fix it… Connection is NATed via [a certain host] and the packets appear on that [host], so something drops the packet further down the link... One workaround is to submit a process from an interactive Slurm session, but the session needs to last for as long as the jobs run. Another option is to use Infiniband networking on mahuika instead of Ethernet for the communication between workers and the master. I modified the clustermq library to verify that it works by hard coding the [host's IP] address [over Infiniband, which differs from its IP address over Ethernet]. Maybe ask the user to contact the clustermq developers for a proper way of doing that? How likely do you think it would be that the clustermq developers could do a connection over Infiniband, or guide us to do the same? |
First of all, great job from your support team. They actually spent the time to track this down, I'm impressed 👍 The way I get the host is by That said, it would probably be better to provide an option of which interface to use (this would solve both issues). I am, however, not entirely sure how to best implement this in a portable manner - so suggestions are welcome. |
@mschubert I'm a member of the NeSI support team and may be able to help come up with something. How widely portable do you need the solution to be? E.g. Linux only but all flavours, Linux plus Mac OS X, Linux plus Mac OS X plus Windows? |
@mschubert @kendonB I've just done a PR that I hope offers a sensible solution. #172 |
I've found that ZeroMQ supports network interfaces as endpoint identifiers, which is a better way to bind sockets to specific devices. I will add an option in |
Continuing from ropensci/drake#1005. I run this script:
The master process hangs (presumably at the
workers$receive_data()
call) and the two worker logs are:and
and on the master I see:
My template file is:
My session info is:
And my ZeroMQ version is:
@mschubert what other troubleshooting steps could I do since it does seem environmental if this code is working for you and @wlandau.
The text was updated successfully, but these errors were encountered: