Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job port is set to NA after binding failure #270

Closed
statquant opened this issue Aug 13, 2021 · 11 comments
Closed

Job port is set to NA after binding failure #270

statquant opened this issue Aug 13, 2021 · 11 comments
Labels

Comments

@statquant
Copy link

Hello, for some reasons that elude me and that I cannot reproduce (yeah that's not much to go to), sometimes I cannot send jobs to my slurm grid.
What happen is I see

Submitting 50 worker jobs (ID: cmqNA) # notice the NA here

and then R just get stuck until I send an interrupt in the terminal. Then I have to wait really long to get an error and the terminal back (which might be a bug in itself ?)
Given this NA I was wondering if there is not something that can be done. Note that when I rerun the same command later on all works well.

Many thanks for the package, it's great and sorry for this unhelpful issue.

@mschubert
Copy link
Owner

Is this from CRAN or Github?

It looks like a port binding failure that returns NA instead of retrying/raising an error. I'll need to look at the exact implementation (i.e., version) to see how this is possible.

@aruaud
Copy link

aruaud commented Dec 7, 2021

I have the same issue with an SGE scheduler - although I do not have to wait long to have the session stop. I usually re-send the job again and again until it works.. Similar to @statquant, I can't reproduce it but it tends to occur when I am sending big data.

Also thanks for the great package :)

@luwidmer
Copy link

luwidmer commented Mar 23, 2022

I also just ran into this with LSF, R 4.1.0 and clustermq_0.8.95.1 off CRAN - might be a good idea to check if the port is NA and fail if it is?

@luwidmer
Copy link

luwidmer commented Mar 23, 2023

Ran into this recently again... pretty hard to reproduce. @mschubert since the NA comes from casting the port to integer (as.integer(sub(".*:", "", private$master))), would it be possible to add something along the lines of

if (is.na(private$port))
{
  stop("Port is NA, aborting, address was: ", private$master)
}

after line 33 in qsys.r on master (https://github.com/mschubert/clustermq/blob/master/R/qsys.r#L33) or after line 21 on develop (https://github.com/mschubert/clustermq/blob/develop/R/qsys.r#L20). This would prevent R from hanging and hopefully give use an error message that gets closer to the root cause.

Regarding the port selection: what are your thoughts on making this configurable (at the moment the range of 6000:9999 is hard-coded in util.r in master at https://github.com/mschubert/clustermq/blob/master/R/util.r#L35)? Sampling some currently free ports might be more robust (e.g. using parallelly::freePort, see https://parallelly.futureverse.org/reference/freePort.html) than sampling 100 ports from a fixed port range without checking whether these ports are free?

@luwidmer
Copy link

luwidmer commented Mar 23, 2023

I can confirm that port NA is generated when all sampled ports are in use (e.g. test by overriding the host method in the package):

library(clustermq)
cmq <- asNamespace("clustermq")
unlockBinding("host", cmq)
cmq$host <- function(
    node=getOption("clustermq.host", Sys.info()["nodename"]),
    ports=32781, # This port is in use 
    n=1
) {
  utils::head(sample(sprintf("tcp://%s:%i", node, ports)), n)
}

fx = function(x) {
  tibble(x = x)
}

Q(fx, x=1:3, n_jobs=3, pkgs = c("tidyverse") )

with the modified error handing yields

> Q(fx, x=1:3, n_jobs=3, pkgs = c("tidyverse") )
Error in super$initialize(..., template = template) : Port is NA, aborting, address was: 

This apparently also happens when the first port in the list is in use, the others are not checked (!):

cmq$host <- function(
    node=getOption("clustermq.host", Sys.info()["nodename"]),
    ports=32781:38000, # This port is in use 
    n=1
) {
  sprintf("tcp://%s:%i", node, ports)
}
> Q(fx, x=1:3, n_jobs=3, pkgs = c("tidyverse") )
Error in super$initialize(..., template = template) : Port is NA, aborting, address was: 

@luwidmer
Copy link

luwidmer commented Mar 23, 2023

@mschubert a suggestion for host() could be:

host <- function(
  node=getOption("clustermq.host", Sys.info()["nodename"]),
  ports=getOption("clustermq.portRange", 1024:65535), 
  n=20
) {
  free_ports <- numeric(n) * NA
  
  for (i in seq_len(n)){
    free_ports[i] <- parallelly::freePort(ports, default = NA)
    ports <- setdiff(ports, free_ports[i])
  }
  
  if (any(is.na(free_ports)))
  {
    stop("Free ports must not be NA")
  }
  
  sprintf("tcp://%s:%i", node, free_ports)
}

@luwidmer
Copy link

luwidmer commented Mar 23, 2023

Is it possible that https://github.com/mschubert/clustermq/blob/master/src/CMQMaster.cpp#L19 has a bug:

    std::string listen(Rcpp::CharacterVector addrs) {
        int i;
        for (i=0; i<addrs.length(); i++) {
            auto addr = Rcpp::as<std::string>(addrs[i]);
            try {
                sock.bind(addr);
            } catch(zmq::error_t const &e) {
                if (errno != EADDRINUSE)
                    Rf_error(e.what());
            }
            return sock.get(zmq::sockopt::last_endpoint);
        }
        Rf_error("Could not bind port after ", i, " tries");
    }

Shouldn't this read as follows (note the return statement location):

	std::string listen(Rcpp::CharacterVector addrs) {
        int i;
        for (i=0; i<addrs.length(); i++) {
            auto addr = Rcpp::as<std::string>(addrs[i]);
            try {
                sock.bind(addr);
		return sock.get(zmq::sockopt::last_endpoint);
            } catch(zmq::error_t const &e) {
                if (errno != EADDRINUSE)
                    Rf_error(e.what());
            }
        }
        Rf_error("Could not bind port after ", i, " tries");
    }

@mschubert
Copy link
Owner

Great catch @luwidmer, that return statement indeed looks off!

Note that it's fixed in develop, but happy to merge a PR if you don't want to wait for that

@luwidmer
Copy link

luwidmer commented Mar 23, 2023

@mschubert awesome, thanks! I patched this in the CRAN version for me, I'd be happy to wait for develop / the next version to hit CRAN. I use clustermq a lot 👍

What do you think of using parallelly in host to pre-populate the list with some ports that (should) be free (barring a race condition where something else is grabbing a bunch of ports between the R part and the C++ call), and making the port range a package option as in #270 (comment) ?

@mschubert
Copy link
Owner

Making the port range configurable via an option makes sense, but I'm not sure I see the advantage of using parallely::freePort?

@luwidmer
Copy link

Indeed... I suppose one could also just pass the entire port range into the C++ without pre-scanning in R! Thanks!

@mschubert mschubert added the bug label Mar 24, 2023
@mschubert mschubert changed the title Sometimes clustermq fails to send jobs to the slurm grid (job nb is NA) Job port is set to NA after binding failure Mar 24, 2023
mschubert added a commit that referenced this issue Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants