New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job port is set to NA after binding failure #270
Comments
Is this from CRAN or Github? It looks like a port binding failure that returns NA instead of retrying/raising an error. I'll need to look at the exact implementation (i.e., version) to see how this is possible. |
I have the same issue with an SGE scheduler - although I do not have to wait long to have the session stop. I usually re-send the job again and again until it works.. Similar to @statquant, I can't reproduce it but it tends to occur when I am sending big data. Also thanks for the great package :) |
I also just ran into this with LSF, R 4.1.0 and clustermq_0.8.95.1 off CRAN - might be a good idea to check if the port is NA and fail if it is? |
Ran into this recently again... pretty hard to reproduce. @mschubert since the
after line 33 in qsys.r on master (https://github.com/mschubert/clustermq/blob/master/R/qsys.r#L33) or after line 21 on develop (https://github.com/mschubert/clustermq/blob/develop/R/qsys.r#L20). This would prevent R from hanging and hopefully give use an error message that gets closer to the root cause. Regarding the port selection: what are your thoughts on making this configurable (at the moment the range of 6000:9999 is hard-coded in util.r in master at https://github.com/mschubert/clustermq/blob/master/R/util.r#L35)? Sampling some currently free ports might be more robust (e.g. using parallelly::freePort, see https://parallelly.futureverse.org/reference/freePort.html) than sampling 100 ports from a fixed port range without checking whether these ports are free? |
I can confirm that port NA is generated when all sampled ports are in use (e.g. test by overriding the host method in the package):
with the modified error handing yields
This apparently also happens when the first port in the list is in use, the others are not checked (!):
|
@mschubert a suggestion for
|
Is it possible that https://github.com/mschubert/clustermq/blob/master/src/CMQMaster.cpp#L19 has a bug:
Shouldn't this read as follows (note the return statement location):
|
@mschubert awesome, thanks! I patched this in the CRAN version for me, I'd be happy to wait for develop / the next version to hit CRAN. I use clustermq a lot 👍 What do you think of using parallelly in |
Making the port range configurable via an option makes sense, but I'm not sure I see the advantage of using |
Indeed... I suppose one could also just pass the entire port range into the C++ without pre-scanning in R! Thanks! |
Hello, for some reasons that elude me and that I cannot reproduce (yeah that's not much to go to), sometimes I cannot send jobs to my slurm grid.
What happen is I see
and then R just get stuck until I send an interrupt in the terminal. Then I have to wait really long to get an error and the terminal back (which might be a bug in itself ?)
Given this
NA
I was wondering if there is not something that can be done. Note that when I rerun the same command later on all works well.Many thanks for the package, it's great and sorry for this unhelpful issue.
The text was updated successfully, but these errors were encountered: