Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"localhost" does not always work for "multicore" #192

Closed
PhDyellow opened this issue Apr 24, 2020 · 4 comments
Closed

"localhost" does not always work for "multicore" #192

PhDyellow opened this issue Apr 24, 2020 · 4 comments

Comments

@PhDyellow
Copy link
Contributor

super$initialize(..., node="localhost")

I am using clustermq v0.8.8.1 inside a container on a HPC, with drake.

I have not been able to use the PBS scheduler, because I haven't figured out how to get "qsub" to work from the container, I can't just bind "qsub" in from the host system.

I would also prefer not to SSH from my laptop to the server for the long running job, I'm moving a lot of data around.

I would be happy to use the multicore scheduler, which will still let me use the 24 core nodes on the HPC.

However, while everything runs well on my laptop, when I move to the HPC, I get "Invalid Argument" from each worker when I try to call workers(5) on the HPC. After a bit of poking, it seems like "localhost" will not work on the HPC, while "127.0.0.1" does.

>     rzmq::connect.socket(socket, "udp://localhost:5000")
Protocol not supported
>     rzmq::connect.socket(socket, "tcp://localhost:5000")
Invalid argument
>     rzmq::connect.socket(socket, "tcp://127.0.0.1:5000")
> #success!

Would there be any issues with changing the node argument to "127.0.0.1", or allowing it to be passed in as a parameter?

Some web pages that helped me debug the issue:

http://api.zeromq.org/2-1:zmq-tcp

https://stackoverflow.com/questions/6024003/why-doesnt-zeromq-work-on-localhost

@mschubert
Copy link
Owner

mschubert commented Apr 25, 2020

Thanks for the report. I definitely want to support containers as much out of the box as possible, so maybe referencing the IP instead of the host name is a good idea here. The only drawback I see is that if a network is IPv6-only (quite unlikely), this would fail.

In the meantime, does localhost show up in your container's /etc/hosts?

If not, does starting your container adding localhost make a difference?

docker run --add-host localhost:127.0.0.1

@PhDyellow
Copy link
Contributor Author

I checked, and localhost does appear in /etc/hosts, within the container and on the host. It appears as

127.0.0.1 localhost.localdomain localhost

On my laptop, it appears as

127.0.0.1 localhost

I am using singularity, and I can't find an equivalent to --add-host in the documentation.

I did quickly check if I can specify an IPv6 address, and it doesn't work for tcp://[::1]:5000, even on my laptop.

In the mean time, I have forked and made the change I suggested. I won't submit a pull request for now, because it might not be a general solution.

@PhDyellow
Copy link
Contributor Author

Another option: use clustermq:::host()

That succeeded on the HPC for me, but not on my laptop. The inverse problem.

@mschubert
Copy link
Owner

This will be solved more generally with #170.

Thank you for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants