Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select singleserver port number on remote host #58

Merged
merged 5 commits into from
Nov 9, 2018

Conversation

cmd-ntrf
Copy link
Contributor

Context

We provide compute nodes on our GPU cluster for a graduate deep learning course through JupyterHub and BatchSpawner. The nodes are available two hours a week under a reservation for the course (a lab period). Each node has 8 GPUs, a student is allocated one GPU and there are around 30 students running Notebook at the same time. Therefore, multiple jupyterhub-singleserver can run on a single compute node. Until last week, it was working flawlessly.

Problem

During last week lab period, two users reported being unable to connect to their notebook. After inspecting their notebook log, I found this message:

[I 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1191] The port 58824 is already in use, trying another port.
[C 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1203] ERROR: the notebook server could not be started because no available port could be found.

The users could not access their notebook, because the singleserver could not start. The singleserver could not start because it was assigned a port number that was already used on the compute, probable by another singleserver used by another student.

The port generation for the singleserver is done on the BatchSpawner side (batchpawner.py:272-278). The function used to generate the random port is jupyterhub.utils.random_port. The function content is reproduced here to help understanding the issue:

def random_port():
    """Get a single random port."""
    sock = socket.socket()
    sock.bind(('', 0))
    port = sock.getsockname()[1]
    sock.close()
    return port

random_port create a socket locally, on the Hub/Spawner side, retrieves the port number and close the socket. Once the function has closed the socket, the port number is available again since nothing on the Hub side is bound to it and random_port could return the same port number when called again. The randomness of the function depends on the kernel handling of ephemeral port numbers. Furthermore, the function is only for local ports. There is no guarantee that an ephemeral port available for the Hub will be available for the compute and this the main issue with using this function to set the remote singleserver port.

Solution

Our team brainstormed on possible solutions that were looking at limiting the risk of port number collisions : hashing the job id, increasing the range from ephemeral port to all user available ports, etc, but they all had the same problem : it meant deciding of the port number on the Hub side, thus having no guarantee this port would be available on the compute and risking a job failure. We concluded that the singleserver port has to be selected remotely and sent back to the Hub/Spawner.

This PR fixes the port generation issue by letting the port number being generated by the singleserver and sent it back to the BatchSpawner through a BSD socket. The BSD socket address and port are provided to the singleserver by command-line arguments in the job script. To add the command-line arguments and the port syncing, this PR implements a batchspawner-singleserver script and app that inherits from SingleUserNotebookApp.

The port number is received by the spawner on the created socket before BatchSpawner.start returns.

This solution has proven effective to solve our port number collision.

Issue with PR solution

Using a socket to communicate between the Spawner/Hub and the Notebook singleserver is not very jupyter-like. It works for our use case, but could be problematic when working in an environment where there is a firewall between the compute nodes and the Hub, since we use a random port number to communicate between the compute node and the Spawner.

There is also no validation that the data received by the Spawner is truly a port number sent by the right compute node.

Ideally, I think the selected port should be communicated back to the Hub through the REST API, but I am uncertain of what it implies and how to properly implement it. Therefore, I think this PR should be accepted as is, but it should be used as the beginning of a solution to the aforementioned problem. I am willing to implement the right solution once we have converged on the proper way to do it.

@rcthomas
Copy link
Contributor

Hi @cmd-ntrf have you tried to set up a hub-managed service to extend the API? We've done some experiments around that because we have the same issue with "eager" port selection. I think the Spawner has to be customized to get the port info back to the Hub from the service.

@mbmilligan
Copy link
Member

Hi, thanks for this work!

I tend to agree with you that opening a socket isn't very Jupyter-esque. The "Jupyter" solution would probably be to create a web api callback, but that might be a bit overkill here. I see two easy and batchspawner-esque possible solutions here:

A) Write our own version of random_port() that doesn't depend on local kernel behavior and is less likely to produce collisions, but accept that occasionally users will have to retry to get a usable port number.

B) Extend the read_job_state() machinery to interface with some logic in the job submission script. Then the port number can be read during polling when the job starts up, before the new route gets sent to the proxy api.

Either way I would ask for your patience in not accepting this PR quite yet. Various Jupyter people have requested for some time that we put out a proper Batchspawner release, so I think we need to get that sorted out first before developing new functionality.

@cmd-ntrf cmd-ntrf force-pushed the remote_port branch 2 times, most recently from 61d0ab0 to 1c6834d Compare February 19, 2018 20:51
@cmd-ntrf
Copy link
Contributor Author

Hi @rcthomas and @mbmilligan,

Thanks for the quick response. Following @rcthomas response, I dug on how the port configuration could be handled using a rest API and before @mbmilligan I was already down the rabbit hole... so I wrote an API Handler for BatchSpawner.

I did not used a hub-managed service. I created a APIHandler that wait for post of the notebook port number and added it to JupyterHub handlers list. To my surprise, it is actually simpler than my first socket solution. It also has the advantage of being somewhat secure has the post can only be done if the user is authenticated.

I understand the need to freeze the code for a release before adding a new feature. On my side, I will deploy the API Handler solution and sees how it goes. I will keep this thread updated if I face any issue.

@mbmilligan
Copy link
Member

Hi @cmd-ntrf @rkdarst -

After some good conversations at the PEARC conference, I think the consensus is that we should go ahead and integrate this in Batchspawner for now, and in parallel pursue getting an API added to Jupyterhub core.

I think it's also about time to put together another release, so let's do that and tag this PR as one that we want to get into good shape for that.

@cmd-ntrf
Copy link
Contributor Author

Good!

I have updated the PR last week to integrate most recent changes made to batchspawner. However, the tests are still failing. I am willing to help fix them, but I will need some guidance.

@cmd-ntrf cmd-ntrf changed the title Select singleserver port number on remote host [WIP] Select singleserver port number on remote host Aug 13, 2018
@cmd-ntrf
Copy link
Contributor Author

cmd-ntrf commented Aug 13, 2018

I have updated this PR to allow user to set the port value instead of forcing it to random. This should also work with the port range PR.

I have also updated the test to fix the spawner port value.

Regarding tests:

  • JupyterHub 0.9.x. : every test now passes.
  • JupyterHub 0.8.1: every test fails because of a bug in JupyterHub 0.8.x when port number is 0 and the spawner server is None, fixed by @minrk on commit jupyterhub/jupyterhub@e3fd4ad By adding a Server object during the spawner initialisation, I was able to bypass the bug in JupyterHub 0.8.x. Every tests passes.
  • JupyterHub 0.7.1: every test fails because the Spawner do not have a server object in JupyterHub 0.7. I am not sure if this is worth fixing since the batchspawner dev changelog mentions the minimum requirements is now JupyterHub 0.8.1. It is an easy fix if we want to maintain support for JupyterHub 0.7.x. It is fixed by commit 410f7d9. Every test passes.

@cmd-ntrf cmd-ntrf changed the title [WIP] Select singleserver port number on remote host Select singleserver port number on remote host Aug 14, 2018
Avoid error when notebook is not installed with JupyterHub
@mbmilligan
Copy link
Member

Now that we have gotten the latest round of testing issues resolved, I can merge this into master. Please note that before the next release we need some documentation added to the README or elsewhere. The fact that users need to install a different jupyterhub-singleuser script to use this feature will not be obvious otherwise.

@mbmilligan mbmilligan merged commit 383e8a3 into jupyterhub:master Nov 9, 2018
user = self.get_current_user()
data = self.get_json_body()
port = int(data.get('port', 0))
user.spawner.current_port = port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using wrapspawner, this fails because user.spawner.current_port needs to be proxied to user.spawner.child_spawner.current_port. I made a quick fix to wrapspawner to proxy it there, but we can ask what's the best place to do this?

When I made wrapspawner proxy all attributes with getattr, it fails some other way which I haven't understood yet.

The main question is: should wrapspawner or batchspawner be responsible for this? I would think wrapspawner, and how do we then avoid having to special case everything needed? I'll return to this later.

@techie879
Copy link

Context

We provide compute nodes on our GPU cluster for a graduate deep learning course through JupyterHub and BatchSpawner. The nodes are available two hours a week under a reservation for the course (a lab period). Each node has 8 GPUs, a student is allocated one GPU and there are around 30 students running Notebook at the same time. Therefore, multiple jupyterhub-singleserver can run on a single compute node. Until last week, it was working flawlessly.

Problem

During last week lab period, two users reported being unable to connect to their notebook. After inspecting their notebook log, I found this message:

[I 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1191] The port 58824 is already in use, trying another port.
[C 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1203] ERROR: the notebook server could not be started because no available port could be found.

The users could not access their notebook, because the singleserver could not start. The singleserver could not start because it was assigned a port number that was already used on the compute, probable by another singleserver used by another student.

The port generation for the singleserver is done on the BatchSpawner side (batchpawner.py:272-278). The function used to generate the random port is jupyterhub.utils.random_port. The function content is reproduced here to help understanding the issue:

def random_port():
    """Get a single random port."""
    sock = socket.socket()
    sock.bind(('', 0))
    port = sock.getsockname()[1]
    sock.close()
    return port

random_port create a socket locally, on the Hub/Spawner side, retrieves the port number and close the socket. Once the function has closed the socket, the port number is available again since nothing on the Hub side is bound to it and random_port could return the same port number when called again. The randomness of the function depends on the kernel handling of ephemeral port numbers. Furthermore, the function is only for local ports. There is no guarantee that an ephemeral port available for the Hub will be available for the compute and this the main issue with using this function to set the remote singleserver port.

Solution

Our team brainstormed on possible solutions that were looking at limiting the risk of port number collisions : hashing the job id, increasing the range from ephemeral port to all user available ports, etc, but they all had the same problem : it meant deciding of the port number on the Hub side, thus having no guarantee this port would be available on the compute and risking a job failure. We concluded that the singleserver port has to be selected remotely and sent back to the Hub/Spawner.

This PR fixes the port generation issue by letting the port number being generated by the singleserver and sent it back to the BatchSpawner through a BSD socket. The BSD socket address and port are provided to the singleserver by command-line arguments in the job script. To add the command-line arguments and the port syncing, this PR implements a batchspawner-singleserver script and app that inherits from SingleUserNotebookApp.

The port number is received by the spawner on the created socket before BatchSpawner.start returns.

This solution has proven effective to solve our port number collision.

Issue with PR solution

Using a socket to communicate between the Spawner/Hub and the Notebook singleserver is not very jupyter-like. It works for our use case, but could be problematic when working in an environment where there is a firewall between the compute nodes and the Hub, since we use a random port number to communicate between the compute node and the Spawner.

There is also no validation that the data received by the Spawner is truly a port number sent by the right compute node.

Ideally, I think the selected port should be communicated back to the Hub through the REST API, but I am uncertain of what it implies and how to properly implement it. Therefore, I think this PR should be accepted as is, but it should be used as the beginning of a solution to the aforementioned problem. I am willing to implement the right solution once we have converged on the proper way to do it.

Context

We provide compute nodes on our GPU cluster for a graduate deep learning course through JupyterHub and BatchSpawner. The nodes are available two hours a week under a reservation for the course (a lab period). Each node has 8 GPUs, a student is allocated one GPU and there are around 30 students running Notebook at the same time. Therefore, multiple jupyterhub-singleserver can run on a single compute node. Until last week, it was working flawlessly.

Problem

During last week lab period, two users reported being unable to connect to their notebook. After inspecting their notebook log, I found this message:

[I 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1191] The port 58824 is already in use, trying another port.
[C 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1203] ERROR: the notebook server could not be started because no available port could be found.

The users could not access their notebook, because the singleserver could not start. The singleserver could not start because it was assigned a port number that was already used on the compute, probable by another singleserver used by another student.

The port generation for the singleserver is done on the BatchSpawner side (batchpawner.py:272-278). The function used to generate the random port is jupyterhub.utils.random_port. The function content is reproduced here to help understanding the issue:

def random_port():
    """Get a single random port."""
    sock = socket.socket()
    sock.bind(('', 0))
    port = sock.getsockname()[1]
    sock.close()
    return port

random_port create a socket locally, on the Hub/Spawner side, retrieves the port number and close the socket. Once the function has closed the socket, the port number is available again since nothing on the Hub side is bound to it and random_port could return the same port number when called again. The randomness of the function depends on the kernel handling of ephemeral port numbers. Furthermore, the function is only for local ports. There is no guarantee that an ephemeral port available for the Hub will be available for the compute and this the main issue with using this function to set the remote singleserver port.

Solution

Our team brainstormed on possible solutions that were looking at limiting the risk of port number collisions : hashing the job id, increasing the range from ephemeral port to all user available ports, etc, but they all had the same problem : it meant deciding of the port number on the Hub side, thus having no guarantee this port would be available on the compute and risking a job failure. We concluded that the singleserver port has to be selected remotely and sent back to the Hub/Spawner.

This PR fixes the port generation issue by letting the port number being generated by the singleserver and sent it back to the BatchSpawner through a BSD socket. The BSD socket address and port are provided to the singleserver by command-line arguments in the job script. To add the command-line arguments and the port syncing, this PR implements a batchspawner-singleserver script and app that inherits from SingleUserNotebookApp.

The port number is received by the spawner on the created socket before BatchSpawner.start returns.

This solution has proven effective to solve our port number collision.

Issue with PR solution

Using a socket to communicate between the Spawner/Hub and the Notebook singleserver is not very jupyter-like. It works for our use case, but could be problematic when working in an environment where there is a firewall between the compute nodes and the Hub, since we use a random port number to communicate between the compute node and the Spawner.

There is also no validation that the data received by the Spawner is truly a port number sent by the right compute node.

Ideally, I think the selected port should be communicated back to the Hub through the REST API, but I am uncertain of what it implies and how to properly implement it. Therefore, I think this PR should be accepted as is, but it should be used as the beginning of a solution to the aforementioned problem. I am willing to implement the right solution once we have converged on the proper way to do it.

We have the same issue as you have encountered. Although I am no expert at this, but I know we are running latest jupyterhub version. Is your fixes incorporated at the latest version? Or do we need to pull the changes seperately and merge them to the latest version? Can you please take a min to help us with some pointers?
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants