Select singleserver port number on remote host #58

cmd-ntrf · 2018-02-19T17:01:09Z

Context

We provide compute nodes on our GPU cluster for a graduate deep learning course through JupyterHub and BatchSpawner. The nodes are available two hours a week under a reservation for the course (a lab period). Each node has 8 GPUs, a student is allocated one GPU and there are around 30 students running Notebook at the same time. Therefore, multiple jupyterhub-singleserver can run on a single compute node. Until last week, it was working flawlessly.

Problem

During last week lab period, two users reported being unable to connect to their notebook. After inspecting their notebook log, I found this message:

[I 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1191] The port 58824 is already in use, trying another port.
[C 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1203] ERROR: the notebook server could not be started because no available port could be found.

The users could not access their notebook, because the singleserver could not start. The singleserver could not start because it was assigned a port number that was already used on the compute, probable by another singleserver used by another student.

The port generation for the singleserver is done on the BatchSpawner side (batchpawner.py:272-278). The function used to generate the random port is jupyterhub.utils.random_port. The function content is reproduced here to help understanding the issue:

def random_port():
    """Get a single random port."""
    sock = socket.socket()
    sock.bind(('', 0))
    port = sock.getsockname()[1]
    sock.close()
    return port

random_port create a socket locally, on the Hub/Spawner side, retrieves the port number and close the socket. Once the function has closed the socket, the port number is available again since nothing on the Hub side is bound to it and random_port could return the same port number when called again. The randomness of the function depends on the kernel handling of ephemeral port numbers. Furthermore, the function is only for local ports. There is no guarantee that an ephemeral port available for the Hub will be available for the compute and this the main issue with using this function to set the remote singleserver port.

Solution

Our team brainstormed on possible solutions that were looking at limiting the risk of port number collisions : hashing the job id, increasing the range from ephemeral port to all user available ports, etc, but they all had the same problem : it meant deciding of the port number on the Hub side, thus having no guarantee this port would be available on the compute and risking a job failure. We concluded that the singleserver port has to be selected remotely and sent back to the Hub/Spawner.

This PR fixes the port generation issue by letting the port number being generated by the singleserver and sent it back to the BatchSpawner through a BSD socket. The BSD socket address and port are provided to the singleserver by command-line arguments in the job script. To add the command-line arguments and the port syncing, this PR implements a batchspawner-singleserver script and app that inherits from SingleUserNotebookApp.

The port number is received by the spawner on the created socket before BatchSpawner.start returns.

This solution has proven effective to solve our port number collision.

Issue with PR solution

Using a socket to communicate between the Spawner/Hub and the Notebook singleserver is not very jupyter-like. It works for our use case, but could be problematic when working in an environment where there is a firewall between the compute nodes and the Hub, since we use a random port number to communicate between the compute node and the Spawner.

There is also no validation that the data received by the Spawner is truly a port number sent by the right compute node.

Ideally, I think the selected port should be communicated back to the Hub through the REST API, but I am uncertain of what it implies and how to properly implement it. Therefore, I think this PR should be accepted as is, but it should be used as the beginning of a solution to the aforementioned problem. I am willing to implement the right solution once we have converged on the proper way to do it.

rcthomas · 2018-02-19T17:36:47Z

Hi @cmd-ntrf have you tried to set up a hub-managed service to extend the API? We've done some experiments around that because we have the same issue with "eager" port selection. I think the Spawner has to be customized to get the port info back to the Hub from the service.

mbmilligan · 2018-02-19T20:06:47Z

Hi, thanks for this work!

I tend to agree with you that opening a socket isn't very Jupyter-esque. The "Jupyter" solution would probably be to create a web api callback, but that might be a bit overkill here. I see two easy and batchspawner-esque possible solutions here:

A) Write our own version of random_port() that doesn't depend on local kernel behavior and is less likely to produce collisions, but accept that occasionally users will have to retry to get a usable port number.

B) Extend the read_job_state() machinery to interface with some logic in the job submission script. Then the port number can be read during polling when the job starts up, before the new route gets sent to the proxy api.

Either way I would ask for your patience in not accepting this PR quite yet. Various Jupyter people have requested for some time that we put out a proper Batchspawner release, so I think we need to get that sorted out first before developing new functionality.

cmd-ntrf · 2018-02-19T20:53:13Z

Hi @rcthomas and @mbmilligan,

Thanks for the quick response. Following @rcthomas response, I dug on how the port configuration could be handled using a rest API and before @mbmilligan I was already down the rabbit hole... so I wrote an API Handler for BatchSpawner.

I did not used a hub-managed service. I created a APIHandler that wait for post of the notebook port number and added it to JupyterHub handlers list. To my surprise, it is actually simpler than my first socket solution. It also has the advantage of being somewhat secure has the post can only be done if the user is authenticated.

I understand the need to freeze the code for a release before adding a new feature. On my side, I will deploy the API Handler solution and sees how it goes. I will keep this thread updated if I face any issue.

mbmilligan · 2018-07-30T21:04:11Z

Hi @cmd-ntrf @rkdarst -

After some good conversations at the PEARC conference, I think the consensus is that we should go ahead and integrate this in Batchspawner for now, and in parallel pursue getting an API added to Jupyterhub core.

I think it's also about time to put together another release, so let's do that and tag this PR as one that we want to get into good shape for that.

cmd-ntrf · 2018-07-31T18:46:53Z

Good!

I have updated the PR last week to integrate most recent changes made to batchspawner. However, the tests are still failing. I am willing to help fix them, but I will need some guidance.

cmd-ntrf · 2018-08-13T21:34:23Z

I have updated this PR to allow user to set the port value instead of forcing it to random. This should also work with the port range PR.

I have also updated the test to fix the spawner port value.

Regarding tests:

JupyterHub 0.9.x. : every test now passes.
JupyterHub 0.8.1: ~~every test fails because of a bug in JupyterHub 0.8.x when port number is 0 and the spawner server is None, fixed by @minrk on commit jupyterhub/jupyterhub@e3fd4ad~~ By adding a Server object during the spawner initialisation, I was able to bypass the bug in JupyterHub 0.8.x. Every tests passes.
JupyterHub 0.7.1: every test fails because the Spawner do not have a server object in JupyterHub 0.7. I am not sure if this is worth fixing since the batchspawner dev changelog mentions the minimum requirements is now JupyterHub 0.8.1. It is an easy fix if we want to maintain support for JupyterHub 0.7.x. It is fixed by commit 410f7d9. Every test passes.

Avoid error when notebook is not installed with JupyterHub

mbmilligan · 2018-11-09T00:54:52Z

Now that we have gotten the latest round of testing issues resolved, I can merge this into master. Please note that before the next release we need some documentation added to the README or elsewhere. The fact that users need to install a different jupyterhub-singleuser script to use this feature will not be obvious otherwise.

rkdarst · 2019-02-01T07:50:47Z

batchspawner/api.py

+        user = self.get_current_user()
+        data = self.get_json_body()
+        port = int(data.get('port', 0))
+        user.spawner.current_port = port


When using wrapspawner, this fails because user.spawner.current_port needs to be proxied to user.spawner.child_spawner.current_port. I made a quick fix to wrapspawner to proxy it there, but we can ask what's the best place to do this?

When I made wrapspawner proxy all attributes with getattr, it fails some other way which I haven't understood yet.

The main question is: should wrapspawner or batchspawner be responsible for this? I would think wrapspawner, and how do we then avoid having to special case everything needed? I'll return to this later.

…side - Closes: jupyterhub#126

techie879 · 2022-05-25T17:18:57Z

Context

We provide compute nodes on our GPU cluster for a graduate deep learning course through JupyterHub and BatchSpawner. The nodes are available two hours a week under a reservation for the course (a lab period). Each node has 8 GPUs, a student is allocated one GPU and there are around 30 students running Notebook at the same time. Therefore, multiple jupyterhub-singleserver can run on a single compute node. Until last week, it was working flawlessly.

Problem

During last week lab period, two users reported being unable to connect to their notebook. After inspecting their notebook log, I found this message:

[I 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1191] The port 58824 is already in use, trying another port.
[C 2018-02-16 10:26:57.826 SingleUserNotebookApp notebookapp:1203] ERROR: the notebook server could not be started because no available port could be found.

The users could not access their notebook, because the singleserver could not start. The singleserver could not start because it was assigned a port number that was already used on the compute, probable by another singleserver used by another student.

The port generation for the singleserver is done on the BatchSpawner side (batchpawner.py:272-278). The function used to generate the random port is jupyterhub.utils.random_port. The function content is reproduced here to help understanding the issue:
def random_port():
    """Get a single random port."""
    sock = socket.socket()
    sock.bind(('', 0))
    port = sock.getsockname()[1]
    sock.close()
    return port
random_port create a socket locally, on the Hub/Spawner side, retrieves the port number and close the socket. Once the function has closed the socket, the port number is available again since nothing on the Hub side is bound to it and random_port could return the same port number when called again. The randomness of the function depends on the kernel handling of ephemeral port numbers. Furthermore, the function is only for local ports. There is no guarantee that an ephemeral port available for the Hub will be available for the compute and this the main issue with using this function to set the remote singleserver port.

Solution

Our team brainstormed on possible solutions that were looking at limiting the risk of port number collisions : hashing the job id, increasing the range from ephemeral port to all user available ports, etc, but they all had the same problem : it meant deciding of the port number on the Hub side, thus having no guarantee this port would be available on the compute and risking a job failure. We concluded that the singleserver port has to be selected remotely and sent back to the Hub/Spawner.

This PR fixes the port generation issue by letting the port number being generated by the singleserver and sent it back to the BatchSpawner through a BSD socket. The BSD socket address and port are provided to the singleserver by command-line arguments in the job script. To add the command-line arguments and the port syncing, this PR implements a batchspawner-singleserver script and app that inherits from SingleUserNotebookApp.

The port number is received by the spawner on the created socket before BatchSpawner.start returns.

This solution has proven effective to solve our port number collision.

Issue with PR solution

Using a socket to communicate between the Spawner/Hub and the Notebook singleserver is not very jupyter-like. It works for our use case, but could be problematic when working in an environment where there is a firewall between the compute nodes and the Hub, since we use a random port number to communicate between the compute node and the Spawner.

There is also no validation that the data received by the Spawner is truly a port number sent by the right compute node.

Ideally, I think the selected port should be communicated back to the Hub through the REST API, but I am uncertain of what it implies and how to properly implement it. Therefore, I think this PR should be accepted as is, but it should be used as the beginning of a solution to the aforementioned problem. I am willing to implement the right solution once we have converged on the proper way to do it.

We have the same issue as you have encountered. Although I am no expert at this, but I know we are running latest jupyterhub version. Is your fixes incorporated at the latest version? Or do we need to pull the changes seperately and merge them to the latest version? Can you please take a min to help us with some pointers?
thanks

cmd-ntrf force-pushed the remote_port branch 2 times, most recently from 61d0ab0 to 1c6834d Compare February 19, 2018 20:51

mbmilligan mentioned this pull request Mar 10, 2018

Create a new release #51

Closed

3 tasks

rkdarst mentioned this pull request Apr 26, 2018

Select single user server port on remote host jupyterhub/jupyterhub#1830

Open

rkdarst mentioned this pull request May 21, 2018

[WIP] Add progress indicators for jupyterhub 0.9 #86

Merged

cmd-ntrf force-pushed the remote_port branch from 1c6834d to eeda227 Compare June 16, 2018 18:44

cmd-ntrf force-pushed the remote_port branch from eeda227 to 745da8d Compare July 25, 2018 13:25

cmd-ntrf closed this Jul 25, 2018

cmd-ntrf force-pushed the remote_port branch from 645f90e to c320f05 Compare July 25, 2018 13:49

cmd-ntrf reopened this Jul 25, 2018

rkdarst referenced this pull request in pontiggi/batchspawner Jul 27, 2018

added port range for restricted vlans

50428e8

rkdarst mentioned this pull request Jul 27, 2018

restricting port range for spawned jupyterhub singleuser servers #102

Open

rkdarst mentioned this pull request Jul 31, 2018

Time for a new Batchspawner release #103

Closed

cmd-ntrf added 2 commits August 13, 2018 14:27

Add API handler to post remote server port number

4e4ec11

Add a 'current_port' traitlet instead of modifying self.port

d105238

cmd-ntrf force-pushed the remote_port branch from 518f001 to d105238 Compare August 13, 2018 21:16

cmd-ntrf changed the title ~~Select singleserver port number on remote host~~ [WIP] Select singleserver port number on remote host Aug 13, 2018

cmd-ntrf added 2 commits August 14, 2018 15:28

Init a Server object in new_spawner to fix tests with JHub 0.8.x

f5af94f

Add support for JupyterHub 0.7.x

410f7d9

cmd-ntrf force-pushed the remote_port branch from c9355b1 to 410f7d9 Compare August 14, 2018 19:29

cmd-ntrf changed the title ~~[WIP] Select singleserver port number on remote host~~ Select singleserver port number on remote host Aug 14, 2018

Remove import of singleuser from __init__.py

2e024e8

Avoid error when notebook is not installed with JupyterHub

mbmilligan merged commit 383e8a3 into jupyterhub:master Nov 9, 2018

rkdarst reviewed Feb 1, 2019

View reviewed changes

rkdarst mentioned this pull request Feb 1, 2019

Proxy current_port to child spawner, needed by batchspawner #58 jupyterhub/wrapspawner#25

Closed

rkdarst added a commit to rkdarst/batchspawner that referenced this pull request Feb 2, 2019

Add documentation of jupyterhub#58, selecting the port on the remote …

c568dee

…side - Closes: jupyterhub#126

rkdarst mentioned this pull request Feb 2, 2019

Not compatible with latest jupyterhub because of get_current_user async #131

Closed

cmd-ntrf deleted the remote_port branch March 27, 2019 21:30

rkdarst mentioned this pull request Apr 10, 2019

Support random port assignment when c.KubeSpawner.port = 0 jupyterhub/kubespawner#299

Open

rkdarst mentioned this pull request Jun 11, 2019

[idea] batchspawner sprint #138

Closed

rkdarst added a commit to rkdarst/batchspawner that referenced this pull request Jun 16, 2019

Add documentation of jupyterhub#58, selecting the port on the remote …

673072d

…side - Closes: jupyterhub#126

This was referenced Jun 18, 2019

Select random port from a range (new implementation) #114

Open

"Remote port selection" allows redesigning batchspawner to avoid polling/spawner interaction #146

Open

dylex mentioned this pull request May 27, 2020

utils.random_port function makes wrong assumptions jupyterhub/jupyterhub#3005

Open

iwilltry42 mentioned this pull request Oct 16, 2020

[Feature] Let singleuser server select a free random port to listen on jupyterhub/kubespawner#448

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select singleserver port number on remote host #58

Select singleserver port number on remote host #58

cmd-ntrf commented Feb 19, 2018

rcthomas commented Feb 19, 2018

mbmilligan commented Feb 19, 2018

cmd-ntrf commented Feb 19, 2018

mbmilligan commented Jul 30, 2018

cmd-ntrf commented Jul 31, 2018

cmd-ntrf commented Aug 13, 2018 •

edited

Loading

mbmilligan commented Nov 9, 2018

rkdarst Feb 1, 2019

techie879 commented May 25, 2022

Select singleserver port number on remote host #58

Select singleserver port number on remote host #58

Conversation

cmd-ntrf commented Feb 19, 2018

rcthomas commented Feb 19, 2018

mbmilligan commented Feb 19, 2018

cmd-ntrf commented Feb 19, 2018

mbmilligan commented Jul 30, 2018

cmd-ntrf commented Jul 31, 2018

cmd-ntrf commented Aug 13, 2018 • edited Loading

mbmilligan commented Nov 9, 2018

rkdarst Feb 1, 2019

Choose a reason for hiding this comment

techie879 commented May 25, 2022

cmd-ntrf commented Aug 13, 2018 •

edited

Loading