Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "connect to job" functionality, use that for CondorSpawner #200

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

olifre
Copy link
Contributor

@olifre olifre commented Dec 18, 2020

This adds the possibility to start a connect_to_job background task on the hub on job start, which establishes connectivity to the actual single user server.
An example for this is condor_ssh_to_job for HTCondor batch systems.

Additionally, the background tasks are monitored:

  • for successful startup. The background task is given some time to successfully establish connectivity.
  • in poll() during job runtime

and if they fail, the job is terminated.

For the CondorSpawner, this leverages condor_ssh_to_job to forward the port of the
single user server to the hub, removing the need for direct connectivity from the hub to the execute nodes.
Notably, it allows to use worker node setups where the workers do not allow for direct inbound connectivity (e.g. NATted setups), requiring only outbound connectivity from worker to hub, delegating the other direction to the batch system.

@Hoeze
Copy link
Contributor

Hoeze commented Jan 31, 2021

Sounds like it would partially solve #169, right?
This + some client command to create new servers would allow to spawn workers on arbitrary locations.

@olifre
Copy link
Contributor Author

olifre commented Jan 31, 2021

Sounds like it would partially solve #169, right?

It's surely an ingredient, but the focus here is different: It allows to connect to workers spawned by JupyterHub via a batch system command. In fact, this means if you spawn batch system workers in arbitrary locations, and the batch system picks them up, this means you can also run notebooks there.

That's also one of the ways we'd like to use it: Start a HTCondor startd, which registers with the batch system, and then let JupyterHub make use of it. This then works with arbitrary locations. The main ingredients here are, though:

  • The batch system allowing to dynamically add workers.
  • The batch system taking care of ensuring connectivity to that arbitrary location.

The necessary change in JupyterHub's batchspawner is that establishing connectivity can be delegated to an additional command (and in case of HTCondor, condor_ssh_to_job is used to delegate that to the batch system). This is what this PR implements.

So in short, yes, it can be seen as an ingredient also for the issue you linked 😉.

@olifre
Copy link
Contributor Author

olifre commented Mar 9, 2021

I pushed two more commits here which ensure no port collision happens for the forwarded port.
This is achieved by choosing a random port on the Hub and forwarding the notebook to that, instead of using the same port on both ends.

Again, this is implemented in a generalizable manner, i.e. it is optional, and could also be used for other SSH tunneling approaches — however, I only added the condor_ssh_to_job connect command for the CondorSpawner since this is what I can test with.

@olifre
Copy link
Contributor Author

olifre commented Mar 9, 2021

Rebased to current master to get the new CI working.

@olifre
Copy link
Contributor Author

olifre commented May 13, 2021

I pushed three more commits with a small improvement, and two more changes:

  • It is now possible to disable the connect_to_job feature for the CondorSpawner again to allow reverting to the old behaviour, if needed for any reason.
  • The string {host} is now replaced with self.ip in the connect_to_job_cmd. This may be useful if an implementation for other batch systems is done, or somebody uses an additional service to proxy for these batch systems.

I see that this branch now has conflicts after 1decdf2 which caused a lot of (useful) reformatting all over the project. I am not (yet) resolving these, but am certainly willing to do so once somebody steps up to start a review (to minimize effort, I'd prefer to resolve conflicts only once).

olifre added 10 commits July 11, 2022 22:54
This adds the possibility to start a "connect_to_job"
background task on the hub on job start,
which establishes connectivity to the actual single user server.
An example for this can be "condor_ssh_to_job" for HTCondor batch systems.

Additionally, the background tasks are monitored:
- for successful startup. The background task is given
  some time to successfully establish connectivity.
- in poll() during job runtime
and if they fail, the job is terminated.
This leverages condor_ssh_to_job to forward the port of the
single user server to the hub, removing the need for direct
connectivity from the hub to the execute nodes.
This allows to use {rport} inside the connect_to_job_cmd.
If this is done, {port} is set to a random_port chosen locally
on the Hub, and {rport} is set to the original remote port.
This is useful e.g. for SSH port forwarding.

If this is used, the {rport} of the notebook is saved into
a new class variable in case it will be needed again.
This uses the functionality to use a random local port on the hub
to forward the notebook port to. It ensures no port collisions
appear between different, forwarded notebooks.
Wrapping the coroutines into futures first allows
to directly check the state of the futures.
For implementations other than condor_ssh_to_job, it might be useful
to use the hostname on which the notebook was spawned to proxy it.
If connect_to_job_cmd is explcitly set to an empty string,
CondorSpawner will not override the hostname with localhost,
allowing to revert to the old behaviour
(assuming direct connectivity).
…ing.

With this, the first database commit will already contain the forwarded port
if connect_to_job is used, and the log will show the correct port number.
This is required since data may change after the connect_to_job function.
The background command can not cleanly be simulated to stay running.
@olifre
Copy link
Contributor Author

olifre commented Jul 11, 2022

I've taken the opportunity that 1.5 years have passed to:

  • Rebase this onto current master. Merging automatically was impossible due to the large reformatting, so I applied all changes one-by-one once more.
  • Adapt to changes in master, e.g. dropped support for JupyterHub <=0.7.
  • Add some fixes and disable tests for connect_to_job, since I did not have a good idea on how to emulate background commands and port forwarding in the test suite.

After this, the PR applies cleanly on master again. Furthermore, CI runs fine apart from Python 3.5 tests, which appear to be broken by an unrelated issue.

Hope this helps, we are still using this in production quite successfully 👍 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants