Add "connect to job" functionality, use that for CondorSpawner #200

olifre · 2020-12-18T09:08:51Z

This adds the possibility to start a connect_to_job background task on the hub on job start, which establishes connectivity to the actual single user server.
An example for this is condor_ssh_to_job for HTCondor batch systems.

Additionally, the background tasks are monitored:

for successful startup. The background task is given some time to successfully establish connectivity.
in poll() during job runtime

and if they fail, the job is terminated.

For the CondorSpawner, this leverages condor_ssh_to_job to forward the port of the
single user server to the hub, removing the need for direct connectivity from the hub to the execute nodes.
Notably, it allows to use worker node setups where the workers do not allow for direct inbound connectivity (e.g. NATted setups), requiring only outbound connectivity from worker to hub, delegating the other direction to the batch system.

Hoeze · 2021-01-31T19:15:56Z

Sounds like it would partially solve #169, right?
This + some client command to create new servers would allow to spawn workers on arbitrary locations.

olifre · 2021-01-31T19:42:32Z

Sounds like it would partially solve #169, right?

It's surely an ingredient, but the focus here is different: It allows to connect to workers spawned by JupyterHub via a batch system command. In fact, this means if you spawn batch system workers in arbitrary locations, and the batch system picks them up, this means you can also run notebooks there.

That's also one of the ways we'd like to use it: Start a HTCondor startd, which registers with the batch system, and then let JupyterHub make use of it. This then works with arbitrary locations. The main ingredients here are, though:

The batch system allowing to dynamically add workers.
The batch system taking care of ensuring connectivity to that arbitrary location.

The necessary change in JupyterHub's batchspawner is that establishing connectivity can be delegated to an additional command (and in case of HTCondor, condor_ssh_to_job is used to delegate that to the batch system). This is what this PR implements.

So in short, yes, it can be seen as an ingredient also for the issue you linked 😉.

olifre · 2021-03-09T21:33:29Z

I pushed two more commits here which ensure no port collision happens for the forwarded port.
This is achieved by choosing a random port on the Hub and forwarding the notebook to that, instead of using the same port on both ends.

Again, this is implemented in a generalizable manner, i.e. it is optional, and could also be used for other SSH tunneling approaches — however, I only added the condor_ssh_to_job connect command for the CondorSpawner since this is what I can test with.

olifre · 2021-03-09T21:36:55Z

Rebased to current master to get the new CI working.

olifre · 2021-05-13T20:29:56Z

I pushed three more commits with a small improvement, and two more changes:

It is now possible to disable the connect_to_job feature for the CondorSpawner again to allow reverting to the old behaviour, if needed for any reason.
The string {host} is now replaced with self.ip in the connect_to_job_cmd. This may be useful if an implementation for other batch systems is done, or somebody uses an additional service to proxy for these batch systems.

I see that this branch now has conflicts after 1decdf2 which caused a lot of (useful) reformatting all over the project. I am not (yet) resolving these, but am certainly willing to do so once somebody steps up to start a review (to minimize effort, I'd prefer to resolve conflicts only once).

This adds the possibility to start a "connect_to_job" background task on the hub on job start, which establishes connectivity to the actual single user server. An example for this can be "condor_ssh_to_job" for HTCondor batch systems. Additionally, the background tasks are monitored: - for successful startup. The background task is given some time to successfully establish connectivity. - in poll() during job runtime and if they fail, the job is terminated.

This leverages condor_ssh_to_job to forward the port of the single user server to the hub, removing the need for direct connectivity from the hub to the execute nodes.

This allows to use {rport} inside the connect_to_job_cmd. If this is done, {port} is set to a random_port chosen locally on the Hub, and {rport} is set to the original remote port. This is useful e.g. for SSH port forwarding. If this is used, the {rport} of the notebook is saved into a new class variable in case it will be needed again.

This uses the functionality to use a random local port on the hub to forward the notebook port to. It ensures no port collisions appear between different, forwarded notebooks.

Wrapping the coroutines into futures first allows to directly check the state of the futures.

For implementations other than condor_ssh_to_job, it might be useful to use the hostname on which the notebook was spawned to proxy it.

If connect_to_job_cmd is explcitly set to an empty string, CondorSpawner will not override the hostname with localhost, allowing to revert to the old behaviour (assuming direct connectivity).

…ing. With this, the first database commit will already contain the forwarded port if connect_to_job is used, and the log will show the correct port number.

This is required since data may change after the connect_to_job function.

The background command can not cleanly be simulated to stay running.

olifre · 2022-07-11T21:11:25Z

I've taken the opportunity that 1.5 years have passed to:

Rebase this onto current master. Merging automatically was impossible due to the large reformatting, so I applied all changes one-by-one once more.
Adapt to changes in master, e.g. dropped support for JupyterHub <=0.7.
Add some fixes and disable tests for connect_to_job, since I did not have a good idea on how to emulate background commands and port forwarding in the test suite.

After this, the PR applies cleanly on master again. Furthermore, CI runs fine apart from Python 3.5 tests, which appear to be broken by an unrelated issue.

Hope this helps, we are still using this in production quite successfully 👍 .

consideRatio added the enhancement label Jan 29, 2021

olifre force-pushed the connect_to_job branch from 345d305 to 3a6e57e Compare March 9, 2021 21:36

olifre added 10 commits July 11, 2022 22:54

batchspawner/CondorSpawner: Make use of connect_to_job feature.

78ad46a

This leverages condor_ssh_to_job to forward the port of the single user server to the hub, removing the need for direct connectivity from the hub to the execute nodes.

batchspawner/CondorSpawner: Make use of connect_to_job rport feature.

3e14a60

This uses the functionality to use a random local port on the hub to forward the notebook port to. It ensures no port collisions appear between different, forwarded notebooks.

BatchSpawnerBase: Don't use internal API of asyncio.Tasks.

f373558

Wrapping the coroutines into futures first allows to directly check the state of the futures.

BatchSpawnerBase: Replace {host} with self.ip in connect_to_job_cmd.

02c4833

For implementations other than condor_ssh_to_job, it might be useful to use the hostname on which the notebook was spawned to proxy it.

batchspawner/CondorSpawner: Allow to disable connect_to_job_cmd.

4ffa087

If connect_to_job_cmd is explcitly set to an empty string, CondorSpawner will not override the hostname with localhost, allowing to revert to the old behaviour (assuming direct connectivity).

BatchSpawnerBase: Move connect_to_job before DB commit and debug logg…

b56fc82

…ing. With this, the first database commit will already contain the forwarded port if connect_to_job is used, and the log will show the correct port number.

BatchSpawnerBase: Store ip, port in self.server after connect_to_job

ff1240f

This is required since data may change after the connect_to_job function.

batchspawner: tests: Disable connect_to_job in testing.

3c3e3aa

The background command can not cleanly be simulated to stay running.

olifre force-pushed the connect_to_job branch from aac38ae to 3c3e3aa Compare July 11, 2022 21:08

consideRatio added new and removed enhancement labels Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "connect to job" functionality, use that for CondorSpawner #200

Add "connect to job" functionality, use that for CondorSpawner #200

olifre commented Dec 18, 2020

Hoeze commented Jan 31, 2021 •

edited

Loading

olifre commented Jan 31, 2021

olifre commented Mar 9, 2021

olifre commented Mar 9, 2021

olifre commented May 13, 2021 •

edited

Loading

olifre commented Jul 11, 2022 •

edited

Loading

Add "connect to job" functionality, use that for CondorSpawner #200

Are you sure you want to change the base?

Add "connect to job" functionality, use that for CondorSpawner #200

Conversation

olifre commented Dec 18, 2020

Hoeze commented Jan 31, 2021 • edited Loading

olifre commented Jan 31, 2021

olifre commented Mar 9, 2021

olifre commented Mar 9, 2021

olifre commented May 13, 2021 • edited Loading

olifre commented Jul 11, 2022 • edited Loading

Hoeze commented Jan 31, 2021 •

edited

Loading

olifre commented May 13, 2021 •

edited

Loading

olifre commented Jul 11, 2022 •

edited

Loading