[idea] batchspawner sprint #138

cmd-ntrf · 2019-03-28T15:41:38Z

With the coming release of JupyterHub 1.0, I would suggest we organize a small code sprint during the next Jupyter for Science User Facilities and High Performance Computing Workshop. It is happening from June 11 to 13.

I am not sure who is participating, but I think @mbmilligan you are going? Anyone else would also be interested in helping solving a few issues and pave the way to release batchspawner 1.0?

rcthomas · 2019-03-28T16:01:44Z

I was really hoping for this. Creating a list of topics for the hack sessions would be really useful.

cbjhnsn · 2019-04-08T16:45:31Z

I probably won't make the event but if this is something that happens I have some feedback I would like to give for adjustments.

rkdarst · 2019-04-09T09:21:22Z

+1. I guess we should start by making wishlist issues about the things we would like to do, and someone could tag them with sprint. Or if they are too small to warrant own issues, add them here.

rcthomas · 2019-04-10T13:05:29Z

Here are some thoughts I've had. The first set relate more to batchspawner than the others...

JupyterHub 1 supports end-to-end SSL to the notebook. For this to work, .create_certs() and .move_certs() need to be implemented on custom spawners. It might be interesting to look at whether there's a common solution for these for all of us batchspawner users. Here is the merged PR from @tgmachina (who will be with us).
Those of us using batchspawner should probably put some time into documenting how we've done our setups, and the interaction with wrapspawner especially.

Other things less related to batchspawner but still relevant to the sprint context...

At our Lab we have a custom solution that will enable users (currently Lab extension developers) to select the jupyter-labhub they want to run at launch time. It's a little wrapper script that we tell them to put at a magic place with the right command to run, and if that script is executable, it will run that instead of whatever the Hub is configured to do. @cmd-ntrf's customization with the app wrapper for sending back the port info makes me think that kind of thing would be better. The main thing this would get us would be allowing users to manage extensions themselves.
The question of monitoring and audit trails has come up at least at our Lab. Some of this we can handle in our infrastructure, but some investigation of how to do more thorough logging of the notebook and terminal might be a good use of time. An example PR trying to address some of this at least is here..

mbmilligan · 2019-04-10T22:24:05Z

Thanks all for getting the conversation started on this! I think if we can get enough batchspawner-interested parties together for the sprint session, this would be a great use of the time. I will certainly be there.

My priorities would probably be (subject to change, in no particular order):

Ensure Batchspawner and Wrapspawner are compatible with JupyterHub 1 (without breaking too much backward compat) and each other
Ensure both are compatible with the new SSL features
Update PyPI packaging for both
Update testing for Batchspawner, and implement testing for Wrapspawner

Additionally, some discussions that I'd like to have, that might go faster face-to-face than over Github threads:

What should our backwards-compat cutoffs actually be, based on the needs of actual users?
Are there changes to the Github workflow or testing infrastructure that would accelerate the incorporation of changes?
Is there significant technical debt that needs addressing?

For that matter, I'm happy to facilitate setting up some web meeting for Batchspawner folks ahead of the workshop, too, if there would be interest in having some of this discussion in advance of the meeting.

rkdarst · 2019-06-11T23:03:26Z

I think we could drop 3.4 support... that would enable supporting progress updates (unless someone more clever than me can fix it). Backwards compatibility is nice, but if it's that or ongoing development, I'd prefer ongoing development.

ProfileSpawner's functionality is also included in KubeSpawner - perhaps they could be unified? Low priority, though.

One high priority would be working Travis tests.

There are so many open pull requests that I don't know what to do... many are related or directly depend on each other, my local commit graph is getting too complicated to be worth looking at (many branches aren't PRs yet because of this). Some issues relate to ones in profile/wrap spawner, sometimes the same thing solved several times. I haven't had time to do much lately, but not having a direction is slowing things down woo. I'll have time to do stuff this summer, if there's a path forward.

We should do something about the jupyterlab command (#141). Preferably, find a way to do this that doesn't require new commands, but I'm not expert enough in this.

Get the "select port on remote host" worked out a bit more. I think part can be sent to jupyterhub. It also related sto the above about custom jupyterhub command - both of these require a new command which makes a manual-maintnance situation. (I could be wrong here, going from memory...)

Creating a jupyter on HPC guide would be great... I really should get around to writing up what I've done. It's pretty similar to what mbmilligan does from what I understand.

I can take part in a remote sprint for the rest of this week. Let me know what I should be working on.

rkdarst · 2019-06-11T23:36:12Z

Here's my classification of issues. If it's in the same bullet point, it's either duplicate issues or duplicate PRs that are all solved by mostly one solution. Feel free to edit this comment to organize better, I did this quickly.

High priority:

Allow batchspawner to be used with different notebook singleuser cmd #141 (PR, batchspawner-singleuser and jupyterlab), Add support for JupyterLab instances #133 (PR, same thing different implementation)
selecting port remotely: Replace current_ip and current_port by ip and port #139 (PR), Create link between WrapSpawner and its child spawner traitlets wrapspawner#27 (PR, same), Proxy current_port to child spawner, needed by batchspawner #58 wrapspawner#25(PR, same), Interaction between batchspawner.api and profilespawner #127 (issue), Wrapspawner/JH0.9.4 compatibilty: get_current_user and current_port #132 (PR), Add documentation of #58, selecting the port on the remote side #130 (docs), 404 errors when combined with wrapspawner #129 (bug report, probably related), Need to import batchspawner.api in jupyterhub_config.py #126 (issue), Select random port from a range (new implementation) #114 (PR, other implementation), restricting port range for spawned jupyterhub singleuser servers #102 (issue). Most are follow-up to Select singleserver port number on remote host #58 (PR, merged).

Medium:

Progess method and reporting: Message when not enough resources are avalilable #121 (issue), Issue timeout warning in UI #120 (issue, UI warning when not enough resources), Implementation of coroutines with asyncio (possibly shortens run_command substantially) #90 (PR), [WIP] Add progress indicators for jupyterhub 0.9 #86 (PR), Support for jupyterhub 0.9 .progress method #81 (issue), Use progress from child spawner wrapspawner#22(PR), Use progress from child spawner wrapspawner#22(PR, same). We need someone smarter than me, or no python 3.4 anymore.

Low:

Environment handling refactoring. #119 (refactor environment handling, discuss first, I may need to update)
Consider security of environment #82 (security of environment)

Easy (don't have difficult interactions with other things, can be any category above. May need thought though.):

OAR batch scheduler #135 and Initial support for OAR batch scheduler #136 (OAR spawner)
Fix typos #140 (readme typos)
Grab environment variables needed for grid engine #137 (grid engine env vars)

Support:

Pass environment variables through ssh #123
LSF integration issue #118 (LSF)
oauth_client_id error #117 (can be closed I think)
how to include a "extra launch script" #98 (can be closed)
Support for selecting Grid Engine parallel environments #35 (grid engine paraller environments)

mbmilligan · 2019-06-13T17:01:51Z

Hello all,

While we have not found much time for hacking here at the JCW, we have had some very productive face-to-face discussions about the issues holding back Batchspawner. To summarize:

Proposal: Having gotten no pushback from the people here, we would like to propose narrowing Batchspawner's backward-compatibility goals. This will allow us to simplify a number of complicated/redundant code paths, and dramatically simplify and speed up the CI testing process. Specifically we suggest:
- Supported Python versions shall be limited to versions supported by the latest released version of Jupyterhub. At this time Jupyterhub 1.0 supports Python 3.5 and newer, so we will immediately stop testing against/avoiding features that would exclude Py3.4.
- Supported Jupyterhub versions shall be limited to the newest released version and one earlier major version. At this time that would be Jupyterhub 1.0 and the 0.9 branch, so we will immediately stop testing against/maintaining support for Jupyterhub 0.8 and earlier.
Priorities: We will focus effort on issuing a new release that
- Resolves the issues around remote port selection, including compat issues with Wrapspawner
- Adopts the wrapper model in Allow batchspawner to be used with different notebook singleuser cmd #141
- Supports Jupyterhub's new end-to-end SSL functionality (enabled by Allow batchspawner to be used with different notebook singleuser cmd #141)
- Supports Jupyterhub 1.0 (anecdotally reported to work now)
Testing: Felix (@cmd-ntrf) demonstrated a method for spinning up a transient SLURM cluster with Jupyterhub/Batchspawner. We want to use this approach to implement all-up integration testing for Batchspawner and Batchspawner/Wrapspawner configurations. This will allow us to focus the CI testing on unit and component tests, while also detecting breakage that is currently hard to identify with CI testing alone.

Other results from our discussions so far:

While dropping Py3.4 support will allow us to implement the progress bar feature, nobody here was enthusiastic about using it. The underlying issue is that batch systems do not give enough information to usefully estimate when a job will start. However, there is considerable interest in bubbling informational messages up to the user (e.g. job information, notice that resource constraints are wrong, etc). This works now in the case of fatal errors (exception messages are printed) but informational messages need work.
Remote port selection via API call is a natural feature to add to the Hub. We want to prepare a PR for Jupyterhub proposing this.
More generally, the Jupyterhub extension API should provide a way for spawners to add request handlers to exchange information with spawned tasks. We should open a discussion with the Jupyterhub core developers about this.
For some schedulers (looking at you, schedMD) querying the job queue is not an especially cheap operation. Many active user sessions, each causing the Hub to issue a periodic poll(), cause problems for these systems. We should implement an optional schedule polling helper (probably running as a Hub service) that polls a queue and caches the result, and an easy way to retarget our poll() methods to hit the helper instead of the normal tool.
Relatedly, we should investigate if it is possible for a spawner instance to reduce the polling frequency once a job has started. If so, we should provide that as an option and enable it by default.

Please consider this an open call for comments on these plans and proposals. Thank you!

cbjhnsn · 2019-06-13T17:08:10Z

For some schedulers (looking at you, schedMD) querying the job queue is not an especially cheap operation. Many active user sessions, each causing the Hub to issue a periodic poll(), cause problems for these systems. We should implement an optional schedule polling helper (probably running as a Hub service) that polls a queue and caches the result, and an easy way to retarget our poll() methods to hit the helper instead of the normal tool. >

This would help us and something that I've thought about trying to add. For two reasons the current poll method is costly for us our scheduler (SGE) only updates every 30 seconds so there is no reason to poll any faster and as mentioned above its quite expensive and excessive polling has been known to cause performance issues for us.

- As agreed during Jupyterhub for research facilities sprint, jupyterhub#138

rkdarst · 2019-06-13T18:11:09Z

Agree on reduction of backwards compatibility - if it's easy for future releases, we can keep 3.5/0.9 for longer in the future.

Progress bar - agreed it may be hard to use, but is important to have that possibility. Likely it will be used for more informational messages.

I'll see what I can do on a merge strategy...

Drop backwards compatibility to jupyterhub==0.9.6 and python==3.5 See discussion in #138 for proposed new back-compat guidelines.

mbmilligan · 2019-06-14T00:23:01Z

Some further notes:

@yuvipanda is very supportive of adding a port-setting API to core Jupyterhub, and we are requested to submit an issue proposing that. I will write that up shortly. @minrk is reportedly averse to giving spawners a general-purpose mechanism to add extension API request handlers, because chaos will ensue. So if we want to do that, we'll have to keep our hacky api import approach.
We should take a look at Kubespawner, which already has a mechanism for reflecting poll() calls to a cacheing layer.

rkdarst · 2019-06-14T15:17:21Z

I was wondering, did you talk about maintainers and how we can get ensure maintenance keeps moving, both strategic questions on what's good and dealing with PRs?

rcthomas · 2019-06-14T15:20:23Z

A little, but we mainly agreed to start a monthly zoom check-in around "batchspawner and friends" to help address those kinds of questions. I'll be in touch with everyone on this about setting that up. It may not need to be a perpetual meeting but we could run it as long as we felt was necessary.

rkdarst · 2019-06-14T20:37:38Z

Monthly meetings would be good, hopefully they would help things go faster. But does that mean there will be only once a month merge times? If we all commented on issues and PRs a bit more, we could figure out what to do. I am going through old PRs and most can be merged now with a little bit of work, once we have a master to test against. It's always better for stuff to be integrated fast so that we can test interactions, I'd rather not have to keep separate integration branches around for personal testing. I'm happy to do whatever is needed, just let me know.

rcthomas · 2019-06-14T21:04:42Z

I don't think that implies once a month merge times. Unless it was decided that was a good idea. A monthly call could entail a check-in on any outstanding PRs and issues and that could help move things along.

rkdarst · 2019-06-16T09:51:02Z

Yeah, I misstated in my comment, I didn't mean to imply that I thought monthly check-ups was opposed to continuous work. It would have been better to say: "is there any plan to try to get stuff reviewed and accepted faster outside of those meetings?". I've gone through most outstanding PRs and the vast majority can be fixend and merged or closed, but we need some of the basic ones handled first. The other options don't seem so good...

mbmilligan · 2019-06-16T16:47:34Z

So far as we discussed, I will continue to be responsible for merges to the official repository. Between our decisions about reduced requirements for back compatibility, and the regular opportunity to discuss strategy and blocking issues, the hope is that reviewing PRs will be considerably less arduous going forward.

More concretely, once I am back in the office, I will be dedicating significant time this week to using the guidance from the workshop to work through our backlog. After the backlog is dealt with, I hope we can settle into a roughly weekly merge cadence.

rkdarst · 2019-06-17T13:26:51Z

That would be great - let me know what I can do. For my part I'll try to comment on things more and do initial reviews. I have a large and a mega PR that integrate most of the other important PRs, which at least works for me in my test system. Not everything is perfect, but at this point it's so complex that I'd rather it stabilize, then check for integration problems.

For several of the PRs, I have some relevant integration/forward porting in the comments. For others, the only integration is in my integration branch, but if #143 is accepted then it's possible to work on those individually.

Let me know what else I can do to help.

rkdarst added a commit to rkdarst/batchspawner that referenced this issue Jun 13, 2019

Drop backwards compatibility to jupyterhub==0.9.6 and python==3.5

14ed789

- As agreed during Jupyterhub for research facilities sprint, jupyterhub#138

mbmilligan added a commit that referenced this issue Jun 13, 2019

Merge pull request #142 from rkdarst/compat_3.5_0.9

1bd1e2a

Drop backwards compatibility to jupyterhub==0.9.6 and python==3.5 See discussion in #138 for proposed new back-compat guidelines.

cmd-ntrf closed this as completed Jun 24, 2019

rkdarst mentioned this issue Sep 6, 2019

Time for a new Batchspawner release #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[idea] batchspawner sprint #138

[idea] batchspawner sprint #138

cmd-ntrf commented Mar 28, 2019

rcthomas commented Mar 28, 2019

cbjhnsn commented Apr 8, 2019

rkdarst commented Apr 9, 2019

rcthomas commented Apr 10, 2019

mbmilligan commented Apr 10, 2019

rkdarst commented Jun 11, 2019

rkdarst commented Jun 11, 2019

mbmilligan commented Jun 13, 2019

cbjhnsn commented Jun 13, 2019

rkdarst commented Jun 13, 2019

mbmilligan commented Jun 14, 2019

rkdarst commented Jun 14, 2019

rcthomas commented Jun 14, 2019

rkdarst commented Jun 14, 2019

rcthomas commented Jun 14, 2019

rkdarst commented Jun 16, 2019

mbmilligan commented Jun 16, 2019

rkdarst commented Jun 17, 2019

[idea] batchspawner sprint #138

[idea] batchspawner sprint #138

Comments

cmd-ntrf commented Mar 28, 2019

rcthomas commented Mar 28, 2019

cbjhnsn commented Apr 8, 2019

rkdarst commented Apr 9, 2019

rcthomas commented Apr 10, 2019

mbmilligan commented Apr 10, 2019

rkdarst commented Jun 11, 2019

rkdarst commented Jun 11, 2019

mbmilligan commented Jun 13, 2019

cbjhnsn commented Jun 13, 2019

rkdarst commented Jun 13, 2019

mbmilligan commented Jun 14, 2019

rkdarst commented Jun 14, 2019

rcthomas commented Jun 14, 2019

rkdarst commented Jun 14, 2019

rcthomas commented Jun 14, 2019

rkdarst commented Jun 16, 2019

mbmilligan commented Jun 16, 2019

rkdarst commented Jun 17, 2019