Integration of run_command, exec_prefix, slurm script, and jinja2 #84

rkdarst · 2018-05-19T18:22:13Z

I have created my own integration branch for #66 #73 #74 and #72. Unfortunately, individual pull requests have taken too long and essentially paused the development of batchspawner for university for a while, so I took the initiative to make my own integration branch and start collecting things together and testing them.

In general, is there any better way to handle jupyterhub development? A better way to communicate, better way to get fast feedback? I'd also prefer individual PRs, but haven't gotten any comments on them, so I can't be easily granular.

So far, I've added tests that cover #72, #73, and #66 somewhat.

The whole thing works with JupyterHub 0.9 + SlurmSpawner for me in basic testing. I will continue internal testing and update this PR with comments when I do more.

- This was commented out because run_command was not in the class, so self.log was not available. Re-enable it.

- If the batch_script or any other string has {{ or {% in it, treat it as a jinja2 template and do advanced formatting. - Allows for optional options to be more easily supported. - Initial implementation for SlurmSpawner only. - Closes: jupyterhub#46 (slurm + {nprocs}) - Closes: jupyterhub#70 (slurm + {nprocs}) - Closes: jupyterhub#38 (templates / fallbacks)

- This adds the option exec_prefix which can be used for overriding the sudo command to use when manipulating jobs. - This defaults to "sudo -E -u {username}" and is appended to every command before it is run. - The condor_q command did not have this, but likely can be used here. - Closes: jupyterhub#39

- By using `srun` for cmd, slurm will properly signal shutdown with SIGTERM in advance. Closes: jupyterhub#50 - Add prologue and epilogue options in case other setup is needed before or after the script runs (e.g. `unset XDG_RUNTIME_DIR`). - Some other script nicesess. - This will eventually need porting to the other spawners, but will require some local knowledge. Slurm first, then others.

mbmilligan · 2018-05-21T19:54:51Z

@rkdarst I'm sorry this progress has been frustratingly slow for you. I absolutely don't want to discourage you from working on the project! That said, I think you are correct to go ahead and maintain your own fork to develop features you need now. Certainly it is a little more work to develop now, and then have to go back and split out changes into a form the parent project can accept. But I hope the result is a higher quality, more sustainable parent project.

Talking more concretely, I will still want to merge individual PRs. The ones I haven't commented on, you can assume I don't find them especially controversial, but they are blocked by something else. In particular, most of your PRs in this series, as discussed in PR #66 are blocked waiting on the Jinja2 templates work to merge. And I am not comfortable merging that until we have tests that check the output of the current Torque/Slurm/Contor/etc implementation classes, so we can be sure they aren't changing behind our backs when the Jinja changes land.

I guess the immediate question is, do you want to wait for me to write that functionality (which I will do in the int_templates branch) or do you want to work up another PR providing that?

rkdarst · 2018-05-21T20:11:42Z

I started with some more tests in this branch. So far, the test templating of command, batch script templating, and exec_prefix. It doesn't yet go so far as to test the concrete spawners, but probably can be extended to do that if you can figure it out. Basically,

# Test starting
` = [re.compile('.*echo'),
                          re.compile('.*RUN'),
                         ]
io_loop.run_sync(spawner.start, timeout=5)

will assert that the next to run_command calls match those regular expressions. So, if your expectations can be made to match that form, then you just need to parameterize it somehow. Other maintainers should be able to push directly to this branch, and feel free to do so. (there is also spawner.cmd_expectlist which asserts the run_command stdout match whatever. Probably less useful). How to make the spawners tests that run without batch systems... I guess mocking the outputs? Or not actually running them, just asserting that they match the right patterns and not testing the outputs?

Anyway, I probably won't be able to get to the concrete spawners in the next few days, so if you start at it go ahead.

mbmilligan · 2018-05-22T03:00:59Z

Sounds like a plan - I should have time this week to develop the testing in the integration branch, and then we can start rebasing your PRs onto that branch. And yes, I'm thinking in terms of a framework that tests the generated strings, and passes back the string that would be expected from the corresponding batch system. I might cherry-pick the testing commits from this PR as a starting point, since you have a reasonable proof-of-concept going already.

And again, thanks for all the help!

rkdarst · 2018-05-22T06:13:59Z

OK, I actually just did this, inspired by our last two comments making it obvious how to do it. So far, it is implemented for torque and slurm. Push coming soon. The one thing that is slightly missing is tests of tests - not that I think it should go that far, but there have been several occurrences of mismatch kwargs resulting in tests not doing anything. Just consider this beta and give it a good looking over at least.

It might be easier to just use my branch to integrate, rather than re-separating all the things.

- Makes consistency with other spawners

- Also updates to other spawners

rkdarst · 2018-05-22T12:52:03Z

Now all spawners have some basic testing. Slurm is the best so far, and the tests for others can be improved as they are converted to jinja.

What other tests are needed for my PRs? I think all the major points are there, though maybe not perfectly yet.

rkdarst · 2018-05-22T14:24:43Z

I went through and checked, and I think all of my changes are tested at least minimally. What is the next step now? Should I continue working on this branch, or can we unify soon? Luckily my next things (documentation) doesn't depend on this too much. I think it would be good to not go cherry-picking from here unless you want to separate development again - feel free to use this as integration. Everything is now together and I would rather not try to separate it out again.

I could start working on specific spawner scripts, but that really should be done in individual PRs so that spawner experts can discuss. But, that can't be done until this is taken care of.

Any other major things on the horizon?

mbmilligan · 2018-05-22T21:55:38Z

Very nice, this looks like great work. I have to say that pulling in a giant branch of changes is still not my preferred approach, but the tests pass and it's hard to argue with good results. Let's proceed like so:

I will do a fresh review of the full set of changes to see if there are any issues to flag.
I will stand up a test instance against our Torque cluster as an all-up integration test.
You will confirm that you have done the same (sounds like you already have, but let's be explicit) against your Slurm cluster.

My part of that shouldn't take more than a couple of days. If all that checks out, I'll merge this branch and close the corresponding individual PRs. Sound fair to you?

willingc

See other PRs for comments and small typos.

rkdarst · 2018-05-23T14:37:54Z

Yes, I've been running this against my dev instance on my slurm cluster. Tonight I will migrate to production and see what happens...

Your plan sounds good!

rkdarst · 2018-05-23T21:47:06Z

I did a upgrade to latest jupyterhub + latest batchspawner on my production system (I figured I should test jupyterhub 0.9, too). No problems yet besides remembering to update the config from dev->production. sqlite3 DB migration even worked, with a caveat.

I see some difference in environment variable handling in my SlurmSpawner, but this shouldn't be a blocker - just something I need to figure out more and probably just document for others to know.

rkdarst · 2018-05-24T14:27:03Z

So using this in production today was OK. The one problem with with the slurm script - adding srun resulted in all of the environment not being exported. But, now it's actually doing what it is supposed to. It turns out that in the previous script,{keepvars} did nothing. Now, it does what is expected but may break what worked previously.

I think this should be fixed in another PR which starts add in more flexible environment handling (for example, a env var setting that adds to the defaults, instead replaces it...)

mbmilligan · 2018-05-29T21:26:51Z

Ok, as promised, I've given this integration branch a trial run against our Torque cluster (not in production, but in our pre-production testing environment), and it behaves as expected. I haven't found any points of concern reading through the code changes. Thanks @willingc for the comments you've added, too! I'm going to go ahead and pull this into master, and we'll continue development from that base. Thanks @rkdarst for your work on this!

DeepHorizons and others added 12 commits April 26, 2018 01:26

Moved run_command into the class

ea90c2e

Re-enable run_command logging, now that it is a method

578cc1c

- This was commented out because run_command was not in the class, so self.log was not available. Re-enable it.

Merge branch 'run_command_method'

b932c05

Merge branch 'jinja2_templates'

e8a0279

Merge branch 'slurm_script'

0478508

Merge branch 'exec_prefix'

1189aad

Add templates and run_command tests

8920725

Create separate _get_batch_script method for testing batch_script subst

011b200

Add tests for exec_prefix and batch script

948846f

rkdarst added 5 commits May 22, 2018 09:14

Bugfix for tests (948846f): keyword argument mismatch

6c1be7f

Add specific spawner tests: Torque and Slurm

578ab18

Update spawner test framework and Torque/SlurmSpawners

391412c

LsfSpawner: don't indent batch script

d9d4292

- Makes consistency with other spawners

Add spawner tests for Moab, Condor, and LSF

5e426d1

- Also updates to other spawners

willingc reviewed May 23, 2018

View reviewed changes

Fix two typos of dff4482

c7b093c

mbmilligan merged commit 2f95453 into jupyterhub:master May 29, 2018

rkdarst mentioned this pull request Jun 6, 2018

Moved run_command into the BatchSpawnerBase class #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integration of run_command, exec_prefix, slurm script, and jinja2 #84

Integration of run_command, exec_prefix, slurm script, and jinja2 #84

Uh oh!

rkdarst commented May 19, 2018

Uh oh!

mbmilligan commented May 21, 2018

Uh oh!

rkdarst commented May 21, 2018

Uh oh!

mbmilligan commented May 22, 2018

Uh oh!

rkdarst commented May 22, 2018

Uh oh!

rkdarst commented May 22, 2018

Uh oh!

rkdarst commented May 22, 2018

Uh oh!

mbmilligan commented May 22, 2018

Uh oh!

willingc left a comment

Uh oh!

rkdarst commented May 23, 2018

Uh oh!

rkdarst commented May 23, 2018

Uh oh!

rkdarst commented May 24, 2018

Uh oh!

mbmilligan commented May 29, 2018

Uh oh!

Uh oh!

Integration of run_command, exec_prefix, slurm script, and jinja2 #84

Integration of run_command, exec_prefix, slurm script, and jinja2 #84

Uh oh!

Conversation

rkdarst commented May 19, 2018

Uh oh!

mbmilligan commented May 21, 2018

Uh oh!

rkdarst commented May 21, 2018

Uh oh!

mbmilligan commented May 22, 2018

Uh oh!

rkdarst commented May 22, 2018

Uh oh!

rkdarst commented May 22, 2018

Uh oh!

rkdarst commented May 22, 2018

Uh oh!

mbmilligan commented May 22, 2018

Uh oh!

willingc left a comment

Choose a reason for hiding this comment

Uh oh!

rkdarst commented May 23, 2018

Uh oh!

rkdarst commented May 23, 2018

Uh oh!

rkdarst commented May 24, 2018

Uh oh!

mbmilligan commented May 29, 2018

Uh oh!

Uh oh!