Skip to content

Conversation

rkdarst
Copy link
Contributor

@rkdarst rkdarst commented May 19, 2018

I have created my own integration branch for #66 #73 #74 and #72. Unfortunately, individual pull requests have taken too long and essentially paused the development of batchspawner for university for a while, so I took the initiative to make my own integration branch and start collecting things together and testing them.

In general, is there any better way to handle jupyterhub development? A better way to communicate, better way to get fast feedback? I'd also prefer individual PRs, but haven't gotten any comments on them, so I can't be easily granular.

So far, I've added tests that cover #72, #73, and #66 somewhat.

The whole thing works with JupyterHub 0.9 + SlurmSpawner for me in basic testing. I will continue internal testing and update this PR with comments when I do more.

DeepHorizons and others added 12 commits April 26, 2018 01:26
- This was commented out because run_command was not in the class, so
  self.log was not available.  Re-enable it.
- If the batch_script or any other string has {{ or {% in it, treat it
  as a jinja2 template and do advanced formatting.
- Allows for optional options to be more easily supported.
- Initial implementation for SlurmSpawner only.
- Closes: jupyterhub#46 (slurm + {nprocs})
- Closes: jupyterhub#70 (slurm + {nprocs})
- Closes: jupyterhub#38 (templates / fallbacks)
- This adds the option exec_prefix which can be used for overriding
  the sudo command to use when manipulating jobs.
- This defaults to "sudo -E -u {username}" and is appended to every
  command before it is run.
- The condor_q command did not have this, but likely can be used here.
- Closes: jupyterhub#39
- By using `srun` for cmd, slurm will properly signal shutdown with
  SIGTERM in advance.  Closes: jupyterhub#50
- Add prologue and epilogue options in case other setup is needed
  before or after the script runs (e.g. `unset XDG_RUNTIME_DIR`).
- Some other script nicesess.
- This will eventually need porting to the other spawners, but will
  require some local knowledge.  Slurm first, then others.
@mbmilligan
Copy link
Member

@rkdarst I'm sorry this progress has been frustratingly slow for you. I absolutely don't want to discourage you from working on the project! That said, I think you are correct to go ahead and maintain your own fork to develop features you need now. Certainly it is a little more work to develop now, and then have to go back and split out changes into a form the parent project can accept. But I hope the result is a higher quality, more sustainable parent project.

Talking more concretely, I will still want to merge individual PRs. The ones I haven't commented on, you can assume I don't find them especially controversial, but they are blocked by something else. In particular, most of your PRs in this series, as discussed in PR #66 are blocked waiting on the Jinja2 templates work to merge. And I am not comfortable merging that until we have tests that check the output of the current Torque/Slurm/Contor/etc implementation classes, so we can be sure they aren't changing behind our backs when the Jinja changes land.

I guess the immediate question is, do you want to wait for me to write that functionality (which I will do in the int_templates branch) or do you want to work up another PR providing that?

@rkdarst
Copy link
Contributor Author

rkdarst commented May 21, 2018

I started with some more tests in this branch. So far, the test templating of command, batch script templating, and exec_prefix. It doesn't yet go so far as to test the concrete spawners, but probably can be extended to do that if you can figure it out. Basically,

# Test starting
` = [re.compile('.*echo'),
                          re.compile('.*RUN'),
                         ]
io_loop.run_sync(spawner.start, timeout=5)                                         

will assert that the next to run_command calls match those regular expressions. So, if your expectations can be made to match that form, then you just need to parameterize it somehow. Other maintainers should be able to push directly to this branch, and feel free to do so. (there is also spawner.cmd_expectlist which asserts the run_command stdout match whatever. Probably less useful). How to make the spawners tests that run without batch systems... I guess mocking the outputs? Or not actually running them, just asserting that they match the right patterns and not testing the outputs?

Anyway, I probably won't be able to get to the concrete spawners in the next few days, so if you start at it go ahead.

@mbmilligan
Copy link
Member

Sounds like a plan - I should have time this week to develop the testing in the integration branch, and then we can start rebasing your PRs onto that branch. And yes, I'm thinking in terms of a framework that tests the generated strings, and passes back the string that would be expected from the corresponding batch system. I might cherry-pick the testing commits from this PR as a starting point, since you have a reasonable proof-of-concept going already.

And again, thanks for all the help!

@rkdarst
Copy link
Contributor Author

rkdarst commented May 22, 2018

OK, I actually just did this, inspired by our last two comments making it obvious how to do it. So far, it is implemented for torque and slurm. Push coming soon. The one thing that is slightly missing is tests of tests - not that I think it should go that far, but there have been several occurrences of mismatch kwargs resulting in tests not doing anything. Just consider this beta and give it a good looking over at least.

It might be easier to just use my branch to integrate, rather than re-separating all the things.

@rkdarst
Copy link
Contributor Author

rkdarst commented May 22, 2018

Now all spawners have some basic testing. Slurm is the best so far, and the tests for others can be improved as they are converted to jinja.

What other tests are needed for my PRs? I think all the major points are there, though maybe not perfectly yet.

@rkdarst
Copy link
Contributor Author

rkdarst commented May 22, 2018

I went through and checked, and I think all of my changes are tested at least minimally. What is the next step now? Should I continue working on this branch, or can we unify soon? Luckily my next things (documentation) doesn't depend on this too much. I think it would be good to not go cherry-picking from here unless you want to separate development again - feel free to use this as integration. Everything is now together and I would rather not try to separate it out again.

I could start working on specific spawner scripts, but that really should be done in individual PRs so that spawner experts can discuss. But, that can't be done until this is taken care of.

Any other major things on the horizon?

@mbmilligan
Copy link
Member

Very nice, this looks like great work. I have to say that pulling in a giant branch of changes is still not my preferred approach, but the tests pass and it's hard to argue with good results. Let's proceed like so:

  • I will do a fresh review of the full set of changes to see if there are any issues to flag.
  • I will stand up a test instance against our Torque cluster as an all-up integration test.
  • You will confirm that you have done the same (sounds like you already have, but let's be explicit) against your Slurm cluster.

My part of that shouldn't take more than a couple of days. If all that checks out, I'll merge this branch and close the corresponding individual PRs. Sound fair to you?

Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other PRs for comments and small typos.

@rkdarst
Copy link
Contributor Author

rkdarst commented May 23, 2018

Yes, I've been running this against my dev instance on my slurm cluster. Tonight I will migrate to production and see what happens...

Your plan sounds good!

@rkdarst
Copy link
Contributor Author

rkdarst commented May 23, 2018

I did a upgrade to latest jupyterhub + latest batchspawner on my production system (I figured I should test jupyterhub 0.9, too). No problems yet besides remembering to update the config from dev->production. sqlite3 DB migration even worked, with a caveat.

I see some difference in environment variable handling in my SlurmSpawner, but this shouldn't be a blocker - just something I need to figure out more and probably just document for others to know.

@rkdarst
Copy link
Contributor Author

rkdarst commented May 24, 2018

So using this in production today was OK. The one problem with with the slurm script - adding srun resulted in all of the environment not being exported. But, now it's actually doing what it is supposed to. It turns out that in the previous script,{keepvars} did nothing. Now, it does what is expected but may break what worked previously.

I think this should be fixed in another PR which starts add in more flexible environment handling (for example, a env var setting that adds to the defaults, instead replaces it...)

@mbmilligan
Copy link
Member

Ok, as promised, I've given this integration branch a trial run against our Torque cluster (not in production, but in our pre-production testing environment), and it behaves as expected. I haven't found any points of concern reading through the code changes. Thanks @willingc for the comments you've added, too! I'm going to go ahead and pull this into master, and we'll continue development from that base. Thanks @rkdarst for your work on this!

@mbmilligan mbmilligan merged commit 2f95453 into jupyterhub:master May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants