-
Notifications
You must be signed in to change notification settings - Fork 136
Integration of run_command, exec_prefix, slurm script, and jinja2 #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- This was commented out because run_command was not in the class, so self.log was not available. Re-enable it.
- If the batch_script or any other string has {{ or {% in it, treat it as a jinja2 template and do advanced formatting. - Allows for optional options to be more easily supported. - Initial implementation for SlurmSpawner only. - Closes: jupyterhub#46 (slurm + {nprocs}) - Closes: jupyterhub#70 (slurm + {nprocs}) - Closes: jupyterhub#38 (templates / fallbacks)
- This adds the option exec_prefix which can be used for overriding the sudo command to use when manipulating jobs. - This defaults to "sudo -E -u {username}" and is appended to every command before it is run. - The condor_q command did not have this, but likely can be used here. - Closes: jupyterhub#39
- By using `srun` for cmd, slurm will properly signal shutdown with SIGTERM in advance. Closes: jupyterhub#50 - Add prologue and epilogue options in case other setup is needed before or after the script runs (e.g. `unset XDG_RUNTIME_DIR`). - Some other script nicesess. - This will eventually need porting to the other spawners, but will require some local knowledge. Slurm first, then others.
@rkdarst I'm sorry this progress has been frustratingly slow for you. I absolutely don't want to discourage you from working on the project! That said, I think you are correct to go ahead and maintain your own fork to develop features you need now. Certainly it is a little more work to develop now, and then have to go back and split out changes into a form the parent project can accept. But I hope the result is a higher quality, more sustainable parent project. Talking more concretely, I will still want to merge individual PRs. The ones I haven't commented on, you can assume I don't find them especially controversial, but they are blocked by something else. In particular, most of your PRs in this series, as discussed in PR #66 are blocked waiting on the Jinja2 templates work to merge. And I am not comfortable merging that until we have tests that check the output of the current Torque/Slurm/Contor/etc implementation classes, so we can be sure they aren't changing behind our backs when the Jinja changes land. I guess the immediate question is, do you want to wait for me to write that functionality (which I will do in the |
I started with some more tests in this branch. So far, the test templating of command, batch script templating, and exec_prefix. It doesn't yet go so far as to test the concrete spawners, but probably can be extended to do that if you can figure it out. Basically, # Test starting
` = [re.compile('.*echo'),
re.compile('.*RUN'),
]
io_loop.run_sync(spawner.start, timeout=5) will assert that the next to Anyway, I probably won't be able to get to the concrete spawners in the next few days, so if you start at it go ahead. |
Sounds like a plan - I should have time this week to develop the testing in the integration branch, and then we can start rebasing your PRs onto that branch. And yes, I'm thinking in terms of a framework that tests the generated strings, and passes back the string that would be expected from the corresponding batch system. I might cherry-pick the testing commits from this PR as a starting point, since you have a reasonable proof-of-concept going already. And again, thanks for all the help! |
OK, I actually just did this, inspired by our last two comments making it obvious how to do it. So far, it is implemented for torque and slurm. Push coming soon. The one thing that is slightly missing is tests of tests - not that I think it should go that far, but there have been several occurrences of mismatch kwargs resulting in tests not doing anything. Just consider this beta and give it a good looking over at least. It might be easier to just use my branch to integrate, rather than re-separating all the things. |
- Makes consistency with other spawners
- Also updates to other spawners
Now all spawners have some basic testing. Slurm is the best so far, and the tests for others can be improved as they are converted to jinja. What other tests are needed for my PRs? I think all the major points are there, though maybe not perfectly yet. |
I went through and checked, and I think all of my changes are tested at least minimally. What is the next step now? Should I continue working on this branch, or can we unify soon? Luckily my next things (documentation) doesn't depend on this too much. I think it would be good to not go cherry-picking from here unless you want to separate development again - feel free to use this as integration. Everything is now together and I would rather not try to separate it out again. I could start working on specific spawner scripts, but that really should be done in individual PRs so that spawner experts can discuss. But, that can't be done until this is taken care of. Any other major things on the horizon? |
Very nice, this looks like great work. I have to say that pulling in a giant branch of changes is still not my preferred approach, but the tests pass and it's hard to argue with good results. Let's proceed like so:
My part of that shouldn't take more than a couple of days. If all that checks out, I'll merge this branch and close the corresponding individual PRs. Sound fair to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other PRs for comments and small typos.
Yes, I've been running this against my dev instance on my slurm cluster. Tonight I will migrate to production and see what happens... Your plan sounds good! |
I did a upgrade to latest jupyterhub + latest batchspawner on my production system (I figured I should test jupyterhub 0.9, too). No problems yet besides remembering to update the config from dev->production. sqlite3 DB migration even worked, with a caveat. I see some difference in environment variable handling in my SlurmSpawner, but this shouldn't be a blocker - just something I need to figure out more and probably just document for others to know. |
So using this in production today was OK. The one problem with with the slurm script - adding I think this should be fixed in another PR which starts add in more flexible environment handling (for example, a env var setting that adds to the defaults, instead replaces it...) |
Ok, as promised, I've given this integration branch a trial run against our Torque cluster (not in production, but in our pre-production testing environment), and it behaves as expected. I haven't found any points of concern reading through the code changes. Thanks @willingc for the comments you've added, too! I'm going to go ahead and pull this into master, and we'll continue development from that base. Thanks @rkdarst for your work on this! |
I have created my own integration branch for #66 #73 #74 and #72. Unfortunately, individual pull requests have taken too long and essentially paused the development of batchspawner for university for a while, so I took the initiative to make my own integration branch and start collecting things together and testing them.
In general, is there any better way to handle jupyterhub development? A better way to communicate, better way to get fast feedback? I'd also prefer individual PRs, but haven't gotten any comments on them, so I can't be easily granular.
So far, I've added tests that cover #72, #73, and #66 somewhat.
The whole thing works with JupyterHub 0.9 + SlurmSpawner for me in basic testing. I will continue internal testing and update this PR with comments when I do more.