Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTCondor bindings to IPython.parallel #3465

Merged
merged 12 commits into from Jun 27, 2013
Merged

Conversation

jabooth
Copy link
Contributor

@jabooth jabooth commented Jun 25, 2013

This PR adds a new set of subclasses in IPython/parallel/apps/launcher.py to add support for the HTCondor batch management system.

I've tried to make a template that has sensible defaults that should work out of the box for most users.

Help config setup
I think I have added config/help support where required but I am still getting my head around the config/traitlets setup in IPython. Does everything look ok?

BatchLauncher changes

  1. Adding job and queue templates to the job are now overridable methods (Condor has to place it's job_array at the end)
  2. Added job_id_regexp_group to allow for subclasses to specify what group of the job_id_regexp needs to be matched. By default this is 0, which means all current implementations remain unchanged, whilst allowing for Condor to have a simple regexp setup. I'm not the worlds foremost regexp expert so there might be a way around this without introducing groups..

Batch launch commands changes
Condor destroys sys.executable for the jobs it farms out, ruining Python's module path checking. This is pretty problematic for the current batch setup, which calls something like
/local/python/path/bin/python -c 'commands required to start the engine or controller' --profile-dir=...
on the remote.

However, by instead calling
/local/python/path/bin/ipengine --profile-dir=... and
/local/python/path/bin/ipcontroller --profile-dir=...
we can skirt around the issue as on entry to these scripts the shebang causes /local/python/path/bin/python to be called, with the sys.executable path set correctly.

I can't think of any downside to this approach, other than you now demand that python is running from a folder containing ipengine and ipcontroller, which is of course the setup everyone has from install. My only concern is that I have no idea how these scripts (and Condor in general for that matter) behave in the Windows world.

Documentation changes
Finally, once accepted the docs probably want to be upgraded to mention the support. Is that something I should do in this PR?

Previously I used condor_submit -verbose so that I could pull the job id
from the first line. However, when submitting many jobs it's silly to
have to search through some many lines of output. Now we have a regexp
for Condor that matches on the non-verbose output. Here the job_id is on
the end of the output, so using grouping in our job_id_regexp becomes
very useful. To account for this I have added a job_id_regexp_group
property to the BatchLauncher and it's subclasses. The default of 0
means the whole regexp is matched - however now an integer can be passed
in here to instead select a subgroup of the expression (see
CondorLauncher and the mechanism will be clear).
def _insert_queue_in_script(self):
"""Inserts a queue if required into the batch script.
"""
print self.queue_regexp.search(self.batch_template)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print statement

@minrk
Copy link
Member

minrk commented Jun 25, 2013

Is there a reason to have special bin_dir/ipcontroller instead of just ipcontroller for default path in condor?

@minrk
Copy link
Member

minrk commented Jun 25, 2013

The probably used to do things, but don't anymore. Go ahead and remove all the empty ones.

@jabooth
Copy link
Contributor Author

jabooth commented Jun 25, 2013

I think there is reason for bin_dir. Firstly, this behaviour is in line with how it is done with the other launchers (lines 80-84).
Secondly, it seems the batch launcher templates are based on the assumption of a shared file system. Specifying a full path to the executable gives us dependable behaviour - e.g. go into whatever virtualenv you want on the machine you run ipcluster start on, and the default configuration guarantees you the same python environment on launched nodes. Without the full path I would either have to get into transferring environment variables to get this behaviour (messy) or by default just get the default system python on nodes (flakey). Of course you could get that behaviour if you want by providing a user defined template, I just thought this was the most solid default option.

Btw do you know why the approach in lines 80-84 is taken instead of the approach I'm using for condor? Launching the ipengine/ipcontroller scripts seems like a better approach to me rather than repeating these what these launchers do manually with the whole python -c 'from IPython.parallel.apps... string manipulation stuff, but I thought it would be a bit bold to change it for all the launchers (and I'm not in a position to test PBS etc).

@minrk
Copy link
Member

minrk commented Jun 25, 2013

Firstly, this behaviour is in line with how it is done with the other launchers (lines 80-84).

That's not what those lines do. Those lines load the ipcontroller/engine startup scripts with the same Python executable as the launcher, specifically so that the PATH to ipcontroller/engine cannot cause confusion by being somewhere weird. It's quite reasonable for the controller/engine scripts to be in a directory other than the Python executable (in fact, I personally have zero systems where ipengine is in the same dir as python).

For that reason, I try to use just the base ipcontroller when sys.executable doesn't make sense (e.g. SSH), and require that people make sure the PATH env makes sense in the batch environment.

@jabooth
Copy link
Contributor Author

jabooth commented Jun 26, 2013

I see, my assumption was clearly wrong about the scripts being reliably located.

Removed bin_dir and added documentation to CondorLauncher to explain all this - seem ok?


def _insert_queue_in_script(self):
"""AFAIK, Condor doesn't have a concept of multiple queues that can be
specified in the script..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra .

@minrk
Copy link
Member

minrk commented Jun 26, 2013

Looks good to me after the one tiny typo. Thanks for going over it with me!

@jabooth
Copy link
Contributor Author

jabooth commented Jun 26, 2013

There you go. No problems, thanks for looking it over so swiftly!

@minrk
Copy link
Member

minrk commented Jun 26, 2013

Oh, one more thing: Can you add Condor to the tests in IPython.parallel.tests.test_launcher? Should be just two trivial entries below TestLSFControllerLauncher and TestLSFEngineSetLauncher. These tests don't do much yet, but may as well.

@jabooth
Copy link
Contributor Author

jabooth commented Jun 26, 2013

Just spotted two minor things:

  1. The docstrings in ipclusterapp don't mention Condor (or LSF for that matter). I'll just tidy them up.
  2. Condor is actually officially called HTCondor as of October 2012. To save confusion it's probably best to update our classes now rather than in a few years time.

I'll go ahead and implement these two changes now

@jabooth
Copy link
Contributor Author

jabooth commented Jun 26, 2013

OK that's all done now.
grep -R [^H][^T]Condor ./IPython to check comes back empty, and
grep -R [^h][^t]condor ./IPython yields only the condor_submit and condor_rm commands so I think we're good.

@minrk
Copy link
Member

minrk commented Jun 27, 2013

Great, thanks.

minrk added a commit that referenced this pull request Jun 27, 2013
Add HTCondor bindings to IPython.parallel
@minrk minrk merged commit 09f2553 into ipython:master Jun 27, 2013
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014
Add HTCondor bindings to IPython.parallel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants