Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow PyCBC workflows to be run on OrangeGrid via OSG #559

Merged
merged 35 commits into from
Nov 9, 2015

Conversation

duncan-brown
Copy link
Contributor

This pull request enable us to run PyCBC inspiral workflows created by pycbc_make_coinc_search_workflow on OrangeGrid via using the Open Science Grid Condor-CE infrastructure. It will also allow us to run on the Comet and Stampede XSEDE sites using the OSG interface, rather running pegasus-mpi-cluster via Globus GRAM, which significantly simplifies the workflow management for XSEDE job submission.

I have tested the patch and regular LIGO Data Grid submission on sugar works fine. To run on OrangeGrid the workflow should be created with the command line

pycbc_make_coinc_search_workflow --config-files ../config/analysis.ini ../config/data.ini ../config/gps_times_O1_analysis_3.ini  ../config/injections_minimal.ini  ../config/plotting.ini file-executables.ini --output output --config-overrides 'pegasus_profile-inspiral:hints|execution.site:orange-grid' 'workflow-main:staging-site:orange-grid=local' 'results_page:output-path:/home/dbrown/public_html/orange-grid/og-nonsharedfs-test-12' --workflow-name og-nonsharedfs-test-12

and then submitted with the command line

pycbc_submit_dax --no-create-proxy --accounting-group sugwg.osg --execution-sites orange-grid --append-pegasus-property 'pegasus.data.configuration=nonsharedfs' --append-pegasus-property 'pegasus.transfer.bypass.input.staging=true' --cache ../og-frames-c00.cache --dax og-nonsharedfs-test-12.dax --local-gsiftp-server sugar-dev2.phy.syr.edu

There are a couple of bugs in Pegasus 4.5.x that we have to work around. The first bug means that you can't get the bundled executables from a GridFTP server, and so you need to stage them locally on the submit site and provide an executables.ini file with ``file://` URLs, e.g.

hdf_trigger_merge = file:///home/dbrown/projects/osg/pycbc-software/v1.2.5/x86_64/composer_xe_2015.0.090/pycbc_coinc_mergetrigs

The second bug causes Pegasus to create some jobs with gsiftp:// URLs rather than file:// URLs when staging output data on the local site. To work around this, after submitting the workflow, wait for the main.dax to plan and then run the commands

cd submitdir/work/main_ID0000001
for file in `grep -l nogrid *sub | egrep -v "(stage_in|stage_out|create_dir|stage_worker)" | sed 's/\.sub/\.sh/g'` ; do perl -pi.bak -e 's+gsiftp://sugar-dev2.phy.syr.edu+file://+g' $file ; done

Karan is aware of these bugs and will fix them when he gets back from vacation. No PyCBC code changes are needed because of these bugs, so this patch is ready to merge.

@duncan-brown
Copy link
Contributor Author

@lppekows please can you verify that workflows run successfully on sugar, atlas and orange grid with this patch?

@@ -198,7 +198,10 @@ def __init__(self, cp, name,
raise TypeError("Failed to find %s executable "
"at %s" % (name, exe_path))

self.add_pfn(exe_path)
if exe_path.startswith('gsiftp://'):
self.add_pfn(exe_path,site='pycbc-code')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've used the convention of using site='nonlocal' for this kind of use.
Also, a small thing. This needs a space after the comma.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's better to use something more specific here so that it can have an entry in the site catalog. "nonlocal" is a bit too generic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so as I can understand this better, because I never really understood why this needs a site= entry at all and we do use this elsewhere (ie. remote frame files if querying a non-local backup frame server). It's a gsiftp link, you get it by running globus-url-copy and that's it. Why do you need to know anything about the remote site that using a generic site="nonlocal" doesn't fulfill? Hell, using site="local" works in that the file will get gsiscp-ed, but then if using the symlink options pegasus will try to symlink against a location that doesn't exist, and pegasus knows doesn't exist (to me that seems a bug, and I don't understand why you/Karan say it is not). I think I don't understand the "site=" property for gsiftp links at all, it should be irrelevant, no?

Using site='pycbc-code' implies only code.phy.syr.edu which seems restrictive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you Ian that the handling of the site property is a bit weird,
but here is my best understanding. Pegasus has no knowledge of where data
actually is and doesn't really associate them with "living" on a particular
site. It, by default, assumes that any url it has is valid from wherever it
tries to use it.

The site attribute really only does one thing.

  1. If you request to run on a particular site, and your job needs data, it
    will prefer PFNs that have your site attribute.

However this interacts badly with the force symlinks options.

  1. Force symlinks turns any PFN into a symlink if you run on the same
    site as the pool, and ignores the actual url structure, only looking at the
    folder path.

To me, it is the symlinks option that is broken. It should only transform
file:// urls and fail elsewhere.

On Fri, Nov 6, 2015 at 10:57 AM, spxiwh notifications@github.com wrote:

In pycbc/workflow/core.py
#559 (comment):

@@ -198,7 +198,10 @@ def init(self, cp, name,
raise TypeError("Failed to find %s executable "
"at %s" % (name, exe_path))

  •    self.add_pfn(exe_path)
    
  •    if exe_path.startswith('gsiftp://'):
    
  •        self.add_pfn(exe_path,site='pycbc-code')
    

Just so as I can understand this better, because I never really understood
why this needs a site= entry at all and we do use this elsewhere (ie.
remote frame files if querying a non-local backup frame server). It's a
gsiftp link, you get it by running globus-url-copy and that's it. Why do
you need to know anything about the remote site that using a generic
site="nonlocal" doesn't fulfill? Hell, using site="local" works in that the
file will get gsiscp-ed, but then if using the symlink options pegasus will
try to symlink against a location that doesn't exist, and pegasus knows
doesn't exist (to me that seems a bug, and I don't understand why you/Karan
say it is not). I think I don't understand the "site=" property for gsiftp
links at all, it should be irrelevant, no?

Using site='pycbc-code' implies only code.phy.syr.edu which seems
restrictive.


Reply to this email directly or view it on GitHub
https://github.com/ligo-cbc/pycbc/pull/559/files#r44120432.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it sounds like you mostly agree with me Alex. Although I take the point that the site= variable can be used to "prefer" one entry over another (although I might want to prefer one non-local entry over another ... maybe because the file is "closer" or "quicker" to get at and I'm not sure you could do that here!) . I'm still missing why site="nonlocal" is not the right thing to do though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that ultimately the weirdness is in the Pegasus symlink handling. Symlinking was never a first-class feature in the design of Pegasus; it was added (years ago) to handle the LIGO case where we pre-stage frame files to a particular site. I'm not claiming that the way that Pegasus currently handles it is correct. There are a couple of other bigger issues in Pegasus' data model that I think need revision (for example whether you have a shared file system or not, or use condor i/o is defined at the workflow level, not the site level, as it should be. The pool attribute in the static PFN cache was added as a work around for the frame data, but we now use it for a lot more.

We should have a face to face with the Pegasus team to hash out some of these issues longer term. For now, I'll just change this to nonlocal for symmetry with other use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the last sentence of my response.

@duncan-brown
Copy link
Contributor Author

I want to add a fix for #513 before merging, so please hold off merging for the moment.

@duncan-brown
Copy link
Contributor Author

I did one last clean up given the above fix: I removed the project name and core count from the Stampede site catalogs as these can now be specified on the command line with

--append-site-property 'stampede:globus|project:TG-PHY000000' --append-site-property 'stampede:globus|count:256'

This is ready to merge once checks are complete.

@duncan-brown duncan-brown assigned lppekows and unassigned duncan-brown Nov 6, 2015
@duncan-brown
Copy link
Contributor Author

Test Suite

I have launched two runs to test this branch against the v1.2.5 release. If these two runs complete successfully, then this branch is ready to merge. I have also started a run on OrangeGrid, which is not a pre-requisite for merging, but tests the OSG plumbing.

Reference workflow on sugar-dev3

globus-url-copy -vb gsiftp://pycbc.phy.syr.edu/var/opt/gitlab/ligo-cbc/pycbc-software/v1.2.5/x86_64/composer_xe_2015.0.090/pycbc_make_coinc_search_workflow file://`pwd`/pycbc_make_coinc_search_workflow
globus-url-copy -vb gsiftp://pycbc.phy.syr.edu/var/opt/gitlab/ligo-cbc/pycbc-software/v1.2.5/x86_64/composer_xe_2015.0.090/pycbc_submit_dax file://`pwd`/pycbc_submit_dax
chmod +x pycbc_*
./pycbc_make_coinc_search_workflow --config-files ../config/* --output output --config-overrides 'results_page:output-path:/home/dbrown/public_html/orange-grid/sugar-test-4-v1.2.5' --workflow-name sugar-test-4-v1.2.5
cd output
../pycbc_submit_dax --accounting-group sugwg.osg --dax sugar-test-4-v1.2.5.dax

Reference workflow with v1.2.5 code base

Normal run with code from this development branch on sugar-dev3

source ~/src/pycbc-dev/bin/activate
pycbc_make_coinc_search_workflow --config-files ../config/* --output output --config-overrides 'results_page:output-path:/home/dbrown/public_html/orange-grid/sugar-test-4' --workflow-name sugar-test-4
cd output
pycbc_submit_dax --accounting-group sugwg.osg --dax sugar-test-4.dax

Standard workflow with this branch

Run on OrangeGrid from sugar-dev2

pycbc_make_coinc_search_workflow --config-files ../config/analysis.ini ../config/data.ini ../config/gps_times_O1_analysis_3.ini  ../config/injections_minimal.ini  ../config/plotting.ini file-executables.ini --output output --config-overrides 'pegasus_profile-inspiral:hints|execution.site:orange-grid' 'workflow-main:staging-site:orange-grid=local' 'results_page:output-path:/home/dbrown/public_html/orange-grid/og-nonsharedfs-test-14' --workflow-name og-nonsharedfs-test-14
cd output
pycbc_submit_dax --accounting-group sugwg.osg --execution-sites orange-grid --append-pegasus-property 'pegasus.data.configuration=nonsharedfs' --append-pegasus-property 'pegasus.transfer.bypass.input.staging=true' --cache ../og-frames-c00.cache --dax og-nonsharedfs-test-14.dax --local-gsiftp-server sugar-dev2.phy.syr.edu --append-site-profile 'local:dagman|maxjobs:10000' --append-site-profile 'local:dagman|maxidle:500'
cd submitdir/work/main_ID0000001
for file in `grep -l nogrid *sub | egrep -v "(stage_in|stage_out|create_dir|stage_worker)" | sed 's/\.sub/\.sh/g'` ; do perl -pi.bak -e 's+gsiftp://sugar-dev2.ph(.syr.edu+file://+g' $file ; done

Run on OrangeGrid

@duncan-brown
Copy link
Contributor Author

The testing for regular jobs is working fine, so in principle this could be merged now, but I would like to fix #562 first in the branch, as it's an easy fix.

@duncan-brown
Copy link
Contributor Author

Tested and this branch is now ready to merge.

@duncan-brown
Copy link
Contributor Author

Check of nonshared filesystem mode is also complete:

https://sugar-dev2.phy.syr.edu/pegasus/u/dbrown/r/146/w?wf_uuid=35d22d80-44b7-4293-8683-5b57502a9b29

If there are no other comments, I will merge this Monday morning.

@duncan-brown duncan-brown assigned duncan-brown and unassigned lppekows Nov 9, 2015
duncan-brown added a commit that referenced this pull request Nov 9, 2015
Allow PyCBC workflows to be run on OrangeGrid via OSG
@duncan-brown duncan-brown merged commit 2b08ae4 into gwastro:master Nov 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants