Skip to content
This repository was archived by the owner on Oct 10, 2019. It is now read-only.

Conversation

@brianhlin
Copy link
Member

Use the pre-existing workdir test function, which catches permission
issues with the workdir as well as non-existence. Do this all before
we write out the job wrapper because why bother?

condor_submit doesn't benefit from this since it doesn't use
bls_add_job_wrapper (or any other common submit functions for that matter)

@edquist @PerilousApricot

Use the pre-existing workdir test function, which catches permission
issues with the workdir as well as non-existence. Do this all before
we write out the job wrapper because why bother?

condor_submit doesn't benefit from this since it doesn't use
bls_add_job_wrapper (or any other common submit functions for that matter)
@brianhlin brianhlin requested a review from djw8605 February 28, 2018 21:38
@brianhlin
Copy link
Member Author

ping @edquist

Copy link
Member

@djw8605 djw8605 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me. This would put the job on hold, of course.

if ! bls_fl_test_exists inputsand ; then
echo "Input sandbox file doesn't exist: $bls_fl_test_exists_result" >&2
echo Error # for the sake of waiting fgets in blahpd
rm -rf $bls_tmp_file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be rm -f not rm -rf ... Both as a matter of safety, and (as it turns out) also to match code in this same function.

Also, separately, should $bls_tmp_file be quoted here? My read is that $bls_tmp_file contains $bls_opt_temp_dir, which is a command line argument for the -T option -- that is to say, it's under user control, and could in theory contain whitespace or shell metacharacters. But i realize that there are several other places in this script where that ought to be fixed, which could perhaps be done as a separate PR.

bls_test_input_files
bls_start_job_wrapper >> $bls_tmp_file
bls_finish_job_wrapper >> $bls_tmp_file
bls_test_working_dir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that previously, bls_test_working_dir was getting run after $bls_tmp_file was written to, and then deleting $bls_tmp_file upon error. Now bls_test_input_files is run beforehand, but still removing the file on error. Not that it's a problem, but does this file get created somewhere else first? Or is this spot in bls_add_job_wrapper the first?

Copy link
Member Author

@brianhlin brianhlin Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the description of the PR:

Use the pre-existing workdir test function, which catches permission
issues with the workdir as well as non-existence. Do this all before
we write out the job wrapper because why bother?

But yes, bls_tmp_file gets created in bls_setup_all_files, which is called long before bls_add_job_wrapper

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my main question and point (which i did not find the answer to in the PR description) are:

does this file [$bls_tmp_file] get created somewhere else first? Or is this spot in bls_add_job_wrapper the first?

Because:

  • if it gets created somewhere else first, then it still makes sense to delete it in bls_test_input_files
  • if the above two lines in bls_add_job_wrapper are the first place where $bls_tmp_file is created/written to, then it is superfluous to do rm -f $bls_tmp_file in bls_test_input_files, as it will never yet exist at that point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then it makes sense as-is.

# Ensure local files actually exist. When called before job submission, this prevents
# unnecessary churn on the scheduler if the files don't exist.
if ! bls_fl_test_exists inputsand ; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this bls_fl_test_exists inputsand check was only in slurm_submit.sh.

Just to confirm: is it intentional that now we will do this check in the other submit scripts also? ({blah,lsf,pbs,sge}_submit.sh)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we want to apply these checks across all batch system submissions

Copy link

@edquist edquist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved!

With a note that other instances of $bls_tmp_file throughout the script could stand to be quoted, possibly in a future PR.

@PerilousApricot
Copy link

@djw8605 Actually, having the job on hold isn't the desired behavior. If the work dir/input files don't exist, the job is a big fail and should go away. (The inputs probably won't magically appear later)

@edquist
Copy link

edquist commented Mar 13, 2018

Disapproved! (As Moate would say.)

@PerilousApricot
Copy link

That may be a condor-ce issue. Blahp is "doing the right thing" and returning a non-zero exit code

@brianhlin
Copy link
Member Author

I'll have to double-check with Jaime but I'm pretty that if grid job submission fails, HTCondor will put it on hold. I imagine the Slurm-specific solution that you're running puts the jobs on hold as well.

We could possibly add periodic removal expressions that check for these types of held jobs to the HTCondor-CE.

@brianhlin
Copy link
Member Author

Speaking with Jaime, blahp failure always yields a held job on the HTCondor side so that's the best we can do here. Merging.

@brianhlin brianhlin merged commit 1f6ad40 into opensciencegrid:v1_18_bosco Mar 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants