-
Notifications
You must be signed in to change notification settings - Fork 12
Check workdir and input files before job submission (SOFTWARE-3154) #69
Conversation
Use the pre-existing workdir test function, which catches permission issues with the workdir as well as non-existence. Do this all before we write out the job wrapper because why bother? condor_submit doesn't benefit from this since it doesn't use bls_add_job_wrapper (or any other common submit functions for that matter)
|
ping @edquist |
djw8605
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me. This would put the job on hold, of course.
| if ! bls_fl_test_exists inputsand ; then | ||
| echo "Input sandbox file doesn't exist: $bls_fl_test_exists_result" >&2 | ||
| echo Error # for the sake of waiting fgets in blahpd | ||
| rm -rf $bls_tmp_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be rm -f not rm -rf ... Both as a matter of safety, and (as it turns out) also to match code in this same function.
Also, separately, should $bls_tmp_file be quoted here? My read is that $bls_tmp_file contains $bls_opt_temp_dir, which is a command line argument for the -T option -- that is to say, it's under user control, and could in theory contain whitespace or shell metacharacters. But i realize that there are several other places in this script where that ought to be fixed, which could perhaps be done as a separate PR.
| bls_test_input_files | ||
| bls_start_job_wrapper >> $bls_tmp_file | ||
| bls_finish_job_wrapper >> $bls_tmp_file | ||
| bls_test_working_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that previously, bls_test_working_dir was getting run after $bls_tmp_file was written to, and then deleting $bls_tmp_file upon error. Now bls_test_input_files is run beforehand, but still removing the file on error. Not that it's a problem, but does this file get created somewhere else first? Or is this spot in bls_add_job_wrapper the first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the description of the PR:
Use the pre-existing workdir test function, which catches permission
issues with the workdir as well as non-existence. Do this all before
we write out the job wrapper because why bother?
But yes, bls_tmp_file gets created in bls_setup_all_files, which is called long before bls_add_job_wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my main question and point (which i did not find the answer to in the PR description) are:
does this file [
$bls_tmp_file] get created somewhere else first? Or is this spot in bls_add_job_wrapper the first?
Because:
- if it gets created somewhere else first, then it still makes sense to delete it in
bls_test_input_files - if the above two lines in
bls_add_job_wrapperare the first place where$bls_tmp_fileis created/written to, then it is superfluous to dorm -f $bls_tmp_fileinbls_test_input_files, as it will never yet exist at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then it makes sense as-is.
| # Ensure local files actually exist. When called before job submission, this prevents | ||
| # unnecessary churn on the scheduler if the files don't exist. | ||
| if ! bls_fl_test_exists inputsand ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this bls_fl_test_exists inputsand check was only in slurm_submit.sh.
Just to confirm: is it intentional that now we will do this check in the other submit scripts also? ({blah,lsf,pbs,sge}_submit.sh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we want to apply these checks across all batch system submissions
edquist
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved!
With a note that other instances of $bls_tmp_file throughout the script could stand to be quoted, possibly in a future PR.
|
@djw8605 Actually, having the job on hold isn't the desired behavior. If the work dir/input files don't exist, the job is a big fail and should go away. (The inputs probably won't magically appear later) |
|
Disapproved! (As Moate would say.) |
|
That may be a condor-ce issue. Blahp is "doing the right thing" and returning a non-zero exit code |
|
I'll have to double-check with Jaime but I'm pretty that if grid job submission fails, HTCondor will put it on hold. I imagine the Slurm-specific solution that you're running puts the jobs on hold as well. We could possibly add periodic removal expressions that check for these types of held jobs to the HTCondor-CE. |
|
Speaking with Jaime, blahp failure always yields a held job on the HTCondor side so that's the best we can do here. Merging. |
Use the pre-existing workdir test function, which catches permission
issues with the workdir as well as non-existence. Do this all before
we write out the job wrapper because why bother?
condor_submit doesn't benefit from this since it doesn't use
bls_add_job_wrapper (or any other common submit functions for that matter)
@edquist @PerilousApricot