New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTCONDOR-2344 Better skip_if_dataflow documentation #2350
HTCONDOR-2344 Better skip_if_dataflow documentation #2350
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the complete example is good, as is mentioning the conditions for a dataflow job to be skipped early.
I like the mention of DAGs at the end.
Dataflow Jobs | ||
''''''''''''' | ||
|
||
A **dataflow job** is a job that might not need to run because its desired | ||
outputs already exist. To skip such a job, add the following line to your | ||
submit file: :index:`dataflow<single: arguments; example>` | ||
outputs already exist, and are more up-to-date than the input files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs already exist, and are more up-to-date than the input files. | |
outputs already exist, and don't need to be recomputed. |
skip_if_dataflow = True | ||
|
||
queue | ||
|
||
A dataflow job meets any of the following criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence doesn't make any sense, and the whole concept is compromised by calling it "skip-if-dataflow"; "dataflow" makes sense as a type of job, not the state of a job (e.g., already completed). We can't change the submit command name at this point, but this makes the word-smithing harder and and more important.
We can say two different things here: "HTCondor assumes that you've specified all of your job's inputs and outputs; if you haven't, but set :subcom:skip_if_dataflow
anyway, HTCondor could skip your job even if it should have been re-run." Or we can say "these are the technical conditions which require a job to be re-run/allow it to be skipped."
Actually, we should probably say both, but separately.
skip_if_dataflow = True | ||
|
||
queue | ||
|
||
A dataflow job meets any of the following criteria: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
A dataflow job meets any of the following criteria: | |
HTCondor assumes that you've specified all of your job's inputs and outputs; if you haven't, but set :subcom:skip_if_dataflow anyway, HTCondor could skip your job even if it should have been re-run. | |
All of the following conditions must be true for HTCondor to skip the job: |
?
* Output files exist, are newer than input files | ||
* Execute file is newer than input files | ||
* Standard input file is newer than input files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't seem like the right conditions, but I remember being confused by this before. It seems like the conditions ought to be:
- Output files exist and are younger than (the youngest of) the the input files, the standard input file, and the executable.
However, the existing list matches what the code actually does, where "A data flow job meets any of the following critera" means "The job is skipped if any of the following are true."
This has been merged into HTCONDOR-1899; we're going to simplify the documentation by fixing the code (a little). |
<Insert PR description here, and leave checklist below for code review.>
HTCondor Pull Request Checklist for internal reviewers
After the above