-
Notifications
You must be signed in to change notification settings - Fork 12
DM-53019: initial support for tracking retries in provenance #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #529 +/- ##
========================================
Coverage 88.93% 88.93%
========================================
Files 150 150
Lines 20634 20779 +145
Branches 2456 2464 +8
========================================
+ Hits 18350 18479 +129
- Misses 1702 1714 +12
- Partials 582 586 +4 ☔ View full report in Codecov by Sentry. |
319866a to
1898a95
Compare
The provenance QG can't know whether the dataset *still* exists, so this is hopefully less confusing.
1898a95 to
24690ae
Compare
MichelleGower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple typos and a question about previous_process_quanta.
Not part of the code changes, but while testing I found it confusing that the ProvenanceQuantumGraphReader.read* functions return the updated reader instead of whatever the function was to read. Returning None might be less confusing.
Merge approved.
| # later via Pydantic; don't want one weird value bringing down | ||
| # our ability to report on an entire run. | ||
| if isinstance(v, float | int | str | bool): | ||
| result.metadata[k] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might we want a debugging message about the skipped metadata?
python/lsst/pipe/base/log_capture.py
Outdated
| raise | ||
| else: | ||
| # If the quantum succeeded, we don't need to save the | ||
| # metadata in the logs, becauase we'll have saved them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: becauase -> because
python/lsst/pipe/base/_status.py
Outdated
| FAILED = -1 | ||
| """Execution of the quantum failed. | ||
| This is always set if the task metadata dataset was not written but logs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the log was?
python/lsst/pipe/base/_status.py
Outdated
| This is always set if the task metadata dataset was not written but logs, | ||
| as is the case when a Python exception is caught and handled by the | ||
| execution system. It may also be set in cases where log were not written |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the log was not written?
| """ | ||
|
|
||
| attempts: list[ProvenanceQuantumAttemptModel] | ||
| """Information about each attempt to run this quantum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this includes the last attempt as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That just seemed simpler, even though some of the attributes of the last attempt are lifted up to siblings of this attribute for convenience (note that this lifting is not done for the corresponding storage model).
python/lsst/pipe/base/log_capture.py
Outdated
| """ | ||
|
|
||
| previous_process_quanta: list[uuid.UUID] = pydantic.Field(default_factory=list) | ||
| """The IDs of other quanta executed in the same process as this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include "previously" in the description as well since the quanta executed in same process after this one aren't included.
This seems like a bunch of repetitive information when clustering (or straight pipetask run). Same comment for this entry in quantum_provenance_graph.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to "previously".
I admit I was just thinking about small clusters, not what happens when you execute a big graphs in one process - though I suppose there's a practical upper limit when you're talking about a single process (i.e. not even pipetask -j, which spawns/forks a new process for every quantum).
But I also don't see a great alternative. The SingleQuantumExecutor that's writing this information doesn't know the order that we will try to run quanta, and if we don't write anything until the last quantum, we run the risk of a failure that prevents any information from being written (and since quanta don't have any linkage to what's run later in the process, we wouldn't have a way to get from a middle quantum to the list of what was run first).
This assumes we reuse the SingleQuantumExector instance, but that's what we do in MPGraphExecutor (when it runs in one process) and SimplePipelineExecutor, and everything else uses one of those two.
Writing task metadata datasets if and only if we succeed is an important invariant of the execution system that would be difficult to change, but we still want to record failure metadata for provenance (for resource usage at least). So we stuff the metadata in the log datasets instead.
This involves refactoring the Scanner class a bit to separate the concept of metadata existence from success. Note that metadata _dataset_ existence does still indicate success, so we don't ingest metadata for failures; instead they'll only be available as components of provenance datasets, once provenance ingest is implemented.
This factors the per-attempt information from the main quantum provenance data structures so we can store it in lists, and it modifies the low-level interfaces for reading logs and metadata to let those return lists as well. Right now this is set up to to be populated only via the extra information SingleQuantumExecutor stuffs into the log datasets (in practice, that means BPS auto-retries), but we should be able to use it to represent post-provenance-ingest graph-level restarts in the future.
dcfaeae to
e5cba2d
Compare
The quantum_provenance_graph (i.e. the implementation for pipetask report v2) will be deprecated after the new provenance system reaches feature-parity, so we don't want to be importing types from that module in the new system. - ExceptionInfo has just moved without change to _status.py, where we define the exceptions it's designed to serialize. - QuantumRunStatus has been copied to _status.py and renamed to QuantumAttemptStatus, with a small change to one enum variant to make its meaning more intuitive and docs that reflect the precise meanings in the new provenance system.
e5cba2d to
dbd4413
Compare
Checklist
package-docs builddoc/changes