DM-53019: initial support for tracking retries in provenance #529

TallJimbo · 2025-11-05T04:11:37Z

Checklist

ran Jenkins
ran and inspected package-docs build
added a release note for user-visible changes to doc/changes

codecov · 2025-11-05T04:21:06Z

Codecov Report

❌ Patch coverage is 91.07143% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.93%. Comparing base (c57584a) to head (dbd4413).
⚠️ Report is 18 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...sst/pipe/base/quantum_graph/aggregator/_scanner.py	82.05%	7 Missing and 7 partials ⚠️
...st/pipe/base/quantum_graph/aggregator/_progress.py	68.42%	10 Missing and 2 partials ⚠️
python/lsst/pipe/base/_status.py	90.90%	1 Missing and 2 partials ⚠️
python/lsst/pipe/base/single_quantum_executor.py	85.00%	1 Missing and 2 partials ⚠️
python/lsst/pipe/base/quantum_graph/_provenance.py	96.77%	0 Missing and 2 partials ⚠️
python/lsst/pipe/base/log_capture.py	97.36%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #529    +/-   ##
========================================
  Coverage   88.93%   88.93%            
========================================
  Files         150      150            
  Lines       20634    20779   +145     
  Branches     2456     2464     +8     
========================================
+ Hits        18350    18479   +129     
- Misses       1702     1714    +12     
- Partials      582      586     +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

The provenance QG can't know whether the dataset *still* exists, so this is hopefully less confusing.

MichelleGower

A couple typos and a question about previous_process_quanta.

Not part of the code changes, but while testing I found it confusing that the ProvenanceQuantumGraphReader.read* functions return the updated reader instead of whatever the function was to read. Returning None might be less confusing.

Merge approved.

MichelleGower · 2025-11-07T00:56:09Z

python/lsst/pipe/base/_status.py

+                # later via Pydantic; don't want one weird value bringing down
+                # our ability to report on an entire run.
+                if isinstance(v, float | int | str | bool):
+                    result.metadata[k] = v


Might we want a debugging message about the skipped metadata?

MichelleGower · 2025-11-10T21:51:47Z

python/lsst/pipe/base/log_capture.py

+                    raise
+                else:
+                    # If the quantum succeeded, we don't need to save the
+                    # metadata in the logs, becauase we'll have saved them


typo: becauase -> because

MichelleGower · 2025-11-10T22:22:48Z

python/lsst/pipe/base/_status.py

+    FAILED = -1
+    """Execution of the quantum failed.
+
+    This is always set if the task metadata dataset was not written but logs,


but the log was?

MichelleGower · 2025-11-10T22:23:18Z

python/lsst/pipe/base/_status.py

+
+    This is always set if the task metadata dataset was not written but logs,
+    as is the case when a Python exception is caught and handled by the
+    execution system.  It may also be set in cases where log were not written


the log was not written?

MichelleGower · 2025-11-11T00:01:47Z

python/lsst/pipe/base/quantum_graph/_provenance.py

+    """
+
+    attempts: list[ProvenanceQuantumAttemptModel]
+    """Information about each attempt to run this quantum.


So this includes the last attempt as well?

Yes. That just seemed simpler, even though some of the attributes of the last attempt are lifted up to siblings of this attribute for convenience (note that this lifting is not done for the corresponding storage model).

MichelleGower · 2025-11-11T18:55:38Z

python/lsst/pipe/base/log_capture.py

+    """
+
+    previous_process_quanta: list[uuid.UUID] = pydantic.Field(default_factory=list)
+    """The IDs of other quanta executed in the same process as this one.


I would include "previously" in the description as well since the quanta executed in same process after this one aren't included.

This seems like a bunch of repetitive information when clustering (or straight pipetask run). Same comment for this entry in quantum_provenance_graph.py.

👍 to "previously".

I admit I was just thinking about small clusters, not what happens when you execute a big graphs in one process - though I suppose there's a practical upper limit when you're talking about a single process (i.e. not even pipetask -j, which spawns/forks a new process for every quantum).

But I also don't see a great alternative. The SingleQuantumExecutor that's writing this information doesn't know the order that we will try to run quanta, and if we don't write anything until the last quantum, we run the risk of a failure that prevents any information from being written (and since quanta don't have any linkage to what's run later in the process, we wouldn't have a way to get from a middle quantum to the list of what was run first).

This assumes we reuse the SingleQuantumExector instance, but that's what we do in MPGraphExecutor (when it runs in one process) and SimplePipelineExecutor, and everything else uses one of those two.

Writing task metadata datasets if and only if we succeed is an important invariant of the execution system that would be difficult to change, but we still want to record failure metadata for provenance (for resource usage at least). So we stuff the metadata in the log datasets instead.

This involves refactoring the Scanner class a bit to separate the concept of metadata existence from success. Note that metadata _dataset_ existence does still indicate success, so we don't ingest metadata for failures; instead they'll only be available as components of provenance datasets, once provenance ingest is implemented.

This factors the per-attempt information from the main quantum provenance data structures so we can store it in lists, and it modifies the low-level interfaces for reading logs and metadata to let those return lists as well. Right now this is set up to to be populated only via the extra information SingleQuantumExecutor stuffs into the log datasets (in practice, that means BPS auto-retries), but we should be able to use it to represent post-provenance-ingest graph-level restarts in the future.

The quantum_provenance_graph (i.e. the implementation for pipetask report v2) will be deprecated after the new provenance system reaches feature-parity, so we don't want to be importing types from that module in the new system. - ExceptionInfo has just moved without change to _status.py, where we define the exceptions it's designed to serialize. - QuantumRunStatus has been copied to _status.py and renamed to QuantumAttemptStatus, with a small change to one enum variant to make its meaning more intuitive and docs that reflect the precise meanings in the new provenance system.

TallJimbo force-pushed the tickets/DM-53019 branch 5 times, most recently from 319866a to 1898a95 Compare November 5, 2025 16:52

TallJimbo marked this pull request as ready for review November 5, 2025 21:09

TallJimbo added 4 commits November 7, 2025 11:43

Add provenance dataset UUID to QG headers.

e944877

Refactor aggregator progress-tracking.

b626999

Rename provenance datasets 'exists' field to 'produced'.

a2ad345

The provenance QG can't know whether the dataset *still* exists, so this is hopefully less confusing.

Adapt to upstream ButlerLogRecords API changes.

75215bd

TallJimbo force-pushed the tickets/DM-53019 branch from 1898a95 to 24690ae Compare November 7, 2025 16:44

MichelleGower approved these changes Nov 11, 2025

View reviewed changes

TallJimbo added 7 commits November 11, 2025 23:08

Store past logs and exception information in logs.

bd65346

Record the IDs of other quanta executed in the same process.

e8f4fa6

This assumes we reuse the SingleQuantumExector instance, but that's what we do in MPGraphExecutor (when it runs in one process) and SimplePipelineExecutor, and everything else uses one of those two.

Fix stray character in docstring.

fb6459a

Add tests for storing and reporting previous-process quanta.

c0a6623

TallJimbo force-pushed the tickets/DM-53019 branch from dcfaeae to e5cba2d Compare November 12, 2025 04:08

TallJimbo added 6 commits November 12, 2025 15:20

Add changelog entry.

0d48620

Move __all__ to our standard location.

41abd99

Log when not propagating exception metadata.

9ee6525

Fix duplicate-read bug in ProvenanceQuantumGraphReader.

bc6e924

Return None instead of self in new QG readers.

dbd4413

TallJimbo force-pushed the tickets/DM-53019 branch from e5cba2d to dbd4413 Compare November 12, 2025 20:20

TallJimbo merged commit aa98aac into main Nov 12, 2025
22 checks passed

TallJimbo deleted the tickets/DM-53019 branch November 12, 2025 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DM-53019: initial support for tracking retries in provenance #529

DM-53019: initial support for tracking retries in provenance #529

Uh oh!

TallJimbo commented Nov 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

MichelleGower left a comment

Uh oh!

MichelleGower Nov 7, 2025

Uh oh!

MichelleGower Nov 10, 2025

Uh oh!

MichelleGower Nov 10, 2025

Uh oh!

MichelleGower Nov 10, 2025

Uh oh!

MichelleGower Nov 11, 2025

Uh oh!

TallJimbo Nov 12, 2025

Uh oh!

MichelleGower Nov 11, 2025

Uh oh!

TallJimbo Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DM-53019: initial support for tracking retries in provenance #529

DM-53019: initial support for tracking retries in provenance #529

Uh oh!

Conversation

TallJimbo commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

codecov bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MichelleGower left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TallJimbo commented Nov 5, 2025 •

edited

Loading

codecov bot commented Nov 5, 2025 •

edited

Loading