DM-35051: Allow Prompt Processing to set up a local repo with existing files #23

kfindeisen · 2022-07-14T22:28:39Z

This PR adds multiple-run support to MiddlewareInterface, and enhances the logs with information that proved useful in debugging. It also does some refactoring of the existing code.

The multiple-run support is not completely unit-tested, as it is sensitive to parallel computing issues that are hard to reproduce locally.

parejkoj

A handful of comments, mostly requesting clarification.

This looks a lot more robust.

python/activator/activator.py

parejkoj · 2022-08-02T19:30:40Z

python/activator/activator.py

+            if len(expid_set) < expected_visit.snaps:
+                _log.warning(f"Processing {len(expid_set)} snaps, expected {expected_visit.snaps}.")
+            _log.info(f"Running pipeline on group: {expected_visit.group} "
+                      f"detector: {expected_visit.detector}")


Here's another place for that log-formatter on Visit.

The logger that we use in pipetask has an "MDC" that lets you assign a dict to the logger such that every log message reports that dict without having to explicitly add it each time. It's why every message includes the dataId. Is that something that would be useful here?

I can't find any mention of MDC in the Python logging package (am I looking in the wrong place?), but in other loggers I've worked with, MDC is attached to a thread or process. That's a problem here because none of the components are actually associated with a particular visit or even detector. Even each MiddlewareInterface object is subject to arbitrary reuse (perhaps a design flaw we should fix?), which is why the main content of this PR is basically a bunch of safety-checking code.

[Edit: I remembered the code wrong; there are in fact three Butlers for each worker repo. Each worker still has the same PID, though.]

MDC is not a python logging feature. It's something we added to match lsst.log. We added it in lsst.daf.butler.core.logging.ButlerMDC. If you are running multi-threaded then the MDC is not going to work though. pipetask always uses one subprocess per job. We didn't put the effort in to make it multi-threaded.

@timj : this is not connected to pipetask at all.

I wasn't imagining it would be. I was making a general comment about there existing a way to attach metadata to a logger so that every log message gets the content without the person having to explicitly include it in every log message. I was mentioning pipetask as an example of where we use the MDC -- it's where all those dataId messages come from in long-log mode. If you are using threads then it's not going to help without patching the MDC class to understand threads.

Actually, maybe we don't need the ButlerMDC system. We might be able to do exactly what we need if we use a custom handler and Handler.setFormatter (with suitable resource guards). That might let us clean up some of the duplicate formatting code while we're at it.

MDC-ing the loggers is deferred to DM-35828.

python/activator/activator.py

python/activator/middleware_interface.py

parejkoj · 2022-08-02T22:07:44Z

python/activator/middleware_interface.py

+        The datasets that exist in ``src_repo`` but not ``dest_repo``.
+    """
+    try:
+        # TODO: storing this set in memory may be a performance bottleneck.


Why a bottleneck here? Is it just from the iterator->set conversion?

Yes, specifically the need to create all the prior data IDs, including the ones whose existence we don't need to check (because we don't need them for the upcoming run).

Oh, and the fact that this set will stick around at least until the iterator is exhausted. 🙂

parejkoj · 2022-08-02T22:07:59Z

python/activator/middleware_interface.py

+    """
+    try:
+        # TODO: storing this set in memory may be a performance bottleneck.
+        # In general, expect dest_repo to have more datasets than src_repo.


Does anything go wrong if that expectation is violated?

Well, it's violated every time we use a fresh repo!

I meant this sentence to be a continuation of the previous one: not only are we materializing one of the sets, it's the bigger one (given an old enough repo).

I don't know of any algorithm for a set difference that's streaming on both sets, unless they're sorted (which queryDatasets does not guarantee), and it sounds like nobody else does, either.

Ah, ok. Maybe tweak the comment a bit with the above information somehow? It wasn't entirely clear to me that the second line was a continuation of that though.

Actually, I deleted it. I think it is impossible for the dest_repo results to be bigger than the src_repo ones (since the two queries use the same arguments, that would imply that there is something in dest_repo calibs/templates/refcats that did not come from src_repo). It also means the dest_repo results won't grow without bound, so no memory problems.

tests/test_middleware_interface.py

python/activator/middleware_interface.py

Multiple visits and detectors are more representative of the kinds of double-registrations we'd expect in deployment.

Detector number is useful for tracking parallel processing, but can't be put in the log labels because neither processes nor objects are uniquely associated with a detector.

kfindeisen force-pushed the tickets/DM-35051 branch 2 times, most recently from d368aa8 to 06288b4 Compare July 18, 2022 21:51

kfindeisen added 2 commits July 19, 2022 15:39

Fix typo in raw upload script.

021967c

Refactor WCS calculation from prep_butler.

4d48020

kfindeisen force-pushed the tickets/DM-35051 branch 4 times, most recently from f724391 to a29b752 Compare July 25, 2022 18:35

kfindeisen force-pushed the tickets/DM-35051 branch 2 times, most recently from 6e2bc73 to a8a6da7 Compare July 27, 2022 20:25

kfindeisen added 4 commits July 27, 2022 15:31

Refactor bounding circle calculation from prep_butler.

54cbce9

Refactor calculations out of prep_butler's export/import block.

ad7aa53

Actually test for dataset existence when calling datasetExists.

aab8e93

Factor out queries from test_prep_butler.

ffe6aca

kfindeisen force-pushed the tickets/DM-35051 branch from a8a6da7 to 51dc03d Compare July 27, 2022 20:39

kfindeisen requested a review from parejkoj July 27, 2022 22:19

kfindeisen marked this pull request as ready for review July 27, 2022 22:19

kfindeisen force-pushed the tickets/DM-35051 branch from 51dc03d to f14718e Compare July 27, 2022 22:25

parejkoj reviewed Aug 2, 2022

View reviewed changes

kfindeisen added 11 commits August 3, 2022 17:42

Implement _query_missing_datasets.

2613a51

Use _query_missing_datasets to skip redundant export/imports.

8e89eb9

Allow _check_imports to have a variable detector.

76f00a9

Allow _check_imports to have variable shards.

6d14661

Expand double-registration test.

cfc04fc

Multiple visits and detectors are more representative of the kinds of double-registrations we'd expect in deployment.

Add string representation for Visit.

4475078

Clarify image-specific log messages.

9396b46

Detector number is useful for tracking parallel processing, but can't be put in the log labels because neither processes nor objects are uniquely associated with a detector.

Handle clean failure when all exposures time out.

e80460a

Add debugging logs for datasets.

6694101

Log local repository creation.

9a74163

Refresh local butler before prepping it.

57ef059

Remove TODO that was addressed on Confluence.

c00d100

kfindeisen force-pushed the tickets/DM-35051 branch from f14718e to c00d100 Compare August 3, 2022 22:44

kfindeisen merged commit ef17915 into main Aug 4, 2022

kfindeisen deleted the tickets/DM-35051 branch August 4, 2022 17:41

DM-35051: Allow Prompt Processing to set up a local repo with existing files #23

DM-35051: Allow Prompt Processing to set up a local repo with existing files #23

Uh oh!

Conversation

kfindeisen commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parejkoj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfindeisen Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kfindeisen commented Jul 14, 2022 •

edited

Loading

kfindeisen Aug 3, 2022 •

edited

Loading

kfindeisen Aug 3, 2022 •

edited

Loading

kfindeisen Aug 3, 2022 •

edited

Loading

kfindeisen Aug 3, 2022 •

edited

Loading