Skip to content

Conversation

kfindeisen
Copy link
Member

@kfindeisen kfindeisen commented Jul 14, 2022

This PR adds multiple-run support to MiddlewareInterface, and enhances the logs with information that proved useful in debugging. It also does some refactoring of the existing code.

The multiple-run support is not completely unit-tested, as it is sensitive to parallel computing issues that are hard to reproduce locally.

@kfindeisen kfindeisen force-pushed the tickets/DM-35051 branch 2 times, most recently from d368aa8 to 06288b4 Compare July 18, 2022 21:51
@kfindeisen kfindeisen force-pushed the tickets/DM-35051 branch 4 times, most recently from f724391 to a29b752 Compare July 25, 2022 18:35
@kfindeisen kfindeisen force-pushed the tickets/DM-35051 branch 2 times, most recently from 6e2bc73 to a8a6da7 Compare July 27, 2022 20:25
@kfindeisen kfindeisen requested a review from parejkoj July 27, 2022 22:19
@kfindeisen kfindeisen marked this pull request as ready for review July 27, 2022 22:19
Copy link
Contributor

@parejkoj parejkoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A handful of comments, mostly requesting clarification.

This looks a lot more robust.

if len(expid_set) < expected_visit.snaps:
_log.warning(f"Processing {len(expid_set)} snaps, expected {expected_visit.snaps}.")
_log.info(f"Running pipeline on group: {expected_visit.group} "
f"detector: {expected_visit.detector}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another place for that log-formatter on Visit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger that we use in pipetask has an "MDC" that lets you assign a dict to the logger such that every log message reports that dict without having to explicitly add it each time. It's why every message includes the dataId. Is that something that would be useful here?

Copy link
Member Author

@kfindeisen kfindeisen Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any mention of MDC in the Python logging package (am I looking in the wrong place?), but in other loggers I've worked with, MDC is attached to a thread or process. That's a problem here because none of the components are actually associated with a particular visit or even detector. Even each MiddlewareInterface object is subject to arbitrary reuse (perhaps a design flaw we should fix?), which is why the main content of this PR is basically a bunch of safety-checking code.

[Edit: I remembered the code wrong; there are in fact three Butlers for each worker repo. Each worker still has the same PID, though.]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDC is not a python logging feature. It's something we added to match lsst.log. We added it in lsst.daf.butler.core.logging.ButlerMDC. If you are running multi-threaded then the MDC is not going to work though. pipetask always uses one subprocess per job. We didn't put the effort in to make it multi-threaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj : this is not connected to pipetask at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't imagining it would be. I was making a general comment about there existing a way to attach metadata to a logger so that every log message gets the content without the person having to explicitly include it in every log message. I was mentioning pipetask as an example of where we use the MDC -- it's where all those dataId messages come from in long-log mode. If you are using threads then it's not going to help without patching the MDC class to understand threads.

Copy link
Member Author

@kfindeisen kfindeisen Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe we don't need the ButlerMDC system. We might be able to do exactly what we need if we use a custom handler and Handler.setFormatter (with suitable resource guards). That might let us clean up some of the duplicate formatting code while we're at it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDC-ing the loggers is deferred to DM-35828.

The datasets that exist in ``src_repo`` but not ``dest_repo``.
"""
try:
# TODO: storing this set in memory may be a performance bottleneck.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a bottleneck here? Is it just from the iterator->set conversion?

Copy link
Member Author

@kfindeisen kfindeisen Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, specifically the need to create all the prior data IDs, including the ones whose existence we don't need to check (because we don't need them for the upcoming run).

Oh, and the fact that this set will stick around at least until the iterator is exhausted. 🙂

"""
try:
# TODO: storing this set in memory may be a performance bottleneck.
# In general, expect dest_repo to have more datasets than src_repo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything go wrong if that expectation is violated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's violated every time we use a fresh repo!

I meant this sentence to be a continuation of the previous one: not only are we materializing one of the sets, it's the bigger one (given an old enough repo).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of any algorithm for a set difference that's streaming on both sets, unless they're sorted (which queryDatasets does not guarantee), and it sounds like nobody else does, either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Maybe tweak the comment a bit with the above information somehow? It wasn't entirely clear to me that the second line was a continuation of that though.

Copy link
Member Author

@kfindeisen kfindeisen Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I deleted it. I think it is impossible for the dest_repo results to be bigger than the src_repo ones (since the two queries use the same arguments, that would imply that there is something in dest_repo calibs/templates/refcats that did not come from src_repo). It also means the dest_repo results won't grow without bound, so no memory problems.

@kfindeisen kfindeisen merged commit ef17915 into main Aug 4, 2022
@kfindeisen kfindeisen deleted the tickets/DM-35051 branch August 4, 2022 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants