-
Notifications
You must be signed in to change notification settings - Fork 0
DM-35051: Allow Prompt Processing to set up a local repo with existing files #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d368aa8
to
06288b4
Compare
f724391
to
a29b752
Compare
6e2bc73
to
a8a6da7
Compare
a8a6da7
to
51dc03d
Compare
51dc03d
to
f14718e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A handful of comments, mostly requesting clarification.
This looks a lot more robust.
python/activator/activator.py
Outdated
if len(expid_set) < expected_visit.snaps: | ||
_log.warning(f"Processing {len(expid_set)} snaps, expected {expected_visit.snaps}.") | ||
_log.info(f"Running pipeline on group: {expected_visit.group} " | ||
f"detector: {expected_visit.detector}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another place for that log-formatter on Visit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logger that we use in pipetask has an "MDC" that lets you assign a dict to the logger such that every log message reports that dict without having to explicitly add it each time. It's why every message includes the dataId. Is that something that would be useful here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any mention of MDC in the Python logging
package (am I looking in the wrong place?), but in other loggers I've worked with, MDC is attached to a thread or process. That's a problem here because none of the components are actually associated with a particular visit or even detector. Even each MiddlewareInterface
object is subject to arbitrary reuse (perhaps a design flaw we should fix?), which is why the main content of this PR is basically a bunch of safety-checking code.
[Edit: I remembered the code wrong; there are in fact three Butlers for each worker repo. Each worker still has the same PID, though.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDC is not a python logging feature. It's something we added to match lsst.log. We added it in lsst.daf.butler.core.logging.ButlerMDC
. If you are running multi-threaded then the MDC is not going to work though. pipetask always uses one subprocess per job. We didn't put the effort in to make it multi-threaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timj : this is not connected to pipetask at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't imagining it would be. I was making a general comment about there existing a way to attach metadata to a logger so that every log message gets the content without the person having to explicitly include it in every log message. I was mentioning pipetask as an example of where we use the MDC -- it's where all those dataId messages come from in long-log mode. If you are using threads then it's not going to help without patching the MDC class to understand threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe we don't need the ButlerMDC
system. We might be able to do exactly what we need if we use a custom handler and Handler.setFormatter
(with suitable resource guards). That might let us clean up some of the duplicate formatting code while we're at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDC-ing the loggers is deferred to DM-35828.
The datasets that exist in ``src_repo`` but not ``dest_repo``. | ||
""" | ||
try: | ||
# TODO: storing this set in memory may be a performance bottleneck. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a bottleneck here? Is it just from the iterator->set conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, specifically the need to create all the prior data IDs, including the ones whose existence we don't need to check (because we don't need them for the upcoming run).
Oh, and the fact that this set will stick around at least until the iterator is exhausted. 🙂
""" | ||
try: | ||
# TODO: storing this set in memory may be a performance bottleneck. | ||
# In general, expect dest_repo to have more datasets than src_repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything go wrong if that expectation is violated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's violated every time we use a fresh repo!
I meant this sentence to be a continuation of the previous one: not only are we materializing one of the sets, it's the bigger one (given an old enough repo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of any algorithm for a set difference that's streaming on both sets, unless they're sorted (which queryDatasets
does not guarantee), and it sounds like nobody else does, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Maybe tweak the comment a bit with the above information somehow? It wasn't entirely clear to me that the second line was a continuation of that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I deleted it. I think it is impossible for the dest_repo
results to be bigger than the src_repo
ones (since the two queries use the same arguments, that would imply that there is something in dest_repo
calibs/templates/refcats that did not come from src_repo
). It also means the dest_repo
results won't grow without bound, so no memory problems.
Multiple visits and detectors are more representative of the kinds of double-registrations we'd expect in deployment.
Detector number is useful for tracking parallel processing, but can't be put in the log labels because neither processes nor objects are uniquely associated with a detector.
f14718e
to
c00d100
Compare
This PR adds multiple-run support to
MiddlewareInterface
, and enhances the logs with information that proved useful in debugging. It also does some refactoring of the existing code.The multiple-run support is not completely unit-tested, as it is sensitive to parallel computing issues that are hard to reproduce locally.