New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-28650: Add Butler.transfer_from #523
Conversation
87d2bbe
to
c158daa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay; lots of comments, but many of them are future concerns that need at most minor changes now. I'm all for solutions that get most of this on master sooner rather than later, though, as it already looks super useful.
In addition to the line comments, I am a bit worried that the logic for how to handle the many kinds of conflicts is going to be very hard for the user to reason about; that's an intrinsic problem at some level, but it might be better solved demanding more from the user about how they want conflicts to be resolved (maybe even a config object of some kind), rather than guessing.
python/lsst/daf/butler/_butler.py
Outdated
The source collection will be reconstructed in this butler using | ||
the same names. If a dataset already exists in a RUN collection | ||
of the same name no transfer will occur and this will not be an error | ||
unless the DatasetRef uses UUID and there is a UUID mismatch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"dataset already exists" means "a dataset with the same dataset type and data ID exists", then?
I don't understand the reasoning for the UUID behavior exception, in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for importDatasets if the ref already exists in the repo it silently skips it. If somehow the UUID does not match the ref in the registry but does match the DataId/DatasetType then it will complain won't it? We can imagine a scenario like this if people do a transfer, then prune their copy and rerun and forget and transfer again. Hence the comment about it being fine if the UUID is used and matches but not if the UUID differs.
# Just because we can see the artifact when running | ||
# the transfer doesn't mean it will be generally | ||
# accessible to a user of this butler. For now warn | ||
# but assume it will be accessible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this should be an exception or silent, but warning seems like it's still going to be a problem if it's user error, while being annoying if the user knows what they are doing. Do we want a flag so the user can say, "trust me on this"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my head this scenario was extremely unlikely because it only happens with ingested datasets and not those created by processing. Since the transfer does not (yet?) copy dimension records over and the receiving registry won't ever have seen these raw files it seems like we aren't really going to get this far in the code. If we turn this into the default mode for transferring content from one butler to another (and only use export/import for cases where you can't see both butlers from your one python process) then yes, we need the code higher up to insert missing dimension records and then ask the user whether absolute URIs should be transferred or left as-is. I was putting in a warning there as a stop-gap scenario where I'm not expecting it to happen but I don't want to raise an exception immediately -- but warning will generally prompt people to ask us what the morning means.
36c879c
to
11c716e
Compare
log.debug("Importing %d refs with %s into run %s", | ||
len(refs_to_import), datasetType.name, run) | ||
|
||
# No way to know if this butler's registry uses UUID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's a job for this ticket, but I do think we want a public API for this. It's not just an implementation detail; it's a behavioral change that's important to a lot of higher-level code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a transitory problem to a certain extent. Things break if you try to use UUID in an old registry and if I knew that I could strip the ids before calling the import. The question is whether we actually care about that kind of migration. One question I have is whether you are thinking of this as a long-term API or a quick hack. I could easily see a "private" attribute on the dataset manager that we retire when we retire integer IDs. Do you envisage a more public API on Registry
that will return the dataset_id_type
even though in the long term that would presumably be the type of DatasetRef.id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think totally public - things in pipe_base and maybe obs_base do care (thinking especially about Quantum and QuantumGraph because I'm working on those now).
I imagine we would probably deprecate this API when we deprecate int
support, but that depends on how much we want to hedge against changing again in the future.
If we wanted to keep it around forever, we might want to think about whether there should be some other flags, in addition to the type; those might last longer:
has_globally_unique_ids
supports_deterministic_ids
Of course, if we expect both of those to always be true forever after int
is gone, then maybe just the int
vs uuid
type flag is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll do this on another ticket because it will also lead to a cleanup of obs_base ingest-raws.
Sometimes the caller might have specified many collections so won't know which specific collection is causing this problem.
Transfer datasets from a run in one butler to a new butler.
No description provided.