DM-28650: Add Butler.transfer_from #523

timj · 2021-05-18T00:09:32Z

No description provided.

TallJimbo

Looks okay; lots of comments, but many of them are future concerns that need at most minor changes now. I'm all for solutions that get most of this on master sooner rather than later, though, as it already looks super useful.

In addition to the line comments, I am a bit worried that the logic for how to handle the many kinds of conflicts is going to be very hard for the user to reason about; that's an intrinsic problem at some level, but it might be better solved demanding more from the user about how they want conflicts to be resolved (maybe even a config object of some kind), rather than guessing.

TallJimbo · 2021-05-19T22:21:53Z

python/lsst/daf/butler/_butler.py

+        The source collection will be reconstructed in this butler using
+        the same names. If a dataset already exists in a RUN collection
+        of the same name no transfer will occur and this will not be an error
+        unless the DatasetRef uses UUID and there is a UUID mismatch.


"dataset already exists" means "a dataset with the same dataset type and data ID exists", then?

I don't understand the reasoning for the UUID behavior exception, in that case.

I think for importDatasets if the ref already exists in the repo it silently skips it. If somehow the UUID does not match the ref in the registry but does match the DataId/DatasetType then it will complain won't it? We can imagine a scenario like this if people do a transfer, then prune their copy and rerun and forget and transfer again. Hence the comment about it being fine if the UUID is used and matches but not if the UUID differs.

python/lsst/daf/butler/_butler.py

python/lsst/daf/butler/datastores/fileDatastore.py

TallJimbo · 2021-05-20T15:26:38Z

python/lsst/daf/butler/datastores/fileDatastore.py

+                        # Just because we can see the artifact when running
+                        # the transfer doesn't mean it will be generally
+                        # accessible to a user of this butler. For now warn
+                        # but assume it will be accessible.


I'm not sure whether this should be an exception or silent, but warning seems like it's still going to be a problem if it's user error, while being annoying if the user knows what they are doing. Do we want a flag so the user can say, "trust me on this"?

In my head this scenario was extremely unlikely because it only happens with ingested datasets and not those created by processing. Since the transfer does not (yet?) copy dimension records over and the receiving registry won't ever have seen these raw files it seems like we aren't really going to get this far in the code. If we turn this into the default mode for transferring content from one butler to another (and only use export/import for cases where you can't see both butlers from your one python process) then yes, we need the code higher up to insert missing dimension records and then ask the user whether absolute URIs should be transferred or left as-is. I was putting in a warning there as a stop-gap scenario where I'm not expecting it to happen but I don't want to raise an exception immediately -- but warning will generally prompt people to ask us what the morning means.

TallJimbo · 2021-05-25T01:34:29Z

python/lsst/daf/butler/_butler.py

+            log.debug("Importing %d refs with %s into run %s",
+                      len(refs_to_import), datasetType.name, run)
+
+            # No way to know if this butler's registry uses UUID.


Not sure it's a job for this ticket, but I do think we want a public API for this. It's not just an implementation detail; it's a behavioral change that's important to a lot of higher-level code.

It's a transitory problem to a certain extent. Things break if you try to use UUID in an old registry and if I knew that I could strip the ids before calling the import. The question is whether we actually care about that kind of migration. One question I have is whether you are thinking of this as a long-term API or a quick hack. I could easily see a "private" attribute on the dataset manager that we retire when we retire integer IDs. Do you envisage a more public API on Registry that will return the dataset_id_type even though in the long term that would presumably be the type of DatasetRef.id?

I think totally public - things in pipe_base and maybe obs_base do care (thinking especially about Quantum and QuantumGraph because I'm working on those now).
I imagine we would probably deprecate this API when we deprecate int support, but that depends on how much we want to hedge against changing again in the future.

If we wanted to keep it around forever, we might want to think about whether there should be some other flags, in addition to the type; those might last longer:

has_globally_unique_ids

supports_deterministic_ids

Of course, if we expect both of those to always be true forever after int is gone, then maybe just the int vs uuid type flag is enough.

Okay. I'll do this on another ticket because it will also lead to a cleanup of obs_base ingest-raws.

Sometimes the caller might have specified many collections so won't know which specific collection is causing this problem.

Transfer datasets from a run in one butler to a new butler.

timj force-pushed the tickets/DM-28650 branch 6 times, most recently from 87d2bbe to c158daa Compare May 19, 2021 20:50

timj marked this pull request as ready for review May 19, 2021 20:52

timj requested a review from TallJimbo May 19, 2021 20:52

TallJimbo approved these changes May 20, 2021

View reviewed changes

timj force-pushed the tickets/DM-28650 branch 4 times, most recently from 36c879c to 11c716e Compare May 24, 2021 21:16

TallJimbo approved these changes May 25, 2021

View reviewed changes

timj added 3 commits May 25, 2021 09:27

Do not turn result into a list since it is immediately passed on

c6af39c

Report name of collection when reporting it's not a RUN

f20cab3

Sometimes the caller might have specified many collections so won't know which specific collection is causing this problem.

Add Butler.transfer_from()

087f847

Transfer datasets from a run in one butler to a new butler.

timj force-pushed the tickets/DM-28650 branch from 11c716e to 3ec4a49 Compare May 25, 2021 20:00

timj added 2 commits May 25, 2021 13:06

Add butler transfer-run command line

1c3f1ad

Add news fragment

ab88153

timj force-pushed the tickets/DM-28650 branch from 3ec4a49 to ab88153 Compare May 25, 2021 20:06

timj merged commit e64b597 into master May 25, 2021

timj deleted the tickets/DM-28650 branch May 25, 2021 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-28650: Add Butler.transfer_from #523

DM-28650: Add Butler.transfer_from #523

timj commented May 18, 2021

TallJimbo left a comment

TallJimbo May 19, 2021

timj May 20, 2021

TallJimbo May 20, 2021

timj May 20, 2021

TallJimbo May 25, 2021

timj May 25, 2021

TallJimbo May 25, 2021

timj May 25, 2021

DM-28650: Add Butler.transfer_from #523

DM-28650: Add Butler.transfer_from #523

Conversation

timj commented May 18, 2021

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment