Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-28650: Add Butler.transfer_from #523

Merged
merged 5 commits into from May 25, 2021
Merged

DM-28650: Add Butler.transfer_from #523

merged 5 commits into from May 25, 2021

Conversation

timj
Copy link
Member

@timj timj commented May 18, 2021

No description provided.

@timj timj force-pushed the tickets/DM-28650 branch 6 times, most recently from 87d2bbe to c158daa Compare May 19, 2021 20:50
@timj timj marked this pull request as ready for review May 19, 2021 20:52
@timj timj requested a review from TallJimbo May 19, 2021 20:52
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay; lots of comments, but many of them are future concerns that need at most minor changes now. I'm all for solutions that get most of this on master sooner rather than later, though, as it already looks super useful.

In addition to the line comments, I am a bit worried that the logic for how to handle the many kinds of conflicts is going to be very hard for the user to reason about; that's an intrinsic problem at some level, but it might be better solved demanding more from the user about how they want conflicts to be resolved (maybe even a config object of some kind), rather than guessing.

The source collection will be reconstructed in this butler using
the same names. If a dataset already exists in a RUN collection
of the same name no transfer will occur and this will not be an error
unless the DatasetRef uses UUID and there is a UUID mismatch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dataset already exists" means "a dataset with the same dataset type and data ID exists", then?

I don't understand the reasoning for the UUID behavior exception, in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for importDatasets if the ref already exists in the repo it silently skips it. If somehow the UUID does not match the ref in the registry but does match the DataId/DatasetType then it will complain won't it? We can imagine a scenario like this if people do a transfer, then prune their copy and rerun and forget and transfer again. Hence the comment about it being fine if the UUID is used and matches but not if the UUID differs.

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/fileDatastore.py Outdated Show resolved Hide resolved
# Just because we can see the artifact when running
# the transfer doesn't mean it will be generally
# accessible to a user of this butler. For now warn
# but assume it will be accessible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this should be an exception or silent, but warning seems like it's still going to be a problem if it's user error, while being annoying if the user knows what they are doing. Do we want a flag so the user can say, "trust me on this"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head this scenario was extremely unlikely because it only happens with ingested datasets and not those created by processing. Since the transfer does not (yet?) copy dimension records over and the receiving registry won't ever have seen these raw files it seems like we aren't really going to get this far in the code. If we turn this into the default mode for transferring content from one butler to another (and only use export/import for cases where you can't see both butlers from your one python process) then yes, we need the code higher up to insert missing dimension records and then ask the user whether absolute URIs should be transferred or left as-is. I was putting in a warning there as a stop-gap scenario where I'm not expecting it to happen but I don't want to raise an exception immediately -- but warning will generally prompt people to ask us what the morning means.

@timj timj force-pushed the tickets/DM-28650 branch 4 times, most recently from 36c879c to 11c716e Compare May 24, 2021 21:16
log.debug("Importing %d refs with %s into run %s",
len(refs_to_import), datasetType.name, run)

# No way to know if this butler's registry uses UUID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's a job for this ticket, but I do think we want a public API for this. It's not just an implementation detail; it's a behavioral change that's important to a lot of higher-level code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a transitory problem to a certain extent. Things break if you try to use UUID in an old registry and if I knew that I could strip the ids before calling the import. The question is whether we actually care about that kind of migration. One question I have is whether you are thinking of this as a long-term API or a quick hack. I could easily see a "private" attribute on the dataset manager that we retire when we retire integer IDs. Do you envisage a more public API on Registry that will return the dataset_id_type even though in the long term that would presumably be the type of DatasetRef.id?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think totally public - things in pipe_base and maybe obs_base do care (thinking especially about Quantum and QuantumGraph because I'm working on those now).
I imagine we would probably deprecate this API when we deprecate int support, but that depends on how much we want to hedge against changing again in the future.

If we wanted to keep it around forever, we might want to think about whether there should be some other flags, in addition to the type; those might last longer:

  • has_globally_unique_ids
  • supports_deterministic_ids

Of course, if we expect both of those to always be true forever after int is gone, then maybe just the int vs uuid type flag is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I'll do this on another ticket because it will also lead to a cleanup of obs_base ingest-raws.

timj added 3 commits May 25, 2021 09:27
Sometimes the caller might have specified many collections
so won't know which specific collection is causing this problem.
Transfer datasets from a run in one butler to a new butler.
@timj timj merged commit e64b597 into master May 25, 2021
@timj timj deleted the tickets/DM-28650 branch May 25, 2021 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants