Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41966: Add Butler.transfer_dimension_records_from API #921

Merged
merged 5 commits into from Dec 7, 2023

Conversation

timj
Copy link
Member

@timj timj commented Dec 5, 2023

Uses populated_by field to find other records to pull along.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@timj timj force-pushed the tickets/DM-41966 branch 2 times, most recently from 92ec7c9 to 564e43e Compare December 5, 2023 23:00
# have to be scanned.
continue

if not can_query:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a flag to not try to copy the derived records? This will break if a quantum graph is used as the source butler but I think that's fine because we shouldn't be enabling records transfer from the graph back to the primary butler because it's got all the records by definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.

I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector because they'd rather assert that someone has done butler registry-instrument correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just realized that the butler transfer-from-graph command lets you specify whether to copy dimension records or not. This now breaks things because of the populated_by follow up query. We seem to have the following options:

  • if we can't query the source butler we issue a warning and transfer what we have.
  • if we can't query the source butler we query the target butler and if those populated_by records are found we return without complaint.
  • in the future we add the related records to the graph and add querying of records to the Graph Butler.
  • we remove the transfer dimensions option from the transfer-from-graph command (or change the default to False and currently raise if True). The default likely should be true anyhow since in all our cases we are transferring back to a butler that created the graph.

Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (d41daf1) 87.50% compared to head (4845263) 87.50%.

Files Patch % Lines
python/lsst/daf/butler/direct_butler.py 82.69% 4 Missing and 5 partials ⚠️
tests/test_simpleButler.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #921   +/-   ##
=======================================
  Coverage   87.50%   87.50%           
=======================================
  Files         292      292           
  Lines       38067    38124   +57     
  Branches     8062     8081   +19     
=======================================
+ Hits        33310    33362   +52     
- Misses       3553     3554    +1     
- Partials     1204     1208    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

source_refs : iterable of `DatasetRef`
Datasets defined in the source butler whose dimension records
should be transferred to this butler. In most circumstances.
transfer is faster if the dataset refs are expanded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to make this an iterable of DataCoordinate, since that's what's actually holding the records and you can get that from Iterable[DatasetRef] with (ref.dataId for ref in source_refs).

And even then I think that's a bit strange for a method with this name; it sounds like it should be taking a bunch of dimension records and transferring those (as well as other less-obvious associated records). But of course that's not actually the interface we need right here, so maybe this is just a naming problem.

One option might be to make this a private method, if the goal is really to support transferring datasets. On the DMTN-249 prototype I wrote a little helper class that I hoped to use to unify the transfer-from and import_/export interfaces, by providing an abstraction over "a bunch of stuff you want to transfer self-consistently". I think we might need that in the transfer APIs to avoid a bunch of methods with names like transfer_dimension_records_from_given_dataset_refs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a public API because the embargo transfers need to be able to call it before they transfer the raws from embargo to public. They can't use Butler.transfer_from() for raws because raws are relocated to public storage outside of butler before being ingested again (but without having to go through calling ingest-raws again since they have refs already from the embargo repo and are building up the FileDataset objects). That's also why refs are the interface and not dataIds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not thrilled with it, but if it's got a clear use case, go ahead, since the kind of generalization I want is even more design work (and I'd probably want to be even more cautious about releasing that half-baked), so we can cross the bridge of replacing this if and when we come to it.

) -> dict[DimensionElement, dict[DataCoordinate, DimensionRecord]]:
primary_records = self._extract_dimension_records_from_data_ids(
source_butler, data_ids, allowed_elements
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an assumption here that if the destination butler already has records for some of these elements, insertDimensionData(..., skip_existing=True) is both efficient and correct as a way to resolve any conflicts. That's a reasonable-enough assumption for it be the default, but we might want to provide more control for advanced users, especially if that could avoid queries against the source butler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. For the populated_by records, wouldn't we have to query the target butler to see if they existed already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's less important and trickier to let the user control those. If we expressed the user control as an "opt out" list of elements rather than an opt-in list we could probably still trust it, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want a skip_elements: list[str] | None = None parameter to be added to butler.transfer_from and butler.transfer_dimension_records_from so that people could say "no detector/instrument/physical_filter or no visit_detector_region" or something. I agree that if detector/instrument/physical_filter are not present in the target repo then you probably do want to run register-instrument first, although the transfer being told to skip them wouldn't explain to people why the transfer failed in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's about what I was thinking vaguely of. I don't care deeply about adding it on this ticket if you just want to get somebody else unblocked right now.

# have to be scanned.
continue

if not can_query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.

I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector because they'd rather assert that someone has done butler registry-instrument correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.


records = source_butler.registry.queryDimensionRecords( # type: ignore
element.name, **data_id.mapping # type: ignore
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably as efficient as we can make it now, but it will be really helpful when we can upload tables of data IDs to the query methods and join against them, both for efficiency and for simplifying the logic here (which is effectively a bunch table joins written in Python). Right now I think we better hope this almost never gets called.

This is clearer than trying to raise the same exception from itself.
Also copies related dimensions populated by the original set.
Butler.transfer_from now uses a part of this API.
This makes sure that exposure is inserted before visit and
before visit_definition.
@timj timj merged commit f87c7a0 into main Dec 7, 2023
16 of 17 checks passed
@timj timj deleted the tickets/DM-41966 branch December 7, 2023 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants