DM-41966: Add Butler.transfer_dimension_records_from API #921

timj · 2023-12-05T22:57:04Z

Uses populated_by field to find other records to pull along.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

timj · 2023-12-05T23:31:14Z

python/lsst/daf/butler/direct_butler.py

+                        # have to be scanned.
+                        continue
+
+                    if not can_query:


Does it make sense to have a flag to not try to copy the derived records? This will break if a quantum graph is used as the source butler but I think that's fine because we shouldn't be enabling records transfer from the graph back to the primary butler because it's got all the records by definition.

I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.

I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector because they'd rather assert that someone has done butler registry-instrument correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.

I have just realized that the butler transfer-from-graph command lets you specify whether to copy dimension records or not. This now breaks things because of the populated_by follow up query. We seem to have the following options:

if we can't query the source butler we issue a warning and transfer what we have.

if we can't query the source butler we query the target butler and if those populated_by records are found we return without complaint.

in the future we add the related records to the graph and add querying of records to the Graph Butler.

we remove the transfer dimensions option from the transfer-from-graph command (or change the default to False and currently raise if True). The default likely should be true anyhow since in all our cases we are transferring back to a butler that created the graph.

codecov · 2023-12-05T23:40:14Z

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (d41daf1) 87.50% compared to head (4845263) 87.50%.

Files	Patch %	Lines
python/lsst/daf/butler/direct_butler.py	82.69%	4 Missing and 5 partials ⚠️
tests/test_simpleButler.py	81.81%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #921   +/-   ##
=======================================
  Coverage   87.50%   87.50%           
=======================================
  Files         292      292           
  Lines       38067    38124   +57     
  Branches     8062     8081   +19     
=======================================
+ Hits        33310    33362   +52     
- Misses       3553     3554    +1     
- Partials     1204     1208    +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo · 2023-12-06T15:46:01Z

python/lsst/daf/butler/_butler.py

+        source_refs : iterable of `DatasetRef`
+            Datasets defined in the source butler whose dimension records
+            should be transferred to this butler. In most circumstances.
+            transfer is faster if the dataset refs are expanded.


Might be better to make this an iterable of DataCoordinate, since that's what's actually holding the records and you can get that from Iterable[DatasetRef] with (ref.dataId for ref in source_refs).

And even then I think that's a bit strange for a method with this name; it sounds like it should be taking a bunch of dimension records and transferring those (as well as other less-obvious associated records). But of course that's not actually the interface we need right here, so maybe this is just a naming problem.

One option might be to make this a private method, if the goal is really to support transferring datasets. On the DMTN-249 prototype I wrote a little helper class that I hoped to use to unify the transfer-from and import_/export interfaces, by providing an abstraction over "a bunch of stuff you want to transfer self-consistently". I think we might need that in the transfer APIs to avoid a bunch of methods with names like transfer_dimension_records_from_given_dataset_refs.

This needs to be a public API because the embargo transfers need to be able to call it before they transfer the raws from embargo to public. They can't use Butler.transfer_from() for raws because raws are relocated to public storage outside of butler before being ingested again (but without having to go through calling ingest-raws again since they have refs already from the embargo repo and are building up the FileDataset objects). That's also why refs are the interface and not dataIds.

Ok. I'm not thrilled with it, but if it's got a clear use case, go ahead, since the kind of generalization I want is even more design work (and I'd probably want to be even more cautious about releasing that half-baked), so we can cross the bridge of replacing this if and when we come to it.

TallJimbo · 2023-12-06T15:57:10Z

python/lsst/daf/butler/direct_butler.py

+    ) -> dict[DimensionElement, dict[DataCoordinate, DimensionRecord]]:
+        primary_records = self._extract_dimension_records_from_data_ids(
+            source_butler, data_ids, allowed_elements
+        )


There's an assumption here that if the destination butler already has records for some of these elements, insertDimensionData(..., skip_existing=True) is both efficient and correct as a way to resolve any conflicts. That's a reasonable-enough assumption for it be the default, but we might want to provide more control for advanced users, especially if that could avoid queries against the source butler.

Ok. For the populated_by records, wouldn't we have to query the target butler to see if they existed already?

I agree it's less important and trickier to let the user control those. If we expressed the user control as an "opt out" list of elements rather than an opt-in list we could probably still trust it, though.

So you want a skip_elements: list[str] | None = None parameter to be added to butler.transfer_from and butler.transfer_dimension_records_from so that people could say "no detector/instrument/physical_filter or no visit_detector_region" or something. I agree that if detector/instrument/physical_filter are not present in the target repo then you probably do want to run register-instrument first, although the transfer being told to skip them wouldn't explain to people why the transfer failed in this case.

Yes, that's about what I was thinking vaguely of. I don't care deeply about adding it on this ticket if you just want to get somebody else unblocked right now.

TallJimbo · 2023-12-06T16:36:55Z

python/lsst/daf/butler/direct_butler.py

+                        # have to be scanned.
+                        continue
+
+                    if not can_query:


I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.

I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector because they'd rather assert that someone has done butler registry-instrument correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.

TallJimbo · 2023-12-06T16:49:20Z

python/lsst/daf/butler/direct_butler.py

+
+                    records = source_butler.registry.queryDimensionRecords(  # type: ignore
+                        element.name, **data_id.mapping  # type: ignore
+                    )


I think this is probably as efficient as we can make it now, but it will be really helpful when we can upload tables of data IDs to the query methods and join against them, both for efficiency and for simplifying the logic here (which is effectively a bunch table joins written in Python). Right now I think we better hope this almost never gets called.

python/lsst/daf/butler/direct_butler.py

This is clearer than trying to raise the same exception from itself.

Also copies related dimensions populated by the original set. Butler.transfer_from now uses a part of this API.

This makes sure that exposure is inserted before visit and before visit_definition.

timj force-pushed the tickets/DM-41966 branch 2 times, most recently from 92ec7c9 to 564e43e Compare December 5, 2023 23:00

timj commented Dec 5, 2023

View reviewed changes

TallJimbo approved these changes Dec 6, 2023

View reviewed changes

timj commented Dec 6, 2023

View reviewed changes

python/lsst/daf/butler/direct_butler.py Outdated Show resolved Hide resolved

timj force-pushed the tickets/DM-41966 branch from bbb15f3 to 0f081df Compare December 6, 2023 23:08

timj added 4 commits December 6, 2023 21:13

Clean up an exception in test code by using add_note

65bdb9e

This is clearer than trying to raise the same exception from itself.

New Butler API for transferring dimension records

43a946e

Also copies related dimensions populated by the original set. Butler.transfer_from now uses a part of this API.

Fix some type annotations

fad6e8a

Ensure that dimension records are inserted in the right order

31fc763

This makes sure that exposure is inserted before visit and before visit_definition.

timj force-pushed the tickets/DM-41966 branch from 0f081df to 31fc763 Compare December 7, 2023 04:13

Add news fragment.

4845263

timj merged commit f87c7a0 into main Dec 7, 2023
16 of 17 checks passed

timj deleted the tickets/DM-41966 branch December 7, 2023 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-41966: Add Butler.transfer_dimension_records_from API #921

DM-41966: Add Butler.transfer_dimension_records_from API #921

timj commented Dec 5, 2023 •

edited

timj Dec 5, 2023

TallJimbo Dec 6, 2023

timj Dec 6, 2023

codecov bot commented Dec 5, 2023 •

edited

TallJimbo Dec 6, 2023

timj Dec 6, 2023

TallJimbo Dec 6, 2023

TallJimbo Dec 6, 2023

timj Dec 6, 2023

TallJimbo Dec 6, 2023

timj Dec 6, 2023

TallJimbo Dec 6, 2023

TallJimbo Dec 6, 2023

TallJimbo Dec 6, 2023

DM-41966: Add Butler.transfer_dimension_records_from API #921

DM-41966: Add Butler.transfer_dimension_records_from API #921

Conversation

timj commented Dec 5, 2023 • edited

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 5, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj commented Dec 5, 2023 •

edited

codecov bot commented Dec 5, 2023 •

edited