Skip to content

Conversation

kfindeisen
Copy link
Member

This PR removes the export/import idiom from MiddlewareInterface. The old idiom needed to use the deprecated butler.datastore.root to anchor relative paths, and there's no guarantee that the central repo won't have its datasets split among multiple stores.

This change ensures that all three _export_* methods export only
datasets, making it easier to treat them identically.
This change makes all the `_export_*` methods agnostic to how we
actually do the export. Their responsibility is now focused exclusively
on identifying the target datasets.
This change allows collections to be exported independently from their
datasets (export/import ties everything to datasets). Calibration
collections, however, need special handling for their associations
under the new system.
By using a combination of transfer_from, registerCollection, and
certify, we can avoid needing to know where datasets are stored, and
therefore using the deprecated Butler.datastore.root.
@kfindeisen kfindeisen marked this pull request as ready for review September 25, 2023 23:08
@kfindeisen kfindeisen requested a review from timj September 25, 2023 23:09
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Do you notice any performance difference? We know that export/import via YAML is really slow compared to doing the direct transfer.

@kfindeisen
Copy link
Member Author

kfindeisen commented Sep 26, 2023

Do you notice any performance difference? We know that export/import via YAML is really slow compared to doing the direct transfer.

Actually, it seems about the same: the integration test for this ticket had downloads of <2.4-3.8 s for prep_butler, and uploads of <6.9-9.0 s for export_outputs, while a previous run of main had <2.6-3.2 s and <8.2-9.0 s, respectively.

By using a combination of transfer_from and syncDimensionData, we can
avoid needing to know where datasets are stored, and therefore using
the deprecated Butler.datastore.root.
@timj
Copy link
Member

timj commented Sep 26, 2023

The speed up mainly comes when you have large numbers of datasets. Switching execution butler to transfer_from from export/import made a huge difference for 100k datasets.

@kfindeisen kfindeisen merged commit af179e0 into main Sep 26, 2023
@kfindeisen kfindeisen deleted the tickets/DM-40163 branch September 26, 2023 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants