DM-40163: Rework dataset transfer code in MiddlewareInterface #85

kfindeisen · 2023-09-25T22:19:05Z

This PR removes the export/import idiom from MiddlewareInterface. The old idiom needed to use the deprecated butler.datastore.root to anchor relative paths, and there's no guarantee that the central repo won't have its datasets split among multiple stores.

This change ensures that all three _export_* methods export only datasets, making it easier to treat them identically.

This change makes all the `_export_*` methods agnostic to how we actually do the export. Their responsibility is now focused exclusively on identifying the target datasets.

This change allows collections to be exported independently from their datasets (export/import ties everything to datasets). Calibration collections, however, need special handling for their associations under the new system.

By using a combination of transfer_from, registerCollection, and certify, we can avoid needing to know where datasets are stored, and therefore using the deprecated Butler.datastore.root.

timj

This looks good to me. Do you notice any performance difference? We know that export/import via YAML is really slow compared to doing the direct transfer.

python/activator/middleware_interface.py

kfindeisen · 2023-09-26T20:11:57Z

Do you notice any performance difference? We know that export/import via YAML is really slow compared to doing the direct transfer.

Actually, it seems about the same: the integration test for this ticket had downloads of <2.4-3.8 s for prep_butler, and uploads of <6.9-9.0 s for export_outputs, while a previous run of main had <2.6-3.2 s and <8.2-9.0 s, respectively.

By using a combination of transfer_from and syncDimensionData, we can avoid needing to know where datasets are stored, and therefore using the deprecated Butler.datastore.root.

timj · 2023-09-26T20:41:01Z

The speed up mainly comes when you have large numbers of datasets. Switching execution butler to transfer_from from export/import made a huge difference for 100k datasets.

kfindeisen added 4 commits September 22, 2023 10:59

Factor _export_collections call out of _export_skymap_and_templates.

b8a7fd1

This change ensures that all three _export_* methods export only datasets, making it easier to treat them identically.

Factor references to RepoExportContext out of _export_*.

4b27195

This change makes all the `_export_*` methods agnostic to how we actually do the export. Their responsibility is now focused exclusively on identifying the target datasets.

Remove use of export/import in prep_butler.

87dc2bc

By using a combination of transfer_from, registerCollection, and certify, we can avoid needing to know where datasets are stored, and therefore using the deprecated Butler.datastore.root.

kfindeisen marked this pull request as ready for review September 25, 2023 23:08

kfindeisen requested a review from timj September 25, 2023 23:09

timj approved these changes Sep 26, 2023

View reviewed changes

python/activator/middleware_interface.py Show resolved Hide resolved

Remove use of export/import in export_outputs.

ac32674

By using a combination of transfer_from and syncDimensionData, we can avoid needing to know where datasets are stored, and therefore using the deprecated Butler.datastore.root.

kfindeisen force-pushed the tickets/DM-40163 branch from 631b8ba to ac32674 Compare September 26, 2023 20:36

kfindeisen merged commit af179e0 into main Sep 26, 2023

kfindeisen deleted the tickets/DM-40163 branch September 26, 2023 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DM-40163: Rework dataset transfer code in MiddlewareInterface #85

DM-40163: Rework dataset transfer code in MiddlewareInterface #85

Uh oh!

kfindeisen commented Sep 25, 2023

Uh oh!

timj left a comment

Uh oh!

Uh oh!

kfindeisen commented Sep 26, 2023 •

edited

Loading

Uh oh!

timj commented Sep 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DM-40163: Rework dataset transfer code in MiddlewareInterface #85

DM-40163: Rework dataset transfer code in MiddlewareInterface #85

Uh oh!

Conversation

kfindeisen commented Sep 25, 2023

Uh oh!

timj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kfindeisen commented Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timj commented Sep 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kfindeisen commented Sep 26, 2023 •

edited

Loading