DM-39939: use query-results' grouping when processing iterables of DatasetRefs #863

TallJimbo · 2023-07-08T15:52:44Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2023-07-08T16:22:22Z

Codecov Report

Patch coverage: 89.51% and project coverage change: +0.01 🎉

Comparison is base (5b07c4d) 87.90% compared to head (8213990) 87.92%.

❗ Current head 8213990 differs from pull request most recent head a3ec38a. Consider uploading reports for the commit a3ec38a to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #863      +/-   ##
==========================================
+ Coverage   87.90%   87.92%   +0.01%     
==========================================
  Files         273      270       -3     
  Lines       35764    35697      -67     
  Branches     7474     7478       +4     
==========================================
- Hits        31440    31388      -52     
+ Misses       3166     3150      -16     
- Partials     1158     1159       +1

Impacted Files	Coverage Δ
python/lsst/daf/butler/registries/sql.py	`85.07% <ø> (+0.02%)`	⬆️
python/lsst/daf/butler/core/progress.py	`86.99% <79.62%> (+0.57%)`	⬆️
...ython/lsst/daf/butler/registry/queries/_results.py	`89.94% <80.00%> (+<0.01%)`	⬆️
python/lsst/daf/butler/core/datasets/ref.py	`84.33% <100.00%> (+0.36%)`	⬆️
tests/test_progress.py	`100.00% <100.00%> (ø)`

... and 29 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

andy-slac

Looks good, one suggestion to make it a bit more type-checkable.

andy-slac · 2023-07-10T23:47:43Z

python/lsst/daf/butler/core/datasets/ref.py

+        if hasattr(refs, "_iter_by_dataset_type"):
+            return refs._iter_by_dataset_type()


I'm not super-happy with this dynamic approach, particularly because our favorite mypy cannot type-check this. Would it be possible to add an abstraction (or maybe Protocol with runtime check) so that isinstance can be used instead?

I've switched to a runtime-checkable protocol, but I've given it a leading underscore (and documented it as "package private") since I don't want external code to start using this.

andy-slac · 2023-07-10T23:48:21Z

python/lsst/daf/butler/core/datasets/ref.py

+        refs : `~collections.abc.Iterable` [ `DatasetRef` ]
+            `DatasetRef` instances to group.  If this has a
+            ``_iter_by_dataset_type`` method, it will be called with no
+            arguments and the result reutrnd.


Suggested change

arguments and the result reutrnd.

arguments and the result returned.

andy-slac · 2023-07-10T23:57:51Z

tests/test_progress.py

+        self.assertEqual(MockProgressBar.last.total, 2)
+
+    def test_iter_item_chunks_not_sized(self):
+        """Test using `Progress.iter_item_chunks` with an unsized iterable


Suggested change

"""Test using `Progress.iter_item_chunks` with an unsized iterable

"""Test using `Progress.iter_item_chunks` with an unsized iterable of

When processing all dataset types in a collection together, this can represent a huge decrease in memory usage, by querying for and then processing only one dataset type at a time.

TallJimbo force-pushed the tickets/DM-39939 branch 2 times, most recently from 45e7f53 to 8213990 Compare July 10, 2023 15:27

andy-slac approved these changes Jul 10, 2023

View reviewed changes

TallJimbo force-pushed the tickets/DM-39939 branch from 8213990 to 8a9e8c4 Compare July 14, 2023 16:11

TallJimbo added 2 commits July 14, 2023 12:32

Make progress methods handle unsized iterables.

9af7b4f

Use query-results' natural grouping by dataset type when possible.

a3ec38a

When processing all dataset types in a collection together, this can represent a huge decrease in memory usage, by querying for and then processing only one dataset type at a time.

TallJimbo force-pushed the tickets/DM-39939 branch from 8a9e8c4 to a3ec38a Compare July 14, 2023 16:32

TallJimbo merged commit f670e8b into main Jul 14, 2023
15 of 16 checks passed

TallJimbo deleted the tickets/DM-39939 branch July 14, 2023 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-39939: use query-results' grouping when processing iterables of DatasetRefs #863

DM-39939: use query-results' grouping when processing iterables of DatasetRefs #863

TallJimbo commented Jul 8, 2023 •

edited

codecov bot commented Jul 8, 2023 •

edited

andy-slac left a comment

andy-slac Jul 10, 2023

TallJimbo Jul 14, 2023

andy-slac Jul 10, 2023

andy-slac Jul 10, 2023

		if hasattr(refs, "_iter_by_dataset_type"):
		return refs._iter_by_dataset_type()

	arguments and the result reutrnd.
	arguments and the result returned.

	"""Test using `Progress.iter_item_chunks` with an unsized iterable
	"""Test using `Progress.iter_item_chunks` with an unsized iterable of

DM-39939: use query-results' grouping when processing iterables of DatasetRefs #863

DM-39939: use query-results' grouping when processing iterables of DatasetRefs #863

Conversation

TallJimbo commented Jul 8, 2023 • edited

Checklist

codecov bot commented Jul 8, 2023 • edited

Codecov Report

andy-slac left a comment

Choose a reason for hiding this comment

andy-slac Jul 10, 2023

Choose a reason for hiding this comment

TallJimbo Jul 14, 2023

Choose a reason for hiding this comment

andy-slac Jul 10, 2023

Choose a reason for hiding this comment

andy-slac Jul 10, 2023

Choose a reason for hiding this comment

TallJimbo commented Jul 8, 2023 •

edited

codecov bot commented Jul 8, 2023 •

edited