DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899

timj · 2023-10-26T22:40:20Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2023-10-26T22:53:10Z

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (e89626a) 87.66% compared to head (17bc74e) 87.68%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #899      +/-   ##
==========================================
+ Coverage   87.66%   87.68%   +0.01%     
==========================================
  Files         276      276              
  Lines       36165    36326     +161     
  Branches     7547     7581      +34     
==========================================
+ Hits        31705    31851     +146     
- Misses       3305     3313       +8     
- Partials     1155     1162       +7

Files	Coverage Δ
python/lsst/daf/butler/__init__.py	`100.00% <100.00%> (ø)`
python/lsst/daf/butler/_butler.py	`90.71% <100.00%> (+0.48%)`	⬆️
python/lsst/daf/butler/registry/_registry.py	`98.23% <100.00%> (ø)`
...n/lsst/daf/butler/remote_butler/server/__init__.py	`100.00% <100.00%> (ø)`
.../daf/butler/remote_butler/server/_server_models.py	`100.00% <100.00%> (+100.00%)`	⬆️
python/lsst/daf/butler/script/ingest_files.py	`92.59% <100.00%> (ø)`
tests/test_butler.py	`97.94% <100.00%> (ø)`
tests/test_server.py	`95.77% <100.00%> (+3.88%)`	⬆️
tests/test_simpleButler.py	`98.43% <100.00%> (ø)`
tests/test_testRepo.py	`100.00% <100.00%> (ø)`
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

python/lsst/daf/butler/_butler.py

dhirving · 2023-10-27T21:52:02Z

tests/test_server.py

+        # Populate the test server.
+        server_butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "base.yaml"))
+        server_butler.import_(filename=os.path.join(TESTDIR, "data", "registry", "datasets-uuid.yaml"))
+


Doesn't matter yet but you may want to move this above the call to _make_remote_butler, in case we decide to read some stuff from the server during RemoteButler initialization.

I'm hoping that it doesn't make a difference here either way because the clients won't be downloading caches of dataset type and collection before this runs will they?

TallJimbo

Just took a look at the APIs, and while I left a lot of comments they're all minor. Did not look at the implementations (I gather that's not needed from me).

python/lsst/daf/butler/_butler.py

TallJimbo · 2023-10-30T19:37:09Z

python/lsst/daf/butler/_butler.py

+        *,
+        collections: CollectionArgType | None = None,
+        timespan: Timespan | None = None,
+        datastore_records: bool = False,


If we provide this as an option, we should also provide a way to expand the data ID (i.e. add dimension records) at the same time, and we should think about whether we want just one option to do both or allow the two kinds of expansion to happen separately.

I had completely failed to notice that that flag was in there now. One option is for find_dataset to always expand and add everything. People shouldn't be calling it in bulk.

Always expanding could be inefficient, I think many clients do not care about datastore records at all, we should not be pessimizing things for them.

Presumably get_dataset should have the datastore_records and dimension_records booleans as well for consistency?

I also need to add datastore records to the SerializedDatasetRef because that was never updated and so it's impossible to return datastore records from the server at the moment...

Jim asked me to not add them to SerializedDatasetRef, as it may increase graph size on disk (but I don't think we give records to refs that go into the graph, at least not yet).

We are sending SerializedDatasetRef over the wire so we can only reconstruct things in the far end that were put in at the other end. We need to add them somehow (but obviously they should be optional). For now I've turned the feature off in client/server so you can't get the datastore records.

python/lsst/daf/butler/_butler.py

TallJimbo · 2023-10-30T19:42:44Z

python/lsst/daf/butler/_butler.py

@@ -799,6 +799,23 @@ def get_dataset_type(self, name: str) -> DatasetType:
        """
        raise NotImplementedError()

+    @abstractmethod
+    def get_dataset(self, id: DatasetId) -> DatasetRef | None:


I think we'll want a vectorized version of this in the future, but it's a pain to implement now. I don't think this interface needs to change in order to live alongside a vectorized one, but I wanted to raise it in case you think otherwise.

There are two open questions:

If we have vectorized version, get_datasets do we keep get_dataset around?

What should get_datasets return? list[DatasetRef] or DatasetRefContainer?

I think we likely want a single request to return a single ref so it seems like we always want Butler.get_dataset even if we later add Butler.get_datasets.

Hmm, do we ever expose bare DatasetId to butler clients (do we need this method at butler level)?

It certainly makes client/server testing a lot easier to begin with since it's the simplest possible API to implement. We do have VO services that are given a UUID (via datalink from, eg, ObsTAP) and need to convert that to a dataset ref so it's not impossible that the server will need to do that (depending on how services connect to butlers).

I think we'll want to start exposing dataset IDs in more places, too; I've been thinking about how we might to get them into the --where expression language, for instance. They appear pretty prominently in QGs already.

andy-slac

Looks good, few minor questions.

python/lsst/daf/butler/_butler.py

andy-slac · 2023-10-30T20:47:29Z

python/lsst/daf/butler/_butler.py

@@ -799,6 +799,23 @@ def get_dataset_type(self, name: str) -> DatasetType:
        """
        raise NotImplementedError()

+    @abstractmethod
+    def get_dataset(self, id: DatasetId) -> DatasetRef | None:


Hmm, do we ever expose bare DatasetId to butler clients (do we need this method at butler level)?

andy-slac · 2023-10-30T20:52:45Z

python/lsst/daf/butler/_butler.py

+        *,
+        collections: CollectionArgType | None = None,
+        timespan: Timespan | None = None,
+        datastore_records: bool = False,


Always expanding could be inefficient, I think many clients do not care about datastore records at all, we should not be pessimizing things for them.

python/lsst/daf/butler/_butler.py

python/lsst/daf/butler/remote_butler/_remote_butler.py

andy-slac · 2023-10-30T21:24:48Z

python/lsst/daf/butler/remote_butler/_remote_butler.py

+        # Temporary hack. Assume strings for collections. In future
+        # want to construct CollectionWildcard and filter it through collection
+        # cache to generate list of collection names.


Working on caching ticket now, I would probably prefer to not have a complete list of collections on every client in the long term.

Interesting. I thought we said we would last week. I thought @TallJimbo wanted the client to know all possible collections when it was building the query object before sending it to the server. The caching story between client and server is a bit tricky.

Yes, I know, but we may need to rethink that part, otherwise we may have scaling issue. I looked at Butler initialization time today, the simplest butler query-data-ids --collections=HSC/raw/all /repo/main exposure takes 10 seconds, large fraction of that comes from pre-loading of collections cache. We have 300k collections today, and it will only grow with time. There are ways to reduce that probably, but in my ideal world it would be better if we cache almost nothing on client side. 🙂

I think having the caches in the client has worked well for us so far (i.e. with the server just being the DB), and I like the idea of being able to optimize server calls away entirely in the client sometimes. And I do think caching on the server is a lot harder, since we can't just punt cache misses to the user there.

But I do hesitate a bit because I'm not sure if "cache all the collections the user can see" continues to be in the "small enough to be irrelevant" category after DR1, both in terms of downloading the cache and holding it in memory. And I have no idea if my "do more on the client to keep the server load light" intuition is better than @dhirving 's "do nothing on the client to keep version management easier" intuition.

Downloading a million collection records just to figure out that pattern does not match any of them looks like a lot of overhead. It is of course about the right balance between what we do on server vs client, we need to find something that works and scales.

Ok, I'm sold on not trying to aggressively fetch all collections the user can access. It's the dataset type definitions, chained collection members, and collection summaries that are needed for query contruction, and we don't need all of any of those.

I think what may break if we drop the complete collection-name cache is our regex/glob collection query implementation - we'll need to start translating/limiting those patterns to something that can be run inside the DB. Yes, we could do a full cache in the RemoteButler server, but this scaling problem will hit DirectButler, too.

I'll do a brain dump on Slack later today about the places we use the caches in the query system so somebody other than me can do some conceptual design work on what it should look like. I think we need to have a better sense of where this is all going, even if we don't decide to switch fully from the current approach yet, but I want to stay off of at least this part of the critical path.

andy-slac · 2023-10-30T21:27:26Z

python/lsst/daf/butler/remote_butler/_remote_butler.py

+        # Temporary hack. Assume strings for collections. In future
+        # want to construct CollectionWildcard and filter it through collection
+        # cache to generate list of collection names.
+        wildcards = CollectionWildcard.from_expression(collections)


Maybe do a quick check here and raise if patterns specifies something?

Seems like I have to change the API to something a bit simpler for collection list anyhow.

andy-slac · 2023-10-30T21:33:31Z

python/lsst/daf/butler/remote_butler/server/_server.py

+) -> SerializedDatasetType:
+    """Return the dataset type."""
+    butler = factory.create_butler()
+    datasetType = butler.get_dataset_type(dataset_type_name)


Is exception transported transparently back to client?

tests/test_server.py

This replaces butler.registry.getDatasetType

The underscores are the preferred way to add new APIs now. The collections parameter should not support wildcards so explicitly declare it should be a sequence of str.

visit_system is now required but no test was triggering validation until recently.

…aset Datastore records do not work in remote butler because they were never added to SerializedDatasetRef.

Demonstrate that MissingDatasetTypeError can be caught in the server and raised again in the client.

timj force-pushed the tickets/DM-41365 branch 3 times, most recently from 07f5f39 to 5768235 Compare October 27, 2023 18:15

timj marked this pull request as ready for review October 27, 2023 18:27

timj changed the title ~~DM-41365: Add findDataset and getDatasetType to Butler API~~ DM-41365: Add findDataset/getDataset/getDatasetType to Butler API Oct 27, 2023

timj force-pushed the tickets/DM-41365 branch 2 times, most recently from f124fbc to 7b59d29 Compare October 27, 2023 20:46

timj requested review from dhirving and TallJimbo October 27, 2023 21:04

timj commented Oct 27, 2023

View reviewed changes

python/lsst/daf/butler/_butler.py Outdated Show resolved Hide resolved

dhirving reviewed Oct 27, 2023

View reviewed changes

timj force-pushed the tickets/DM-41365 branch from 7b59d29 to 9f27ffd Compare October 27, 2023 22:40

TallJimbo approved these changes Oct 30, 2023

View reviewed changes

andy-slac approved these changes Oct 30, 2023

View reviewed changes

timj force-pushed the tickets/DM-41365 branch from 9f27ffd to 13d3bff Compare October 31, 2023 18:09

timj added 6 commits October 31, 2023 11:54

Add DirectButler.get_dataset_type API

eb89aee

This replaces butler.registry.getDatasetType

Add get_dataset_type to RemoteButler

73472aa

Add DirectButler.find_dataset

46c9cee

Add Butler.get_dataset

8703186

Add news fragment

04bd9b5

Check whether remote butler thinks it is writeable

2b6ba5a

Use data_id/dataset_type in find_dataset and simplify collections

b22a5f3

The underscores are the preferred way to add new APIs now. The collections parameter should not support wildcards so explicitly declare it should be a sequence of str.

timj force-pushed the tickets/DM-41365 branch from 13d3bff to b22a5f3 Compare October 31, 2023 18:54

timj added 4 commits November 2, 2023 09:39

Add support for day_obs/seq_num to Butler.find_dataset

02c54d2

Add storage class conversion to get_dataset and find_dataset

6c120a9

Remove lifting of CollectionArgType

adf8600

Add missing instrument metadata to test yaml

694614b

visit_system is now required but no test was triggering validation until recently.

timj force-pushed the tickets/DM-41365 branch 3 times, most recently from 07b0946 to 9afece7 Compare November 2, 2023 22:00

timj added 2 commits November 2, 2023 15:58

Add dimension and datastore record retrieval for get_dataset/find_dat…

1e3d68c

…aset Datastore records do not work in remote butler because they were never added to SerializedDatasetRef.

Add experimental exception handling in client/server

17bc74e

Demonstrate that MissingDatasetTypeError can be caught in the server and raised again in the client.

timj force-pushed the tickets/DM-41365 branch from 9afece7 to 17bc74e Compare November 2, 2023 22:58

timj merged commit 5baa637 into main Nov 3, 2023
16 checks passed

timj deleted the tickets/DM-41365 branch November 3, 2023 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899

DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899

timj commented Oct 26, 2023 •

edited

codecov bot commented Oct 26, 2023 •

edited

dhirving Oct 27, 2023

timj Oct 27, 2023

TallJimbo left a comment

TallJimbo Oct 30, 2023

timj Oct 30, 2023 •

edited

andy-slac Oct 30, 2023

timj Nov 1, 2023

TallJimbo Nov 1, 2023

timj Nov 2, 2023

andy-slac Nov 2, 2023

timj Nov 2, 2023

TallJimbo Oct 30, 2023

timj Oct 30, 2023

andy-slac Oct 30, 2023

timj Oct 30, 2023

TallJimbo Oct 31, 2023

andy-slac left a comment

andy-slac Oct 30, 2023

andy-slac Oct 30, 2023

andy-slac Oct 30, 2023

timj Oct 30, 2023

andy-slac Oct 31, 2023

TallJimbo Oct 31, 2023 •

edited

andy-slac Oct 31, 2023

TallJimbo Oct 31, 2023

andy-slac Oct 30, 2023

timj Oct 30, 2023

andy-slac Oct 30, 2023

DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899

DM-41365: Add findDataset/getDataset/getDatasetType to Butler API #899

Conversation

timj commented Oct 26, 2023 • edited

Checklist

codecov bot commented Oct 26, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj Oct 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Oct 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj commented Oct 26, 2023 •

edited

codecov bot commented Oct 26, 2023 •

edited

timj Oct 30, 2023 •

edited

TallJimbo Oct 31, 2023 •

edited