DM-26685: Add command-line tool for Registry.queryDatasets #389

n8pease · 2020-10-07T22:06:41Z

No description provided.

n8pease · 2020-10-08T20:50:16Z

python/lsst/daf/butler/cli/cmd/commands.py

+@repo_argument(required=True)
+@glob_argument(help="GLOB is one or more glob-style expressions that fully or partially identify the "
+                    "dataset types to be queried.")
+@click.option("--collections",


we have a shared option "--collection-type" used in other places. What do you think about using that here?

That's a different kind of argument entirely, actually; --collections here is more like the positional GLOB argument to butler query-collections.

FWIW, queryDatasets does internally expand a collections argument that includes one or more patterns into an explicit list of collections in much the same way as queryCollections, and that could accept collection types from the user to provide more fine-grained control over what's in that expanded list. But that'd first require a change to Registry.queryDatasets itself, and hence is probably out of scope for this ticket.

n8pease · 2020-10-08T21:01:35Z

python/lsst/daf/butler/cli/cmd/commands.py

+                          order of `collections` passed in).  If used, `collections` must specify at least one
+                          expression and must not contain wildcards."""))
+@click.option("--components",
+              type=click.Choice(["ALL", "NONE", "UNMATCHED"], case_sensitive=False),


I just realized a conflict between this --components and the flag --components/--no-components on line 154. Grumble...

I think I'm inclined to remove this option in the interests of keeping the user interface as simple as possible to start with. It seems to me that UNMATCHED is the right default in almost every situation.

timj · 2020-10-08T22:49:57Z

python/lsst/daf/butler/cli/cmd/commands.py

+              help=unwrap("""A string expression similar to a SQL WHERE clause. May involve any column of a
+                          dimension table or a dimension name as a shortcut for the primary key column of a
+                          dimension table."""))
+@click.option("--deduplicate",


I don't think deduplicate should be a user interface option at all. Why would someone ever want to see a duplicate row. All it does is add confusion to the output. I think that you should use the deduplicate flag internally where you can and if you can't you need to throw the per-dataset type refs into a set and then extract them from the set sorted by ref.id.

Maybe this needs a new name, but it is not at all equivalent to deduplicating results after the fact. deduplicate=True says to only return one result for each datasetType+dataId combination across all searched collections, searching them in order.

Even more important not to confuse the output here then where we get identical answers in the output and no-one knows why. Are you saying that I got two rows because a dataset happened to be in two collections but since the collection is not known to the DatasetRef we can't tell why we had two rows?

I think it's somewhere in between, because DatasetRef does know the dataset's RUN-type collection, so both that and the dataset_id will always be printed, and the fact that they're actually different and where they're coming from will be apparent. That should be the common case.

But it is also possible to get the exact same dataset multiple times because both the RUN it's in and a TAGGED collection it's in are being queried. Providing that information about why it appeared twice is hard right now, but just removing that duplication is as simple as stuffing the query results in a set, and that seems like the right thing to do anyway.

timj · 2020-10-09T00:51:55Z

python/lsst/daf/butler/cli/cmd/commands.py

+        row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id)
+        row.update(datasetRef.dataId.items())
+        rows.append(row)
+        tables[datasetRef.datasetType.name] = rows


I don't really like how this bit of code reads the array, updates it, and then assigns it back to the same place. That seems like it should be something that defaultdict can handle. The following code results in the same output but there is only one line in the for loop rather than five and it's not continually writing the list back to the dict.

tables = defaultdict(list) for datasetRef in datasets: tables[datasetRef.datasetType.name]].append({"type": datasetRef.datasetType.name, "run": datasetRef.run, "id": datasetRef.id, **datasetRef.dataId}) for datasetName, rows in tables.items(): print("") print(rows) Table(rows, names=rows[0].keys()).pprint_all() print("")

the only annoyance being that Table sorts the columns if they aren't an OrderedDict and so you need the names here explicitly (there is a comment in astropy that they should fix this when they drop older pythons).

@timj there is something about the sorting code that doesn't like sorting (I think) DimensionElements:

[{'type': 'bias', 'run': 'HSC/calib/gen2/2013-11-03', 'id': 1, Dimension(instrument): 'HSC', Dimension(detector): 10}] Traceback (most recent call last): File "/Users/n8pease/code/lsst/daf_butler/bin/butler", line 28, in <module> sys.exit(main()) File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/cli/butler.py", line 311, in main return cli() File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/cli/cmd/commands.py", line 203, in query_datasets Table(rows, names=rows[0].keys()).pprint_all() File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/astropy/table/table.py", line 610, in __init__ init_func(data, names, dtype, n_cols, copy) File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/astropy/table/table.py", line 852, in _init_from_list_of_dicts names_from_data = sorted(names_from_data) File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/core/dimensions/elements.py", line 305, in __gt__ return self.universe.getElementIndex(self.name) > self.universe.getElementIndex(other.name) AttributeError: 'str' object has no attribute 'name'

I can go debug that if we want(?), or I can use 3 rows in the loop and an OrderedDict:

for datasetRef in datasets: row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id) row.update(datasetRef.dataId.items()) tables[datasetRef.datasetType.name].append(row)

And then we don't have to do the names annoyance either.

What do you think?

That error is from astropy because you have not specified the names as an argument to the Table constructor. Did you remove the names parameter? Without the names parameter and without an OrderedDict it tries to work out the column names itself but sorts them. With the names parameter it doesn't bother.

You are right that the row is the OrderedDict and tables is the defaultdict so I did miss that last night.

Can't you new example code do the **datasetRef.dataId in the OrderedDict constructor can't it?

Does the following work then?

for datasetRef in datasets: row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id, **datasetRef.dataId) tables[datasetRef.datasetType.name].append(row)

please add a comment that you are using OrderedDict specifically because current astropy has not yet changed to understand that dicts are ordered. Otherwise people are going to get confused later on.

if rows is a dict and I pass names to the Table init, it fails as above. See code below.

for datasetRef in datasets: row = dict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id) row.update(datasetRef.dataId.items()) tables[datasetRef.datasetType.name].append(row)

and this in the command function:

for datasetName, rows in tables.items(): print("") Table(rows, names=rows[0].keys()).pprint_all() print("")

If I comment out row.update(datasetRef.dataId.items()) then it does not fail.

Also I've realized if I make the astropy table in the script function I can do it without the intermediate data structure. Will investigate that briefly.

Okay, I think I already said I'm fine with OrderedDict if you use the ** expansion rather than the separate update.

It looks like they sort the column names even if a names is given and then ignore it to use the names you want. The problem you are having is that for a dataId the keys of the dataId are not actually strings so you need to convert the dataId to a normal str key dict for sorting to work (or fix the comparison method in the class as you say above -- something we should probably do eventually because sorting keys in a dataId is not completely ridiculous -- not for this ticket). Not relevant if you stick with OrderedDict along with a note as to why.

timj

Two main comments:

I think we want to use the sorting from DM-26324: Enable sorting of datasets during export #396 so that we get predictable output.
Don't ask for URIs unless people want them.

Plus some other comments. I tried out the command line and it worked great.

timj · 2020-10-19T23:57:47Z

python/lsst/daf/butler/script/queryDatasets.py

+        DatasetRef and component name, and component URI."""
+        datasetTypeName = datasetRef.datasetType.name
+        if name:
+            datasetTypeName = ".".join((datasetTypeName, name))


Please use the proper API for this. Something like:

datasetTypeName = datasetRef.datasetType.componentTypeName(name)

timj · 2020-10-20T00:02:07Z

python/lsst/daf/butler/script/queryDatasets.py

+                      *extraValues))
+
+    for datasetRef in datasets:
+        primaryURI, componentURIs = butler.getURIs(datasetRef, collections=datasetRef.run)


getURIs takes a non-zero amount of time so I wouldn't call it unless show_uri is True.

As far as I've been able to tell, getURIs is the only way to get the components of the datasetRef. Is there another way, or do we want to not show the components if show_uri is false?

If people explicitly ask for raw.wcs as a dataset type then they can get the components they ask for and getURIs should do the right thing. getURIs will only return multiple values if the dataset has been disassembled (in which case each component gets a separate URI -- if it has not been disassembled each component would get the same URI). Therefore getURIs is only useful if people want to see URIs. If they don't want URIs don't call it at all and give people what they got from queryDatasets (which may or may not include a component name)

timj · 2020-10-20T00:21:33Z

python/lsst/daf/butler/script/queryDatasets.py

+                      datasetRef.id,
+                      *extraValues))
+
+    for datasetRef in datasets:


In DM-26324 (#396) I am adding support for sorting DatasetRefs. This would really help with reproducibility of the output here since at the moment we are at the mercy of the order returned by the database and that order is going to depend on the database. For example you end up with the detectors not coming out as you would expect so it's hard to work out if something obvious is missing. I think this means you can group all the returned DatasetRef from queryDatasets into a dict of sets (keyed by DatasetType name) and then sort the DatasetRef out of the sets.

timj · 2020-10-20T00:23:34Z

python/lsst/daf/butler/script/queryDatasets.py

+
+    def __init__(self, columnNames):
+        self.columnNames = columnNames
+        # Rows are stored in the keys of a dict to get set-like


As I say below, I think you can use sets and sorting if my other ticket merges.

timj · 2020-10-20T00:32:27Z

tests/test_cliCmdQueryDatasets.py

+            if showUri:
+                expectedNames.append("URI")
+
+            for table in tables:


Would it be clearer here to define the astropy.table.Table you expect to get and then use assertEqual directly? Rather than try to write code to compare all the rows and columns yourself?

assertEqual (and variations) does not seem to work with astropy tables. I'm able to use report_diff_values with good effect though https://docs.astropy.org/en/stable/api/astropy.utils.diff.report_diff_values.html#astropy.utils.diff.report_diff_values
something like

diff = io.StringIO() self.assertEqual(len(tables), 1) self.assertTrue(report_diff_values(tables[0], expectedTable, fileobj=diff), msg=diff.getvalue())

does that seem ok?

that works for me.

timj · 2020-10-20T00:35:27Z

tests/test_cliCmdQueryDatasets.py

+        self.maxDiff = None
+        print(self.root)
+        self.assertListEqual(tables[0]["URI"].tolist(), [
+            self._makeUri(


The Butler datastore root is a ButlerURI so for what you are doing something like butler.datastore.root.join("ingest/run/test_blah/") should work for you without having to use urlunparse.

n8pease force-pushed the tickets/DM-26685 branch 2 times, most recently from 80e30ee to 3552d3d Compare October 8, 2020 18:52

n8pease commented Oct 8, 2020

View reviewed changes

timj reviewed Oct 8, 2020

View reviewed changes

timj reviewed Oct 9, 2020

View reviewed changes

n8pease force-pushed the tickets/DM-26685 branch 5 times, most recently from 96da9a9 to 3f0e1c3 Compare October 19, 2020 22:49

timj approved these changes Oct 20, 2020

View reviewed changes

n8pease added 2 commits October 20, 2020 11:09

import module instead of functions

de2e1ab

add query_datasets CLI command

70029ae

n8pease force-pushed the tickets/DM-26685 branch from 3f0e1c3 to 70029ae Compare October 21, 2020 17:38

n8pease merged commit b5a4b20 into master Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-26685: Add command-line tool for Registry.queryDatasets #389

DM-26685: Add command-line tool for Registry.queryDatasets #389

n8pease commented Oct 7, 2020

n8pease Oct 8, 2020

TallJimbo Oct 8, 2020 •

edited

n8pease Oct 8, 2020

timj Oct 9, 2020

timj Oct 8, 2020

TallJimbo Oct 9, 2020

timj Oct 9, 2020

TallJimbo Oct 9, 2020 •

edited

timj Oct 9, 2020

n8pease Oct 9, 2020

timj Oct 9, 2020

n8pease Oct 9, 2020

n8pease Oct 9, 2020

timj Oct 9, 2020

timj left a comment

timj Oct 19, 2020

timj Oct 20, 2020

n8pease Oct 20, 2020

timj Oct 20, 2020

timj Oct 20, 2020

timj Oct 20, 2020

timj Oct 20, 2020

n8pease Oct 20, 2020

timj Oct 20, 2020

timj Oct 20, 2020

DM-26685: Add command-line tool for Registry.queryDatasets #389

DM-26685: Add command-line tool for Registry.queryDatasets #389

Conversation

n8pease commented Oct 7, 2020

Choose a reason for hiding this comment

TallJimbo Oct 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Oct 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Oct 8, 2020 •

edited

TallJimbo Oct 9, 2020 •

edited