New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-26685: Add command-line tool for Registry.queryDatasets #389
Conversation
80e30ee
to
3552d3d
Compare
@repo_argument(required=True) | ||
@glob_argument(help="GLOB is one or more glob-style expressions that fully or partially identify the " | ||
"dataset types to be queried.") | ||
@click.option("--collections", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a shared option "--collection-type" used in other places. What do you think about using that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a different kind of argument entirely, actually; --collections
here is more like the positional GLOB
argument to butler query-collections
.
FWIW, queryDatasets
does internally expand a collections
argument that includes one or more patterns into an explicit list of collections in much the same way as queryCollections
, and that could accept collection types from the user to provide more fine-grained control over what's in that expanded list. But that'd first require a change to Registry.queryDatasets
itself, and hence is probably out of scope for this ticket.
order of `collections` passed in). If used, `collections` must specify at least one | ||
expression and must not contain wildcards.""")) | ||
@click.option("--components", | ||
type=click.Choice(["ALL", "NONE", "UNMATCHED"], case_sensitive=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized a conflict between this --components and the flag --components/--no-components on line 154. Grumble...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm inclined to remove this option in the interests of keeping the user interface as simple as possible to start with. It seems to me that UNMATCHED is the right default in almost every situation.
help=unwrap("""A string expression similar to a SQL WHERE clause. May involve any column of a | ||
dimension table or a dimension name as a shortcut for the primary key column of a | ||
dimension table.""")) | ||
@click.option("--deduplicate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think deduplicate should be a user interface option at all. Why would someone ever want to see a duplicate row. All it does is add confusion to the output. I think that you should use the deduplicate flag internally where you can and if you can't you need to throw the per-dataset type refs into a set and then extract them from the set sorted by ref.id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this needs a new name, but it is not at all equivalent to deduplicating results after the fact. deduplicate=True
says to only return one result for each datasetType+dataId combination across all searched collections, searching them in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even more important not to confuse the output here then where we get identical answers in the output and no-one knows why. Are you saying that I got two rows because a dataset happened to be in two collections but since the collection is not known to the DatasetRef we can't tell why we had two rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's somewhere in between, because DatasetRef does know the dataset's RUN-type collection, so both that and the dataset_id will always be printed, and the fact that they're actually different and where they're coming from will be apparent. That should be the common case.
But it is also possible to get the exact same dataset multiple times because both the RUN it's in and a TAGGED collection it's in are being queried. Providing that information about why it appeared twice is hard right now, but just removing that duplication is as simple as stuffing the query results in a set
, and that seems like the right thing to do anyway.
row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id) | ||
row.update(datasetRef.dataId.items()) | ||
rows.append(row) | ||
tables[datasetRef.datasetType.name] = rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like how this bit of code reads the array, updates it, and then assigns it back to the same place. That seems like it should be something that defaultdict
can handle. The following code results in the same output but there is only one line in the for loop rather than five and it's not continually writing the list back to the dict.
tables = defaultdict(list)
for datasetRef in datasets:
tables[datasetRef.datasetType.name]].append({"type": datasetRef.datasetType.name, "run": datasetRef.run,
"id": datasetRef.id, **datasetRef.dataId})
for datasetName, rows in tables.items():
print("")
print(rows)
Table(rows, names=rows[0].keys()).pprint_all()
print("")
the only annoyance being that Table
sorts the columns if they aren't an OrderedDict
and so you need the names here explicitly (there is a comment in astropy that they should fix this when they drop older pythons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timj there is something about the sorting code that doesn't like sorting (I think) DimensionElement
s:
[{'type': 'bias', 'run': 'HSC/calib/gen2/2013-11-03', 'id': 1, Dimension(instrument): 'HSC', Dimension(detector): 10}]
Traceback (most recent call last):
File "/Users/n8pease/code/lsst/daf_butler/bin/butler", line 28, in <module>
sys.exit(main())
File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/cli/butler.py", line 311, in main
return cli()
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/cli/cmd/commands.py", line 203, in query_datasets
Table(rows, names=rows[0].keys()).pprint_all()
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/astropy/table/table.py", line 610, in __init__
init_func(data, names, dtype, n_cols, copy)
File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/astropy/table/table.py", line 852, in _init_from_list_of_dicts
names_from_data = sorted(names_from_data)
File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/core/dimensions/elements.py", line 305, in __gt__
return self.universe.getElementIndex(self.name) > self.universe.getElementIndex(other.name)
AttributeError: 'str' object has no attribute 'name'
I can go debug that if we want(?), or I can use 3 rows in the loop and an OrderedDict
:
for datasetRef in datasets:
row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id)
row.update(datasetRef.dataId.items())
tables[datasetRef.datasetType.name].append(row)
And then we don't have to do the names annoyance either.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That error is from astropy because you have not specified the names as an argument to the Table constructor. Did you remove the names parameter? Without the names parameter and without an OrderedDict it tries to work out the column names itself but sorts them. With the names parameter it doesn't bother.
You are right that the row is the OrderedDict and tables is the defaultdict so I did miss that last night.
Can't you new example code do the **datasetRef.dataId
in the OrderedDict constructor can't it?
Does the following work then?
for datasetRef in datasets:
row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id, **datasetRef.dataId)
tables[datasetRef.datasetType.name].append(row)
please add a comment that you are using OrderedDict specifically because current astropy has not yet changed to understand that dicts are ordered. Otherwise people are going to get confused later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if rows
is a dict and I pass names to the Table init, it fails as above. See code below.
for datasetRef in datasets:
row = dict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id)
row.update(datasetRef.dataId.items())
tables[datasetRef.datasetType.name].append(row)
and this in the command function:
for datasetName, rows in tables.items():
print("")
Table(rows, names=rows[0].keys()).pprint_all()
print("")
If I comment out row.update(datasetRef.dataId.items())
then it does not fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I've realized if I make the astropy table in the script function I can do it without the intermediate data structure. Will investigate that briefly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I already said I'm fine with OrderedDict if you use the **
expansion rather than the separate update.
It looks like they sort the column names even if a names is given and then ignore it to use the names you want. The problem you are having is that for a dataId the keys of the dataId are not actually strings so you need to convert the dataId to a normal str key dict for sorting to work (or fix the comparison method in the class as you say above -- something we should probably do eventually because sorting keys in a dataId is not completely ridiculous -- not for this ticket). Not relevant if you stick with OrderedDict along with a note as to why.
96da9a9
to
3f0e1c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two main comments:
- I think we want to use the sorting from DM-26324: Enable sorting of datasets during export #396 so that we get predictable output.
- Don't ask for URIs unless people want them.
Plus some other comments. I tried out the command line and it worked great.
DatasetRef and component name, and component URI.""" | ||
datasetTypeName = datasetRef.datasetType.name | ||
if name: | ||
datasetTypeName = ".".join((datasetTypeName, name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the proper API for this. Something like:
datasetTypeName = datasetRef.datasetType.componentTypeName(name)
*extraValues)) | ||
|
||
for datasetRef in datasets: | ||
primaryURI, componentURIs = butler.getURIs(datasetRef, collections=datasetRef.run) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getURIs takes a non-zero amount of time so I wouldn't call it unless show_uri
is True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I've been able to tell, getURIs is the only way to get the components of the datasetRef. Is there another way, or do we want to not show the components if show_uri
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people explicitly ask for raw.wcs
as a dataset type then they can get the components they ask for and getURIs should do the right thing. getURIs will only return multiple values if the dataset has been disassembled (in which case each component gets a separate URI -- if it has not been disassembled each component would get the same URI). Therefore getURIs is only useful if people want to see URIs. If they don't want URIs don't call it at all and give people what they got from queryDatasets (which may or may not include a component name)
datasetRef.id, | ||
*extraValues)) | ||
|
||
for datasetRef in datasets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DM-26324 (#396) I am adding support for sorting DatasetRefs. This would really help with reproducibility of the output here since at the moment we are at the mercy of the order returned by the database and that order is going to depend on the database. For example you end up with the detectors not coming out as you would expect so it's hard to work out if something obvious is missing. I think this means you can group all the returned DatasetRef from queryDatasets into a dict of sets (keyed by DatasetType name) and then sort the DatasetRef out of the sets.
|
||
def __init__(self, columnNames): | ||
self.columnNames = columnNames | ||
# Rows are stored in the keys of a dict to get set-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I say below, I think you can use sets and sorting if my other ticket merges.
tests/test_cliCmdQueryDatasets.py
Outdated
if showUri: | ||
expectedNames.append("URI") | ||
|
||
for table in tables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer here to define the astropy.table.Table you expect to get and then use assertEqual
directly? Rather than try to write code to compare all the rows and columns yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertEqual
(and variations) does not seem to work with astropy tables. I'm able to use report_diff_values
with good effect though https://docs.astropy.org/en/stable/api/astropy.utils.diff.report_diff_values.html#astropy.utils.diff.report_diff_values
something like
diff = io.StringIO()
self.assertEqual(len(tables), 1)
self.assertTrue(report_diff_values(tables[0], expectedTable, fileobj=diff), msg=diff.getvalue())
does that seem ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works for me.
tests/test_cliCmdQueryDatasets.py
Outdated
self.maxDiff = None | ||
print(self.root) | ||
self.assertListEqual(tables[0]["URI"].tolist(), [ | ||
self._makeUri( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Butler datastore root is a ButlerURI so for what you are doing something like butler.datastore.root.join("ingest/run/test_blah/")
should work for you without having to use urlunparse.
3f0e1c3
to
70029ae
Compare
No description provided.