Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-26685: Add command-line tool for Registry.queryDatasets #389

Merged
merged 2 commits into from Oct 21, 2020

Conversation

n8pease
Copy link
Contributor

@n8pease n8pease commented Oct 7, 2020

No description provided.

@n8pease n8pease force-pushed the tickets/DM-26685 branch 2 times, most recently from 80e30ee to 3552d3d Compare October 8, 2020 18:52
@repo_argument(required=True)
@glob_argument(help="GLOB is one or more glob-style expressions that fully or partially identify the "
"dataset types to be queried.")
@click.option("--collections",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a shared option "--collection-type" used in other places. What do you think about using that here?

Copy link
Member

@TallJimbo TallJimbo Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a different kind of argument entirely, actually; --collections here is more like the positional GLOB argument to butler query-collections.

FWIW, queryDatasets does internally expand a collections argument that includes one or more patterns into an explicit list of collections in much the same way as queryCollections, and that could accept collection types from the user to provide more fine-grained control over what's in that expanded list. But that'd first require a change to Registry.queryDatasets itself, and hence is probably out of scope for this ticket.

order of `collections` passed in). If used, `collections` must specify at least one
expression and must not contain wildcards."""))
@click.option("--components",
type=click.Choice(["ALL", "NONE", "UNMATCHED"], case_sensitive=False),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized a conflict between this --components and the flag --components/--no-components on line 154. Grumble...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm inclined to remove this option in the interests of keeping the user interface as simple as possible to start with. It seems to me that UNMATCHED is the right default in almost every situation.

help=unwrap("""A string expression similar to a SQL WHERE clause. May involve any column of a
dimension table or a dimension name as a shortcut for the primary key column of a
dimension table."""))
@click.option("--deduplicate",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think deduplicate should be a user interface option at all. Why would someone ever want to see a duplicate row. All it does is add confusion to the output. I think that you should use the deduplicate flag internally where you can and if you can't you need to throw the per-dataset type refs into a set and then extract them from the set sorted by ref.id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this needs a new name, but it is not at all equivalent to deduplicating results after the fact. deduplicate=True says to only return one result for each datasetType+dataId combination across all searched collections, searching them in order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more important not to confuse the output here then where we get identical answers in the output and no-one knows why. Are you saying that I got two rows because a dataset happened to be in two collections but since the collection is not known to the DatasetRef we can't tell why we had two rows?

Copy link
Member

@TallJimbo TallJimbo Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's somewhere in between, because DatasetRef does know the dataset's RUN-type collection, so both that and the dataset_id will always be printed, and the fact that they're actually different and where they're coming from will be apparent. That should be the common case.

But it is also possible to get the exact same dataset multiple times because both the RUN it's in and a TAGGED collection it's in are being queried. Providing that information about why it appeared twice is hard right now, but just removing that duplication is as simple as stuffing the query results in a set, and that seems like the right thing to do anyway.

row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id)
row.update(datasetRef.dataId.items())
rows.append(row)
tables[datasetRef.datasetType.name] = rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like how this bit of code reads the array, updates it, and then assigns it back to the same place. That seems like it should be something that defaultdict can handle. The following code results in the same output but there is only one line in the for loop rather than five and it's not continually writing the list back to the dict.

tables = defaultdict(list)
for datasetRef in datasets:
    tables[datasetRef.datasetType.name]].append({"type": datasetRef.datasetType.name, "run": datasetRef.run,
                                      "id": datasetRef.id, **datasetRef.dataId})

for datasetName, rows in tables.items():
    print("")
    print(rows)
    Table(rows, names=rows[0].keys()).pprint_all()
print("")

the only annoyance being that Table sorts the columns if they aren't an OrderedDict and so you need the names here explicitly (there is a comment in astropy that they should fix this when they drop older pythons).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj there is something about the sorting code that doesn't like sorting (I think) DimensionElements:


[{'type': 'bias', 'run': 'HSC/calib/gen2/2013-11-03', 'id': 1, Dimension(instrument): 'HSC', Dimension(detector): 10}]
Traceback (most recent call last):
  File "/Users/n8pease/code/lsst/daf_butler/bin/butler", line 28, in <module>
    sys.exit(main())
  File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/cli/butler.py", line 311, in main
    return cli()
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/cli/cmd/commands.py", line 203, in query_datasets
    Table(rows, names=rows[0].keys()).pprint_all()
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/astropy/table/table.py", line 610, in __init__
    init_func(data, names, dtype, n_cols, copy)
  File "/Users/n8pease/code/conda/miniconda3-py37_4.8.2/envs/lsst-scipipe-cb4e2dc/lib/python3.7/site-packages/astropy/table/table.py", line 852, in _init_from_list_of_dicts
    names_from_data = sorted(names_from_data)
  File "/Users/n8pease/code/lsst/daf_butler/python/lsst/daf/butler/core/dimensions/elements.py", line 305, in __gt__
    return self.universe.getElementIndex(self.name) > self.universe.getElementIndex(other.name)
AttributeError: 'str' object has no attribute 'name'

I can go debug that if we want(?), or I can use 3 rows in the loop and an OrderedDict:

    for datasetRef in datasets:
        row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id)
        row.update(datasetRef.dataId.items())
        tables[datasetRef.datasetType.name].append(row)

And then we don't have to do the names annoyance either.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error is from astropy because you have not specified the names as an argument to the Table constructor. Did you remove the names parameter? Without the names parameter and without an OrderedDict it tries to work out the column names itself but sorts them. With the names parameter it doesn't bother.

You are right that the row is the OrderedDict and tables is the defaultdict so I did miss that last night.

Can't you new example code do the **datasetRef.dataId in the OrderedDict constructor can't it?

Does the following work then?

for datasetRef in datasets:
        row = OrderedDict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id, **datasetRef.dataId)
        tables[datasetRef.datasetType.name].append(row)

please add a comment that you are using OrderedDict specifically because current astropy has not yet changed to understand that dicts are ordered. Otherwise people are going to get confused later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if rows is a dict and I pass names to the Table init, it fails as above. See code below.

    for datasetRef in datasets:
        row = dict(type=datasetRef.datasetType.name, run=datasetRef.run, id=datasetRef.id)
        row.update(datasetRef.dataId.items())
        tables[datasetRef.datasetType.name].append(row)

and this in the command function:

    for datasetName, rows in tables.items():
        print("")
        Table(rows, names=rows[0].keys()).pprint_all()
    print("")

If I comment out row.update(datasetRef.dataId.items()) then it does not fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I've realized if I make the astropy table in the script function I can do it without the intermediate data structure. Will investigate that briefly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I already said I'm fine with OrderedDict if you use the ** expansion rather than the separate update.

It looks like they sort the column names even if a names is given and then ignore it to use the names you want. The problem you are having is that for a dataId the keys of the dataId are not actually strings so you need to convert the dataId to a normal str key dict for sorting to work (or fix the comparison method in the class as you say above -- something we should probably do eventually because sorting keys in a dataId is not completely ridiculous -- not for this ticket). Not relevant if you stick with OrderedDict along with a note as to why.

@n8pease n8pease force-pushed the tickets/DM-26685 branch 5 times, most recently from 96da9a9 to 3f0e1c3 Compare October 19, 2020 22:49
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main comments:

Plus some other comments. I tried out the command line and it worked great.

DatasetRef and component name, and component URI."""
datasetTypeName = datasetRef.datasetType.name
if name:
datasetTypeName = ".".join((datasetTypeName, name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the proper API for this. Something like:

datasetTypeName = datasetRef.datasetType.componentTypeName(name)

*extraValues))

for datasetRef in datasets:
primaryURI, componentURIs = butler.getURIs(datasetRef, collections=datasetRef.run)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getURIs takes a non-zero amount of time so I wouldn't call it unless show_uri is True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I've been able to tell, getURIs is the only way to get the components of the datasetRef. Is there another way, or do we want to not show the components if show_uri is false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people explicitly ask for raw.wcs as a dataset type then they can get the components they ask for and getURIs should do the right thing. getURIs will only return multiple values if the dataset has been disassembled (in which case each component gets a separate URI -- if it has not been disassembled each component would get the same URI). Therefore getURIs is only useful if people want to see URIs. If they don't want URIs don't call it at all and give people what they got from queryDatasets (which may or may not include a component name)

datasetRef.id,
*extraValues))

for datasetRef in datasets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DM-26324 (#396) I am adding support for sorting DatasetRefs. This would really help with reproducibility of the output here since at the moment we are at the mercy of the order returned by the database and that order is going to depend on the database. For example you end up with the detectors not coming out as you would expect so it's hard to work out if something obvious is missing. I think this means you can group all the returned DatasetRef from queryDatasets into a dict of sets (keyed by DatasetType name) and then sort the DatasetRef out of the sets.


def __init__(self, columnNames):
self.columnNames = columnNames
# Rows are stored in the keys of a dict to get set-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I say below, I think you can use sets and sorting if my other ticket merges.

if showUri:
expectedNames.append("URI")

for table in tables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer here to define the astropy.table.Table you expect to get and then use assertEqual directly? Rather than try to write code to compare all the rows and columns yourself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertEqual (and variations) does not seem to work with astropy tables. I'm able to use report_diff_values with good effect though https://docs.astropy.org/en/stable/api/astropy.utils.diff.report_diff_values.html#astropy.utils.diff.report_diff_values
something like

        diff = io.StringIO()
        self.assertEqual(len(tables), 1)
        self.assertTrue(report_diff_values(tables[0], expectedTable, fileobj=diff), msg=diff.getvalue())

does that seem ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works for me.

self.maxDiff = None
print(self.root)
self.assertListEqual(tables[0]["URI"].tolist(), [
self._makeUri(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Butler datastore root is a ButlerURI so for what you are doing something like butler.datastore.root.join("ingest/run/test_blah/") should work for you without having to use urlunparse.

@n8pease n8pease merged commit b5a4b20 into master Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants