New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DM-26690: Add command-line tool for Registry.queryDataIds #402

Merged

n8pease merged 8 commits into master from tickets/DM-26690

Nov 4, 2020

Contributor

n8pease commented Oct 22, 2020

No description provided.

n8pease force-pushed the tickets/DM-26690 branch 2 times, most recently from 43c7627 to 845db52 Compare

October 27, 2020 00:16

timj requested changes

View reviewed changes

Member

timj left a comment

In addition to the code comments, I'm not entirely sure I completely understand the output of the command so it would be great if there was some extra help text somewhere (I'm not sure how easy it is to add more information in click vs adding extra text to the sphinx page. Mentioning DimensionGraph in the help string may be confusing.

A few things jumped out:

I was able to run with a dimensions argument and no --collections option so the docstring was confusing at best. It wasn't at all clear what I would do with the dimension argument and it was a pleasant surprise when for a whim I asked for "patch" on a calexp dataset and was rewarded by it telling me the tract/patch for each calexp. Even better with htm7. That's great but not obvious.
Sometimes you get identical rows repeated and it's not clear why. It might be because there are repeated options with dimensions that we do not see but it doesn't help anybody to include the duplicates. Filtering them out should be straightforward.
Table sorting should be used as is now done in query-datasets.

python/lsst/daf/butler/script/queryDataIds.py Outdated

+                                                         collections=collections)
+                  if len(results.graph) > 0:
+                      return Table([{str(k): v for k, v in item.full.items()} for item in list(results)])

Member

timj Oct 27, 2020

Do you need the list() at the end there? results is an iterator so it should iterate automatically without first going through the list.

After you create the table you will need to sort the columns as we now do for query-datasets (we want to prioritize spatial/temporal columns over others).

tests/test_cliCmdQueryDataIds.py Outdated

+                                                 datasets=datasets,
+                                                 where=where)
+                  def _assertTablesEqual(self, tables, expectedTables):

Member

timj Oct 27, 2020

I tweaked this method a little on the branch I recently merged. I had a problem with a missing comma in expectedTables definition with one element leading to the tables being rows not tables. I added an assert to check they were all tables. You can move helper methods to lsst.daf.butler.tests.

tests/test_cliCmdQueryDataIds.py Outdated

+                  def setUp(self):
+                      self.root = tempfile.mkdtemp(dir=TESTDIR)
+                      Butler.makeRepo(self.root, config=Config(self.configFile))
+                      self.butlerConfigFile = os.path.join(self.root, "butler.yaml")

Member

timj Oct 27, 2020

Why is this being stored in an instance property? I think it's only used a few lines below. Also, you don't need it -- you should take the return value of Butler.makeRepo and pass that directly to the Butler constructor below.

tests/test_cliCmdQueryDataIds.py

+                                                              self.butler.registry)
+                      # Add needed Dimensions
+                      self.butler.registry.insertDimensionData("instrument", {"name": "DummyCamComp"})

Member

timj Oct 27, 2020

I think you may want to take a look at lsst.daf.butler.tests.makeTestRepo which does a lot of this heavy lifting for you. The downside is that it expects a single butler per class and assumes you'll create a new collection for each test.

Contributor Author

n8pease Nov 3, 2020

I can't work out how to make it do what I need, and I think it mocks some behaviors in a way that's hard to understand (it mentions a registry override, and arbitrary cross assignment of relationships). In a day I couldn't get it to do what I needed, Jim directed me at something else (test_simpleButler), and I understand what this code does (and I think it's clear from reading it, but maybe just because it's made of stuff that I do grok). So, I'd like to leave it as is if that's ok.

Member

timj Nov 3, 2020

Ok. I'm not sure what the real problem is because makeTestRepo doesn't mock anything -- you can call it with empty dataIds iterables and it's still less code than calling Butler.makeRepo yourself isn't it?

tests/test_cliCmdQueryDataIds.py Outdated

+                                            [563, 234, 456.7, 752, 8, 9, 27])
+                  @staticmethod
+                  def _addDatasetType(datasetTypeName, dimensions, storageClass, registry):

Member

timj Oct 27, 2020

Use lsst.daf.butler.tests.addDatasetType

tests/test_cliCmdQueryDataIds.py Outdated

+                      return datasetType
+                  @staticmethod
+                  def _queryDataIds(repo, dimensions=(), collections=(), datasets=None, where=None, ):

Member

timj Oct 27, 2020

I'm not sure what this trampoline gives us over importing queryDataIds from script and using it directly. Is it solely the defaulted arguments? If so can you add a comment here?

tests/test_cliCmdQueryDataIds.py Outdated

+                      res = self._queryDataIds(self.root, dimensions=("visit",))
+                      expected = AstropyTable(
+                          array((
+                              ("R", "DummyCamComp", "d-r", "423", "1"),

Member

timj Oct 27, 2020

The numbers here have to be numbers and not strings.

TallJimbo reviewed

View reviewed changes

python/lsst/daf/butler/cli/cmd/commands.py Outdated

+              @repo_argument(required=True)
+              @dimensions_argument(help=unwrap("""DIMENSIONS are dimensions of the data IDs to yield. Will be expanded to a
+                                               complete DimensionGraph. --collections is required if using this
+                                               argument."""))

Member

TallJimbo Oct 27, 2020 •

edited

--collections would be required for the --datasets option, not this argument.

I'd just change this docstring to "DIMENSIONS are the keys of the data IDs to yield. Will be expanded to include any dependencies." I think that's a bit more beginner-friendly.

Member

timj Oct 27, 2020

In fact I think it would help a lot if we gave some examples here and said "such as exposure, instrument, or tract".

Are we planning on at some point adding an option that would list the full record for the dimension? eg listing the exposure.obs_id or detector.full_name?

Member

TallJimbo Oct 27, 2020

Are we planning on at some point adding an option that would list the full record for the dimension? eg listing the exposure.obs_id or detector.full_name?

There is a simpler Python API for that via a different registry method, queryDimensionRecords. I figured we'd just make that another subcommand on another ticket; the full-records expanded data ID you can get back from queryDataIds (by calling expanded() on the result object) isn't something that lends itself to command-line pretty-printing.

Member

timj Oct 27, 2020

ok. That would be handy so I'll maybe take a look at that myself.

Member

timj commented Oct 27, 2020

One other thing I forgot was that typing butler query-data-ids repo reports None -- other query subcommands return something and I was initially at a loss as to what to do next since there wasn't any drill down obvious.

n8pease force-pushed the tickets/DM-26690 branch 5 times, most recently from a38a225 to 3a80090 Compare

November 3, 2020 20:21

n8pease requested a review from timj

November 3, 2020 20:45

timj approved these changes

View reviewed changes

Member

timj left a comment

One minor comment about your new table assert.

tests/test_cliCmdQueryDatasets.py Outdated

@@ @@ -177,7 +165,7 @@ def testShowURI(self): @@
                                      "visit", "URI")),
                       )
-                      self._assertTablesEqual(tables, expectedTables)
+                      assertAstropyTablesEqual(self, tables, expectedTables)

Member

timj Nov 3, 2020

You probably want assertAstropyTablesEqual to be in a class and then use multiple inheritance in this TestCase so that it inherits from unittest.TestCase and the butlerTestHelper (or whatever you call it). Then you'd still be able to do self.assertTablesEqual here and make it look like a normal test assert. We use the multiple inheritance technique a lot in obs_base butler tests.

n8pease force-pushed the tickets/DM-26690 branch from d4e8cdd to 6a6da12 Compare

November 3, 2020 23:17

n8pease added 8 commits

November 4, 2020 09:03


          split import to one per line

1177d45


          alphabatize option

e899a61


          add query-data-ids command

05443a6


          deduplicate & sort rows

5dc466e


          move assert(Astropy)TablesEqual to shared location

bc0ca08


          improve help text and no results output, slightly.

2b03ea9


          fix CLI '\f' docstring trucator

d5f36c4


          try to be a little helpful if no results

dccd9eb

n8pease force-pushed the tickets/DM-26690 branch from 6a6da12 to dccd9eb Compare

November 4, 2020 17:04

n8pease merged commit e2e77cb into master

timj deleted the tickets/DM-26690 branch

February 16, 2024 17:17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment