Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-26690: Add command-line tool for Registry.queryDataIds #402

Merged
merged 8 commits into from Nov 4, 2020

Conversation

n8pease
Copy link
Contributor

@n8pease n8pease commented Oct 22, 2020

No description provided.

@n8pease n8pease force-pushed the tickets/DM-26690 branch 2 times, most recently from 43c7627 to 845db52 Compare October 27, 2020 00:16
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the code comments, I'm not entirely sure I completely understand the output of the command so it would be great if there was some extra help text somewhere (I'm not sure how easy it is to add more information in click vs adding extra text to the sphinx page. Mentioning DimensionGraph in the help string may be confusing.

A few things jumped out:

  • I was able to run with a dimensions argument and no --collections option so the docstring was confusing at best. It wasn't at all clear what I would do with the dimension argument and it was a pleasant surprise when for a whim I asked for "patch" on a calexp dataset and was rewarded by it telling me the tract/patch for each calexp. Even better with htm7. That's great but not obvious.
  • Sometimes you get identical rows repeated and it's not clear why. It might be because there are repeated options with dimensions that we do not see but it doesn't help anybody to include the duplicates. Filtering them out should be straightforward.
  • Table sorting should be used as is now done in query-datasets.

collections=collections)

if len(results.graph) > 0:
return Table([{str(k): v for k, v in item.full.items()} for item in list(results)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the list() at the end there? results is an iterator so it should iterate automatically without first going through the list.

After you create the table you will need to sort the columns as we now do for query-datasets (we want to prioritize spatial/temporal columns over others).

datasets=datasets,
where=where)

def _assertTablesEqual(self, tables, expectedTables):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked this method a little on the branch I recently merged. I had a problem with a missing comma in expectedTables definition with one element leading to the tables being rows not tables. I added an assert to check they were all tables. You can move helper methods to lsst.daf.butler.tests.

def setUp(self):
self.root = tempfile.mkdtemp(dir=TESTDIR)
Butler.makeRepo(self.root, config=Config(self.configFile))
self.butlerConfigFile = os.path.join(self.root, "butler.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being stored in an instance property? I think it's only used a few lines below. Also, you don't need it -- you should take the return value of Butler.makeRepo and pass that directly to the Butler constructor below.

self.butler.registry)

# Add needed Dimensions
self.butler.registry.insertDimensionData("instrument", {"name": "DummyCamComp"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to take a look at lsst.daf.butler.tests.makeTestRepo which does a lot of this heavy lifting for you. The downside is that it expects a single butler per class and assumes you'll create a new collection for each test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't work out how to make it do what I need, and I think it mocks some behaviors in a way that's hard to understand (it mentions a registry override, and arbitrary cross assignment of relationships). In a day I couldn't get it to do what I needed, Jim directed me at something else (test_simpleButler), and I understand what this code does (and I think it's clear from reading it, but maybe just because it's made of stuff that I do grok). So, I'd like to leave it as is if that's ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not sure what the real problem is because makeTestRepo doesn't mock anything -- you can call it with empty dataIds iterables and it's still less code than calling Butler.makeRepo yourself isn't it?

[563, 234, 456.7, 752, 8, 9, 27])

@staticmethod
def _addDatasetType(datasetTypeName, dimensions, storageClass, registry):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use lsst.daf.butler.tests.addDatasetType

return datasetType

@staticmethod
def _queryDataIds(repo, dimensions=(), collections=(), datasets=None, where=None, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this trampoline gives us over importing queryDataIds from script and using it directly. Is it solely the defaulted arguments? If so can you add a comment here?

res = self._queryDataIds(self.root, dimensions=("visit",))
expected = AstropyTable(
array((
("R", "DummyCamComp", "d-r", "423", "1"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers here have to be numbers and not strings.

@repo_argument(required=True)
@dimensions_argument(help=unwrap("""DIMENSIONS are dimensions of the data IDs to yield. Will be expanded to a
complete DimensionGraph. --collections is required if using this
argument."""))
Copy link
Member

@TallJimbo TallJimbo Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--collections would be required for the --datasets option, not this argument.

I'd just change this docstring to "DIMENSIONS are the keys of the data IDs to yield. Will be expanded to include any dependencies." I think that's a bit more beginner-friendly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I think it would help a lot if we gave some examples here and said "such as exposure, instrument, or tract".

Are we planning on at some point adding an option that would list the full record for the dimension? eg listing the exposure.obs_id or detector.full_name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning on at some point adding an option that would list the full record for the dimension? eg listing the exposure.obs_id or detector.full_name?

There is a simpler Python API for that via a different registry method, queryDimensionRecords. I figured we'd just make that another subcommand on another ticket; the full-records expanded data ID you can get back from queryDataIds (by calling expanded() on the result object) isn't something that lends itself to command-line pretty-printing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. That would be handy so I'll maybe take a look at that myself.

@timj
Copy link
Member

timj commented Oct 27, 2020

One other thing I forgot was that typing butler query-data-ids repo reports None -- other query subcommands return something and I was initially at a loss as to what to do next since there wasn't any drill down obvious.

@n8pease n8pease force-pushed the tickets/DM-26690 branch 5 times, most recently from a38a225 to 3a80090 Compare November 3, 2020 20:21
@n8pease n8pease requested a review from timj November 3, 2020 20:45
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment about your new table assert.

@@ -177,7 +165,7 @@ def testShowURI(self):
"visit", "URI")),
)

self._assertTablesEqual(tables, expectedTables)
assertAstropyTablesEqual(self, tables, expectedTables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want assertAstropyTablesEqual to be in a class and then use multiple inheritance in this TestCase so that it inherits from unittest.TestCase and the butlerTestHelper (or whatever you call it). Then you'd still be able to do self.assertTablesEqual here and make it look like a normal test assert. We use the multiple inheritance technique a lot in obs_base butler tests.

@n8pease n8pease merged commit e2e77cb into master Nov 4, 2020
@timj timj deleted the tickets/DM-26690 branch February 16, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants