Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27154: Usability improvement suggestions for butler collection commands #431

Merged
merged 9 commits into from Nov 17, 2020

Conversation

n8pease
Copy link
Contributor

@n8pease n8pease commented Nov 10, 2020

No description provided.

@n8pease n8pease force-pushed the tickets/DM-27154 branch 2 times, most recently from 28170eb to 2efb461 Compare November 10, 2020 19:22
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly comments on test infrastructure changes but also a comment on the new exception class.


Returns
-------
`str`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting this to return a ButlerURI.

Also this is the wrong doc syntax for a return value. I'm not completely convinced that this deserves a method of its own.

butler = self.butler
butler.put(metric, ref)

def addDatasetType(self, dimensions, datasetTypeName, storageClass):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is here. It's identical to lsst.daf.butler.tests.addDatasetType isn't it?

butler = Butler(self.root, run=run)
else:
butler = self.butler
butler.put(metric, ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should return the DatasetRef returned by this put.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole routine can be something like:

def addDataset(...):
    metric = self._makeExampleMetrics()
    return self.butler.put(metric, datasetType if datasetType is not None else self.datasetType,
                           dataId=dataId, run=run)

datasetType = self.datasetType
ref = DatasetRef(datasetType, dataId, id=None)
metric = self._makeExampleMetrics()
if run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed. You can say self.butler.put(metric, ref, run=run) below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails when I call this and the run doesn't exist yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be clearer then if you had a line here that created the run collection rather than without commentary relying on Butler to create it behind the scenes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's weird if/because I'm thinking about it wrong though...

Copy link
Member

@timj timj Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a comment here it all seems superfluous. Whereas:

if run:
    self.butler.registry.registerCollection(run, type=CollectionType.RUN)

is explicit about what you need. @TallJimbo do you prefer the "create a new butler for each run" paradigm? Maybe that's better for the general user. If that is the case I'd like a comment to explain why you are having to create a new butler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That makes sense, works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of tests, users are never going to create new runs or even call put themselves (it'll all be operations like ingest or pipetask doing that). So I think we can go with whatever maximizes test code readability. I don't have a super strong preference; the "new butler for each run" pattern is maybe a bit more familiar to those coming from Gen2, but I agree that it's less explicit.

"""
if not datasetType:
datasetType = self.datasetType
ref = DatasetRef(datasetType, dataId, id=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary. You do not need to create a ref for the put.

butlerCfg = Butler.makeRepo("here")
# the purpose of this call is to create some collections
_ = Butler(butlerCfg, run=run, tags=[tag], collections=[tag])
_ = Butler(butlerCfg, run=run, tags=[tag], collections=[tag], writeable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the _ =

def testChained(self):
with self.runner.isolated_filesystem():

# Create a butler and add some chained collections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and replace the datastore with a mock


# Create a butler and add some chained collections
butlerCfg = Butler.makeRepo("here")
with unittest.mock.patch.object(Datastore, "fromConfig", spec=Datastore.fromConfig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the with a bit confusing here since it implies that something is freed or reversed when the block exits. This seems to be how it works though.

Copy link
Contributor Author

@n8pease n8pease Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't totally understand it (it's cargo culted from something @TallJimbo wrote), so I did more research. The deal is that Datastore.fromConfig gets patched: replaced with a function that returns a magicmock. The butler initializer calls Datastore.fromConfig, which is a MagicMock instance and calling it returns a new MagicMock instance, so now the butler's self.datastore is a magicmock. Init finishes, and export and get are monkey patched onto the butler.datastore magic mock. Then the with block exits, which does restore Datastore.fromConfig to its normal function, but butler.datastore is of course not changed because how would it and anyway we don't want it to be - it's why we did all the above anyway.

Copy link
Contributor Author

@n8pease n8pease Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, all that seems kind of complicated and it seems like it's not necessary to patch Datastore.fromConfig - the butler init works just fine with it not patched. We do run into problems when we call butler1.import_(filename=os.path.join(TESTDIR, "data", "registry", "base.yaml")) without patching, but before making that call we can simply replace the datastore functions that need replacing, without any context manager. Doing this seems to be enough:

        butler1.datastore.export = self._mock_export
        butler1.datastore.get = self._mock_get
        butler1.datastore.ingest = MagicMock()

(note that instead of datastore be a MagicMock with export and get monkey patched on, I've replaced those two with our alternative impls, and made ingest a mock because it will get called in the course of calling butler.ingest_, but does not seem to need to do anything.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TallJimbo if there's a reason to use the patch context manager, please let me know? In the meantime I'll change it to the simpler version I just described and you and @timj can 👍 or 👎 that impl if y'all want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(hmm, it seems to need the context manager in test_simpleButler. working on understanding why...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha. The butlers were getting created slightly differently, the one that worked was being init'd with the config returned by Butler.makeRepo, and the one that did not was building a config itself, and needed config["root"] to be set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the first time I ever really used unittest.mock (other than small changes to tests others had written), so if you find a simpler way to accomplish what my code did, assume I just wasn't aware of that alternative.

formatter="lsst.daf.butler.formatters.json.JsonFormatter")

@staticmethod
def _mock_get(ref: DatasetRef, parameters: Optional[Mapping[str, Any]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than copying this from test_simpleButler.py, please put the mocks in a DatastoreMock base class in the tests hierarchy.

collectionType = self.registry.getCollectionType(name)
if purge and not unstore:
raise PruneCollectionsArgsError(PruneCollectionsArgsError.Reason.PURGE_WITHOUT_UNSTORE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a new subclass for specificity but couldn't we also make three different exceptions? You could have a new base class of PruneCollectionsArgsError and then have three subclasses of that such as PurgeWithoutUnstoreCollectionsError etc. They could have the error message burned in and you wouldn't need the enum here and you wouldn't need to check the reason later on -- you could simply catch the specific exception and replace it with a new error message suitable for click users.

Adds an execption type (inherits from TypeError) for the ways
arguments to Butler.pruneCollections can fail. This allows the
CLI script function to catch the exception, know the reason for
the failure, and format a message to report the error in a way
that will make sense on the command line.
If this seems like an acceptible solution to reviewers this pattern
may be used elsewhere to improve CLI error reporting as needed.
Adds various ways of formatting the collections.
Puts duplicated test repo code for the various CLI tests in a
shared location & adds some API for modifying the repo.
Numpy recently added a deprecation warning for "ragged" arrays;
in this case it was ragged because of a type mismatch;
CollectionSearch vs str. Making the CollectionSearch a string
in the list comprehension prevents us from encountering the
warning.
removes use of @patch, and just replaces mocked functions.
@n8pease n8pease merged commit 6353d3e into master Nov 17, 2020
@timj timj deleted the tickets/DM-27154 branch February 16, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants