Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31366: Add confirmation request to butler prune-collection #561

Merged
merged 3 commits into from Sep 2, 2021

Conversation

n8pease
Copy link
Contributor

@n8pease n8pease commented Aug 18, 2021

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Only real comment is about reporting non-RUN collections that are being cleared out.

pruneCollection_willRemoveMsg = "The following collections will be removed:"
pruneCollection_askContinueMsg = "Continue?"
pruneCollection_didRemoveCollections = "Removed collections."
pruneCollection_aborted = "Aborted."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for these being defined here and not embedded directly in the print messages below? I don't see them being used multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it promotes reuse in the future. I also think it's easier to proof read the user-facing text when it's in one place. If you want I can embed it though (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that you did the same thing for pruneDatasets. I'm not sure it does promote reuse does it given the pruneCollection_ prefix and the fact that there also exists a pruneDatasets_askContinueMsg = "Continue?"? Wouldn't the case for reuse be significantly stronger if the "Continue?" variable was reused for both commands? Now there are two sets of these variables with no code reuse coming out of it. Can you at least assign the value of pruneCollection_askContinueMsg from the pruneDataset one?

The strongest argument for doing this would be if we were internationalizing the commands such that we could switch to Spanish based on the locale setting but I imagine there is some other infrastructure we should be using for that (gnu gettext?) Are you interested in I18N?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I want to use dialog snippets between commands, the need for subtle differences can become difficult (I've experienced this mostly in the shared option help text). But, it's an interesting point that both commands implement a verify-do/abort pattern that could be worth writing a separate shared implementation. Let's do it when we have a 3rd use case ;-). In the meantime I'll remove the separate definition here.
I crossed paths only a little with I18N ages ago when I was doing more automated testing work. I'm not specifically interested in it, but am also happy to do work that needs to get done.

confirm : `bool`
Get results for what would be removed and return the results for
display & confirmation, with a completion function to run after
confirmation.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value needs to be documented. The confirm docs talk more about the return parameter than about the boolean itself.

def __init__(self, confirm):
# if `confirm == True`, will contain the astropy table describing data
# that will be removed.
self.removeTable = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using type annotations elsewhere in this file so might be worth adding some here and to the pruneCollection definition below.

show_uri=False,
)

collections: Dict[str, int] = defaultdict(lambda: 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to seed this dict with the output of registry.queryCollections() on that collection with includeChains because we want to be able to report that chain collections will be deleted as well as RUNs. If you prune a chained collection and that collection includes another chained collection I don't think it will turn up in the list at the moment (we should make it appear with ndatasets = 0 or something (or maybe saying "CHAIN" or "CALIBRATION"). Or else add in all the chain/calibration collections in at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, you're saying that butler.registry.queryDatasets will expand the first chained collection, but not subsequent chained collections?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I don't think I'm saying that at all. I'm saying that datasets are by definition only associated with RUN collections. You are therefore only reporting RUN collections in your summary even though there may be some CHAINED collections also being removed. I'm just asking that you report those chained collections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj please have a look & make sure this does what you want now, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect:

$ butler prune-collection ../pipelines_check/DATA_REPO_WITH_COMP/ demo_collection_exe
Searching collections...
The following collections will be removed:
         Collection          Collection Type Number of Datasets
---------------------------- --------------- ------------------
demo_collection_exe          CHAINED         -                 
demo_collection_exe/YYYYMMDD RUN             21                
HSC/calib                    CALIBRATION     -                 
HSC/raw/all                  RUN             1                 
refcats                      RUN             2                 
Continue? [y/N]: 

That's a complete report of the scary deletes that are about to happen.

raise TypeError(
f"Cannot prune {e.collectionType} collection {e.collectionType.name} with --purge.") from e
class CollectionInfo:
def __init__(self, count, type=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please put a little bit of documentation in this class? In particular, type annotations. Note that if you turn mypy on for this file it really isn't going to be happy about self.count being an int or a string when code later on assumes that it will be just fine to increment it. Given that it's a private class you can get away with it but you'd probably want to write this differently once we go to mypy everywhere (in some future dream land).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs & type annotations. Let me know if it looks ok? During development I temporarily turned off the disallow_untyped_defs exception for the scripts folder, and made sure there were no reported untyped defs for this file, so hopefully this file is totally taken care of now.

class CollectionInfo:
"""Lightweight container to hold the type of collection and the number
of datasets in the collection if applicable."""
def __init__(self, count: Union[None, int], type: Union[str]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write the first one as Optional[int] and why is the second one a union at all?

of datasets in the collection if applicable."""
def __init__(self, count: Union[None, int], type: Union[str]):
self.count = count
self.type: str = type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need this type annotation if you are annotating the argument? You could simplify this by using a dataclass

Copy link
Contributor Author

@n8pease n8pease Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation was vestigial, thanks for catching (too bad mypy does not warn for the duplicate decl).

I did not know about dataclass (and had lamented that namedtuple is (of course) immutable). Thanks for letting me know about that.

@n8pease n8pease merged commit 4ccda24 into master Sep 2, 2021
@n8pease n8pease deleted the tickets/DM-31366 branch September 2, 2021 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants