New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-31366: Add confirmation request to butler prune-collection #561
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Only real comment is about reporting non-RUN collections that are being cleared out.
pruneCollection_willRemoveMsg = "The following collections will be removed:" | ||
pruneCollection_askContinueMsg = "Continue?" | ||
pruneCollection_didRemoveCollections = "Removed collections." | ||
pruneCollection_aborted = "Aborted." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for these being defined here and not embedded directly in the print messages below? I don't see them being used multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it promotes reuse in the future. I also think it's easier to proof read the user-facing text when it's in one place. If you want I can embed it though (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that you did the same thing for pruneDatasets. I'm not sure it does promote reuse does it given the pruneCollection_
prefix and the fact that there also exists a pruneDatasets_askContinueMsg = "Continue?"
? Wouldn't the case for reuse be significantly stronger if the "Continue?" variable was reused for both commands? Now there are two sets of these variables with no code reuse coming out of it. Can you at least assign the value of pruneCollection_askContinueMsg
from the pruneDataset
one?
The strongest argument for doing this would be if we were internationalizing the commands such that we could switch to Spanish based on the locale setting but I imagine there is some other infrastructure we should be using for that (gnu gettext
?) Are you interested in I18N?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I want to use dialog snippets between commands, the need for subtle differences can become difficult (I've experienced this mostly in the shared option help text). But, it's an interesting point that both commands implement a verify-do/abort pattern that could be worth writing a separate shared implementation. Let's do it when we have a 3rd use case ;-). In the meantime I'll remove the separate definition here.
I crossed paths only a little with I18N ages ago when I was doing more automated testing work. I'm not specifically interested in it, but am also happy to do work that needs to get done.
confirm : `bool` | ||
Get results for what would be removed and return the results for | ||
display & confirmation, with a completion function to run after | ||
confirmation. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value needs to be documented. The confirm
docs talk more about the return parameter than about the boolean itself.
def __init__(self, confirm): | ||
# if `confirm == True`, will contain the astropy table describing data | ||
# that will be removed. | ||
self.removeTable = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using type annotations elsewhere in this file so might be worth adding some here and to the pruneCollection
definition below.
show_uri=False, | ||
) | ||
|
||
collections: Dict[str, int] = defaultdict(lambda: 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to seed this dict with the output of registry.queryCollections()
on that collection with includeChains
because we want to be able to report that chain collections will be deleted as well as RUNs. If you prune a chained collection and that collection includes another chained collection I don't think it will turn up in the list at the moment (we should make it appear with ndatasets = 0 or something (or maybe saying "CHAIN" or "CALIBRATION"). Or else add in all the chain/calibration collections in at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, you're saying that butler.registry.queryDatasets
will expand the first chained collection, but not subsequent chained collections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I don't think I'm saying that at all. I'm saying that datasets are by definition only associated with RUN collections. You are therefore only reporting RUN collections in your summary even though there may be some CHAINED collections also being removed. I'm just asking that you report those chained collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timj please have a look & make sure this does what you want now, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect:
$ butler prune-collection ../pipelines_check/DATA_REPO_WITH_COMP/ demo_collection_exe
Searching collections...
The following collections will be removed:
Collection Collection Type Number of Datasets
---------------------------- --------------- ------------------
demo_collection_exe CHAINED -
demo_collection_exe/YYYYMMDD RUN 21
HSC/calib CALIBRATION -
HSC/raw/all RUN 1
refcats RUN 2
Continue? [y/N]:
That's a complete report of the scary deletes that are about to happen.
749e31e
to
9dc4d10
Compare
raise TypeError( | ||
f"Cannot prune {e.collectionType} collection {e.collectionType.name} with --purge.") from e | ||
class CollectionInfo: | ||
def __init__(self, count, type=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please put a little bit of documentation in this class? In particular, type annotations. Note that if you turn mypy on for this file it really isn't going to be happy about self.count
being an int or a string when code later on assumes that it will be just fine to increment it. Given that it's a private class you can get away with it but you'd probably want to write this differently once we go to mypy everywhere (in some future dream land).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docs & type annotations. Let me know if it looks ok? During development I temporarily turned off the disallow_untyped_defs
exception for the scripts folder, and made sure there were no reported untyped defs for this file, so hopefully this file is totally taken care of now.
9dc4d10
to
f8f4825
Compare
class CollectionInfo: | ||
"""Lightweight container to hold the type of collection and the number | ||
of datasets in the collection if applicable.""" | ||
def __init__(self, count: Union[None, int], type: Union[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write the first one as Optional[int] and why is the second one a union at all?
of datasets in the collection if applicable.""" | ||
def __init__(self, count: Union[None, int], type: Union[str]): | ||
self.count = count | ||
self.type: str = type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need this type annotation if you are annotating the argument? You could simplify this by using a dataclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type annotation was vestigial, thanks for catching (too bad mypy does not warn for the duplicate decl).
I did not know about dataclass (and had lamented that namedtuple is (of course) immutable). Thanks for letting me know about that.
f8f4825
to
1c4257e
Compare
1c4257e
to
67bb02b
Compare
Checklist
doc/changes