New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement *-discarded actions #146
Comments
I also ran into this bug. For some reason, only the Edit: very bad idea. Non-duplicated mails will be selected too... |
same is true for move-discarded |
So the core functionality of the tool is not implemented... Can we prioritize this bug (which actually makes tool useless) to be fixed sooner than later? |
@vincentbernat is right. All In the mean time, you can indeed emulate the effect of The thing is all mail-deduplicate/mail_deduplicate/action.py Lines 82 to 91 in f423b41
Same for strategies by the way. They all have their own aliases to allow users to better map their own worldview to the operational logic. See: mail-deduplicate/mail_deduplicate/strategy.py Lines 206 to 224 in f423b41
The problem is at the selection process step, right before we perform the action:
Applying the strategy return the list of selected mails, not the one that were discarded. See how each implemented strategies only returns the subset of the duplicate pool: mail-deduplicate/mail_deduplicate/strategy.py Lines 37 to 39 in f423b41
mail-deduplicate/mail_deduplicate/strategy.py Lines 52 to 56 in f423b41
and so on... By the time we need to perform the action, we only have a subset of the initial duplicate pool. We do not have the list of those that were discarded. This is the limit of the code architecture inherited from the initial dumb script I wrote 10 years ago. It was an early optimization to reduce the memory footprint. Given that context, it will be hard to easily implement the action with the current code structure. I propose to first tackle #87, i.e. keep a cache of canonical hashes used to ID each mail. That way we'll be in a position to only deal with sets of hashes in our selection/action phases instead of parsed mail objects. This will bring much cleaner and flexible code to implement the missing actions. |
@kdeldycke
|
I've managed to do the inteded action by making it skip unique emails: mail_deduplicate/mail_deduplicate/deduplicate.py # Unique mails are always selected. No need to mobilize the whole
# DuplicateSet machinery.
if mail_count == 1:
self.stats["mail_unique"] += 1
self.stats["set_single"] += 1
if self.conf.action == "delete-selected":
logger.debug("Skipped deletion of unique mail.")
self.stats["mail_skipped"] += 1
else:
logger.debug("Add unique message to selection.")
self.stats["mail_selected"] += 1
candidates = mail_set
# We need to resort to a selection strategy to discriminate mails
# within the set.
else:
duplicates = DuplicateSet(hash_key, mail_set, self.conf)
candidates = duplicates.select_candidates()
# Merge duplicate set's stats to global stats.
self.stats += duplicates.stats |
Thanks, it;s exact place where I'm coding right now for myself, but on top I introduced a new option. |
Any progress on this?
Command line was
|
@dschrempf Can you please open a new ticket regarding select-newer misbehaving? |
The missing |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Preliminary checks
Describe the bug
When running
mdedup -a delete-discarded -s select-one ./.MyMailDirFolder
I get the follwoing result:NotImplementedError: delete-discarded action not implemented yet.
To reproduce
Steps to reproduce the behavior:
The full
mdedup -a delete-discarded -s select-one ./.MyMailDirFolder
CLI invocation you used.The data set leading to the bug.
Try to produce here the minimal subset of mails leading to the bug, and add copies of those mails (eventually censored).
This effort will help maintainers add this particular edge-case to the set of unittests to prevent future regressions.
You can reduce down the issue to a particular deduplicate subset by using the
--hash-only
parameter.Expected behavior
Duplicated emails being deleted
Is this an expected behaviour?
The text was updated successfully, but these errors were encountered: