Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement *-discarded actions #146

Closed
3 tasks done
vladox opened this issue Dec 29, 2020 · 12 comments · Fixed by #290
Closed
3 tasks done

Implement *-discarded actions #146

vladox opened this issue Dec 29, 2020 · 12 comments · Fixed by #290
Assignees
Labels
✨ enhancement Improvement or change to an existing feature

Comments

@vladox
Copy link

vladox commented Dec 29, 2020

Preliminary checks

Describe the bug

When running mdedup -a delete-discarded -s select-one ./.MyMailDirFolder I get the follwoing result:

NotImplementedError: delete-discarded action not implemented yet.

To reproduce

Steps to reproduce the behavior:

  1. The full mdedup -a delete-discarded -s select-one ./.MyMailDirFolder CLI invocation you used.

    $ mdedup -a delete-discarded -s select-one ./.MyMailDirFolder
    
  2. The data set leading to the bug.
    Try to produce here the minimal subset of mails leading to the bug, and add copies of those mails (eventually censored).
    This effort will help maintainers add this particular edge-case to the set of unittests to prevent future regressions.
    You can reduce down the issue to a particular deduplicate subset by using the --hash-only parameter.

Expected behavior

Duplicated emails being deleted

Is this an expected behaviour?

@vladox vladox added the bug label Dec 29, 2020
@vladox vladox changed the title \U0001F41B delete-discarded action not implemented yet delete-discarded action not implemented yet Dec 29, 2020
@vincentbernat
Copy link

vincentbernat commented Jan 5, 2021

I also ran into this bug. For some reason, only the *-selected actions are implemented. So, in many cases, you can use -a delete-selected (and maybe repeat several times if you get several duplicates).

Edit: very bad idea. Non-duplicated mails will be selected too...

@leggewie
Copy link
Contributor

same is true for move-discarded

@andy-shev
Copy link

So the core functionality of the tool is not implemented... Can we prioritize this bug (which actually makes tool useless) to be fixed sooner than later?

@kdeldycke kdeldycke changed the title delete-discarded action not implemented yet All *-discarded action not implemented yet Jan 20, 2021
@kdeldycke
Copy link
Owner

@vincentbernat is right. All *-discarded actions are not implemented yet.

In the mean time, you can indeed emulate the effect of -a delete-discarded -s select-one with -a delete-selected -s select-all-but-one. I.e. by reversing the logic.

The thing is all *-discarded actions are syntactic sugar. They should implement the reverse of *-selected actions at:

ACTIONS = FrozenDict(
{
COPY_SELECTED: copy_selected,
COPY_DISCARDED: None,
MOVE_SELECTED: move_selected,
MOVE_DISCARDED: None,
DELETE_SELECTED: delete_selected,
DELETE_DISCARDED: None,
}
)

Same for strategies by the way. They all have their own aliases to allow users to better map their own worldview to the operational logic. See:

# Groups strategy aliases and their definitions. Aliases are great useability
# features as it helps users to better reason about the selection operators
# dependening on their mental models.
STRATEGY_ALIASES = frozenset(
[
(SELECT_NEWEST, DISCARD_OLDER),
(SELECT_NEWER, DISCARD_OLDEST),
(SELECT_OLDEST, DISCARD_NEWER),
(SELECT_OLDER, DISCARD_NEWEST),
(SELECT_BIGGEST, DISCARD_SMALLER),
(SELECT_BIGGER, DISCARD_SMALLEST),
(SELECT_SMALLEST, DISCARD_BIGGER),
(SELECT_SMALLER, DISCARD_BIGGEST),
(SELECT_NON_MATCHING_PATH, DISCARD_MATCHING_PATH),
(SELECT_MATCHING_PATH, DISCARD_NON_MATCHING_PATH),
(SELECT_ALL_BUT_ONE, DISCARD_ONE),
(SELECT_ONE, DISCARD_ALL_BUT_ONE),
]
)

The problem is at the selection process step, right before we perform the action:

selected = apply_strategy(self.conf.strategy, self)

Applying the strategy return the list of selected mails, not the one that were discarded. See how each implemented strategies only returns the subset of the duplicate pool:

return {
mail for mail in duplicates.pool if mail.timestamp < duplicates.newest_timestamp
}

return {
mail
for mail in duplicates.pool
if mail.timestamp == duplicates.oldest_timestamp
}

and so on...

By the time we need to perform the action, we only have a subset of the initial duplicate pool. We do not have the list of those that were discarded.

This is the limit of the code architecture inherited from the initial dumb script I wrote 10 years ago. It was an early optimization to reduce the memory footprint. Given that context, it will be hard to easily implement the action with the current code structure.

I propose to first tackle #87, i.e. keep a cache of canonical hashes used to ID each mail. That way we'll be in a position to only deal with sets of hashes in our selection/action phases instead of parsed mail objects. This will bring much cleaner and flexible code to implement the missing actions.

@kdeldycke kdeldycke changed the title All *-discarded action not implemented yet Implement *-discarded actions Jan 20, 2021
@andy-shev
Copy link

andy-shev commented Jan 20, 2021

@kdeldycke
The problem is that single mails are got removed by your proposal. So, it's not equivalent substitution, please revert "bug" tag and prioritize the fix.

mdedup -n -i maildir -s select-all-but-one -a delete-selected -C 32 .

● Phase #4 - Report and statistics
╒════════════╤══════════╤══════════════════════════════════════════════════════════════╕
│ Mails      │   Metric │ Description                                                  │
╞════════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ Found      │   221373 │ Total number of mails encountered from all mail sources.     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │     1067 │ Number of mails individuality rejected because they were     │
│            │          │ unparseable or did not had enough metadata to compute        │
│            │          │ hashes.                                                      │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │   220306 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │   165621 │ Number of unique hashes.                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │   121943 │ Number of unique mails (which where automaticcaly added to   │
│            │          │ selection).                                                  │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │    98363 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │          │ sets with at least 2 mails).                                 │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │     8644 │ Number of mails ignored in the selection phase because the   │
│            │          │ whole set they belongs to was skipped.                       │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │    40268 │ Number of mails discarded from the final selection.          │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │   171394 │ Number of mails kept in the final selection on which the     │
│            │          │ action will be performed.                                    │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │        0 │ Number of mails copied from their original mailbox to        │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │        0 │ Number of mails moved from their original mailbox to         │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │   171394 │ Number of mails deleted from their mailbox in-place.         │
╘════════════╧══════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤══════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets     │   Metric │ Description                                                │
╞════════════════════╪══════════╪════════════════════════════════════════════════════════════╡
│ Total              │   165621 │ Total number of duplicate sets.                            │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Single             │   121943 │ Total number of sets containing a single mail and did not  │
│                    │          │ had to have a strategy applied to. They were automatticaly │
│                    │          │ kept in the final selection.                               │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │        0 │ Number of sets skipped from the selection process because  │
│                    │          │ they had encoding issues.                                  │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size     │      510 │ Number of sets skipped from the selection process because  │
│                    │          │ they were too disimilar in size.                           │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipeed - Content  │        0 │ Number of sets skipped from the selection process because  │
│                    │          │ they were too disimilar in content.                        │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │        0 │ Number of sets skipped from the selection process because  │
│                    │          │ the strategy could not be applied.                         │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated       │    40268 │ Number of valid sets on which the selection strategy was   │
│                    │          │ successfully applied.                                      │
╘════════════════════╧══════════╧════════════════════════════════════════════════════════════╛

@nutria007
Copy link

nutria007 commented Jan 20, 2021

I've managed to do the inteded action by making it skip unique emails:

https://github.com/nutria007/mail-deduplicate/blob/872ec4a10dd391c9ad1650e7317b75dd52d24abe/mail_deduplicate/deduplicate.py#L404-L422

mail_deduplicate/mail_deduplicate/deduplicate.py
Lines 404 to 422

            # Unique mails are always selected. No need to mobilize the whole
            # DuplicateSet machinery.
            if mail_count == 1:
                self.stats["mail_unique"] += 1
                self.stats["set_single"] += 1
                if self.conf.action == "delete-selected":
                    logger.debug("Skipped deletion of unique mail.")
                    self.stats["mail_skipped"] += 1
                else:
                    logger.debug("Add unique message to selection.")
                    self.stats["mail_selected"] += 1
                    candidates = mail_set
            # We need to resort to a selection strategy to discriminate mails
            # within the set.
            else:
                duplicates = DuplicateSet(hash_key, mail_set, self.conf)
                candidates = duplicates.select_candidates()
                # Merge duplicate set's stats to global stats.
                self.stats += duplicates.stats

@andy-shev
Copy link

I've managed to do the inteded action by making it skip unique emails:

mail_deduplicate/mail_deduplicate/deduplicate.py Lines 404 to 422

Thanks, it;s exact place where I'm coding right now for myself, but on top I introduced a new option.

@dschrempf
Copy link

Any progress on this? select-newer also doesn't work as expected. Single mails without duplicates are still selected.

│ Mails      │   Metric │ Description                                                  │
╞════════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ Found      │     1060 │ Total number of mails encountered from all mail sources.     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │        0 │ Number of mails rejected individually because they were      │
│            │          │ unparseable or did not have enough metadata to compute       │
│            │          │ hashes.                                                      │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │     1060 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │     1060 │ Number of unique hashes.                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │     1060 │ Number of unique mails (which where automatically added to   │
│            │          │ selection).                                                  │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │        0 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │          │ sets with at least 2 mails).                                 │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │        0 │ Number of mails ignored in the selection phase because the   │
│            │          │ whole set they belong to was skipped.                        │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │        0 │ Number of mails discarded from the final selection.          │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │     1060 │ Number of mails kept in the final selection on which the     │
│            │          │ action will be performed.                                    │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │        0 │ Number of mails copied from their original mailbox to        │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │        0 │ Number of mails moved from their original mailbox to         │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │     1060 │ Number of mails deleted from their mailbox in-place.         │

Command line was

 mdedup -s select-newer -a delete-selected -n -t ctime maildir

@leggewie
Copy link
Contributor

@dschrempf Can you please open a new ticket regarding select-newer misbehaving?

@djwf
Copy link

djwf commented Mar 4, 2021

FYI: just added an issue and PR regarding the option to only act on duplicates (#203 and #204, respectively)

@kdeldycke
Copy link
Owner

The missing *-discarded actions have been implemented in #290 and are now available upstream.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 13, 2021
@kdeldycke kdeldycke added ✨ enhancement Improvement or change to an existing feature and removed enhancement labels Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ enhancement Improvement or change to an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants