Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No deletion: all mail differences within limits #97

Closed
vikasrawal opened this issue Oct 2, 2020 · 15 comments
Closed

No deletion: all mail differences within limits #97

vikasrawal opened this issue Oct 2, 2020 · 15 comments
Labels
🐛 bug Something isn't working, or a fix is proposed

Comments

@vikasrawal
Copy link

vikasrawal commented Oct 2, 2020

I am using

mdedup -C -1 -S -1 -s delete-smaller maildirname

But I still get "Check that mail differences are within the limits." messages, and nothing is deleted. All duplicates are either ignored or skipped.

│ Mails      │   Metric │
╞════════════╪══════════╡
│ Found      │      227 │
├────────────┼──────────┤
│ Rejected   │        0 │
├────────────┼──────────┤
│ Kept       │      227 │
├────────────┼──────────┤
│ Unique     │        7 │
├────────────┼──────────┤
│ Duplicates │      220 │
├────────────┼──────────┤
│ Deleted    │        0 │
╘════════════╧══════════╛
╒══════════════════════════════════════╤══════════╕
│ Duplicate sets                       │   Metric │
╞══════════════════════════════════════╪══════════╡
│ Total                                │      117 │
├──────────────────────────────────────┼──────────┤
│ Ignored                              │        7 │
├──────────────────────────────────────┼──────────┤
│ Skipped                              │      110 │
├──────────────────────────────────────┼──────────┤
│ Rejected (bad encoding)              │        0 │
├──────────────────────────────────────┼──────────┤
│ Rejected (too dissimilar in size)    │        0 │
├──────────────────────────────────────┼──────────┤
│ Rejected (too dissimilar in content) │        0 │
├──────────────────────────────────────┼──────────┤
│ Deduplicated                         │        0 │

What might be wrong?

Is there a way to rename the files rather than delete duplicates?

Thanks for help.

Vikas

@kdeldycke
Copy link
Owner

I just released v4.0.0 of the CLI: https://github.com/kdeldycke/mail-deduplicate/releases/tag/v4.0.0

Can you try that out please? And tell me if it fixes your issue?

@kdeldycke
Copy link
Owner

As for you suggestion of renaming the file instead of deleting the duplicates, I created a feature request for this at: #98

@kdeldycke kdeldycke changed the title no deletion happens No deletion: all mail differences within limits Oct 2, 2020
@vikasrawal
Copy link
Author

The above result is for V4.0.0.

Thanks for responding.

Best wishes.

@kdeldycke
Copy link
Owner

Can you run the CLI in debug mode (--verbosity=DEBUG parameter) and identify the relevant logs? Better yet, can you provide the content of two mails, the corresponding logs and your expected results?

@vikasrawal
Copy link
Author

vikasrawal commented Oct 2, 2020

Gives this kind of messages:

--- 2 mails sharing hash c47ac7d8a9c3f103ca756dfb8b789527c0119d67b10b71d0a951c405
debug: <DuplicateSet hash=c47ac7d8a9c3f103ca756dfb8b789527c0119d67b10b71d0a951c405, size=2, conf=<mail_deduplicate.Config object at 0x7f5fab8c40d0>, pool=frozenset({<mail_deduplicate.mail.Mail object at 0x7f5faa3df0a0>, <mail_deduplicate.mail.Mail object at 0x7f5faa3b8070>})> created.
Check that mail differences are within the limits.
Skip checking for size differences.
Skip checking for content differences.
debug: Call delete_smaller() strategy.
Deleting all mails strictly smaller than 627 bytes...
0 candidates found for deletion.
Skip set: no deletion happened.
debug: <DuplicateSet hash=d44f163c514f6a454c4cf080fe8609040dd10ba3728376487d614538, size=2, conf=<mail_deduplicate.Config object at 0x7f5fab8c40d0>, pool=frozenset({<mail_deduplicate.mail.Mail object at 0x7f5faa3f5700>, <mail_deduplicate.mail.Mail object at 0x7f5faa3d8460>})> created.
Check that mail differences are within the limits.
Skip checking for size differences.
Skip checking for content differences.
debug: Call delete_smaller() strategy.
Deleting all mails strictly smaller than 2239 bytes...
0 candidates found for deletion.
Skip set: no deletion happened.
--- 2 mails sharing hash 54480fd2aa7cc864a2c427c7ede98b917426fc418d593918535609d5
debug: <DuplicateSet hash=54480fd2aa7cc864a2c427c7ede98b917426fc418d593918535609d5, size=2, conf=<mail_deduplicate.Config object at 0x7f5fab8c40d0>, pool=frozenset({<mail_deduplicate.mail.Mail object at 0x7f5faa3a4520>, <mail_deduplicate.mail.Mail object at 0x7f5faa3f5460>})> created.
Check that mail differences are within the limits.
Skip checking for size differences.
Skip checking for content differences.
debug: Call delete_smaller() strategy.
Deleting all mails strictly smaller than 85994 bytes...
0 candidates found for deletion.
Skip set: no deletion happened.

@kdeldycke
Copy link
Owner

I just released a new v5.0.0 version: https://github.com/kdeldycke/mail-deduplicate/releases/tag/v5.0.0

Can you check it out please?

@vikasrawal
Copy link
Author

vikasrawal commented Oct 6, 2020

Nothing seems to be deleted. I have tried different deletion strategies. See the relevant logs:

When using delete-older (date-header)

--- 2 mails sharing hash 01327175d87a9078a30b00ea63574694907b613f2efd06e79f2da5ea
debug: <DuplicateSet hash=01327175d87a9078a30b00ea63574694907b613f2efd06e79f2da5ea, size=2, conf=<mail_deduplicate.Config object at 0x7f78ac2c67c0>, pool=frozenset({<Mail '1553568282.27026_36.katkuian,U=36'>, <Mail '1529304528.22219_280.khandrai,U=36'>})> created.
Check that mail differences are within the limits.
Skip checking for size differences.
Skip checking for content differences.
debug: Call delete_older() strategy.
Select all mails strictly older than the 1453342388 timestamp...
0 candidates found for deletion.

When using delete-smaller

--- 2 mails sharing hash 01327175d87a9078a30b00ea63574694907b613f2efd06e79f2da5ea
debug: <DuplicateSet hash=01327175d87a9078a30b00ea63574694907b613f2efd06e79f2da5ea, size=2, conf=<mail_deduplicate.Config object at 0x7f03f6c8a7c0>, pool=frozenset({<Mail '1553568282.27026_36.katkuian,U=36'>, <Mail '1529304528.22219_280.khandrai,U=36'>})> created.
Check that mail differences are within the limits.
Skip checking for size differences.
Skip checking for content differences.
debug: Call delete_smaller() strategy.
Select all mails strictly smaller than 0 bytes...
0 candidates found for deletion.

@kdeldycke
Copy link
Owner

I think I have an idea. It seems the sets of two mails are perfectly identical even after all the normalisation and canonicalisation. Which means whatever the strategy selected, there is not enough difference (by date or by size) to choose among the duplicates which one to keep/discard.

All in all, I guess we need a kind of delete-random strategy to only keep one mail in the DuplicateSet.

@kdeldycke kdeldycke added bug and removed question labels Oct 6, 2020
@vikasrawal
Copy link
Author

If it is easier, one could also do keep-alphabetically-first or keep-alphabetically-last using file names. That would also solve the problem.

@kdeldycke
Copy link
Owner

I just implemented --strategy=select-one/--strategy=discard-all-but-one at

def discard_one(duplicates):
"""Randomly discards one duplicate, and select all others."""
return {random.choice(tuple(duplicates.pool))}
def discard_all_but_one(duplicates):
"""Randomly discards all duplicates, but select one."""
return set(random.sample(duplicates.pool, k=len(duplicates.pool) - 1))
.

This will be part of the upcoming v6.0.0 release, but you can try it out by clone the develop branch.

@kdeldycke
Copy link
Owner

Also in v6.0.0. I implemented #98 , i.e. a way to move or copy selected duplicate mails to a brand new box instead of deleting them in place. With all that you should be able to perform the deduplication you're looking for.

@matclab
Copy link

matclab commented Oct 13, 2020

Just tried with poetry run mdedup -a delete-selected -s select-all-but-one folder and it seems to work like a charm.

Thank you.

@vikasrawal
Copy link
Author

vikasrawal commented Oct 13, 2020 via email

@kdeldycke
Copy link
Owner

Thanks a lot for your feedback! I'll now close this issue as resolved then. I'll probably release v6.0.0 in the next few days.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 18, 2021
@kdeldycke kdeldycke added 🐛 bug Something isn't working, or a fix is proposed and removed bug labels Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐛 bug Something isn't working, or a fix is proposed
Projects
None yet
Development

No branches or pull requests

3 participants