Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delete option? #2

Open
DarrienG opened this issue Jan 29, 2021 · 5 comments
Open

Add delete option? #2

DarrienG opened this issue Jan 29, 2021 · 5 comments
Labels
enhancement New feature or request investigate This needs to be researched

Comments

@DarrienG
Copy link

Having an all in one binary would be amazing. If this supported deleting all dupes after finding, it would be great.

@jRimbault
Copy link
Owner

jRimbault commented Jan 29, 2021

Most [all] of the issues related to that feature would be around UI/UX.

Do I just delete all duplicates ? obviously not, so I have to somehow defer control to the user over what gets deleted or not, there are different ways to go around that. Show each groups of duplicate to the user, let them choose which gets deleted, how do I present them each group ? A group can grow quite large and cumbersome for a human to handle. Just expose a set of options, flags and switches to act as criteria for deletion ? But those would surely be different for each set of duplicate.
And then there are the easy technical aspects, do I build an interactive mode into the main tool or do I output a dedicated script like rmlint does ? rmlint's script is my preferred way, I find it quite clever in fact, though I have style issues with the script it outputs.

I haven't thought of a good way to solve all that ? I'm open to bouncing ideas.

@DarrienG
Copy link
Author

Honestly if there were just a --delete-all-dupes option without input I would be ok with that. Nice and simple, just delete them all.

@jRimbault jRimbault added enhancement New feature or request investigate This needs to be researched labels Feb 4, 2021
@maluramichael
Copy link

For my case it would be nice to leave just the oldest one and remove everything else. I try to cleanup up a huge drive full of family photos. They are so heavy cluttered and duplicated. So i would look up the exif create date.

But that is just one case. I would be fine with some kind of interface inside the code. So we can extend the behaviour on our own.

A function which gets a list of the duplicated list and returns a new list with filenames that need to be deleted.

@jRimbault
Copy link
Owner

jRimbault commented Feb 9, 2021

Thank you for your feedback, it adds to the list of items I'll keep in mind in the future.

I'm still not sure how to proceed (or if at all) with this feature. I have been thinking (in the back of my mind) about it for quite some time now.

In the meantime would you be able to make this kind of solution work ?

Running yadf path/to/your/files -f ldjson | python_script.py. Piping the line delimited json output to a python script doing the deletion with your own criteria ?

Untested Tested a bit :

#!/usr/bin/env python3

import fileinput
import json
import os


def main():
    for line in fileinput.input():
        files = json.loads(line)
        files.sort(key=exifdate)
        for filename in files[1:]:
            os.remove(filename)


def exifdate(filename):
    # get the exif date for each file
    # I don't know how to extract that information with the python stdlib
    # I'd expect PIL/Pillow has something for that, but it's a third party package
    return filename

if __name__ == "__main__":
    main()

Or a more elaborate script in this example.

jRimbault added a commit that referenced this issue Feb 19, 2021
hyperfine -w 10 "./target/release/yadf ~" "./target/release/yadf -H ~" "./target/release/yadf ~"

Benchmark #1: ./target/release/yadf ~
  Time (mean ± σ):      2.977 s ±  0.031 s    [User: 9.598 s, System: 13.738 s]
  Range (min … max):    2.935 s …  3.021 s    10 runs

Benchmark #2: ./target/release/yadf -H ~
  Time (mean ± σ):      3.785 s ±  0.040 s    [User: 9.698 s, System: 13.917 s]
  Range (min … max):    3.730 s …  3.886 s    10 runs

Benchmark #3: ./target/release/yadf ~
  Time (mean ± σ):      2.954 s ±  0.025 s    [User: 9.555 s, System: 13.737 s]
  Range (min … max):    2.919 s …  2.991 s    10 runs

Summary
  './target/release/yadf ~' ran
    1.01 ± 0.01 times faster than './target/release/yadf ~'
    1.28 ± 0.02 times faster than './target/release/yadf -H ~'
@GGG-KILLER
Copy link

GGG-KILLER commented Nov 23, 2021

In my case I'd prefer hardlinking the duplicate files so only 1 remains on disk.

EDIT: Maybe have a --merge-mode flag?
Then have a few options like:

  • delete-older
  • delete-newer
  • hardlink-older
  • hardlink-newer
  • softlink-older
  • softlink-newer

Though this would lead into the issue of "What's 'older' and what's 'newer'?" Do we check creation time, modification time or access time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request investigate This needs to be researched
Projects
None yet
Development

No branches or pull requests

4 participants