Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer a way of deleting / hardlinking / softlinking duplicated files automatically #27

Closed
aurelg opened this issue Sep 14, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@aurelg
Copy link

aurelg commented Sep 14, 2020

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:

#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

@piranna
Copy link

piranna commented Sep 14, 2020

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch

What do you means, exactly?

I like your aproach to using Python, maybe bash is not enought, althought it's more powerful than people would expect, and this can be done with it in a more portable way, while the Python wrapper would need to be an independent project since it would not be just a helper command anymore... But yes, a fclones-helpers package would totally make sense :-)

@aurelg
Copy link
Author

aurelg commented Sep 14, 2020

What do you means, exactly?

Most of the python code above deals with reconstructing proper datastructures from the fclones output. I guess such datastructures are probably already available in fclones. A dedicated flag could bypass the need for implementing (and maintaining) a parser.

I'm not very happy with the python dependency either. IMHO the link between an independent python project and fclones would be so tight that I don't think it's worth the split.

I'd prefer a shell-based approach as well. It would be more portable, but I fear it could be rather limiting later, though (as it becomes pretty complex, not very readable nor reliable when compared to Python when tests, additional switches or edge cases handling are needed).

Anyhow, a postprocessing step would probably limit (if not defeat) the speed advantage of fclones vs jdupes/fdupes.

@piranna
Copy link

piranna commented Sep 14, 2020

Anyhow, a postprocessing step would probably limit (if not defeat) the speed advantage of fclones vs jdupes/fdupes.

I think bottleneck are in hashes...

@pkolaczk pkolaczk added the enhancement New feature or request label Sep 14, 2020
@pkolaczk
Copy link
Owner

@aurelg The postprocessing step would be fast and definitely not a bottleneck. The main bottleneck is I/O for reading files to compute the hashes.

I generally agree this feature is much easier to implement inside fclones.
However, this is not as simple as the provided python script. When automatcally deleting user files, one has to be extremely cautious. E.g. there may be some edge cases when e.g. during the scanning phase a file was moved to a different location and fclones registered it as a duplicate. But at the moment it wants to delete it, there is no duplicate anymore.

This:

       if isfile(dst):
                    unlink(dst)
       link(src, dst)

might end up deleting the only existing file.

Better to move the file first, before deleting, then create a link, then if all ok, drop the moved file.

@aurelg
Copy link
Author

aurelg commented Sep 14, 2020

It might also be nice to avoid creating dst if that has been removed since fclones was executed. Such edge cases come from the arbitrary amount of time (and changes on the filesystem) between the execution of fclones and the postprocessing. An implementation inside fclones could be far more robust. 👍

@pkolaczk pkolaczk self-assigned this Apr 20, 2021
@pkolaczk pkolaczk added this to the 0.12 milestone Apr 20, 2021
@rleaver152
Copy link

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:

#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

I added a few things - love the code. Assumes you output the csv file to /tmp for tidyness. Remember to put the primary directory last in the fclones path to keep those as a priority (contrast to rdfind where its the first directory that is kept priority)

#!/usr/bin/env python3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
import os
import logging
from pathlib import Path

def main() -> None:
    with open("/tmp/fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
#                logging.debug("%s -> %s", src, dst)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                dst = dst.strip('\n')
		my_file=Path(dst)
                if my_file.is_file():
                    os.remove(dst)

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()


@rleaver152
Copy link

rleaver152 commented May 2, 2021

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:

#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

and here is a version just to move files to a duplicates directory ($HOME/Duplicates) for safety


#!/usr/bin/env python3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
import os, shutil
import logging
from pathlib import Path

def main() -> None:
    with open("/tmp/fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]
            moveto = "/Users/MyUserName/Duplicates/"
            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)
                dst = dst.strip('\n')
                my_file_list=Path(dst)
                if my_file_list.is_file():
                    myfile = os.path.basename(dst)
                    sink = moveto+myfile
                    shutil.move(dst,sink )

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()




@piranna
Copy link

piranna commented May 2, 2021

Assumes you output the csv file to /tmp for tidyness

Better if it gets the info directly from stdin :-)

@rleaver152
Copy link

rleaver152 commented May 2, 2021

Assumes you output the csv file to /tmp for tidyness

Better if it gets the info directly from stdin :-)

I like to check before deleting!! :-) and the move one loses directory structure so equally want to check first

@pkolaczk
Copy link
Owner

pkolaczk commented Jun 5, 2021

Implemented in #53 released as v0.12.0.

@pkolaczk pkolaczk closed this as completed Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants