Offer a way of deleting / hardlinking / softlinking duplicated files automatically #27

aurelg · 2020-09-14T07:35:39Z

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:

#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

The text was updated successfully, but these errors were encountered:

piranna · 2020-09-14T07:55:11Z

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch

What do you means, exactly?

I like your aproach to using Python, maybe bash is not enought, althought it's more powerful than people would expect, and this can be done with it in a more portable way, while the Python wrapper would need to be an independent project since it would not be just a helper command anymore... But yes, a fclones-helpers package would totally make sense :-)

aurelg · 2020-09-14T08:24:24Z

What do you means, exactly?

Most of the python code above deals with reconstructing proper datastructures from the fclones output. I guess such datastructures are probably already available in fclones. A dedicated flag could bypass the need for implementing (and maintaining) a parser.

I'm not very happy with the python dependency either. IMHO the link between an independent python project and fclones would be so tight that I don't think it's worth the split.

I'd prefer a shell-based approach as well. It would be more portable, but I fear it could be rather limiting later, though (as it becomes pretty complex, not very readable nor reliable when compared to Python when tests, additional switches or edge cases handling are needed).

Anyhow, a postprocessing step would probably limit (if not defeat) the speed advantage of fclones vs jdupes/fdupes.

piranna · 2020-09-14T08:52:30Z

Anyhow, a postprocessing step would probably limit (if not defeat) the speed advantage of fclones vs jdupes/fdupes.

I think bottleneck are in hashes...

pkolaczk · 2020-09-14T09:42:19Z

@aurelg The postprocessing step would be fast and definitely not a bottleneck. The main bottleneck is I/O for reading files to compute the hashes.

I generally agree this feature is much easier to implement inside fclones.
However, this is not as simple as the provided python script. When automatcally deleting user files, one has to be extremely cautious. E.g. there may be some edge cases when e.g. during the scanning phase a file was moved to a different location and fclones registered it as a duplicate. But at the moment it wants to delete it, there is no duplicate anymore.

This:

       if isfile(dst):
                    unlink(dst)
       link(src, dst)

might end up deleting the only existing file.

Better to move the file first, before deleting, then create a link, then if all ok, drop the moved file.

aurelg · 2020-09-14T17:36:01Z

It might also be nice to avoid creating dst if that has been removed since fclones was executed. Such edge cases come from the arbitrary amount of time (and changes on the filesystem) between the execution of fclones and the postprocessing. An implementation inside fclones could be far more robust. 👍

rleaver152 · 2021-05-02T07:09:30Z

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:
#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()
PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

I added a few things - love the code. Assumes you output the csv file to /tmp for tidyness. Remember to put the primary directory last in the fclones path to keep those as a priority (contrast to rdfind where its the first directory that is kept priority)

#!/usr/bin/env python3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
import os
import logging
from pathlib import Path

def main() -> None:
    with open("/tmp/fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
#                logging.debug("%s -> %s", src, dst)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                dst = dst.strip('\n')
		my_file=Path(dst)
                if my_file.is_file():
                    os.remove(dst)

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

rleaver152 · 2021-05-02T07:23:05Z

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:
#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()
PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

and here is a version just to move files to a duplicates directory ($HOME/Duplicates) for safety


#!/usr/bin/env python3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
import os, shutil
import logging
from pathlib import Path

def main() -> None:
    with open("/tmp/fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]
            moveto = "/Users/MyUserName/Duplicates/"
            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)
                dst = dst.strip('\n')
                my_file_list=Path(dst)
                if my_file_list.is_file():
                    myfile = os.path.basename(dst)
                    sink = moveto+myfile
                    shutil.move(dst,sink )

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

piranna · 2021-05-02T08:35:53Z

Assumes you output the csv file to /tmp for tidyness

Better if it gets the info directly from stdin :-)

rleaver152 · 2021-05-02T12:26:27Z

Assumes you output the csv file to /tmp for tidyness

Better if it gets the info directly from stdin :-)

I like to check before deleting!! :-) and the move one loses directory structure so equally want to check first

pkolaczk · 2021-06-05T19:39:15Z

Implemented in #53 released as v0.12.0.

aurelg mentioned this issue Sep 14, 2020

Compare with fdupes #25

Closed

pkolaczk added the enhancement New feature or request label Sep 14, 2020

pkolaczk self-assigned this Apr 20, 2021

pkolaczk added this to the 0.12 milestone Apr 20, 2021

pkolaczk closed this as completed Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offer a way of deleting / hardlinking / softlinking duplicated files automatically #27

Offer a way of deleting / hardlinking / softlinking duplicated files automatically #27

aurelg commented Sep 14, 2020 •

edited

Loading

piranna commented Sep 14, 2020

aurelg commented Sep 14, 2020

piranna commented Sep 14, 2020

pkolaczk commented Sep 14, 2020

aurelg commented Sep 14, 2020

rleaver152 commented May 2, 2021

rleaver152 commented May 2, 2021 •

edited

Loading

piranna commented May 2, 2021

rleaver152 commented May 2, 2021 •

edited

Loading

pkolaczk commented Jun 5, 2021

Offer a way of deleting / hardlinking / softlinking duplicated files automatically #27

Offer a way of deleting / hardlinking / softlinking duplicated files automatically #27

Comments

aurelg commented Sep 14, 2020 • edited Loading

piranna commented Sep 14, 2020

aurelg commented Sep 14, 2020

piranna commented Sep 14, 2020

pkolaczk commented Sep 14, 2020

aurelg commented Sep 14, 2020

rleaver152 commented May 2, 2021

rleaver152 commented May 2, 2021 • edited Loading

piranna commented May 2, 2021

rleaver152 commented May 2, 2021 • edited Loading

pkolaczk commented Jun 5, 2021

aurelg commented Sep 14, 2020 •

edited

Loading

rleaver152 commented May 2, 2021 •

edited

Loading

rleaver152 commented May 2, 2021 •

edited

Loading