Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Feature to check for files [on an external disk] which are *not* present somewhere on the [backup] disk #162

Open
Wikinaut opened this issue Aug 5, 2019 · 3 comments

Comments

@Wikinaut
Copy link

Wikinaut commented Aug 5, 2019

I wish to have a feature which makes intelligent use of the checksum/hashes of the huge "backup" drive X so that - when I connect a smaller drive Z to my computer - so that I can quickly list all those files which are

  • present on drive Z ; and/but
  • not present on drive X

This is a "one-way" check. I don't want to have the huge list of differences. I only want to know those files from Z which for one reason or another have not been copied (or later moved) to drive X, on any directory there. So basically, it's a checksum/hash issue.

@Wikinaut
Copy link
Author

Wikinaut commented Sep 13, 2019

Hello, can we talk about such a new feature? If you wish, I can explain again why rsync is not a solution.

It's something like https://askubuntu.com/a/767988

fdupes is an excellent program to find the duplicate files but it does not list the non-duplicate files, which is what you are looking for. However, we can list the files that are not in the fdupes output using a combination of find and grep.

@pixelb
Copy link
Owner

pixelb commented Sep 13, 2019

OK an rsync solution should work if the structure in the dest was similar to that in the source.
I.E. something like rsync -rl --dry-run --out-format="%f" --checksum Z/ X/

So I presume the structure of your source Z is different to that in dest X.
I.E. you want to list files not backed up, no matter where they are in Z,
so that you can copy them to the appropriate location in X etc.

So you want the equivalent of the following, but with more efficient handling of unique file sizes etc:

    $ SRC=Z/; DST=X/
    $ find $SRC $DST -type f | xargs md5sum | sed "\|  $DST|p" |
      sort | uniq -w32 -u | cut -d' ' -f3

One could avoid the overhead of scanning and checksumming $DST if it was not updated between fslint dedupe runs. In that case fslint could write and index of size,checksum,name which could be used directly in the process above

@Wikinaut
Copy link
Author

Wikinaut commented Sep 13, 2019

Yes, the structure is different, or may be different, so we have to "search" for the file hash.

I also found this proposal for "fdupes" adrianlopezroche/fdupes#19

It would be good to save the hash/parse/analyze information of a specific fdupes run, in order to compare later this "virtual"files tree with a real file tree.


Currently I run the suggested sequence from https://askubuntu.com/a/767988 (see above):
to list the files which are unique to backup (Z in my example), i. e. which are in backup but not in documents. [My use case is vice versa: to look for files which are not yet somewhere in the "backup"]

fdupes -r backup/ documents/ > dup.txt
find backup/ -type f | grep -Fxvf dup.txt 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants