Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate bug and bad output format #1

Open
keul opened this issue Sep 22, 2013 · 0 comments
Open

Duplicate bug and bad output format #1

keul opened this issue Sep 22, 2013 · 0 comments

Comments

@keul
Copy link
Owner

keul commented Sep 22, 2013

From Albert Hofkamp in the far november 2009


I am looking for a utility to find the mess at my HD, and your program seemed a good first step, so I tried it.
Here are my findings, maybe they are useful for improving the program further.

First of all, the program is flawed:

[hat@localhost]~/tmp% md5sum *
d7e31943a69bdb8e403532b48e1543bc b
029b6cf9da0ded154bddbb323774b452 bla
d7e31943a69bdb8e403532b48e1543bc c
bbda6f89123d797e45af1f94b85ec7d3 q

[hat@localhost]~/tmp% ls -l
total 24
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 10652 2009-11-28 09:13 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:13 q

files b and c are the same file as you can see in the md5 checksum.
Your program detects that:
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/b

Completed

So far so good. Now watch what happens if I rename q:

[hat@localhost]/tmp% mv q a
[hat@localhost]
/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size

Completed

What happened to the equality of b and c?

Secondly, your output is nicely readable if the number of duplicates is 2 files. If you have 3 or more files with the same contents, the number of output lines explodes, and I have to find out manually which files are the same:
hat@localhost]/tmp% cp b a
[hat@localhost]
/tmp% cp b bla
[hat@localhost]/tmp% cp ../scancodes.pdf .
[hat@localhost]
/tmp% ll
total 60
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 a
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 44920 2009-11-28 09:27 scancodes.pdf
[hat@localhost]/tmp% cp scancodes.pdf y
[hat@localhost]
/tmp% cp scancodes.pdf bbb
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/b is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/bla is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/scancodes.pdf is a duplicate of /home/hat/tmp/bbb
The file /home/hat/tmp/y is a duplicate of /home/hat/tmp/bbb

Completed

I'd rather have either a single line for each set of duplicates, or groups of lines, as in

/home/hat/tmp/a
/home/hat/tmp/b
/home/hat/tmp/bla
/home/hat/tmp/c

/home/hat/tmp/scancodes.pdf
/home/hat/tmp/bbb
/home/hat/tmp/y

Much less clutter, much easier to understand.
All filenames at a single line is simpler if you want to feed the duplicate names into a second script for further processing.

/home/hat/tmp/a; /home/hat/tmp/b; /home/hat/tmp/bla; /home/hat/tmp/c
/home/hat/tmp/scancodes.pdf; /home/hat/tmp/bbb; /home/hat/tmp/y

If you can find a way to get rid of all the /home/hat/tmp/ prefixes, that would be great (but it may be difficult when duplicate files are spread over several directories).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant