Duplicate bug and bad output format #1

keul · 2013-09-22T12:56:25Z

From Albert Hofkamp in the far november 2009

I am looking for a utility to find the mess at my HD, and your program seemed a good first step, so I tried it.
Here are my findings, maybe they are useful for improving the program further.

First of all, the program is flawed:

[hat@localhost]~/tmp% md5sum *
d7e31943a69bdb8e403532b48e1543bc b
029b6cf9da0ded154bddbb323774b452 bla
d7e31943a69bdb8e403532b48e1543bc c
bbda6f89123d797e45af1f94b85ec7d3 q

[hat@localhost]~/tmp% ls -l
total 24
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 10652 2009-11-28 09:13 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:13 q

files b and c are the same file as you can see in the md5 checksum.
Your program detects that:
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/b

Completed

So far so good. Now watch what happens if I rename q:

[hat@localhost]/tmp% mv q a
[hat@localhost]/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size

Completed

What happened to the equality of b and c?

Secondly, your output is nicely readable if the number of duplicates is 2 files. If you have 3 or more files with the same contents, the number of output lines explodes, and I have to find out manually which files are the same:
hat@localhost]/tmp% cp b a
[hat@localhost]/tmp% cp b bla
[hat@localhost]/tmp% cp ../scancodes.pdf .
[hat@localhost]/tmp% ll
total 60
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 a
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 44920 2009-11-28 09:27 scancodes.pdf
[hat@localhost]/tmp% cp scancodes.pdf y
[hat@localhost]/tmp% cp scancodes.pdf bbb
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/b is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/bla is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/scancodes.pdf is a duplicate of /home/hat/tmp/bbb
The file /home/hat/tmp/y is a duplicate of /home/hat/tmp/bbb

Completed

I'd rather have either a single line for each set of duplicates, or groups of lines, as in

/home/hat/tmp/a
/home/hat/tmp/b
/home/hat/tmp/bla
/home/hat/tmp/c

/home/hat/tmp/scancodes.pdf
/home/hat/tmp/bbb
/home/hat/tmp/y

Much less clutter, much easier to understand.
All filenames at a single line is simpler if you want to feed the duplicate names into a second script for further processing.

/home/hat/tmp/a; /home/hat/tmp/b; /home/hat/tmp/bla; /home/hat/tmp/c
/home/hat/tmp/scancodes.pdf; /home/hat/tmp/bbb; /home/hat/tmp/y

If you can find a way to get rid of all the /home/hat/tmp/ prefixes, that would be great (but it may be difficult when duplicate files are spread over several directories).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate bug and bad output format #1

Duplicate bug and bad output format #1

keul commented Sep 22, 2013

Duplicate bug and bad output format #1

Duplicate bug and bad output format #1

Comments

keul commented Sep 22, 2013