You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking for a utility to find the mess at my HD, and your program seemed a good first step, so I tried it.
Here are my findings, maybe they are useful for improving the program further.
First of all, the program is flawed:
[hat@localhost]~/tmp% md5sum *
d7e31943a69bdb8e403532b48e1543bc b
029b6cf9da0ded154bddbb323774b452 bla
d7e31943a69bdb8e403532b48e1543bc c
bbda6f89123d797e45af1f94b85ec7d3 q
[hat@localhost]~/tmp% ls -l
total 24
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 10652 2009-11-28 09:13 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:13 q
files b and c are the same file as you can see in the md5 checksum.
Your program detects that:
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/b
Completed
So far so good. Now watch what happens if I rename q:
[hat@localhost]/tmp% mv q a
[hat@localhost]/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
Completed
What happened to the equality of b and c?
Secondly, your output is nicely readable if the number of duplicates is 2 files. If you have 3 or more files with the same contents, the number of output lines explodes, and I have to find out manually which files are the same:
hat@localhost]/tmp% cp b a
[hat@localhost]/tmp% cp b bla
[hat@localhost]/tmp% cp ../scancodes.pdf .
[hat@localhost]/tmp% ll
total 60
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 a
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 44920 2009-11-28 09:27 scancodes.pdf
[hat@localhost]/tmp% cp scancodes.pdf y
[hat@localhost]/tmp% cp scancodes.pdf bbb
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/b is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/bla is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/scancodes.pdf is a duplicate of /home/hat/tmp/bbb
The file /home/hat/tmp/y is a duplicate of /home/hat/tmp/bbb
Completed
I'd rather have either a single line for each set of duplicates, or groups of lines, as in
Much less clutter, much easier to understand.
All filenames at a single line is simpler if you want to feed the duplicate names into a second script for further processing.
If you can find a way to get rid of all the /home/hat/tmp/ prefixes, that would be great (but it may be difficult when duplicate files are spread over several directories).
The text was updated successfully, but these errors were encountered:
From Albert Hofkamp in the far november 2009
I am looking for a utility to find the mess at my HD, and your program seemed a good first step, so I tried it.
Here are my findings, maybe they are useful for improving the program further.
First of all, the program is flawed:
[hat@localhost]~/tmp% md5sum *
d7e31943a69bdb8e403532b48e1543bc b
029b6cf9da0ded154bddbb323774b452 bla
d7e31943a69bdb8e403532b48e1543bc c
bbda6f89123d797e45af1f94b85ec7d3 q
[hat@localhost]~/tmp% ls -l
total 24
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 10652 2009-11-28 09:13 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:13 q
files b and c are the same file as you can see in the md5 checksum.
Your program detects that:
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/b
Completed
So far so good. Now watch what happens if I rename q:
[hat@localhost]
/tmp% mv q a/tmp% ~/.local/bin/duplicatefinder.py -v .[hat@localhost]
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
Completed
What happened to the equality of b and c?
Secondly, your output is nicely readable if the number of duplicates is 2 files. If you have 3 or more files with the same contents, the number of output lines explodes, and I have to find out manually which files are the same:
hat@localhost]
/tmp% cp b a/tmp% cp b bla[hat@localhost]
[hat@localhost]
/tmp% cp ../scancodes.pdf ./tmp% ll[hat@localhost]
total 60
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 a
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 b
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:27 bla
-rw-r--r--. 1 hat hat 2543 2009-11-28 09:00 c
-rw-r--r--. 1 hat hat 44920 2009-11-28 09:27 scancodes.pdf
[hat@localhost]
/tmp% cp scancodes.pdf y/tmp% cp scancodes.pdf bbb[hat@localhost]
[hat@localhost]~/tmp% ~/.local/bin/duplicatefinder.py -v .
Starting checking directories /home/hat/tmp
Phase 1: walking directories
Phase 2: sorting by size
Phase 3: seek files with same size
The file /home/hat/tmp/b is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/bla is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/c is a duplicate of /home/hat/tmp/a
The file /home/hat/tmp/scancodes.pdf is a duplicate of /home/hat/tmp/bbb
The file /home/hat/tmp/y is a duplicate of /home/hat/tmp/bbb
Completed
I'd rather have either a single line for each set of duplicates, or groups of lines, as in
/home/hat/tmp/a
/home/hat/tmp/b
/home/hat/tmp/bla
/home/hat/tmp/c
/home/hat/tmp/scancodes.pdf
/home/hat/tmp/bbb
/home/hat/tmp/y
Much less clutter, much easier to understand.
All filenames at a single line is simpler if you want to feed the duplicate names into a second script for further processing.
/home/hat/tmp/a; /home/hat/tmp/b; /home/hat/tmp/bla; /home/hat/tmp/c
/home/hat/tmp/scancodes.pdf; /home/hat/tmp/bbb; /home/hat/tmp/y
If you can find a way to get rid of all the /home/hat/tmp/ prefixes, that would be great (but it may be difficult when duplicate files are spread over several directories).
The text was updated successfully, but these errors were encountered: