Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
321 lines (222 sloc) 12.4 KB
dupd(1) dupd(1)
NAME
dupd - find duplicate files
SYNOPSIS
dupd COMMAND [OPTIONS]
DESCRIPTION
dupd scans all the files in the given path(s) to find files with dupli‐
cate content.
The sets of duplicate files are not displayed during a scan. Instead,
the duplicate info is saved into a database which can be queried with
subsequent commands without having to scan all files again.
Even though dupd can be used as a simple duplicate reporting tool simi‐
lar to how other duplicate finders work (by running dupd scan ; dupd
report), the real power of dupd comes from interactively exploring the
filesystem for duplicates after the scan has completed. See the file,
ls, dups, uniques and refresh commands.
Additional documentation and examples are available under the docs
directory in the source tree. If you don't have the source tree avail‐
able, see https://github.com/jvirkki/dupd/blob/master/docs/index.md
COMMANDS
As noted in the synopsis, the first argument to dupd must be the com‐
mand to run. The command is one of:
scan - scan files looking for duplicates
report - show duplicate report from last scan
file - check for duplicates of one file
ls - list info about every file
dups - list all duplicate files
uniques - list all unique files
refresh - remove deleted files from the database
validate - revalidate all duplicates in database
rmsh - create shell script to delete all duplicates (use with care!)
help - show brief usage info
usage - show this documentation
man - show this documentation
license - show license info
version - show version and exit
OPTIONS
scan - Perform the filesystem scan for duplicates.
-p, --path PATH
Recursively scan the directory tree starting at this path. The
path option can be given multiple times to specify multiple
directory trees to scan. If no path option is given, the
default is to start scanning from the current directory.
-m, --minsize SIZE
Minimum size (in bytes) to include in scan. By default all
files with 1 byte or more are scanned. In practice duplicates
in files that small are rarely interesting, so you can speed up
the scan by ignoring smaller files.
--buflimit LIMIT
Limit read buffer size. LIMIT may be an integer in bytes, or
include a suffix of M for megabytes or G for gigabytes. The scan
animation shows the percentage of buffer space in use (%b).
Unless that value goes up to 100% or beyond during a scan there
is no need to adjust this limit. Setting this limit to a low
value will constrain dupd memory usage but possibly at a cost to
performance (depends on the data set).
-X, --one-file-system
For each path scanned, do not cross over to a different filesys‐
tem. This is helpful, for example, if you want to scan / but
want to avoid any other mounted filesystems such as NFS mounts
or external drives.
--hidden
Include hidden files (and hidden directories) in the scan. By
default these are not included.
--db PATH
Override the default database file location. The default is
$HOME/.dupd_sqlite. If you override the path during scan,
remember to provide this argument and the path for subsequent
operations so the database can be found.
-I, --hardlink-is-unique
Consider hard links to the same file content as unique. By
default hard links are listed as duplicates. See HARD LINKS
section below. Note that if this option is given during scan,
it cannot be given during interactive operations.
--stats-file FILE
On completion, create (or append to) FILE and save some stats
from the run. These are the same stats as get displayed in ver‐
bose mode but are more suitable for programmatic consumption.
report - Display the list of duplicates.
--cut PATHSEG
Remove prefix PATHSEG from the file paths in the report output.
This can reduce clutter in the output text if all the files
scanned share a long identical prefix.
--minsize SIZE
Report only duplicate sets which consume at least this much disk
space. Note this is the total size occupied by all the dupli‐
cates in a set, not their individual file size.
--format NAME
Produce the report in this output format. NAME is one of text,
csv, json. The default is text.
Note: The database format generated by scan is not guaranteed to be
compatible with future versions. You should run report (and all the
other commands below which access the database) using the same version
of dupd that was used to generate the database.
file - Report duplicate status of one file.
To check whether one given file still has known duplicates use the file
operation. Note that this does not do a new scan so it will not find
new duplicates. This checks whether the duplicates identified during
the previous scan still exist and verifies (by hash) whether they are
still duplicates.
--file PATH
Required: The file to check
--cut PATHSEG
Remove prefix PATHSEG from the file paths in the report output.
--exclude PATH
Ignore any duplicates under PATH when reporting duplicates.
This is useful if you intend to delete the entire tree under
PATH, to make sure you don't delete all copies of the file.
--hardlink-is-unique
Ignore the existence of hard links to the file for the purpose
of considering whether the file is unique.
ls, uniques, dups - List matching files.
While the file command checks the duplicate status of a single file,
these commands do the same for all the files in a given directory tree.
ls - List all files, show whether they have duplicates or not.
uniques - List all unique files.
dups - List all files which have known duplicates.
--path PATH
Start from this directory (default is current directory)
--cut PATHSEG
Remove prefix $PATHSEG from the file paths in the output.
--exclude PATH
Ignore any duplicates under PATH when reporting duplicates.
--hardlink-is-unique
Ignore the existence of hard links to the file for the purpose
of considering whether the file is unique.
refresh - Refreshing the database.
As you remove duplicate files these are still listed in the dupd data‐
base. Ideally you'd run the scan again to rebuild the database. Note
that re-running the scan after deleting some duplicates can be very
fast because the files are in the cache, so that is the best option.
However, when dealing with a set of files large enough that they don't
fit in the cache, re-running the scan may take a long time. For those
cases the refresh command offers a much faster alternative.
The refresh command checks whether all the files in the dupd database
still exist and removes those which do not.
Be sure to consider the limitations of this approach. The refresh com‐
mand does not re-verify whether all files listed as duplicates are
still duplicates. It also, of course, does not detect any new dupli‐
cates which may have appeared since the last scan.
In summary, if you have only been deleting duplicates since the previ‐
ous scan, run the refresh command. It will prune all the deleted files
from the database and will be much faster than a scan. However, if you
have been adding and/or modifying files since the last scan, it is best
to run a new scan.
validate - Validating the database.
The validate operation is primarily for testing but is documented here
as it may be useful if you want to reconfirm that all duplicates in the
database are still truly duplicates.
In most cases you will be better off re-running the scan operation
instead of using validate.
Validate is fairly slow as it will fully hash every file in the data‐
base.
rmsh - Create shell scrip to remove duplicate files.
As a policy dupd never modifies the filesystem!
As a convenience for those times when it is desirable to automatically
remove files, this operation can create a shell script to do so. The
output is a shell script (to stdout) which can you run to delete your
files (if you're feeling lucky).
Review the generated script carefully to see if it truly does what you
want!
Automated deletion is generally not very useful because it takes human
intervention to decide which of the duplicates is the best one to keep
in each case. While the content is the same, one of them may have a
better file name and/or location.
Optionally, the shell script can create either soft or hard links from
each removed file to the copy being kept. The options are mutually
exclusive.
--link Create symlinks for deleted files.
--hardlink
Create hard links for deleted files.
hash - Hash a single file and display result.
--file PATH
Required: The file to check
Additional global options
-q Quiet, suppress all output.
-v Verbose mode. Can be repeated multiple times for ever increas‐
ing verbosity.
-V, --verbose-level N
Set the logging verbosity level directly to N.
-h Show brief help summary.
--db PATH
Override the default database file location.
-C, --cache PATH
Override the default hash cache database file location.
-F, --hash NAME
Specify an different hash function. This applies to any command
which uses content hashing. NAME is one of: md5 sha1 sha512
xxhash
HARD LINKS
Are hard links duplicates or not? The answer depends on "what do you
mean by duplicates?" and "what are you trying to do?"
If your primary goal for removing duplicates is to save disk space then
it makes sense to ignore hardlinks. If, on the other hand, your pri‐
mary goal is to reduce filesystem clutter then it makes more sense to
think of hardlinks as duplicates.
By default dupd considers hardlinks as duplicates. You can switch this
around with the --hardlink-is-unique option. This option can be given
either during scan or to the interactive reporting commands (file, ls,
uniques, dups).
SIGNALS
Sending SIGUSR1 to dupd will toggle between the default progress coun‐
ters and highly verbose debug output (equivalent to -V 10).
EXAMPLES
Scan all files in your home directory and then show the sets of dupli‐
cates found:
% dupd scan --path $HOME
% dupd report
Show duplicate status (duplicate or unique) for all files in docs sub‐
directory:
% dupd ls --path docs
I'm about to delete docs/old.doc but want to check one last time that
it is a duplicate and I want to review where those duplicates are:
% dupd file --file docs/old.doc -v
Read the documentation in the dupd 'docs' directory or online documen‐
tation for more usage examples.
EXIT
dupd exits with status code 0 on success, non-zero on error.
SEE ALSO
sqlite3(1)
https://github.com/jvirkki/dupd/blob/master/docs/index.md
dupd(1)