Skip to content
Quickly remove duplicates, without changing the order, and without getting OOM on huge wordlists.
C Shell Makefile Python
Branch: master
Clone or download
ImgBotApp and nil0x42 [ImgBot] Optimize images
*Total -- 247.28kb -> 212.96kb (13.88%)

/data/img/3-chunked-processing.png -- 69.27kb -> 58.73kb (15.22%)
/data/img/1-comparison.png -- 84.28kb -> 72.99kb (13.4%)
/data/img/2-line-struct.png -- 93.72kb -> 81.24kb (13.32%)

Signed-off-by: ImgBotApp <ImgBotHelp@gmail.com>
Latest commit a521d67 Nov 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data/img [ImgBot] Optimize images Nov 14, 2019
include Version 1.0 Jan 4, 2019
src Code comments review Jan 4, 2019
test fix bug @ makefile.sh test Feb 20, 2018
.gitignore add rule on gitignore May 13, 2017
.travis.yml [TEST] New unit test: nonreg.sh Mar 11, 2015
LICENSE Update LICENSE Feb 18, 2019
Makefile remove stupid compilation warning message Feb 19, 2018
README.md Update README.md Dec 13, 2018
TODO.md update TODO list May 13, 2017

README.md

Duplicut

The duplicut tool finds and removes duplicate entries from a wordlist, without changing the order, and without getting OOM on huge wordlists whose size exceeds available memory.

Build Status

Quick start:

make release
./duplicut <WORDLIST_WITH_DUPLICATES> -o <NEW_CLEAN_WORDLIST>

Overview

Building statictically optimized wordlists for password cracking often requires to be able to find and remove duplicate entries without changing the order.

Unfortunately, existing duplicate removal tools are not able to handle very huge wordlists without crashing due to insufficient memory:

Duplicut is written in C, and optimized to be as fast and memory frugal as possible.

For example, duplicut hashmap saves up to 50% space by packing size information within line pointer's extra bits:

If the whole file doesn't fit in memory, file is split into chunks, and each one is tested against following chunks.

So complexity is equal to th triangle number:


Usage: duplicut [OPTION]... [INFILE] -o [OUTFILE]
Remove duplicate lines from INFILE without sorting.

Options:
-o, --outfile <FILE>       Write result to <FILE>
-t, --threads <NUM>        Max threads to use (default max)
-m, --memlimit <VALUE>     Limit max used memory (default max)
-l, --line-max-size <NUM>  Max line size (default 14)
-p, --printable            Filter ascii printable lines
-h, --help                 Display this help and exit
-v, --version              Output version information and exit

Example: duplicut wordlist.txt -o new-wordlist.txt
  • Features:

    • Handle huge wordlists, even those whose size exceeds available RAM.
    • Line max length based filtering (-l option).
    • Ascii printable chars based filtering (-p option).
    • Press any key to get program status.
  • Implementation:

    • Written in pure C code, designed to be fast.
    • Compressed hash map items on 64 bit platforms.
    • [TODO]: Multi threaded application.
    • [TODO]: Uses huge memory pages to increase performance.
  • Limitations:

    • Any line longer than 255 chars is ignored.
    • Heavily tested on Linux x64, mostly untested on other platforms.
You can’t perform that action at this time.