⚠️ This project is an experiment in AI-assisted software development.The idea for the tool was suggested by
-4oas:Data Deduplication with LSH:
- Implement a near-duplicate detection tool using locality-sensitive hashing (LSH) for a collection of text documents.
- Add visualization for clusters of similar documents.
I picked out this idea from roughly a dozen others, and had
-4oflesh it out into a detailed spec.Implementation was done entirely in Cursor, using the
claude-3.5-sonnetmodel, with a dash ofo1-mini.I personally wrote low double-digit lines of code, documentation included! (This note included!)
Treat the logo below as a visual hint of what to expect. (And please disregard the "TM"!)
git clone git@github.com:othercriteria/ndetect.git
cd ndetect
direnv allow
ndetect .git clone git@github.com:othercriteria/ndetect.git
cd ndetect
python -m venv .venv
source .venv/bin/activate
pip install -e .
ndetect . # If installation was successful
# OR
python -m ndetect . # Always works as long as package is installedThere are enough text files in the repo to get started using the interactive mode.
- The tool identifies near-duplicate text documents based on their content similarity using MinHash.
- Groups of duplicates are formed by detecting transitive relationships (e.g., if A is similar to B and B is similar to C, all three form a group).
The tool operates in two primary modes:
- Interactive Mode (default):
- Groups of duplicates are displayed in a clear tabular format, showing:
- File paths
- File sizes (in bytes)
- Modification timestamps
- Approximate similarity percentage
- Actions include:
- Keeping all files in a group
- Deleting specific duplicates
- Moving duplicates to a designated directory
- Viewing detailed information
- Skipping groups
- Quitting the program
- Groups of duplicates are displayed in a clear tabular format, showing:
- Non-Interactive Mode:
- Automatically processes files according to specified criteria without user interaction
- Supports configurable options for:
- Similarity threshold
- Criteria to determine which file to keep (e.g., age, size)
- Output directory for duplicates
- Logging of actions
- The tool automatically excludes files that are not text-like
- Criteria for exclusion:
- Low ratio of printable characters in the file content
- Decoding errors when reading as UTF-8
- File extension filters (defaults to .txt, .md, .log, .csv)
-
Initialization:
- Shows a progress spinner while scanning files
- Reports the total number of valid text files found
-
Group Presentation:
-
Each group is displayed in a bordered panel showing:
~94.53% similar ┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ # ┃ File ┃ Size ┃ Modified ┃ ┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ 1 │ path/to/file1.txt │ 47,755 bytes │ 2025-01-13 13:57 │ │ 2 │ path/to/file2.txt │ 46,845 bytes │ 2025-01-13 13:57 │ └───┴───────────────────────────────────────┴──────────────┴──────────────────┘
-
For groups with 2 files: Shows "~XX.XX% similar"
-
For groups with 3+ files: Shows "~XX.XX% avg. similarity"
-
-
Default Keeper Selection:
- A default keeper file is automatically selected based on retention criteria
- Users can choose to accept or change the default keeper when performing actions
- When prompted "Do you want to select a different keeper? [y/n]:", users can:
- Accept the default (n)
- Choose a different file by number (y)
-
Available Actions:
-
[n] Next group: Skip to the next group without making changes
-
[d] Delete duplicates: Delete selected files, keeping the chosen keeper
-
[m] Move duplicates: Move selected files to a holding directory
-
[p] Preview: Display file contents with configurable limits:
╭─── path/to/file1.txt ────╮ │ First line of content │ │ Second line of content │ │ ... │ ╰── Size: 47,755 bytes ────╯
-
[s] Show similarities: Display pairwise similarity scores:
╭─── Pairwise Similarities ────╮ │ File 1 File 2 Similar │ │ file1.txt file2.txt 97.6% │ │ file1.txt file3.txt 89.0% │ ╰────────────────────────────╯
-
[q] Quit: Exit the program
-
[h] Help: Show available commands
-
-
Dry Run Mode:
- When using
--dry-run, actions are simulated:- Shows "Dry run: Would move these files:" or similar messages
- No actual file operations are performed
- Useful for previewing what actions would be taken
- When using
-
Pending Operations:
- Move operations are queued until explicitly confirmed
- Before exiting or after processing groups, prompts: "Execute pending moves? [y/n]:"
- Reports results: "Successfully moved X files"
-
Initialization:
- The tool scans the provided file paths, identifies text-like files, and computes MinHash signatures.
- Files are grouped based on content similarity.
-
Group Presentation:
-
Duplicate groups are presented to the user, e.g.:
Select files (space-separated numbers, 'all' or 'none'): 1. ../numpy-2.2.1.dist-info/LICENSE.txt 2. ../scipy-1.15.1.dist-info/LICENSE.txt
-
The user is prompted to act on each group.
-
-
Available Actions:
- [k] Keep all: No changes are made to this group.
- [d] Delete duplicates: The user selects which files to delete from the group.
- [m] Move duplicates to holding: The user selects files to move to a holding directory.
- [i] Show details: Show similarity scores for the group or detailed file metadata.
-
Dynamic Updates:
- After each action, groups are recalculated to account for changes (e.g., removed or moved files).
- Similarities are not re-computed; groups are updated based on the existing similarity graph.
- Batch Processing:
- Groups are formed automatically based on the similarity threshold.
- Actions are applied based on user-defined criteria:
- Similarity threshold: (e.g., 0.85).
- Retention criteria: Keep files based on:
- Smallest size.
- Oldest or newest modification time.
- Priority paths (e.g., files in
/important).
- Example behavior: For a group of duplicates:
- Retain the oldest file and move the rest to the purgatory directory.
- Logging:
- Outputs a log file summarizing actions taken (e.g., files moved, deleted, or skipped).
For technical background, see the MinHash paper and the Python implemetation in the datasketch library.
- Efficiently calculates content similarity for large collections of text documents.
- Groups duplicates using a similarity graph:
- Nodes represent files.
- Edges represent pairs of files with similarity above the threshold.
- Implementation details:
- Document fingerprinting using k-shingles:
- Text is normalized (lowercase, whitespace normalized)
- Content is split into overlapping k-shingles (default k=5)
- Each shingle is hashed and added to MinHash signature
- Similarity calculation:
- Jaccard similarity between MinHash signatures
- Configurable number of permutations (default: 128)
- Configurable shingle size for different use cases
- Document fingerprinting using k-shingles:
- Files that fail basic text-likeness checks are excluded from processing.
- Includes:
- Decoding as UTF-8.
- Checking for printable character ratio.
- Optional file extension filters.
- Groups are updated dynamically as actions are taken (e.g., deleting or moving files).
- Ensures that changes are immediately reflected in the presented groups.
- Users can specify a similarity threshold (e.g., 0.85) to adjust the sensitivity of duplicate detection.
- In non-interactive mode, users can specify criteria for which file to retain within a group:
- Age: Oldest or newest.
- Size: Smallest or largest.
- Priority paths: Retain files in certain directories.
- Logs all actions taken, including files excluded from processing and duplicates detected.
- Log format: human-readable text or structured formats like JSON.
# Interactive mode (default)
ndetect /path/to/files
# Non-interactive mode
ndetect --mode non-interactive /path/to/files--mode [interactive|non-interactive]: Operation mode (default: interactive)--threshold [float]: Minimum similarity threshold (default: 0.85)--holding-dir [path]: Directory for moved duplicates (default: ./holding)--dry-run: Preview actions without making changes--verbose: Enable verbose logging--log-file [path]: Path to log file (if not specified, only logs to console)
# Find duplicates with higher similarity threshold
ndetect --threshold 0.95 /path/to/files
# Move duplicates to custom directory with logging
ndetect --holding-dir /tmp/duplicates --log-file ndetect.log /path/to/files
# Non-interactive mode with custom retention strategy
ndetect --mode non-interactive \
--retention newest \
--holding-dir /archive \
/path/to/files
# Preview operations without making changes
ndetect --dry-run /path/to/files--min-printable-ratio [float]: Minimum ratio of printable characters (default: 0.8)--num-perm [int]: Number of MinHash permutations (default: 128)--shingle-size [int]: Size of text shingles for comparison (default: 5)--include-empty: Include empty (zero-byte) files in analysis (default: skip empty files)--max-workers [int]: Maximum number of worker processes for parallel scanning (default: CPU count)
--follow-symlinks: Follow symbolic links when scanning (default)--no-follow-symlinks: Do not follow symbolic links when scanning--max-symlink-depth [int]: Maximum depth when following symbolic links (default: 10)--base-dir [path]: Restrict symlink resolution to specified directory
The tool implements robust and secure symlink handling with the following features:
-
Base Directory Containment:
- Optional base directory restriction via
--base-dir - Prevents symlinks from accessing files outside the allowed directory tree
- Handles both absolute and relative symlinks securely
- Validates paths using strict resolution checks
- Optional base directory restriction via
-
Cycle Detection:
- Prevents infinite loops from circular symlinks
- Tracks visited paths during resolution
- Handles both direct and indirect cycles gracefully
- Clears visited path cache between resolutions
-
Depth Control:
- Configurable maximum symlink depth (default: 10)
- Prevents resource exhaustion from deep symlink chains
- Customizable via
--max-symlink-depthoption - Returns None for chains exceeding maximum depth
- Atomic symlink resolution to prevent TOCTOU vulnerabilities
- Strict path validation for base directory containment
- Graceful handling of broken symlinks and permission errors
- Comprehensive error reporting and logging
- Zero-byte file detection and handling
- Safe handling of relative path traversal attempts
The symlink handler:
- Validates existence before resolution
- Converts relative symlinks to absolute paths
- Performs base directory containment checks
- Maintains state for cycle detection
- Provides detailed error information
- Supports both interactive and non-interactive modes
For untrusted environments, it's recommended to:
- Set a lower
--max-symlink-depthvalue - Enable base directory containment with
--base-dir - Use
--no-follow-symlinksif symlink following is not required
- Files are validated for UTF-8 encoding and minimum printable character ratio
- Text content is normalized (lowercase, whitespace normalized)
- Content is split into overlapping k-shingles (default k=5)
- MinHash signatures are generated using configurable permutations (default: 128)
- Files are compared using MinHash-based Jaccard similarity
- Groups are formed using connected components with transitive relationships
- Groups are presented in order of highest similarity first
- Performance optimizations:
- Cached MinHash signatures
- Batched processing for memory efficiency
- Dynamic group updates after operations
The following parameters can be customized:
--num-perm: Number of MinHash permutations (default: 128)--shingle-size: Size of text shingles (default: 5)--threshold: Minimum similarity threshold (default: 0.8)--min-printable: Minimum ratio of printable characters (default: 0.8)
- Python 3.10 or higher (tested up to 3.13)
We take security seriously. Please see our Security Policy for:
- How to report vulnerabilities
- Our response process
- Supported versions
For information about setting up a development environment and contributing to ndetect, please see our Contributing Guide.
The project uses several code quality tools:
- Ruff for code formatting, linting, and import sorting
- (d)mypy for static type checking
- markdownlint for markdown linting
- pre-commit for automated checks before commits
All of these tools are automatically configured in the development environment when
using nix develop.
