# Duplicate File Finder

This tool scans a directory tree for duplicate files by comparing MD5 hashes.
When duplicates are found, it generates a **review report** that you can edit
before confirming any deletions.

It supports **resume/checkpoint**: if a scan is interrupted, you can pick up
where you left off without re-hashing files that haven't changed.

In [None]:
# --- Configuration ---
# Set these variables before running the scan.

root_dir = '/path/to/scan/'               # Directory to scan (needs trailing slash)
resume = False                             # Set to True to resume a previous scan
report_path = 'duplicate_report.txt'       # Where to write the review report

In [None]:
# --- Scan Filters ---
# Set to a list of extensions to skip (e.g. [".jpg", ".png"]), or None to skip nothing
ignore_extensions = None

# Set to a list of extensions to scan exclusively (e.g. [".pdf"]), or None to scan all
# Note: cannot be used together with ignore_extensions
only_extensions = None

# Maximum file size to scan (files larger than this are skipped), or None for no limit
max_size = None
max_size_unit = "MB"  # "KB", "MB", or "GB"

# Minimum file size to scan (files smaller than this are skipped), or None for no limit
min_size = None
min_size_unit = "KB"  # "KB", "MB", or "GB"

## Scan Filters

- **`ignore_extensions`** — A list of file extensions to skip (e.g. `[".jpg", ".png"]`). Files with these extensions won't be scanned. Set to `None` to skip nothing.
- **`only_extensions`** — A list of file extensions to scan exclusively (e.g. `[".pdf"]`). Only files with these extensions will be scanned. Set to `None` to scan all files.
- **Note:** `ignore_extensions` and `only_extensions` are **mutually exclusive** — setting both will raise an error.
- **`max_size` / `max_size_unit`** — Skip files larger than this size. Set `max_size` to `None` for no upper limit.
- **`min_size` / `min_size_unit`** — Skip files smaller than this size. Set `min_size` to `None` for no lower limit.
- Units can be `"KB"`, `"MB"`, or `"GB"`. Extensions are matched **case-insensitively** (`.JPG` matches `.jpg`).

from src.duplicate_finder import ScanConfig, find_all_duplicate_files

config = ScanConfig(
    root_dir=root_dir,
    resume=resume,
    report_path=report_path,
    ignore_extensions=ignore_extensions,
    only_extensions=only_extensions,
    max_size=max_size,
    max_size_unit=max_size_unit,
    min_size=min_size,
    min_size_unit=min_size_unit,
)
grouped = find_all_duplicate_files(config)

total_dupes = sum(len(files) for files in grouped.values())
print(f'Scan complete. Found {len(grouped)} duplicate group(s) ({total_dupes} files total).')

In [None]:
from src.duplicate_finder import ScanConfig, find_all_duplicate_files

config = ScanConfig(root_dir=root_dir, resume=resume, report_path=report_path)
grouped = find_all_duplicate_files(config)

total_dupes = sum(len(files) for files in grouped.values())
print(f'Scan complete. Found {len(grouped)} duplicate group(s) ({total_dupes} files total).')

In [None]:
from src.duplicate_finder import generate_report

generate_report(grouped, config.report_path)
print(f'Report written to: {config.report_path}')
print('Edit the report file to change KEEP/REMOVE labels before proceeding.')

## Report Format

The report is a plain text file with tab-separated columns:

```
# [md5: a1b2c3d4] [size: 200KB] [3 files]
KEEP\t2025-03-15 10:30\tC:/photos/IMG_001.jpg
REMOVE\t2025-03-15 10:30\tC:/backup/IMG_001_copy.jpg
REMOVE\t2025-06-01 14:22\tC:/downloads/IMG_001(1).jpg
```

- The first file in each group defaults to **KEEP**, the rest default to **REMOVE**.
- Change any `REMOVE` to `KEEP` (or vice versa) to control what gets deleted.
- **Rule:** Every group must have at least one `KEEP`. The loader will raise an error otherwise.
- Lines starting with `#` are group headers (for readability only).

In [None]:
from src.duplicate_finder import load_report, get_files_to_remove

report = load_report(config.report_path)
to_remove = get_files_to_remove(report)
to_keep = len(report) - len(to_remove)

print(f'{to_keep} file(s) to keep, {len(to_remove)} file(s) to remove.')

## Delete Duplicates

**Warning:** Deleting files is irreversible. Review the summary above and
make sure the report reflects what you actually want to delete.
Uncomment the code in the cell below to confirm deletion.

In [None]:
from src.duplicate_finder import remove_files

# Uncomment the lines below to delete all files marked REMOVE.
# remove_files(to_remove)
# print(f'Removed {len(to_remove)} file(s).')