Skip to content

Commit

Permalink
Pin pyav to v12, add better tooling, refactor internals, bump hydrus_…
Browse files Browse the repository at this point in the history
…api (#43)

Possible breaking changes:

- Pin pyav to <12

  - v12 was giving me errors when trying it in the CI and locally. For now v11 will be fine. It can be bumped later if the issues are investigated/resolved. 

Internal Changes:

- Refactor DB into smaller classes and functions

- Add test Hydrus database (unused in CI currently)

- Bump hydrus_api to newest version (cd554b63)

- Move test files to "testdb" submodule to avoid git-lfs bandwidth issue and to reduce this repo's size

- Improve documentation

- Add CI basic unit tests and vpdq benchmark

- Fix lint errors
  • Loading branch information
ianwal committed Mar 17, 2024
1 parent 80d1aba commit e5146a7
Show file tree
Hide file tree
Showing 46 changed files with 873 additions and 437 deletions.
1 change: 0 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
tests/videos/** filter=lfs diff=lfs merge=lfs -text
*.mkv filter=lfs diff=lfs merge=lfs -text
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.gif filter=lfs diff=lfs merge=lfs -text
Expand Down
66 changes: 62 additions & 4 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,74 @@ jobs:

steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v4
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
python-version: '3.11'

- name: Install hatch
run: pip install hatch

- name: Install dependencies
run: pip install -e .

- name: Lint with Ruff
run: hatch run lint:lint

- name: Format check with black
run: hatch run lint:format

unittest:
name: Unit Test
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10', '3.11', '3.12']

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
lfs: 'false'
submodules: 'recursive'

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install hatch
run: pip install hatch

- name: Unit test
run: |
hatch env create test
hatch run test:all
benchmark:
name: Benchmark
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10', '3.11', '3.12']

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
lfs: 'false'
submodules: 'recursive'

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install hatch
run: pip install hatch

- name: Benchmark
run: |
hatch env create benchmark
hatch run benchmark:vpdq --benchmark-json "${{ matrix.python-version }}.json"
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -185,3 +185,6 @@ cython_debug/
.ionide

# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode

# Hydrus
!.gitkeep
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "testdb"]
path = tests/testdb
url = https://github.com/hydrusvideodeduplicator/testdb.git
33 changes: 11 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Hydrus Video Deduplicator
<img src="https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/assets/104981058/e65383e8-1978-46aa-88b6-6fdda9767367">

Hydrus Video Deduplicator finds potential duplicate videos through the Hydrus API
Hydrus Video Deduplicator finds potential duplicate videos through the Hydrus API and marks them as potential duplicates to allow manual filtering through the Hydrus Client GUI.


[![PyPI - Version](https://img.shields.io/pypi/v/hydrusvideodeduplicator.svg)](https://pypi.org/project/hydrusvideodeduplicator)
Expand All @@ -15,16 +15,11 @@ Hydrus Video Deduplicator finds potential duplicate videos through the Hydrus AP

---

## How It Works:
The deduplicator works by comparing videos similarity by their [perceptual hash](https://en.wikipedia.org/wiki/Perceptual_hashing).
Hydrus Video Deduplicator **does not modify your files**. It only marks videos as "potential duplicates" through the Hydrus API so that you can filter them manually in the duplicates processing page.

Potential duplicates can be processed through the Hydrus duplicates processing page just like images.
[See the Hydrus documentation for how duplicates are managed in Hydrus](https://hydrusnetwork.github.io/hydrus/duplicates.html).

You can choose to process only a subset of videos with `--query` using Hydrus tags, e.g. `--query="character:edward"` will only process videos with the tag `character:edward`.

For more information check out the [wiki](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki) and the [FAQ](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki/faq)

---
This program contains no telemetry. It only makes requests to the Hydrus API URL.

## [Installation:](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki/Installation)
#### Dependencies:
Expand All @@ -38,31 +33,25 @@ python3 -m pip install hydrusvideodeduplicator

## [Usage:](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki/Usage)

Simplest usage:

```sh
python3 -m hydrusvideodeduplicator --api-key="<your key>"
python3 -m hydrusvideodeduplicator --api-key="put your Hydrus api key in these quotes here"
```

For full list of options see `--help` or the [usage page.](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki/Usage)
You should now see all potential video duplicates in the Hydrus duplicates processing page.

---
For many users, it should be as simple as the Usage command above.

## TODO:
- [ ] Option to rollback and remove potential duplicates
- [ ] OR predicates for --query
- [x] Parallelize hashing and duplicate search
- [ ] Automatically generate access key with Hydrus API
- [x] Docker container
- [ ] Upload Docker container to Docker Hub (GitHub Action)
- [x] Pure Python port of vpdq
- [x] Windows compatibility without WSL or Docker
For more information, see the [Usage](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki/Usage) and [FAQ](https://github.com/hydrusvideodeduplicator/hydrus-video-deduplicator/wiki/faq).

---

## Contact:

Create an issue on GitHub for any problems/concerns. Provide as much detail as possible in your issue.

Message @applenanner on the [Hydrus Discord](https://discord.gg/wPHPCUZ) for other general questions/concerns
Message @applenanner on the [Hydrus Discord](https://discord.gg/wPHPCUZ) for other general questions/concerns.

---

Expand Down
38 changes: 28 additions & 10 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ requires-python = ">=3.10"
license = "MIT"
keywords = []
authors = [
{ name = "hydrusvideodeduplicator", email = "applenannerapple@gmail.com" },
{ name = "hydrusvideodeduplicator", email = "hydrusvideodeduplicator@gmail.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
Expand All @@ -31,8 +31,8 @@ dependencies = [
"psutil",
"joblib",
# Below is for vpdqpy
"Pillow",
"pyav",
"pillow",
"pyav<12",
]

[project.urls]
Expand All @@ -50,6 +50,24 @@ exclude = [
[tool.hatch.version]
path = "src/hydrusvideodeduplicator/__about__.py"

[tool.hatch.envs.benchmark]
dependencies = [
"pytest",
"pytest-benchmark",
]

[tool.hatch.envs.benchmark.scripts]
# pytest-benchmark is a plugin for pytest that allows for benchmarking
vpdq = "python -m pytest tests/test_benchmark_vpdqpy.py {args}"

[tool.hatch.envs.test]
dependencies = [
"pytest",
]

[tool.hatch.envs.test.scripts]
all = "python -m pytest tests/test_dedupe.py tests/test_vpdqpy.py {args}"

[tool.hatch.envs.lint]
dependencies = [
"black",
Expand All @@ -58,7 +76,7 @@ dependencies = [

[tool.hatch.envs.lint.scripts]
format = "black --check src"
lint = "ruff src"
lint = "ruff check src"

[tool.black]
target-version = ["py310", "py311"]
Expand All @@ -69,12 +87,12 @@ skip-string-normalization = true
# Enable the pycodestyle (`E`) and Pyflakes (`F`) rules by default.
# Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
# McCabe complexity (`C901`) by default.
select = ["E", "F"]
ignore = []
lint.select = ["E", "F"]
lint.ignore = []

# Allow autofix for all enabled rules (when `--fix`) is provided.
fixable = ["ALL"]
unfixable = []
lint.fixable = ["ALL"]
lint.unfixable = []

# Exclude a variety of commonly ignored directories.
exclude = [
Expand All @@ -100,13 +118,13 @@ exclude = [
"node_modules",
"venv",
]
per-file-ignores = {"tests/**/*" = ["PLR2004", "S101", "TID252"]}
lint.per-file-ignores = {"tests/**/*" = ["PLR2004", "S101", "TID252"]}

# Same as Black.
line-length = 120

# Allow unused variables when underscore-prefixed.
dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"

# Assume Python 3.10.
target-version = "py310"
67 changes: 40 additions & 27 deletions src/hydrusvideodeduplicator/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,15 @@
import hydrusvideodeduplicator.hydrus_api as hydrus_api

from .__about__ import __version__
from .config import HYDRUS_API_KEY, HYDRUS_API_URL, HYDRUS_LOCAL_FILE_SERVICE_KEYS, HYDRUS_QUERY, REQUESTS_CA_BUNDLE
from .client import HVDClient
from .config import (
HYDRUS_API_KEY,
HYDRUS_API_URL,
HYDRUS_LOCAL_FILE_SERVICE_KEYS,
HYDRUS_QUERY,
REQUESTS_CA_BUNDLE,
)
from .db import DedupeDB
from .dedup import HydrusVideoDeduplicator

"""
Expand Down Expand Up @@ -47,6 +55,9 @@ def main(
verbose: Annotated[Optional[bool], typer.Option(help="Verbose logging")] = False,
debug: Annotated[Optional[bool], typer.Option(hidden=True)] = False,
):
# Fix mypy errors from optional parameters
assert overwrite is not None and threshold is not None and skip_hashing is not None and job_count is not None

# CLI debug parameter sets log level to info or debug
loglevel = logging.WARNING
if debug:
Expand All @@ -63,7 +74,7 @@ def main(

# Clear cache
if clear_search_cache:
HydrusVideoDeduplicator.clear_search_cache()
DedupeDB.clear_search_cache()

# CLI overwrites env vars with no default value
if not api_key:
Expand All @@ -78,36 +89,31 @@ def main(
print("Hydrus URL not set. Exiting...")
raise typer.Exit(code=1)

print(f"Connecting to {api_url}")
# Client connection
# TODO: Try to connect with https first and then fallback to http with a strong warning
hydrus_client = hydrus_api.Client(
api_url=api_url,
access_key=api_key,
verify_cert=verify_cert,
)

print(f"Connecting to {api_url}")
error_connecting = True
error_connecting_exception_msg = ""
error_connecting_exception = ""
try:
superdeduper = HydrusVideoDeduplicator(
hydrus_client,
hvdclient = HVDClient(
file_service_keys=file_service_key,
job_count=job_count,
api_url=api_url,
access_key=api_key,
verify_cert=verify_cert,
)
except hydrus_api.InsufficientAccess as exc:
error_connecting_exception_msg = "Invalid Hydrus API key."
error_connecting_exception = exc
error_connecting_exception = str(exc)
except hydrus_api.DatabaseLocked as exc:
error_connecting_exception_msg = "Hydrus database is locked. Try again later."
error_connecting_exception = exc
error_connecting_exception = str(exc)
except hydrus_api.ServerError as exc:
error_connecting_exception_msg = "Unknown Server Error."
error_connecting_exception = exc
error_connecting_exception = str(exc)
except hydrus_api.APIError as exc:
error_connecting_exception_msg = "API Error"
error_connecting_exception = exc
error_connecting_exception = str(exc)
except hydrus_api.ConnectionError as exc:
# Probably SSL error
if "SSL" in str(exc):
Expand All @@ -119,31 +125,38 @@ def main(
)
else:
error_connecting_exception_msg = "Failed to connect to Hydrus. Is your Hydrus instance running?"
error_connecting_exception = exc
error_connecting_exception = str(exc)
else:
error_connecting = False

if error_connecting:
logging.fatal("FATAL ERROR HAS OCCURRED")
logging.fatal(error_connecting_exception)
print(f"[red] {error_connecting_exception_msg} ")
logging.fatal(str(error_connecting_exception))
print(f"[red] {str(error_connecting_exception_msg)} ")
raise typer.Exit(code=1)

# Deduplication parameters
if debug:
HVDClient._log.setLevel(logging.DEBUG)

# Deduplication

deduper = HydrusVideoDeduplicator(
client=hvdclient,
job_count=job_count,
)

if debug:
superdeduper.hydlog.setLevel(logging.DEBUG)
superdeduper._DEBUG = True
deduper.hydlog.setLevel(logging.DEBUG)
deduper._DEBUG = True

if threshold < 0:
if threshold < 0.0 or threshold > 100.0:
print("[red] ERROR: Invalid similarity threshold. Must be between 0 and 100.")
raise typer.Exit(code=1)
superdeduper.threshold = threshold
HydrusVideoDeduplicator.threshold = threshold

superdeduper.clear_trashed_files_from_db()
DedupeDB.clear_trashed_files_from_db(hvdclient)

# Run all deduplicate functionality
superdeduper.deduplicate(
deduper.deduplicate(
overwrite=overwrite,
custom_query=query,
skip_hashing=skip_hashing,
Expand Down
Loading

0 comments on commit e5146a7

Please sign in to comment.