Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable (tune?) parallelism #9

Open
trimitri opened this issue Jul 3, 2022 · 1 comment
Open

Enable (tune?) parallelism #9

trimitri opened this issue Jul 3, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@trimitri
Copy link

trimitri commented Jul 3, 2022

Great tool!

When running the script, a worker for each CPU core seems to be spawned. But all work then happens on one of the workers.

It seems, that creation of the fingerprints takes most of the time, at least for small (20k images) collections.

The creation of the fingerprints could possibly be parallelized very well. Or would merging the individual thread/process results be a hassle?

Even on a 6+ years old system, the CPU + SSD load was around 20%. So for current systems, probably acceleration of up to 10x could be achieved.

I'm now thinking about hacking this together by launching parallel runs with separate fingerprint databases, and then merging them. I'm afraid stuff is going to break, given my skills...

Do you have plans to implement parallelism?

@jhnc jhnc added the enhancement New feature or request label Jul 17, 2022
@jhnc
Copy link
Owner

jhnc commented Jul 17, 2022

Thank you. I agree it would be good to be able to run fingerprinting in parallel. Unfortunately, I think the code would need to be reworked substantially. For now, you could certainly do fingerprint runs to separate databases and then merge. Off the top of my head, something like this should work:

#!/bin/bash

# number of workers
par=4

workdir=$(mktemp -d)

# generate file lists (assumes no newlines in filenames; needs GNU split)
# use your own appropriate find equivalent
find /img/top/dir/ -type f  |\
split -a3 --numeric-suffixes=1 -n r/$par - $workdir/flist.

# run fingerprinting processes (needs GNU xargs)
printf '%03d\n' $(seq $par) |\
xargs -P$par -I@ bash -c "findimagedupes -n -f $workdir/db.@ -- - < $workdir/flist.@"  

# merge
for db in $workdir/db.*; do args="$args -f $db"; done
findimagedupes -n $args -M fpdb-all

# clean up
rm -r $workdir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants