Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library: Add duplicate search with delete function #1457

Closed

Conversation

SNThrailkill
Copy link

@SNThrailkill SNThrailkill commented Aug 5, 2021

Draft PR of initial implementation. Will definitely need polish and tweaks.

@CLAassistant
Copy link

CLAassistant commented Aug 5, 2021

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@SNThrailkill SNThrailkill marked this pull request as draft August 5, 2021 04:10
@SNThrailkill
Copy link
Author

SNThrailkill commented Aug 5, 2021

Sample API Payload:
http://localhost:2342/api/v1/duplicates/5352b00121b026185ab1ab400077c655c8dba877

Sample API Response:

[
  {
  "Name":"2021/04/20210402_154516_0895C5F8 - Copy.jpeg",
  "Root":"/",
  "Hash":"5352b00121b026185ab1ab400077c655c8dba877",
  "Size":1802482,
  "ModTime":1622606866
  },{
  "Name":"2021/04/20210402_154516_0895C5F8.jpeg",
  "Root":"/",
  "Hash":"5352b00121b026185ab1ab400077c655c8dba877",
  "Size":1802482,
  "ModTime":1622606866
  }
]

@SNThrailkill
Copy link
Author

Sample of current UI implementation
sample

@ark-
Copy link

ark- commented Aug 5, 2021

Great work!

Out of interest, what method is used to compare the photos and determine a match? I note you match a hash but I'm not familiar with how Photoprism determines this hash.

@lastzero
Copy link
Member

lastzero commented Aug 5, 2021

Out of interest, what method is used to compare the photos and determine a match? I note you match a hash but I'm not familiar with how Photoprism determines this hash.

PhotoPrism uses SHA-1 checksums to detect exact duplicates. While it's older / shorter than e.g. SHA-2, you don't need to be afraid of security issues as it's not used as cryptographic hash. In addition we index image properties like colors and contrast that could be used to find similar images.

@ark-
Copy link

ark- commented Aug 5, 2021

Thanks Michael. I wonder if (further down the line) there could be an image "similarity" algorithm. I know your thoughts on this are that metadata is better at determining a match than any RGB comparison but perhaps it could be a suitable fallback when no metadata is available (e.g. no EXIF data or useful filename).

The https://github.com/qarmin/czkawka project has a fairly good image duplicate feature. However it is written in Rust and so is the library it uses https://github.com/abonander/img_hash.

@lastzero
Copy link
Member

lastzero commented Aug 5, 2021

You can already group / sort by similarity:

https://demo.photoprism.org/browse?view=cards&order=similar&public=true&quality=3

This was specifically built for linear sorting - not to monitor unlicensed image usage, or find derived works with a near 100% confidence. There are other algos for this. We'll look into it when our roadmap has space for this.

@ark-
Copy link

ark- commented Aug 5, 2021

Very interesting! Didn't notice that feature so thanks for pointing it out.

@G2G2G2G
Copy link

G2G2G2G commented Aug 7, 2021

There's ahash, phash, dhash, etc for "similar" images, regular cryptographic hashes will give you exact matches, if we want to add similar-like image search (maybe you check a box for exact matches vs similar) we can go deeper in the future (other than just color similarity etc or whatever is used now)

In my mind when looking for "same image" I'd want to see all the images of the same, if it's cropped, scaled to a different size, recompressed as 50% quality jpeg, or whatever. All of those changes the image to the computer, but not really to the human eye.

Some links:
https://stackoverflow.com/questions/75891/algorithm-for-finding-similar-images

https://github.com/JohannesBuchner/imagehash/blob/master/find_similar_images.py

http://www.phash.org/ (strict but not as strict as cryptographic hash)

Unfortunately not many things written directly in Go, it is annoying to have to compile stuff per system etc.

@graciousgrey graciousgrey added the work-in-progress Please don't merge just yet label Aug 10, 2021
@graciousgrey
Copy link
Member

@G2G2G2G implementing other algorithms than sha1/perceptual hashing to find exact duplicates or similar files won't be part of this issue. This is about being able to delete already detected duplicates via the UI. I guess the title "advanced duplicate handling" is a bit misleading.

Ideally compressed, cropped or edited images are already stacked e.g. in case they have been taken at the same place and time, have related filenames or share the same document id. Perceptual hashing is implemented as well and we have a open ticket for providing a UI for it: #28

We know this won't catch all similar images, especially in case files have no metadata. Implementing other algorithms is something we can do later, feel free to open a separate issue for this :)

@stbenjam
Copy link

stbenjam commented Oct 29, 2021

Are there plans to match and allow deletion of perceptual hash matches as well? I know they stack, but I ended up importing about half my library twice, and I really want the dupes gone. The first import was with a piece of software that compressed, resized, converted, etc so sha's don't match.

@graciousgrey
Copy link
Member

@stbenjam In case your similar images are already stacked, you find a delete button on the edit dialog's file tab, see https://docs.photoprism.org/user-guide/organize/stacks/#remove-not-primary-files-permanently. We also have a ticket to display very similar images (based on percepual hashes) that have not been stacked to the user, so that they can be manually merged. In this case a user could decide to delete one instead of merging (#28)

@lastzero
Copy link
Member

@SNThrailkill My apologies for the long delay! Do you still want us to work with you on this PR? What's still missing (besides fixing conflicts) before it's ready for release?

@lastzero lastzero added waiting Impediment / blocked / waiting feature Feature Pull Request (PR) labels Jun 27, 2023
@lastzero lastzero changed the title Feature to display and delete duplicates Library: Add duplicate search with delete function Jun 27, 2023
@lastzero
Copy link
Member

I'm sorry to have to close this PR for now, as we cannot easily merge it and have not received any feedback. If you have time to get the changes to a point where we can merge them without further refactoring and testing, feel free to reopen the PR or create a new one. Thank you very much for your contribution!

@lastzero lastzero closed this Jul 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature Pull Request (PR) waiting Impediment / blocked / waiting work-in-progress Please don't merge just yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants