Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a more efficient datastructure for searching for duplicates #28

Open
ianwal opened this issue Jul 10, 2023 · 0 comments
Open

Use a more efficient datastructure for searching for duplicates #28

ianwal opened this issue Jul 10, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@ianwal
Copy link
Collaborator

ianwal commented Jul 10, 2023

Current searching requires comparing every video at least once, it takes n*(n-1)/2 comparisons.

There exists more appropriate data structures for searching for similar objects.


Vantage-point tree

A vantage-point tree would require rebuilding the tree when videos are added, O(n * log(n)), but searching would be greatly improved.

The distance function for the VP tree would be the similarity of a video. Currently, that is a number [0,100].

It could also be possible to store the individual PDQHashes in a VP tree, but the tree would become extremely large with large videos, and it would require storing a table for pdqhash:video. It would likely reduce search time though because it would avoid calculating the similarity of videos while traversing the tree.

@ianwal ianwal added the enhancement New feature or request label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant