Library for the heavy lifting part of finding similar images based on pixels.
Given a specific image in a collection, find X similar images in the same collection, where similar uses visual similarity such as color, shape or elements ("bike", "snow", "balloon") on the image.
When using a neural network for image classification and extracting feature vectors for determining image similarity, there is a need for finding the Nearest Neighbour in a 2048 dimensional vector space.
The current purpose of the fairly-similar
-repository is to explore ways of locating
such neighbours with a mix of pre-processing, accuracy and computing power that works
the the Royal Danish Library and its digital image collection.
Currently there are about 300K images in the collection, but the solution should scale into the lower millions. Let's say 5 million. The collection is semi-dynamic, meaning that images are added in batches. If a solution uses pre-processing, this pre-processing should be fast enough to be done on a nightly basis without excessive hardware requirements.
As we are currently experimenting, we are measuring CPU-core-requirements: If a solution uses 4 threads, its speed (number of images checked/second) will be divided by 4 when comparing to other solutions. Do not that it is a positive feature it it can take advantage of multiple cores when running in production.
A block with 270707 image vectors (2048 dimensions, represented with 4 byte floats) can be downloaded as binary (2.1 GB) or text (2.0 GB).
The numbers are all positive and there is a tendency for vectors to have a fairly small (10-100) number of dimensions that are significant, i.e. have a relatively large number.
- Download the binary
sample data and put then in
src/test/resources/
- Make a package (
mvn package
) - Run
java -cp target/similar-lib-0.1-SNAPSHOT.jar:src/test/resources/ dk.kb.similar.NearestNeighbour
- Create a new class in
dk.kb.similar.finder
that implementsNearestFinder
- Add an instance of your class to
NearestNeighbour.createFinders
- Run the
NearestNeighbour
as described above and compare the results
On an older 4 core i5 desktop:
java -cp target/similar-lib-0.1-SNAPSHOT.jar:src/test/resources/ dk.kb.similar.NearestNeighbour 50000 20
...
EarlyNearestFinder: 1571ms (636,443 points/sec, totalDist=3444.3)
StrongestSignalsFinder: 194ms (5,128,772 points/sec, totalDist=3459.1)
DiceNearestFinder: 170ms (5,877,439 points/sec, totalDist=4579.0)
RandomFinder: 3ms (306,026,768 points/sec, totalDist=7974.7)
EarlyNearestFinder
guarantees "correct" results, meaning the points that are nearest to each other, measured by Euclidian distance.