fairly-similar

Library for the heavy lifting part of finding similar images based on pixels.

Overall problem

Given a specific image in a collection, find X similar images in the same collection, where similar uses visual similarity such as color, shape or elements ("bike", "snow", "balloon") on the image.

Nearest Neighbor

When using a neural network for image classification and extracting feature vectors for determining image similarity, there is a need for finding the Nearest Neighbour in a 2048 dimensional vector space.

The current purpose of the fairly-similar-repository is to explore ways of locating such neighbours with a mix of pre-processing, accuracy and computing power that works the the Royal Danish Library and its digital image collection.

Currently there are about 300K images in the collection, but the solution should scale into the lower millions. Let's say 5 million. The collection is semi-dynamic, meaning that images are added in batches. If a solution uses pre-processing, this pre-processing should be fast enough to be done on a nightly basis without excessive hardware requirements.

As we are currently experimenting, we are measuring CPU-core-requirements: If a solution uses 4 threads, its speed (number of images checked/second) will be divided by 4 when comparing to other solutions. Do not that it is a positive feature it it can take advantage of multiple cores when running in production.

Test data

A block with 270707 image vectors (2048 dimensions, represented with 4 byte floats) can be downloaded as binary (2.1 GB) or text (2.0 GB).

The numbers are all positive and there is a tendency for vectors to have a fairly small (10-100) number of dimensions that are significant, i.e. have a relatively large number.

How to run

Download the binary sample data and put then in src/test/resources/
Make a package (mvn package)
Run java -cp target/similar-lib-0.1-SNAPSHOT.jar:src/test/resources/ dk.kb.similar.NearestNeighbour

How to fight

Create a new class in dk.kb.similar.finder that implements NearestFinder
Add an instance of your class to NearestNeighbour.createFinders
Run the NearestNeighbour as described above and compare the results

Current status

On an older 4 core i5 desktop: java -cp target/similar-lib-0.1-SNAPSHOT.jar:src/test/resources/ dk.kb.similar.NearestNeighbour 50000 20

...
EarlyNearestFinder: 1571ms (636,443 points/sec, totalDist=3444.3)
StrongestSignalsFinder: 194ms (5,128,772 points/sec, totalDist=3459.1)
DiceNearestFinder: 170ms (5,877,439 points/sec, totalDist=4579.0)
RandomFinder: 3ms (306,026,768 points/sec, totalDist=7974.7)

EarlyNearestFinder guarantees "correct" results, meaning the points that are nearest to each other, measured by Euclidian distance.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
src		src
.gitignore		.gitignore
CONSIDERATIONS.md		CONSIDERATIONS.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

.gitignore

.gitignore

CONSIDERATIONS.md

CONSIDERATIONS.md

Jenkinsfile

Jenkinsfile

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

fairly-similar

Overall problem

Nearest Neighbor

Test data

How to run

How to fight

Current status

About

Releases

Packages

Contributors 5

Languages

License

kb-dk/fairly-similar

Folders and files

Latest commit

History

Repository files navigation

fairly-similar

Overall problem

Nearest Neighbor

Test data

How to run

How to fight

Current status

About

Resources

License

Stars

Watchers

Forks

Languages