JImageHash is a performant perceptual image fingerprinting library entirely written in Java. The library returns a similarity score aiming to identify entities which are likely modifications of the original source while being robust variouse attack vectors ie. color, rotation and scale transformation.
A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.
This library was inspired by Dr. Neal Krawetz blog post "kind of like that" and incorporates several improvements. A comprehensive overview of perceptual image hashing can be found in this paper by Christoph Zauner.
The project is hosted on bintray and jcenter. Please be aware that migrating from one major version to another usually invalidates creatd hashes
<repositories>
<repository>
<id>jcenter</id>
<url>https://jcenter.bintray.com/</url>
</repository>
</repositories>
<dependency>
<groupId>com.github.kilianB</groupId>
<artifactId>JImageHash</artifactId>
<version>3.0.0</version>
</dependency>
<!-- If you want to use the database image matcher you need to add h2 as well -->
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<version>1.4.197</version>
</dependency>
File img0 = new File("path/to/file.png");
File img1 = new File("path/to/secondFile.jpg");
HashingAlgorithm hasher = new PerceptiveHash(32);
Hash hash0 = hasher.hash(img0);
Hash hash1 = hasher.hash(img1);
double similarityScore = hash0.normalizedHammingDistance(hash1);
if(similarityScore < .2) {
//Considered a duplicate in this particular case
}
//Chaining multiple matcher for single image comparison
SingleImageMatcher matcher = new SingleImageMatcher();
matcher.addHashingAlgorithm(new AverageHash(64),.3);
matcher.addHashingAlgorithm(new PerceptiveHash(32),.2);
if(matcher.checkSimilarity(img0,img1)) {
//Considered a duplicate in this particular case
}
Below you can find examples of convenience methods used to get fast results. Further examples are provided in the examples folder explain how to choose and optimize individual algorithms on your own.
File | Content |
---|---|
CompareImages.java | Compare the similarity of two images using a single algorithm and a custom threshold |
ChainAlgorithms.java | Chain multiple algorithms to achieve a better precision & recall. |
MatchMultipleImages.java | Precompute the hash of multiple images to retrieve all relevant images in a batch. |
DatabaseExample.java | Store hashes persistently in a database. Serialize and Deserialize the matcher. |
AlgorithmBenchmark.java | Test different algorithm/setting combinations against your images to see which settings give the best result. |
Clustering Example | Extensive tutotial matching 17.000 images . As described in the blog |
The persistent
package allows hashes and matchers to be saved to disk. In turn the images are not kept in memory and are only referenced by file path allowing to handle a great deal of images
at the same time.
The cached
version keeps the BufferedImage image objects in memory allowing to change hashing algorithms on the fly and a direct retrieval of the buffered image objects of matching images.
The categorize
package contains image clustering matchers. KMeans and Categorical as well as weighted matchers.
The exotic
package features BloomFilter, and the SingleImageMatcher used to match 2 images without any fancy additions.
Image | High | Low | Copyright | Thumbnail | Ballon | |
---|---|---|---|---|---|---|
High Quality | ||||||
Low Quality | ||||||
Altered Copyright | ||||||
Thumbnail | ||||||
Ballon |
Image matchers can be configured using different algorithm. Each comes with individual properties
Algorithm | Feature | Notes |
---|---|---|
AverageHash | Average Luminosity | Fast and good all purpose algorithm |
AverageColorHash | Average Color | Version 1.x.x AHash. Usually worse off than AverageHash. Not robust against color changes |
DifferenceHash | Gradient/Edge detection | A bit more robust against hue/sat changes compared to AColorHash |
Wavelet Hash | Frequency & Location | Feature extracting by applying haar wavlets multiple times to the input image. Detection quality better than inbetween aHash and pHash. |
PerceptiveHash | Frequency | Hash based on Discrete Cosine Transformation. Smaller hash distribution but best accuracy / bitResolution. |
MedianHash | Median Luminosity | Identical to AHash but takes the median value into account. A bit better to detect watermarks but worse at scale transformation |
AverageKernelHash | Average luminosity | Same as AHash with kernel preprocessing. So far usually performs worse, but testing is not done yet. |
Rotational Invariant | ||
RotAverageHash | Average Luminosity | Rotational robust version of AHash. Performs well but performance scales disastrous with higher bit resolutions . Conceptual issue: pixels further away from the center are weightend less. |
RotPHash | Frequency | Rotational invariant version of pHash using ring partition to map pixels in a circular fashion. Lower complexity for high bit sizes but due to sorting pixel values usually maps to a lower normalized distance. Usually bit res of >= 64bits are preferable |
Experimental. Hashes available but not well tuned and subject to changes | ||
HogHash | Angular Gradient based (detection of shapes?) | A hashing algorithm based on hog feature detection which extracts gradients and pools them by angles. Usually used in support vector machine/NNs human outline detection. It's not entirely set how the feature vectors should be encoded. Currently average, but not great results, expensive to compute and requires a rather high bit resolution |
Image clustering with fuzzy hashes allowing to represent hashes with probability bits instead of simple 0's and 1's
See the wiki page on how to test differet hashing algorithms with your set of images