Matchbox Installation & Use
The duplicate image detection tool
To install you need:
- Latest version of OpenCV
- Python 2.7
Please refere to the INSTALL.md file in this repository.
This tool extracts features from images. The set of features implemented contains the basic image metadata extraction, the basic image processing features Color Histograms and Image Profiles, as well as more complex features based on interest point detection.
Features can be either extracted all at once or by distinctively specifiyng the required feature. Extracted features are either stored to the same directory of the corresponding image, or to a specified directory. Features can be stored in gzipped xml format or in binary format. Binary storage enables faster processing while xml provides more flexibility for data processing with third party tools.
The stored feature filenames have the format:
<original_image_filename>..<dat|xml.gz> e.g. img00001.tif.ImageHistogram.xml.gz
The compare tool compares two extracted features and calculates a similarity estimation.
Input files have to be of the same feature set. Comparison between different feature sets is not possible. Also only two files can be compared with each other, not a set of files.
The resulting similarity estimation is written in xml format to standard output (e.g. the command line interface).
The train tool is a specialized tool to create visual vocabularies based on visual bag-of-words. A visual bag-of-words is a pendant to the bag-of-words in classical information retrieval, where each text document is represented as a histogram of its distinctive word occurences. This approach has been adopted in image processing based on features from interest point detectors - especially SIFT features.
The train tool takes a list of SIFT descriptors and applies a clustering algorithm onto it. The calculated centroids represent the visual vocabulary that will be used in further processing of certain workflows.
Duplicate detection is the task of detecting duplicates within an image collection.
- extract SIFTComparison features of all images
- train a visual vocabulary on the extracted features
- extract BoWHistograms using the vocabulary and all extracted SIFTComparison features
- create a similarity matrix for the collection using compare on all BoWHistogram features
- take the top-most similar images for each image and compare their SIFTComparison features
- Set a threshold based on the retrieved data
- images with an SSIM exceeding the threshold are considered to be duplicates
Command line use
python2.7 ./FindDuplicates.py /home/matchbox/matchbox-data/ all
Output of duplicates detection script is a list of possible duplicates (e.g. document 10 is a duplicate candidate for page 2)
[1 of 20] 1 [2 of 20] 2 =>  [3 of 20] 3
SSIM comparison sample
extractfeatures.exe <bild1> extractfeatures.exe <bild2> compare <bild1>.feat.xml <bild2>.feat.xml
Output of SSIM comparison with value between 0 and 1, where 1 means high similarity
<SIFTCompairison> <SSIM>0.889990</SSIM> ... </SIFTCompairision>
Features and roadmap
- Duplicate detection in a digital document collection
- SSIM image comparison