The duplicate image detection tool
Help! I have a million images and I'm sure there are duplicates, but which are they?
Checking to identify duplicates manually is a very time-consuming and error-prone process. You need a tool to help you: Matchbox.
Matchbox is an open source tool which:
- provides decision-making support for duplicate image detection in or across collections
- identifies duplicate content, even where files are different (in format, size, rotation, cropping, colour-enhancement etc.), and if they have been scanned from different original copies of the same publication
- applies state-of-the art image processing
- works where OCR will not, for example images of handwriting or music scores
- is useful in assembling collections from multiple sources, and identifying missing files
Matchbox brings the following benefits:
- Automated quality assurance
- Reduced manual effort and error
- Saved time
- Lower costs, e.g. storage, effort
- Open source, standalone tool. Also as Taverna component for easy invocation
- Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping and distortions
- May be applied to wide range of image collections, not just print images
There are numerous situations in which you may need to identify duplicate images in collections, for example:
- to ensure that a page or book has not been digitized twice
- to discover whether a master and service set of digitized images represent the same set of originals
- to confirm that all scans have gone through post-scan image processing
- This work was partially supported by the SCAPE project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137)
- Austrian Institute of Technology - 2013: Alexander Schindler alexander.schindler@ait.ac.at, Reinhold Huber-M�rk Reinhold.Huber-Moerk@ait.ac.at