Skip to content

Find Duplicates

kiwidude68 edited this page Mar 3, 2024 · 11 revisions

MobileRead History

Introduction

This plugin will help you to identify duplicate authors, titles, formats, series, publishers, tags and identifiers in your Calibre libraries.

  • Duplicate authors are where you have multiple variants of an author due to spacing, punctuation, spelling differences or word order. e.g. Kevin Anderson / Kevin J. Anderson / Keven Anderson / Anderson, Kevin / Anderson Kevin / Bloggs, Joe & Anderson, Kevin
  • Duplicate titles are where you have multiple book entries with either the same or varying titles. e.g. Martian Way / The Martian Way / The Martian Way (2010) / The Martian Way and Other Stories
  • Duplicate formats are where the contents of a particular format like ePub are binary identical to another in your library

The plugin offers a variety of matching algorithms for finding possible groups of duplicate candidates. Each algorithm combination provides a differing trade-off of the number of genuine duplicates found versus the number of false positives (near duplicates).

Menu

When the search is complete the results of each group are presented to you to navigate through. You can then do one of three things:

  • If the group contains genuine duplicates, use the existing Merge feature in the Edit metadata menu to resolve the duplicate book entries.
  • If the group contains non duplicates, you can mark the group as exempt to prevent those books or authors from appearing together in future searches.
  • Skip the group for now and just move to the next one, either deferring your decision or to mark all remaining groups as exemptions when finished.

The "Find metadata variations" menu which allows you to find variations of author, publisher, series and tag names and rename directly on this dialog. Again a number of different matching algorithms are available for use.

You also have the ability to perform duplicate comparisons across multiple libraries. So for instance if you have a "working" library and a "main" library, you can search for duplicates between those libraries with the same range of algorithms and produce a report for later resolution.

Main Features

  • Searches either your entire library or respecting any search restriction set at the time you Find Duplicates.
  • Choose your desired combination of title and author matching from any of "identical", "similar", "soundex", "fuzzy" or "ignore" algorithms.
  • Choose alternative algorithms such as matching identifiers or binary comparison.
  • View the results either one group at a time, or showing all duplicate candidates at once using highlighting to show the groups.
  • When doing author duplicate searches (ignore title), optionally highlight the authors under consideration in the tag browser for ease of renaming
  • Sort the result groups either by title/author (default) or by the size of the group
  • Fine tune the soundex algorithm options to make them "fuzzier" or more explicit matching.
  • Optionally include the languages field when comparing titles, so intentionally using the same book title in different languages does not show as duplicates.
  • Optionally have binary duplicate formats automatically removed from your library when doing a binary comparison.
  • Mark the current group as exempt or all groups as exempt from appearing as duplicates again
  • Review your duplicate exemptions with the opportunity to reverse the exemption allowing duplicate consideration again
  • Exempt either individual books (title searches) or authors (author searches)
  • Clicking the clear search button, setting a different restriction or choosing an explicit Clear duplicate results menu option will exit duplicate search mode.
  • Switching libraries or restarting Calibre will also clear any duplicate search results. Your exemptions will be remember and are stored per library.
  • Find metadata variations for authors, publishers, series and tags to eradicate unwanted duplicates with an alternative simplified UI to rename them.
  • Find duplicates across multiple libraries, producing a report.
  • When placed on the toolbar, clicking the toolbar button without duplicate groups displayed will display the Find Duplicates options dialog.
    • When results are displayed, clicking on the button will move to the next result.
    • Ctrl+click or Shift+Click to navigate to the previous result.
  • Use Delete key to remove entry from library list in cross library search options.
  • Customize the keyboard shortcuts for a number of the menu options.

Configuration

Access the configuration dialog via either of:

  • Preferences -> Plugins -> User interface action -> Find Duplicates -> Customize plugin
  • Find Duplicates -> Customize plugin...

Configuration Dialog

Find Book Duplicates

This feature is used to find individual book duplicates within a single library based on a flexible range of criteria. It will also respect whatever virtual library you might have selected to search for results within.

Find Book Duplicates Dialog

Duplicate Search Type

  • Title/Author - Compares books using the title and/or author metadata in your library.
    • You can more granularly define whether both or either of title and author are used in the comparison below.
  • Binary Compare - Compares book formats for binary identical files, ignoring all other book metadata.
    • Can help find cases where the book metadata for one of those is incorrect or inconsistent.
    • Note a limitation of the plugin is that it cannot tell you which format(s) are duplicates if a book has multiple.
  • Identifier - Compares books using the specified identifier.
    • Can also help find cases where a book has either the wrong identifier/linked to the wrong book from metadata download.

Title Matching

These are listed in order of how specific the match results will be. So an "Identical" title match will bring back far fewer duplicate results than a "Fuzzy" title match for instance. However that also means you will have less false positives in the results. I suggest you start with the most specific combinations, and only use Soundex or Fuzzy for some edge case situations to quickly scan the results.

Title Matching Description
Identical The most specific type of search, where the title must exactly match to be considered a duplicate.
Similar Removes common punctuation and prefixes from the title prior to comparing.
Uses the same matching logic as the calibre Automerge feature.
e.g. A First Book's Colour; would simplify to First Books Colour for matching purposes.
Note this will not find any spelling variations or typos.
Soundex Uses an algorithm to index book titles by how they sound. Wikipedia
You can adjust how precise this algorithm is using the length.
The larger the number, the more variations it will generate.
Note that increasing this number will make comparisons slower and use more memory.
Fuzzy This is an aggressive algorithm that removes any words after &, and, or and aka.
e.g. a title Jason and the Argonauts would simplify to Jason for title match purposes.
Ignore Ignores the title metadata completely, so books will compare on Author name only.

Author Matching

Author Matching Description
Identical The most specific type of search, where the author name must exactly match to be considered a duplicate.
Similar Removes common punctuation, certain keywords like jr, author initials and ordering.
e.g. Smith Sr., John D. would simplify to Smith, John and John Smith for matching purposes.
Note this will not find any spelling variations or typos.
Soundex Uses an algorithm to index book titles by how they sound. Wikipedia
You can adjust how precise this algorithm is using the length.
The larger the number, the more variations it will generate.
Note that increasing this number will make comparisons slower and use more memory.
Fuzzy This algorithm attempts to find variants of authors by shortening to initials.
e.g. John Smith would simplify to JSmith for author match purposes.
Ignore Ignores the author metadata completely, so books will compare on title only.

Result Options

Option Description
Show all groups at once with highlighting /
Show one group at a time
The results of a book duplicate search are displayed in calibre.
Most users may prefer to see all results at once.
For more information see Book Duplicates Search Results below.
Highlight authors in the tag browser For use if in Title Matching you chose the Ignore option.
Sort groups by number of duplicates Optionally order the results so that groups with more duplicates appear first.
Include languages metadata when
comparing titles
Some users may have the same book title but different Languages metadata.
Check this option to prevent these from showing as a duplicate.
When doing a Binary Compare,
automatically remove duplicate formats
A convenience option for the issue mentioned above for Binary Compare.
There is no way to visually highlight for a book which format is the binary duplicate.
Checking this option will automatically delete all duplicate formats.
The remaining copy it keeps will be on your oldest book metadata row.
No other book metadata/formats are touched.
You will still have to merge/delete the remaining book metadata in the results.

Book Duplicates Search Results

After your search has completed, the results are presented to you. Results are divided into groups, in the example below it is with the Show all groups at once with highlighting enabled. The first group is highlighted automatically, here we have two author variations for the same book.

Find Duplicates Output

Several important things to note here:

  • You have had a virtual library applied using a special search of marked:duplicates.
  • The marked:duplicate_group_0001 search is what is allowing the plugin to highlight the first group of results.
  • While working through these duplicate results it effectively prevents you from doing other searches across your library.

Now is the opportunity for you to work through the book duplicate results output, doing one highlighted group at a time. You would most likely do one of the following actions:

  • Hit the delete key on a book row you are happy to immediately remove, or
  • Merge two book rows together to combine metadata/formats, or
  • If this is a false positive result, you can choose Find Duplicates -> Mark current group as exempt. This will prevent that same combination being displayed to you the next time you do a duplicates search.

To navigate to the next duplicate result group:

  • Left click on the Find Duplicates button in your toolbar, or
  • Click on Find Duplicates -> Next result, or
  • Mark the group as exempt.

To navigate to the previous duplicate result group:

  • Ctrl + Left click on the Find Duplicates button in your toolbar, or
  • Shift + Left click on the Find Duplicates button in your toolbar, or
  • Click on Find Duplicates -> Previous result

Existing Book Duplicates results

To exit out of this mode (which will clear remaining duplicate search results), you can do one of:

  • Hitting the Esc key
  • Clicking the Clear Search button
  • Click on Find Duplicates -> Clear search results

Find Library Duplicates

This feature is used to find book duplicates between your current library and another calibre library based on a flexible range of criteria. You should perform this operation from the library in which you wish to optionally remove any duplicates found. It will also respect whatever virtual library you might have selected to search for results within (as of v1.10.2).

Find Library Duplicates Dialog

The majority of the options are the same as for Find Book Duplicates dialog above. Though you will most likely want to stick to a very tightly controlled scope of duplicate comparison options like Identical/Similar.

  • Display duplicate books when search completes - Enabled by default, any duplicates found will be displayed in a virtual library of results for your to browse. Alrenatively if you just want the log file of results without changing what is displayed in calibre then uncheck this option.

Library Duplicate Search Results

When the search completes, you will be presented with a dialog like the following. Click Show details to see the details log output, which you can optionally save to a file.

Find Library Output

In your current library, all of the books that were considered duplicates are displayed if you enabled the display option above. You could simply delete any books in this list that you are happy with the matching result according to the log in your other library.

Find Metadata Variations

This feature is used to find duplicates due to variations in your metadata such as Authors, Series, Publisher or Tags. This allows you to make bulk changes to many books at once rather than having to work through many individual book duplicate group results. It will also respect whatever virtual library you might have selected to search for results within (as of v1.10.1).

I would actually suggest doing an Authors metadata variations check as the first type of duplicates search you do. Because once you cleanup your author metadata variants you can "tighten" that restriction in the Find Books dialog which will reduce the number of false positives and results you have to deal with.

Find Metadata Variations Dialog

The options in this dialog are similar to those described for the Find Book Duplicates dialog above.

Metadata Variations Search Results

After you click Search you will be presented with the list of results. As you click on each item in the list on the left, the right-hand side will show you all the "variations" that were found for that particular row.

You can optionally also display the matching books for that (Author in the above example) by checking the Show matching books option. This will display all books for both the selected item and the variations selected in the right side. However if you deselect one or more of those variations your calibre search will update appropriately to remove that from the query.

For each result you now have two options available:

  • If you think all books should have this item changed to a new name:
    • Specify your required new name is in the Rename to: dropdown
    • Click on the Rename button to apply to all the matching book metadata in your library.
  • If you are happy that this variation is not one you want to make a change for:
    • Click on the Ignore button to remove it from the list of search results (no changes to your library).
    • Note: Currently the plugin does not support exemptions for metadata variations. So you will see the same results again in a future search.

Exemptions

Creating an "exemption" for a particular book duplicate group ensures that you will never see such a result combination appear again in future. This can be useful when there are genuine reasons for such "not really a duplicate" conditions exist such as:

  • an author re-using the same title for another book in an unrelated series.
  • two different authors sharing the same name (or nearly the same such as with initials).

You can view the underlying data for all books that have been exempted using Find Duplicates -> Show all book duplicate exemptions. This will display all the affected books in your library. To change an individual exemption, select a book and choose Find Duplicates -> Manage exemptions for this book which will present a dialog like the following:

Manage Exemptions Dialog

This is an example exemption where the author reused the same book title for a different series. If I wanted to remove this specific exemption then I would tick the Remove checkbox and click ok. It would now reappear in future "Identical Title, identical Author" searches for instance.

Troubleshooting

Slow searches

There have been reports from some users who have very large amounts of exemptions that it can negatively impact the performance of Find Duplicates when performing searches.

If you get this issue, then if possible could you zip up and send me your metadata.db for your library so I could attempt to replicate the problem and test any fixes for it. I don't need your underlying library book files - just the above database file. You can send it to me via PM in the MobileRead forums. I have not personally encountered the situation to be able to know whether any potential fix would improve it or not otherwise.

One way to confirm whether this exemptions are the problem is to use Find Duplicates -> Customize plugin... -> View library preferences....

Library Preferences Dialog

You could copy the contents of the right hand side into a file and save it for later if you want to keep the exemptions. Then click on Clear, OK and restart calibre. Then try your search again. If it performs considerable faster (albeit with all the extra normally exempted results now included) then we would have a confirmation this is the issue (again at feed that info back on the forums if you can).

Unexpected library highlighting

There can be situations like a calibre crash while you were viewing Find Duplicates book results which give you some "unexpected" behavior when you restart calibre. This is due to the library highlighting feature. You can find more details on how to fix it in this post here

Donations

If you enjoy my calibre plugins or extensions, please feel free to show your appreciation!

paypal

paypal.me/kiwicalibre