Fuzzydups is a suite of programs for finding near-duplicate strings in large lists. Its goals are to be efficient and to provide a flexible Unix-style command-line interface.
I use fuzzydups to help manage tags and playlists in my music collection. Some of the tasks it helps with are:
looking for near-duplicates in tags; this can help find accidental duplicates in the collection (even if they are not byte-for-byte the same), but can also help correct typos when there are multiple versions of the same tag (for example, you have many tracks with the same artist, but have made a typo in some of the tags).
I keep m3u-style text playlists that refer to tracks by their filename, which contain the artist and title in the filename. Occasionally I correct typos or other information in the artist or title, and the filename is updated to reflect the fix. This makes the playlist stale. Fuzzydups can efficiently find the most likely candidate for such a track.
These examples will illustrate more explicitly the applications
mentioned above. See also the
fuzzymunge manpages in
Documentation/ directory for more examples.
Imagine you have a large collection of music tracks, and a program to print some particular tag. You could then do something like:
find /path/to/music -name '*.flac' | xargs print_tag artist | sort -u | fuzzydups
to get a list of near-duplicate pairs.
For freshening stale playlists, you would do something like:
find /path/to/music -name '*.flac' >tracks fuzzymunge -b tracks -i playlist1 playlist2 playlist3
which will modify in-place the three playlists, correcting any line in
each playlist that does not have a matching entry in the
to its nearest match.
One naïve way to do near-duplicate detection is to calculate the
Levenshtein edit distance between each input record and each possible
match; sort the distances and return the best match for each input
record. When finding duplicates within a set of N records, this ends up
O(n^2) Levenshtein calculations, each of which is somewhat
Fuzzydups takes a different approach. The core of the algorithm is a trie data structure; this structure breaks strings down into a sequence of character steps, with similar prefixes sharing nodes. For example, a trie containing "ab", "abc", and "abd" would look like:
a, terminal=false ("a" does not exist) | b, terminal=true ("ab" exists) / \ c d terminal=true terminal=true ("abc" exists") ("abd" exists)
We build the trie either from a set of "base" items, or incrementally from the input list. For each input line, we do a budgeted walk of the trie, spending zero "points" for an exact match, or one point for a single-character edit (i.e., a substitution, deletion, or addition). This has two advantages over the naïve solution:
Some of the edit-distance calculations are shared for multiple records. That is, if we have three records "foo1", "foo2", and "foo3", we traverse the "foo" part of the trie only one time to match all three records.
We can bound the walk dynamically. If we run out of budget, we know there is no point in continuing our walk and can return early. This avoids calculating a very high edit-distance that will not be of use to us. Similarly, if we have already found a match with budget
N, we can stop walking other parts of the trie once our budget exceeds
To build from the git repository, you will need the
mfm tool. You can
get it here: http://github.com/peff/mfm
mfm make make install
Asciidoc documentation is in the
Documentation/ directory, and can be
read as-is or built into manpages with
make doc from the top-level.
Send questions or comments to firstname.lastname@example.org.