jbktree provides a generic BK-tree implemented as a java.util.Collection
. A BK-tree is a kind of metric tree designed for use in discrete metric spaces. BK-trees are generally used to efficiently conduct k-nearest neighbor searches.
A common use case for BK-trees (but certainly not the only use case) is "fuzzy matching" for strings, or "spell-checking." In this use case, we add a list of known words into a BK-tree, and can then search the tree for words that are within a certain edit distance of a query term. To demonstrate this use case, we can start by loading a list of words into a List
.
final List<String> words;
// Under macOS, /usr/share/dict/words contains a list of 235,886 English words
try (final BufferedReader reader = new BufferedReader(new FileReader("/usr/share/dict/words"))) {
words = reader.lines().collect(Collectors.toList());
}
Next, we define the distance function we'd like to use to calculate the edit distance between words in the list. In this case, we'll use the Levenshtein distance implementation from Commons Text:
final DiscreteDistanceFunction<String> distanceFunction = (first, second) ->
LevenshteinDistance.getDefaultInstance().apply(first, second);
With a distance function and a collection of words, constructing a BK-tree is staightforward:
final BKTree<String> bkTree = new BKTree<>(distanceFunction, words);
Now, let's say we have a word that we think is misspelled ("exaple"
), and we want to find some possible replacements. We can search the BK-tree for all other words that are within a certain edit distance ("radius") of our query term to get some suggestions:
final PriorityQueue<String> results = bkTree.getNearestNeighbors("exaple", 2);
That gives us the following results:
Neighbor | Distance |
---|---|
example | 1 |
hexapla | 2 |
vexable | 2 |
exciple | 2 |
exile | 2 |
exhale | 2 |
staple | 2 |
epaule | 2 |
enable | 2 |
elapse | 2 |
eagle | 2 |
saple | 2 |
maple | 2 |
exalt | 2 |