Skip to content

lispc/EditDistanceClusterer

Repository files navigation

EditDistanceJoiner

EditDistanceJoiner is a java library develop by database group of Tsinghua University which can help you (a) select similar string pairs and (b) get similar string clusters among lots of strings based on similarity measured by edit distance very effiently.

How does it work?

This library is based on a method called PassJoin proposed on VLDB2012, which is proved to be orders of magnitude faster than previous methods. The library can handle a dataset in 2 minutes which costs 70 minutes by naive brute force implementation used in simile-vicino, besides, unlike simile-vicino which uses blocking methods to speed up clustering with the loss of accuracy, this library can generate accurate results.

Usage

This library use similar interface with simile-vicino. You can have a look at the samples in joining and clustering at EditDistanceClustererTest and EditDistanceJoinerTest

About

Edit-distance-based similar string joiner and clusterer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published