Basic similarity search for Apache Drill
Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
sample-data
src
.gitignore
LICENSE
README.md
pom.xml

README.md

drill-fuzzy-search

drill-fuzzy-search is a plugin for Apache Drill that supports simple similarity and distance search.

What is supported:

It's based on SimMetrics library which you may refer to for more informations: SimMetrics

Installation

Clone/download the sourcecode and run:

$ mvn clean package

then copy target/drill-fuzzy-search* jars to drill's jars/3rdparty directory. Also ensure that to copy simmetrics-core-3.2.3.jar in the same place.

You can download simmetrics dependency manually from maven repository - simmetrics or copy it from your local maven repo.

After building the code you can start Drill with:

$ bin/drill-embedded

You can also use my fork of Apache Drill with drill-gis - drill-fuzzysearch branch.

Usage

Sample dataset

There is a test CSV dataset included. You can copy it to Drill's sample-data directory.

The structure of the CSV is as follows:

text A, text B, similarity level

Following examples are based on queries to sample file which is embedded in drill-fuzzy-search jar file (classpath) i.e.:

select * from cp.`sample-data/similarities.csv`;

but you can also query dataset from filesystem:

select * from dfs.deault.`/home/k255/drill/sample-data/similarities.csv`;

Fuzzy queries

Comparison of texts with different metrics:

select columns[0] as textA, columns[1] as textB, 
    levenshtein(columns[0], columns[1]) as distance
    from cp.`sample-data/similarities.csv`;
select columns[0] as textA, columns[1] as textB,
    levenshtein(columns[0], columns[1]) as distance
    from cp.`sample-data/similarities.csv`
    where levenshtein(columns[0], columns[1]) > 0.7;
select columns[0] as textA, columns[1] as textB, 
    jaro(columns[0], columns[1]) as distance
    from cp.`sample-data/similarities.csv`;
select columns[0] as textA, columns[1] as textB, 
    cosine_similarity(columns[0], columns[1]) as distance
    from cp.`sample-data/similarities.csv`;

See also

Feature request - https://issues.apache.org/jira/browse/DRILL-3747

Pull request - https://github.com/apache/drill/pull/224

Author

Karol Potocki

License


Apache 2.0 License