Skip to content

Commit

Permalink
remove Hamming, create StringDistance
Browse files Browse the repository at this point in the history
  • Loading branch information
matthieugomez committed Dec 12, 2019
1 parent 538c379 commit 82d5f3b
Show file tree
Hide file tree
Showing 10 changed files with 128 additions and 176 deletions.
59 changes: 24 additions & 35 deletions README.md
Expand Up @@ -6,16 +6,14 @@ This Julia package computes various distances between `AbstractString`s
## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.

## Syntax
## Compare
The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

```julia
compare(::AbstractString, ::AbstractString, ::PreMetric = TokenMax(Levenshtein()))
compare(::AbstractString, ::AbstractString, ::StringDistance)
```

## Distances
- Edit Distances
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
Expand All @@ -35,38 +33,35 @@ compare(::AbstractString, ::AbstractString, ::PreMetric = TokenMax(Levenshtein()
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.

```julia
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
```

## Find
`find_best` returns the index of the element with the highest similarity score.
It returns nothing if all elements have a similarity score below `min_score` (default to 0.0)
Some examples:
```julia
find_best("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein())
#> 1
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
```

`find_all` returns the indices of the elements with a similarity score higher than a minimum value (default to 0.8)
A good distance to link adresses etc (where the word order does not matter) is `TokenMax(Levenshtein()`

```julia
find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.8)
#> 1-element Array{String,1}:
#> [1]
```
## Find
- `findmax` returns the value and index of the element in `iter` with the highest similarity score with `x`. Its syntax is:
```julia
findmax(x::AbstractString, iter::AbstractString, dist::StringDistance)
```

While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`)
- `findall` returns the indices of all elements in `iter` with a similarity score with `x` higher than a minimum value (default to 0.8). Its syntax is:
```julia
findall(x::AbstractString, iter::AbstractVector, dist::StringDistance)
```

## Evaluate
The functions `findmax` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).

## Evaluate
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

```julia
Expand All @@ -76,12 +71,6 @@ evaluate(Levenshtein(), "New York", "New York")
#> 0
```

## Which distance should I use?

As a rule of thumb,
- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)
- The distance `TokenMax(Levenshtein())` (the default for `compare`) is a good choice to link sequence of words (adresses, names) across datasets (see [`fuzzywuzzy`](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/))

## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
Expand Down
19 changes: 18 additions & 1 deletion benchmark/.sublime2Terminal.jl
@@ -1 +1,18 @@
@time find_all(x[1], y, TokenMax(DamerauLevenshtein()))




# check
function h(t, x, y; min_score = 1/3)
out = fill(false, length(x))
for i in eachindex(x)
if compare(x[i], y[i], t) < min_score
out[i] = compare(x[i], y[i], t ; min_score = min_score) 0.0
else
out[i] = compare(x[i], y[i], t ; min_score = min_score) compare(x[i], y[i], t)
end
end
all(out)
end
h(Levenshtein(), x, y)
h(DamerauLevenshtein(), x, y)
25 changes: 11 additions & 14 deletions benchmark/benchmark.jl
Expand Up @@ -8,8 +8,6 @@ function f(t, x, y; min_score = 0.0)
[compare(x[i], y[i], t; min_score = min_score) for i in 1:length(x)]
end

@time f(Hamming(), x, y)
#0.05s
@time f(Jaro(), x, y)
#0.3s
@time f(Levenshtein(), x, y)
Expand All @@ -23,27 +21,26 @@ end



@time find_best(x[1], y, Levenshtein())
# 0.41
@time find_best(x[1], y, DamerauLevenshtein())
# 0.41
@time findmax(x[1], y, Levenshtein())
# 0.14
@time findmax(x[1], y, DamerauLevenshtein())
# 0.15

@time find_all(x[1], y, Levenshtein())
@time findall(x[1], y, Levenshtein())
# 0.06
@time find_all(x[1], y, DamerauLevenshtein())
@time findall(x[1], y, DamerauLevenshtein())
# 0.05
@time find_all(x[1], y, Partial(DamerauLevenshtein()))
@time findall(x[1], y, Partial(DamerauLevenshtein()))
# 0.9

@time find_all(x[1], y, TokenSort(DamerauLevenshtein()))
@time findall(x[1], y, TokenSort(DamerauLevenshtein()))
# 0.27
@time find_all(x[1], y, TokenSet(DamerauLevenshtein()))
# 0.8
@time find_all(x[1], y, TokenMax(DamerauLevenshtein()))
@time findall(x[1], y, TokenSet(DamerauLevenshtein()))
# 0.74
@time findall(x[1], y, TokenMax(DamerauLevenshtein()))
# 2.25


# 1.6s slower compared to StringDist



Expand Down
44 changes: 21 additions & 23 deletions src/StringDistances.jl
Expand Up @@ -6,17 +6,30 @@ using Distances
import Distances: evaluate, result_type
using DataStructures # for SortedSet in TokenSort

##############################################################################
##
## include
##
##############################################################################
abstract type StringDistance <: SemiMetric end
include("utils.jl")
include("edit.jl")
include("qgram.jl")
include("compare.jl")
include("find.jl")

function result_type(m::StringDistance, a::AbstractString, b::AbstractString)
typeof(evaluate(m, oneunit(a), oneunit(b)))
end

##############################################################################
##
## Export
##
##############################################################################

export
evaluate,
compare,
result_type,
Hamming,
StringDistance,
Levenshtein,
DamerauLevenshtein,
Jaro,
Expand All @@ -31,25 +44,10 @@ Partial,
TokenSort,
TokenSet,
TokenMax,
qgram,
find_best,
find_all

##############################################################################
##
## include
##
##############################################################################
include("utils.jl")
include("edit.jl")
include("qgram.jl")
include("compare.jl")
include("find.jl")

function result_type(m::Union{Hamming, Jaro, Levenshtein, DamerauLevenshtein, RatcliffObershelp, AbstractQGramDistance, Winkler, Partial, TokenSort, TokenSet, TokenMax}, a::AbstractString, b::AbstractString)
typeof(evaluate(m, oneunit(a), oneunit(b)))
end

evaluate,
compare,
result_type,
qgram
end

##############################################################################
Expand Down
65 changes: 27 additions & 38 deletions src/compare.jl
Expand Up @@ -6,36 +6,16 @@
##
##############################################################################
"""
compare(s1::AbstractString, s2::AbstractString, dist::PreMetric = TokenMax(Levenshtein()))
compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)
compare returns a similarity score between the strings `s1` and `s2` based on the distance `dist`
"""
function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::Hamming; min_score = 0.0)
(ismissing(s1) | ismissing(s2)) && return missing
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len2 == 0 && return 1.0
1.0 - evaluate(dist, s1, s2) / len2
end

function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::Union{Jaro, RatcliffObershelp}; min_score = 0.0)
(ismissing(s1) | ismissing(s2)) && return missing
1.0 - evaluate(dist, s1, s2)
end

function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::AbstractQGramDistance; min_score = 0.0)
(ismissing(s1) | ismissing(s2)) && return missing
# When string length < q for qgram distance, returns s1 == s2
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len1 <= dist.q - 1 && return convert(Float64, s1 == s2)
if typeof(dist) <: QGram
1.0 - evaluate(dist, s1, s2) / (len1 + len2 - 2 * dist.q + 2)
else
1.0 - evaluate(dist, s1, s2)
end
end

function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::Union{Levenshtein, DamerauLevenshtein}; min_score = 0.0)
(ismissing(s1) | ismissing(s2)) && return missing
s1, s2 = reorder(s1, s2)
Expand All @@ -51,8 +31,17 @@ function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, M
end
end

function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing})
compare(s1, s2, TokenMax(Levenshtein()))
function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::QGramDistance; min_score = 0.0)
(ismissing(s1) | ismissing(s2)) && return missing
# When string length < q for qgram distance, returns s1 == s2
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len1 <= dist.q - 1 && return convert(Float64, s1 == s2)
if typeof(dist) <: QGram
1.0 - evaluate(dist, s1, s2) / (len1 + len2 - 2 * dist.q + 2)
else
1.0 - evaluate(dist, s1, s2)
end
end

##############################################################################
Expand All @@ -61,11 +50,11 @@ end
##
##############################################################################
"""
Winkler(dist::Premetric, p::Real = 0.1, boosting_threshold::Real = 0.7, l::Integer = 4)
Winkler(dist::StringDistance, p::Real = 0.1, boosting_threshold::Real = 0.7, l::Integer = 4)
Winkler is a `PreMetric` modifier that boosts the similarity score between two strings by a scale `p` when the strings share a common prefix with lenth lower than `l` (the boost is only applied the similarity score above `boosting_threshold`)
Winkler is a `StringDistance` modifier that boosts the similarity score between two strings by a scale `p` when the strings share a common prefix with lenth lower than `l` (the boost is only applied the similarity score above `boosting_threshold`)
"""
struct Winkler{T1 <: PreMetric, T2 <: Real, T3 <: Real, T4 <: Integer} <: PreMetric
struct Winkler{T1 <: StringDistance, T2 <: Real, T3 <: Real, T4 <: Integer} <: StringDistance
dist::T1
p::T2 # scaling factor. Default to 0.1
boosting_threshold::T3 # boost threshold. Default to 0.7
Expand Down Expand Up @@ -98,11 +87,11 @@ JaroWinkler() = Winkler(Jaro(), 0.1, 0.7)
##
##############################################################################
"""
Partial(dist::Premetric)
Partial(dist::StringDistance)
Partial is a `PreMetric` modifier that returns the maximal similarity score between the shorter string and substrings of the longer string
Partial is a `StringDistance` modifier that returns the maximal similarity score between the shorter string and substrings of the longer string
"""
struct Partial{T <: PreMetric} <: PreMetric
struct Partial{T <: StringDistance} <: StringDistance
dist::T
end

Expand Down Expand Up @@ -153,11 +142,11 @@ end
##
##############################################################################
"""
TokenSort(dist::Premetric)
TokenSort(dist::StringDistance)
TokenSort is a `PreMetric` modifier that adjusts for differences in word orders by reording words alphabetically.
TokenSort is a `StringDistance` modifier that adjusts for differences in word orders by reording words alphabetically.
"""
struct TokenSort{T <: PreMetric} <: PreMetric
struct TokenSort{T <: StringDistance} <: StringDistance
dist::T
end

Expand All @@ -175,11 +164,11 @@ end
##
##############################################################################
"""
TokenSet(dist::Premetric)
TokenSet(dist::StringDistance)
TokenSort is a `PreMetric` modifier that adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
TokenSort is a `StringDistance` modifier that adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
"""
struct TokenSet{T <: PreMetric} <: PreMetric
struct TokenSet{T <: StringDistance} <: StringDistance
dist::T
end

Expand Down Expand Up @@ -207,11 +196,11 @@ end
##
##############################################################################
"""
TokenMax(dist::Premetric)
TokenMax(dist::StringDistance)
TokenSort is a `PreMetric` modifier that combines similarlity scores using the base distance, its Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.
TokenSort is a `StringDistance` modifier that combines similarlity scores using the base distance, its Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.
"""
struct TokenMax{T <: PreMetric} <: PreMetric
struct TokenMax{T <: StringDistance} <: StringDistance
dist::T
end

Expand Down

2 comments on commit 82d5f3b

@matthieugomez
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register()

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request updated: JuliaRegistries/General/6617

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if Julia TagBot is installed, or can be done manually through the github interface, or via:

git tag -a v0.5.0 -m "<description of version>" 82d5f3bc91e896d67c6fc1bc6ce8cce0a4cf546b
git push origin v0.5.0

Please sign in to comment.