remove Hamming, create StringDistance

matthieugomez · Dec 12, 2019 · 82d5f3b · 82d5f3b · matthieugomez · Dec 12, 2019
1 parent 538c379
commit 82d5f3b
Show file tree

Hide file tree

Showing 10 changed files with 128 additions and 176 deletions.
diff --git a/README.md b/README.md
@@ -6,16 +6,14 @@ This Julia package computes various distances between `AbstractString`s
 ## Installation
 The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.
 
-## Syntax
+## Compare
 The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:
 
 ```julia
-compare(::AbstractString, ::AbstractString, ::PreMetric = TokenMax(Levenshtein()))
+compare(::AbstractString, ::AbstractString, ::StringDistance)
 ```
 
-## Distances
 - Edit Distances
-	- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
 	- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
 	- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
 	- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
@@ -35,38 +33,35 @@ compare(::AbstractString, ::AbstractString, ::PreMetric = TokenMax(Levenshtein()
 	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
 	- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.
 
-	```julia
-	compare("martha", "marhta", Jaro())
-	compare("martha", "marhta", Winkler(Jaro()))
-	compare("william", "williams", QGram(2))
-	compare("william", "williams", Winkler(QGram(2)))
-	compare("New York Yankees", "Yankees", Levenshtein())
-	compare("New York Yankees", "Yankees", Partial(Levenshtein()))
-	compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
-	compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
-	compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
-	```
-
-## Find
-`find_best` returns the index of the element with the highest similarity score.
-It returns nothing if all elements have a similarity score below `min_score` (default to 0.0)
+Some examples:
 ```julia
-find_best("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein())
-#> 1
+compare("martha", "marhta", Jaro())
+compare("martha", "marhta", Winkler(Jaro()))
+compare("william", "williams", QGram(2))
+compare("william", "williams", Winkler(QGram(2)))
+compare("New York Yankees", "Yankees", Levenshtein())
+compare("New York Yankees", "Yankees", Partial(Levenshtein()))
+compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
+compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
+compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
 ```
 
-`find_all` returns the indices of the elements with a similarity score higher than a minimum value (default to 0.8)
+A good distance to link adresses etc (where the word order does not matter) is `TokenMax(Levenshtein()`
 
-```julia
-find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.8)
-#> 1-element Array{String,1}:
-#> [1]
-```
+## Find
+- `findmax` returns the value and index of the element in `iter` with the highest similarity score with `x`. Its syntax is:
+	```julia
+	findmax(x::AbstractString, iter::AbstractString, dist::StringDistance)
+	```
 
-While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`)
+- `findall` returns the indices of all elements in `iter` with a similarity score with `x` higher than a minimum value (default to 0.8). Its syntax is:
+	```julia
+	findall(x::AbstractString, iter::AbstractVector, dist::StringDistance)
+	```
 
-## Evaluate
+The functions `findmax` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
 
+## Evaluate
 The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
 
 ```julia
@@ -76,12 +71,6 @@ evaluate(Levenshtein(), "New York", "New York")
 #> 0
 ```
 
-## Which distance should I use?
-
-As a rule of thumb, 
-- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)
-- The distance `TokenMax(Levenshtein())` (the default for `compare`) is a good choice to link sequence of words (adresses, names) across datasets (see [`fuzzywuzzy`](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/))
-
 ## References
 - [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
 - [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)

diff --git a/benchmark/.sublime2Terminal.jl b/benchmark/.sublime2Terminal.jl
@@ -1 +1,18 @@
-@time find_all(x[1], y, TokenMax(DamerauLevenshtein()))
+
+
+
+
+# check
+function h(t, x, y; min_score = 1/3)
+	out = fill(false, length(x))
+	for i in eachindex(x)
+		if compare(x[i], y[i], t) <  min_score
+			out[i] = compare(x[i], y[i], t ; min_score = min_score) ≈ 0.0
+			else
+			out[i] = compare(x[i], y[i], t ; min_score = min_score) ≈ compare(x[i], y[i], t)
+		end
+	end
+	all(out)
+end
+h(Levenshtein(), x, y)
+h(DamerauLevenshtein(), x, y)
diff --git a/benchmark/benchmark.jl b/benchmark/benchmark.jl
@@ -8,8 +8,6 @@ function f(t, x, y; min_score = 0.0)
     [compare(x[i], y[i], t; min_score = min_score) for i in 1:length(x)]
 end
 
-@time f(Hamming(), x, y)
-#0.05s
 @time f(Jaro(), x, y)
 #0.3s
 @time f(Levenshtein(), x, y)
@@ -23,27 +21,26 @@ end
 
 
 
-@time find_best(x[1], y, Levenshtein())
-# 0.41
-@time find_best(x[1], y, DamerauLevenshtein())
-# 0.41
+@time findmax(x[1], y, Levenshtein())
+# 0.14
+@time findmax(x[1], y, DamerauLevenshtein())
+# 0.15
 
-@time find_all(x[1], y, Levenshtein())
+@time findall(x[1], y, Levenshtein())
 # 0.06
-@time find_all(x[1], y, DamerauLevenshtein())
+@time findall(x[1], y, DamerauLevenshtein())
 # 0.05
-@time find_all(x[1], y, Partial(DamerauLevenshtein()))
+@time findall(x[1], y, Partial(DamerauLevenshtein()))
 # 0.9
 
-@time find_all(x[1], y, TokenSort(DamerauLevenshtein()))
+@time findall(x[1], y, TokenSort(DamerauLevenshtein()))
 # 0.27
-@time find_all(x[1], y, TokenSet(DamerauLevenshtein()))
-# 0.8
-@time find_all(x[1], y, TokenMax(DamerauLevenshtein()))
+@time findall(x[1], y, TokenSet(DamerauLevenshtein()))
+# 0.74
+@time findall(x[1], y, TokenMax(DamerauLevenshtein()))
 # 2.25
 
 
-# 1.6s slower compared to StringDist
 
 
 

diff --git a/src/StringDistances.jl b/src/StringDistances.jl
@@ -6,17 +6,30 @@ using Distances
 import Distances: evaluate, result_type
 using DataStructures  # for SortedSet in TokenSort
 
+##############################################################################
+##
+## include
+##
+##############################################################################
+abstract type StringDistance <: SemiMetric end
+include("utils.jl")
+include("edit.jl")
+include("qgram.jl")
+include("compare.jl")
+include("find.jl")
+
+function result_type(m::StringDistance, a::AbstractString, b::AbstractString)
+    typeof(evaluate(m, oneunit(a), oneunit(b)))
+end
+
 ##############################################################################
 ##
 ## Export
 ##
 ##############################################################################
 
 export
-evaluate,
-compare,
-result_type,
-Hamming,
+StringDistance,
 Levenshtein,
 DamerauLevenshtein,
 Jaro,
@@ -31,25 +44,10 @@ Partial,
 TokenSort,
 TokenSet,
 TokenMax,
-qgram,
-find_best,
-find_all
-
-##############################################################################
-##
-## include
-##
-##############################################################################
-include("utils.jl")
-include("edit.jl")
-include("qgram.jl")
-include("compare.jl")
-include("find.jl")
-
-function result_type(m::Union{Hamming, Jaro, Levenshtein, DamerauLevenshtein, RatcliffObershelp, AbstractQGramDistance, Winkler, Partial, TokenSort, TokenSet, TokenMax}, a::AbstractString, b::AbstractString)
-    typeof(evaluate(m, oneunit(a), oneunit(b)))
-end
-
+evaluate,
+compare,
+result_type,
+qgram
 end
 
 ##############################################################################

diff --git a/src/compare.jl b/src/compare.jl
@@ -6,36 +6,16 @@
 ##
 ##############################################################################
 """
-    compare(s1::AbstractString, s2::AbstractString, dist::PreMetric = TokenMax(Levenshtein()))
+    compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)
 
 compare returns a similarity score between the strings `s1` and `s2` based on the distance `dist`
 """
-function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::Hamming; min_score = 0.0)
-    (ismissing(s1) | ismissing(s2)) && return missing
-    s1, s2 = reorder(s1, s2)
-    len1, len2 = length(s1), length(s2)
-    len2 == 0 && return 1.0
-    1.0 - evaluate(dist, s1, s2) / len2
-end
 
 function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::Union{Jaro, RatcliffObershelp}; min_score = 0.0)
     (ismissing(s1) | ismissing(s2)) && return missing
     1.0 - evaluate(dist, s1, s2)
 end
 
-function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::AbstractQGramDistance; min_score = 0.0)
-    (ismissing(s1) | ismissing(s2)) && return missing
-    # When string length < q for qgram distance, returns s1 == s2
-    s1, s2 = reorder(s1, s2)
-    len1, len2 = length(s1), length(s2)
-    len1 <= dist.q - 1 && return convert(Float64, s1 == s2)
-    if typeof(dist) <: QGram
-        1.0 - evaluate(dist, s1, s2) / (len1 + len2 - 2 * dist.q + 2)
-    else
-        1.0 - evaluate(dist, s1, s2)
-    end
-end
-
 function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing},  dist::Union{Levenshtein, DamerauLevenshtein}; min_score = 0.0)
     (ismissing(s1) | ismissing(s2)) && return missing
     s1, s2 = reorder(s1, s2)
@@ -51,8 +31,17 @@ function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, M
     end
 end
 
-function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing})
-    compare(s1, s2, TokenMax(Levenshtein()))
+function compare(s1::Union{AbstractString, Missing}, s2::Union{AbstractString, Missing}, dist::QGramDistance; min_score = 0.0)
+    (ismissing(s1) | ismissing(s2)) && return missing
+    # When string length < q for qgram distance, returns s1 == s2
+    s1, s2 = reorder(s1, s2)
+    len1, len2 = length(s1), length(s2)
+    len1 <= dist.q - 1 && return convert(Float64, s1 == s2)
+    if typeof(dist) <: QGram
+        1.0 - evaluate(dist, s1, s2) / (len1 + len2 - 2 * dist.q + 2)
+    else
+        1.0 - evaluate(dist, s1, s2)
+    end
 end
 
 ##############################################################################
@@ -61,11 +50,11 @@ end
 ##
 ##############################################################################
 """
-   Winkler(dist::Premetric, p::Real = 0.1, boosting_threshold::Real = 0.7, l::Integer = 4)
+   Winkler(dist::StringDistance, p::Real = 0.1, boosting_threshold::Real = 0.7, l::Integer = 4)
 
-Winkler is a `PreMetric` modifier that boosts the similarity score between two strings by a scale `p` when the strings share a common prefix with lenth lower than `l` (the boost is only applied the similarity score above `boosting_threshold`)
+Winkler is a `StringDistance` modifier that boosts the similarity score between two strings by a scale `p` when the strings share a common prefix with lenth lower than `l` (the boost is only applied the similarity score above `boosting_threshold`)
 """
-struct Winkler{T1 <: PreMetric, T2 <: Real, T3 <: Real, T4 <: Integer} <: PreMetric
+struct Winkler{T1 <: StringDistance, T2 <: Real, T3 <: Real, T4 <: Integer} <: StringDistance
     dist::T1
     p::T2          # scaling factor. Default to 0.1
     boosting_threshold::T3      # boost threshold. Default to 0.7
@@ -98,11 +87,11 @@ JaroWinkler() = Winkler(Jaro(), 0.1, 0.7)
 ##
 ##############################################################################
 """
-   Partial(dist::Premetric)
+   Partial(dist::StringDistance)
 
-Partial is a `PreMetric` modifier that returns the maximal similarity score between the shorter string and substrings of the longer string
+Partial is a `StringDistance` modifier that returns the maximal similarity score between the shorter string and substrings of the longer string
 """
-struct Partial{T <: PreMetric} <: PreMetric
+struct Partial{T <: StringDistance} <: StringDistance
     dist::T
 end
 
@@ -153,11 +142,11 @@ end
 ##
 ##############################################################################
 """
-   TokenSort(dist::Premetric)
+   TokenSort(dist::StringDistance)
 
-TokenSort is a `PreMetric` modifier that adjusts for differences in word orders by reording words alphabetically.
+TokenSort is a `StringDistance` modifier that adjusts for differences in word orders by reording words alphabetically.
 """
-struct TokenSort{T <: PreMetric} <: PreMetric
+struct TokenSort{T <: StringDistance} <: StringDistance
     dist::T
 end
 
@@ -175,11 +164,11 @@ end
 ##
 ##############################################################################
 """
-   TokenSet(dist::Premetric)
+   TokenSet(dist::StringDistance)
 
-TokenSort is a `PreMetric` modifier that adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
+TokenSort is a `StringDistance` modifier that adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
 """
-struct TokenSet{T <: PreMetric} <: PreMetric
+struct TokenSet{T <: StringDistance} <: StringDistance
     dist::T
 end
 
@@ -207,11 +196,11 @@ end
 ##
 ##############################################################################
 """
-   TokenMax(dist::Premetric)
+   TokenMax(dist::StringDistance)
 
-TokenSort is a `PreMetric` modifier that combines similarlity scores using the base distance, its Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.
+TokenSort is a `StringDistance` modifier that combines similarlity scores using the base distance, its Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.
 """
-struct TokenMax{T <: PreMetric} <: PreMetric
+struct TokenMax{T <: StringDistance} <: StringDistance
     dist::T
 end