Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in DamerauLevenshtein #30

Closed
ericphanson opened this issue Jul 13, 2020 · 9 comments
Closed

bug in DamerauLevenshtein #30

ericphanson opened this issue Jul 13, 2020 · 9 comments

Comments

@ericphanson
Copy link

ericphanson commented Jul 13, 2020

using StringDistances
a = "qxKH"
b = collect(a); b[2]=rand('a':'z'); b = join(b);
 DamerauLevenshtein()(a,b)
#> 1
 DamerauLevenshtein()(a,b, 1.1)
#>2.1
 DamerauLevenshtein()(a,b, 2)
#>1

Here, DL(a, b, max_dist) should be 1 for any max_dist > 1. But it seems for 1<max_dist<2 we get the wrong answer.

@matthieugomez
Copy link
Owner

matthieugomez commented Jul 13, 2020

Thanks. How did you end up on this? I am now restricting the last argument to be an Integer (26221a1). Does that solve your issue?

@ericphanson
Copy link
Author

So I ended up at this issue by trying to answer: "does short word w appear in long string str up to 2 in DL-distance?". Since Partial invokes normalize, I first used e.g. Partial(DL)(w, str, 2 / length(w)), but I ran into issues where that would not identify correct matches, and I assumed it was due to floating point issues:

julia> w = "abcdef"
"abcdef"

julia> str = "1234abcxyf1234"
"1234abcxyf1234"

julia> Partial(DL)(w, str, 2 / length(w))
1.0

Here, the answer I want is 1/3, so that when I multiply by length(w) I indeed get 2, the unnormalized DL-distance between w and substrings of str. So I fixed it via

julia> Partial(DL)(w, str, (2 / length(w)) + eps())
0.3333333333333333

I thought things were working fine until I ran into a test query that did not match, which I reduced to the issue here.

I think therefore restricting to integers will at least error instead of giving the wrong answer, but I'm not sure it solves the problem completely, since I don't quite know how to do get the right functionality. I think maybe I should just vendor a copy of Partial and modify it to not normalize to avoid floating point issues altogether.

@ericphanson
Copy link
Author

Maybe I am using max_dist incorrectly. Is this right?

julia> DL("abcdef", "abcxyf", 2)
3

julia> DL("abcdef", "abcxyf",3)
2

I was thinking they both should be 2.

matthieugomez added a commit that referenced this issue Jul 13, 2020
@matthieugomez
Copy link
Owner

matthieugomez commented Jul 13, 2020

Right, it's a bug. Thanks for spotting it. I think I have solved it with 4df4bad.
Let me know if you encounter other issues.

@ericphanson
Copy link
Author

Thanks for the quick fixes! I'll try again tomorrow and see if I can spot any issues.

@matthieugomez
Copy link
Owner

Also, Partial no longer normalizes by default.

@ericphanson
Copy link
Author

I haven't been able to find any more issues, by the way. I'll post a new issue if I find any.

Mind registering the latest release?

@matthieugomez
Copy link
Owner

ok, done. Let me know if you encounter other bugs — this is very useful.

@ericphanson
Copy link
Author

Will do, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants