Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Levenshtein Distance is a true Metric, not a SemiMetric #19

Closed
oxinabox opened this issue Jan 27, 2020 · 8 comments
Closed

Levenshtein Distance is a true Metric, not a SemiMetric #19

oxinabox opened this issue Jan 27, 2020 · 8 comments

Comments

@oxinabox
Copy link

It statisfies the triangle inequality.
Here's a nice blog post on that fact.

But in the package we have subtyping StringDistance which subtypes SemiMetric.

Getting this wrong is a problem because, various kinds of neighbours trees (NearestNeighbours.jl) require a true Metric to work.

I suggest that we @deprecate_binding StringDistance SemiMetric
and just use the right type from Distances.jl for each thing directly.
The fact that a particular metric is a StringDistance does not particularly matter.

@oxinabox
Copy link
Author

The more minimal version of this is to just change Levenshtein

@matthieugomez
Copy link
Owner

matthieugomez commented Jan 28, 2020

That's interesting. I'm not sure what to do. The issue is that Levenshtein cannot both inherit from StringDistances and Metric.
To be clear, Levenshtein is still a semi-metric, so the current behavior is not wrong per se, it's just not precise enough.

@matthieugomez
Copy link
Owner

matthieugomez commented Jan 28, 2020

Removing the abtract type StringDistance would not solve everything.

This package introduces distance modifiers Partial, Winkler etc which are StringDistances parametrized by other StringDistances (see an example here). It is possible (I have not checked) that Partial{Levenshtein} is a true metric whereas Partial{Jaro} is only a semimetric. But there is no way to express these properties through inheritance in Julia: the type Partial must either inherit from SemiMetric or fromMetric.

This suggests that properties of the metric should a trait rather than an abstract type. I'm curious what other people think (e.g. @KristofferC)

@KristofferC
Copy link

The Distances.jl has some very strange stuff like https://github.com/JuliaStats/Distances.jl/blob/7f3a28c0d1372e3b3edbcbc28f00ba5645e1bbdb/src/metrics.jl#L107-L108 which makes it hard to understand how types are structured...

Regarding using a trait, it might make sense. The comment about "Partial{Levenshtein} is a true metric whereas Partial{Jaro} is only a semimetric" suggests it would be a good idea since subtyping cannot model that.

@oxinabox
Copy link
Author

Removing StringDistance would at least partially solve the issue.
Sinced even if we are left with Partial{Levenshtein} <: SemiMetric (which as you say is certainly not wrong),
its still means we can get Levenshtein<:Metric presice.

My gut is telling me that most of those parameterized distence modifies are not true Metrics
So this wold be a start until
A) a trait based system can be worked out
B) someone sits down and writes the actual proofs on which modifiers on which base metrics are result in which traits.

@matthieugomez
Copy link
Owner

matthieugomez commented Jan 28, 2020

Inheritance from StringDistances is useful for me. Maybe ask the NearestNeighbour package to allow premetric — they can always print a warning in this case.

@matthieugomez
Copy link
Owner

I've removed the inheritance from StringDistance (it is just now just a union of distances). Levenshtein is now a Metric. Let's see how it goes — I'll tag a new version in a few weeks if no one complains.

@oxinabox
Copy link
Author

oxinabox commented Feb 8, 2020

thanks, think that is a good solution, and if it becomes unworkable then traits become next inline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants