Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaro Winkler distance added #2044

Merged
merged 36 commits into from Jul 10, 2018

Conversation

Projects
None yet
3 participants
@53X
Copy link
Contributor

53X commented Jun 12, 2018

Jaro Winkler distance has been added to nltk.metrics . Also PEP-8 issues have been taken care of .

I have not used any inbuilt functions from nltk for making this. This PR introduces a new string comparison metric. The metrics has been written as a completely different function named jaro_winkler_distance()
It fixes this issue

@53X 53X changed the title Jaro Winkler distance added #2006 Issue fix attempt Jaro Winkler distance added Jun 12, 2018

@alvations alvations self-assigned this Jun 12, 2018

@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jun 13, 2018

@alvations , please give your thoughts and comments on this one

p = type float , weight parameter with a standard value of 0.1
l_cnt = type int, common prefix at the start of the string (max val = 4)
jw_sim= type float, jaro winkler similarity between s1 and s2
jw_dist = type float, it is equal to 1 - jw_sim """

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

The docstring has to be a lot clearer.

  • What's s1 and s2? I assume it's taking strings as the distance but actually the other distance metrics in nltk.metrics.distance accepts any sequence type. Jaro-Winkler seems to be generic enough to cover any sequence.

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Also variable names in this function needs to be clearer.

Since the function scopes within itself, you can afford to have more explicit variable names, e.g.

  • mtch -> matches
  • t_cnt -> transposition_counts

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

s1 and s2 are the 2 strings between which we wanna calculate the JW distance. I am going to include it in the docstring. Thanks for the point out

jw_sim= type float, jaro winkler similarity between s1 and s2
jw_dist = type float, it is equal to 1 - jw_sim """
s1 = s1.lower()
s2 = s2.lower()

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

As commented above, if the input is a sequence, this would throw an AttributeError.

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

If .lower() is the intention, then adding it as an kwarg in the function arguments would be better, i.e.

def jaro_winkler_distance(s1, s2, p=0.1, lowercase=True):
    # some code ...

    if lowercase:
        s1, s2 = s1.lower(), s2.lower()

     # other code ...
if(abs(pos1 - pos2) <= dist_bound):
mtch = mtch + 1
if(pos1 != pos2):
t_cnt = t_cnt + 1

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Actually, this looks very levenshtein like.

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Out of curiosity, Is there a simpler way to do this?

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

Well the above code actually counts the no.of "matched" characters and the no.of transposition counts. Also Levenshtein calculates transpositions.

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

Simpler way? I don't know ..... I am gonna have to check it out. BTW this works too

p = type float , weight parameter with a standard value of 0.1
l_cnt = type int, common prefix at the start of the string (max val = 4)
jw_sim= type float, jaro winkler similarity between s1 and s2
jw_dist = type float, it is equal to 1 - jw_sim """

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Also variable names in this function needs to be clearer.

Since the function scopes within itself, you can afford to have more explicit variable names, e.g.

  • mtch -> matches
  • t_cnt -> transposition_counts
if(l_cnt == 4):
break
jw_sim = j_sim+(l_cnt*p*(1-j_sim))
jw_dist = 1 - jw_sim

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Just return 1 - jw_sim here.

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

Sure .... it is actually redundant

break
if(l_cnt == 4):
break
jw_sim = j_sim+(l_cnt*p*(1-j_sim))

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

It would make sense to put this formula in the function docstring and then explain the related variables.

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

yeah sure

@@ -143,7 +147,7 @@ def masi_distance(label1, label2):
return 1 - len_intersection / len_union * m


def interval_distance(label1,label2):
def interval_distance(label1, label2):

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

👍

@@ -66,7 +68,7 @@ def edit_distance(s1, s2, substitution_cost=1, transpositions=False):
been done in other orders, but at least three steps are needed.
Allows specifying the cost of substitution edits (e.g., "a" -> "b"),
because sometimes it makes sense to assign greater penalties to substitutions.
because often it makes sense to assign bigger penalties to substitutions.

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Is this the right assumption that it's often and not sometimes?

Also greater penalties sounds better than bigger penalties, the latter sounds presidential ;P

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

I am changing it back to 'sometimes' and 'greater' . Thanks for the heads up

@@ -34,7 +35,8 @@ def _edit_dist_init(len1, len2):
return lev


def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False):
def _edit_dist_step(lev, i, j, s1, s2,
substitution_cost=1, transpositions=False):

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

No reason to break this unless it's over the 80 char lens. Even so, the break should be one per kwargs, so

def _edit_dist_step(lev, i, j, s1, s2,
                     substitution_cost=1, 
                       transpositions=False):

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

it is over 80 chars . So I decided to break it this way

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

Break the other keyword into a second line too, that'll be more consistent. Thanks!

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

cool

else:
break
if(l_cnt == 4):
break

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

These conditions checks don't exactly makes use of numpy. I think there are better ways to do the match.

This comment has been minimized.

@alvations

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

Well i actually used numpy for the np.floor(). Also I wen through RIBES and the find_increasing_sequences(worder) function actually returns monotonic sequences given a worder list. But l_cnt here is actually this : the length of common prefix at the start of the string up to a maximum of four characters . It is as per this: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jun 13, 2018

I have to read more on the metric to really understand the algorithm implemented. But I've left some suggestions on the coding style and documentation issues that needs to be improved. Hope that helps.

print("Jaro-Winkler_distance:", jaro_winkler_distance('JONES', 'JohnSon'))
print("Jaro-Winkler_similarity:",
1-jaro_winkler_distance('JONES', 'JohnSon'))

This comment has been minimized.

@alvations

alvations Jun 13, 2018

Contributor

One more request here =)

Could you reuse the edit_distance_exmaples and move this bit of demo code into the loop at L240?

This comment has been minimized.

@53X

53X Jun 13, 2018

Author Contributor

Yeah sure . I was afraid of breaking the code initially, so avoided doing that

@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jun 13, 2018

@alvations , I am gonna make the edits real quick

@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jun 13, 2018

@alvations , I made the changes . Please take a look

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jun 14, 2018

After reading the formula and other implementations, it looks like the winkler part is merely a function on the jaro similarity. I'll make some changes and push.

@53X if you don't mind, I'll do some adaptations to your code and push so that it better documented and suit the Pythonic conventions.


Changes made

  • Decouple the jaro distance from the jaro-winkler function.

  • After some considering, imposing lower on the comparison should be a user decisions, we shouldn't encourage or enforce .lower()

  • Storing the length of the strings, they are used several times, no point re-calculating them =)

  • Variable name changes,

    • match -> matches , since it's counting
    • transposition_count -> transpositions, since it's just counting and there's no real storage of any thing other than integers, it's okay to just drop the _count. Also there's no need to store the half the value, just store the full integer and half it when returning the jaro similarity
    • dist_bound -> match_bound , it's a little strange to call it distance bound because it's not exactly the upper bound of the metric itself but the matches.
  • x = x + 1 -> x += 1

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jun 14, 2018

@53X checking your algorithm, you have made certain assumptions with

    # Iterate through sequences, check for matches and compute transpositions.
    for ch1 in s1:     # Iterate through each character.
        if ch1 in s2:  # Check whether the 
            pos1 = s1.index(ch1)
            pos2 = s2.index(ch1)
            if(abs(pos1-pos2) <= dist_bound):
                match = match + 1
                if(pos1 != pos2):
                    transposition_count = transposition_count+1

When you use the .index() it will always be returning the first instance, so if there is a repeat and the range isn't of the bounding dist_bound, it's not going to work correctly, right?

I don't think you can get both matches and transpositions in 1 iteration. You might have to iterate through and be a little more like dynamic programming when you iterate through 2 sequences with the bounding boxes, like in https://rosettacode.org/wiki/Jaro_distance#Python

alvations added some commits Jun 14, 2018

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jun 14, 2018

@53X I'm done with the edits. I hope the changes made are okay with you. I think it's best to be more explicit esp with some of these variable names and clearer in terms of what each loop / condition is doing.

Going back to the algorithm, I think you might need to do some tests on whether the loop here is really the same as the typical dynamic programming code from Rosetta code, it looks different given the .index() assumptions.

    # Iterate through sequences, check for matches and compute transpositions.
    for ch1 in s1:     # Iterate through each character.
        if ch1 in s2:  # Check whether the 
            pos1 = s1.index(ch1)
            pos2 = s2.index(ch1)
            if(abs(pos1-pos2) <= dist_bound):
                match = match + 1
                if(pos1 != pos2):
                    transposition_count = transposition_count+1

Looks like there's some difference in https://www.kaggle.com/alvations/jaro-winkler

Oddly, this implementation https://github.com/nap/jaro-winkler-distance is also different:

>>> from pyjarowinkler.distance import get_jaro_distance
>>> string_distance_examples = [
...     ("rain", "shine"), ("abcdef", "acbdef"), ("language", "lnaguaeg"),
...     ("language", "lnaugage"), ("language", "lngauage")]
>>> 
>>> for s1, s2 in string_distance_examples:
...    get_jaro_distance(s1, s2, winkler=False)
... 
0.6333333333333333
0.9444444444444445
0.9166666666666666
0.9166666666666666
0.9583333333333334

alvations added some commits Jun 14, 2018

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jun 14, 2018

With all the implementations giving different values, I suggest we follow this paper's output https://www.census.gov/srd/papers/pdf/rr93-8.pdf (page 35, Table 5) and aim our implementation towards that:

Table 5 Comparison of String Comparators Rescaled between 0 and 1

Strings Winkler Jaro D-L
billy billy 1.000 1.000 1.000
billy bill 0.967 0.933 0.800
billy blily 0.947 0.933 0.600
massie massey 0.944 0.889 0.600
yvette yevett 0.911 0.889 0.600
billy bolly 0.893 0.867 0.600
dwayne duane 0.858 0.822 0.400
dixon dickson 0.853 0.791 0.200
billy susan 0.000 0.000 0.000

Another table of results from https://www.census.gov/srd/papers/pdf/rr94-5.pdf

Table 2.1. Comparison of String Comparators Using Last Names, First Names, and Street Names

Strings Jaro Wink McLa Lynch
SHACKLEFORD SHACKELFORD 0.970 0.982 0.982 0.989
DUNNINGHAM CUNNIGHAM 0.896 0.896 0.896 0.931
NICHLESON NICHULSON 0.926 0.956 0.969 0.977
JONES JOHNSON 0.790 0.832 0.860 0.874
MASSEY MASSIE 0.889 0.933 0.953 0.953
ABROMS ABRAMS 0.889 0.922 0.946 0.952
HARDIN MARTINEZ 0.722 0.722 0.722 0.774
ITMAN SMITH 0.467 0.467 0.507 0.507
JERALDINE GERALDINE 0.926 0.926 0.948 0.966
MARHTA MARTHA 0.944 0.961 0.961 0.971
MICHELLE MICHAEL 0.869 0.921 0.938 0.944
JULIES JULIUS 0.889 0.933 0.953 0.953
TANYA TONYA 0.867 0.880 0.916 0.933
DWAYNE DUANE 0.822 0.840 0.873 0.896
SEAN SUSAN 0.783 0.805 0.845 0.845
JON JOHN 0.917 0.933 0.933 0.933
JON JAN 0.000 0.000 0.860 0.860
BROOKHAVEN BRROKHAVEN 0.933 0.947 0.947 0.964
BROOK HALLOW BROOK HLLW 0.944 0.967 0.967 0.977
DECATUR DECATIR 0.905 0.943 0.960 0.965
FITZRUREITER FITZENREITER 0.856 0.913 0.923 0.945
HIGBEE HIGHEE 0.889 0.922 0.922 0.932
HIGBEE HIGVEE 0.889 0.922 0.946 0.952
LACURA LOCURA 0.889 0.900 0.930 0.947
IOWA IONA 0.833 0.867 0.867 0.867
1ST IST 0.000 0.000 0.844 0.844
@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jun 15, 2018

@alvations , please take a look at the (JON,JAN) and (1ST,IST) examples from your comment ( second table) . I think that there is a typo in the paper. There is no way of (JON ,JAN) and (1ST, IST) being completely dissimilar as suggested in the paper( since we have got matching characters for both of them). Also I have experimented with the value of max_l parameter in the jaro-winkler formula. Keeping it's value to 4 gives us the best results ( i.e completely replicating the results reported in the paper). ALso for these two ambiguously reported values in the paper, my results are pretty much consistent with this online tool. I have also made few more changes . Please review them and give your feedback

alvations and others added some commits Jun 26, 2018

@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jun 30, 2018

@alvations , I have made sure that all the test cases pass in the doc-test. Please have a look at the latest commits and provide your feedback.

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 2, 2018

@53X what is the p_val? Is it decimal places? If so, it shouldn't be in the similarity function but in the doctest

Sorry about the previous comment, saw the commits while commuting and misunderstood. There's no need to call it p_values, user might confuse it with the p-value that is normally used in statistics.

Why is there a need for different p for different tests, shouldn't they all use 0.1 like in the paper?

You shouldn't be hacking the p such that the values passes in the doctest =(

@53X is it because the default value of p isn't 0.1 in the paper but a variable?

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 3, 2018

Lets debug this one instance at a time in a way that we can copy and paste the code.

Here's the stripped down code without the docstring:

def jaro_similarity(s1, s2):
    # First, store the length of the strings
    # because they will be re-used several times.
    len_s1, len_s2 = len(s1), len(s2)

    # The upper bound of the distance for being a matched character.
    match_bound = max(len_s1, len_s2) // 2 - 1

    # Initialize the counts for matches and transpositions.
    matches = 0  # no.of matched characters in s1 and s2
    transpositions = 0  # no. of transpositions between s1 and s2
    flagged_1 = []  # positions in s1 which are matches to some character in s2
    flagged_2 = []  # positions in s2 which are matches to some character in s1

    # Iterate through sequences, check for matches and compute transpositions.
    for i in range(len_s1):     # Iterate through each character.
        upperbound = min(i+match_bound, len_s2-1)
        lowerbound = max(0, i-match_bound)
        for j in range(lowerbound, upperbound+1):
            if s1[i] == s2[j] and j not in flagged_2:
                matches += 1
                flagged_1.append(i)
                flagged_2.append(j)
                break
    flagged_2.sort()
    for i, j in zip(flagged_1, flagged_2):
        if s1[i] != s2[j]:
            transpositions += 1

    if matches == 0:
        return 0
    else:
        return 1/3 * (matches/len_s1 +
                      matches/len_s2 +
                      (matches-transpositions//2)/matches
                      )


def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
    # Compute the Jaro similarity
    jaro_sim = jaro_similarity(s1, s2)

    # Initialize the upper bound for the no. of prefixes.
    # if user did not pre-define the upperbound, use smaller among length of s1
    # and length of s2

    # Compute the prefix matches.
    l_ = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break
    # Return the similarity value as described in docstring.
    return jaro_sim + (l_ * p * (1 - jaro_sim))

Then we run the tests:

winkler_examples = [("billy", "billy"), ("billy", "bill"), ("billy", "blily"), 
                    ("massie", "massey"), ("yvette", "yevett"), ("billy", "bolly"), ("dwayne", "duane"), 
                    ("dixon", "dickson"), ("billy", "susan")]

winkler_scores = [1.000, 0.967, 0.947, 0.944, 0.911, 0.893, 0.858, 0.853, 0.000]
jaro_scores =    [1.000, 0.933, 0.933, 0.889, 0.889, 0.867, 0.822, 0.790, 0.000]


for (s1, s2), jscore, wscore in zip(winkler_examples, jaro_scores, winkler_scores):
    if round(jaro_similarity(s1, s2), 3) != jscore:
        print('jaro', s1, s2, jscore, round(jaro_similarity(s1, s2), 3))
        
    #if round(jaro_winkler_similarity(s1, s2, p=p_val), 3) != wscore:
    # Compute the prefix matches.
    l_ = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break

    print('jaro_winkler', s1, s2, jscore, wscore, jaro_winkler_similarity(s1, s2), l_)

[out]:

jaro_winkler billy billy 1.0 1.0 1.0 4
jaro_winkler billy bill 0.933 0.967 0.96 4
jaro_winkler billy blily 0.933 0.947 0.94 1
jaro_winkler massie massey 0.889 0.944 0.9333333333333333 4
jaro_winkler yvette yevett 0.889 0.911 0.8999999999999999 1
jaro_winkler billy bolly 0.867 0.893 0.88 1
jaro_winkler dwayne duane 0.822 0.858 0.84 1
jaro_winkler dixon dickson 0.79 0.853 0.8323809523809523 2
jaro_winkler billy susan 0.0 0.0 0.0 0

So our Jaro similarity matches, we must have done something different in the winkler's rescaling.

I suspect it's because of how we're computing the prefix matches. If we take a look at the p_val hacks you put in the previous commits, they're of a specific denominator, i.e. 0.125 = 1/8, 0.20 = 1/5. So the first thing to check is the l_ value instead of putting up with a "variable" constant.

Lets take a look at

s1 = "billy"
s2 = "bill"

j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant

l_ = 4
print(j + (l_*0.1*(1-j)))
l_ = 5 
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values

[out]:

0.96
0.9666666666666666

Similarly:

s1 = "dixon"
s2 = "dickson"

j = jaro_similarity(s1, s2)
w = jaro_winkler_similarity(s1, s2)
p = 0.1 # Keep this constant

l_ = 2
print(j + (l_*0.1*(1-j)))
l_ = 3 
print(j + (l_*0.1*(1-j))) # Now it matches the Table 5 values

[out]:

0.8323809523809523
0.8533333333333333

It's possible that max_l default is not 4.

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 3, 2018

I might be wrong about the l_ and p. Looking at the formula:

jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) )

It boils down to:

y = x + m * (1-x) 

And to assure that y ranges between [0, 1], it looks like the we need the condition 0.0 <= m <=1.0.

That means that l * p are not exactly something that can be independent of each other in the Jaro Winkler's formula. The default value of p=0.1 and setting the max of l at 4 constrains the scaling factor of the 1-x so it worked but when l*p is more than 1.0, we get:

>>> s1 = "billy"
>>> s2 = "bill"
>>> jaro_winkler_similarity(s1, s2, p=0.3, max_l=4) # Output > 1.0
1.0133333333333334

Similarly there's no real need for max_l=4, it can be higher as long as the p factor product with it is less than or equals to 1.0, e.g.

>>> s1 = "billy"
>>> s2 = "bill"
>>> jaro_winkler_similarity(s1, s2, p=0.05, max_l=7)
0.9466666666666665

After looking at the Winkler (1990) paper, it looks like the parameters in the experiments were ambiguously set. So I think the p variable hacks you've put in is okay. Let me try to clean it up a little and add an assert.

alvations added some commits Jul 3, 2018

Finalized JR similarity implementation.
Cleaner loop to check of `l` prefix size and added warning to unexpected output
@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 3, 2018

@53X there's still 3 examples that doesn't match up with the p_factors changes:

  • BROOK HALLOW and BROOK HLLW
  • DECATUR and DECATIR
  • SHACKLEFORD and SHACKELFORD
l_ += 1
else:
break
if l_ == max_l:
break
if l_ > max_l:

This comment has been minimized.

@53X

53X Jul 3, 2018

Author Contributor

The loop should terminate and stop counting when l_ has reached its maximum bound ( max_l). Thus this would be the better option as l_ shouldn't go beyond max_l :


	if l_ == max_l:
	    break

This comment has been minimized.

@53X

53X Jul 3, 2018

Author Contributor

@alvations , please review this piece of code

@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jul 3, 2018

The formula for the jaro_winkler_similarity goes as follows:
jaro_winkler_similarity = jaro_sim + l_ * p * (1-jaro_sim) .
Thus we can deduce that l_ * p <=1 is necessary (not max_l * p <=1) such that jaro_winkler_distance <=1.

Quoting from the comments in the code of this PR ( see jaro_winkler_similarity function in the code):

l_ is the length of common prefix at the start of the string.This implementation provides an upperbound for the l_ value.A common value of this upperbound is 4.

max_l is this upperbound which is set to its common value ,4 by default. It is worth noting that the value of l_ is lesser to max_l = 100 (say) in the case of (TANYA, TONYA) where the value of p=0.1 and l_ = 1.

I have put a new doc-test case separately that explains the same and outputs the same value of jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=4) and jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=100) as 0.880 .

Thus 0<= jaro_winkler_similarity <=1 depends not on max_l * p <=1 but on l_ * p <=1

One more reason as to why max_l =4 by default is that we need l_ * p <= 1 for the similarity value to lie between 0 and 1. Also the max value that p can take is 0.25. Thus in such cases where p =0.25 to maintain the condition l_ * p <= 1, limiting value of l_ is 4.


if not 0 <= max_l * p <= 1:
        warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]."
			  "Jaro-Winkler similarity will not be between 0 and 1.")
		     )

This piece of code that generates the warning is logical as there might be cases where l_ becomes equal to max_l such that l_ * p >1 . In light of this we might change the warning to be the following :

if not 0 <= max_l * p <= 1:
        warnings.warn(str("The product  `l_ * p` might not fall between [0,1]."
			  "Jaro-Winkler similarity might not be between 0 and 1.")
		     )

Also instead of this code that calculates the prefix length, l_ :

for s1_i, s2_i in zip(s1, s2): 
        if s1_i == s2_i:
            l_ += 1
        else:
            break
        if l_ > max_l:
            break

we should use this one :

for s1_i, s2_i in zip(s1, s2): 
        if s1_i == s2_i:
            l_ += 1
        else:
            break
        if l_ == max_l:
            break 

This is because the loop should terminate as soon as l_ reaches it's upperbound (max_l) (if at all it reaches the upperbound) .

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 3, 2018

@53X Thanks for the explanation on the if l_ == max_l:!

Regarding the warning, the max_l is a user changeable parameter while l_ is the internal and variable count of the longest prefix length. Since max_l subsumes l_ checking for max_l * p <=1 effectively also checks for l_ * p <=1, the warning to the user should be something under the user's control.

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 3, 2018

@53X just 3 small comments pending in the code review and some light changes and the PR will look good to merge!

Sorry that the process took a while because of the back and forth regarding the algorithm and the regression tests.

warnings.warn(str("The product of `max_l * p` doesn't fall between [0,1]."
"Jaro-Winkler similarity will not be between 0 and 1.")
)
warnings.warn(str("The product `l_ * p` might not fall between [0,1]."

This comment has been minimized.

@alvations

alvations Jul 3, 2018

Contributor

I think it's okay to use max_l here. See comment in PR.

@@ -260,80 +260,90 @@ def jaro_winkler_similarity(s1, s2, p=0.1, max_l=4):
American Statistical Association: 354-359.
such that:
jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) )
jaro_winkler_sim = jaro_sim + ( l_ * p * (1 - jaro_sim) )

This comment has been minimized.

@alvations

alvations Jul 3, 2018

Contributor

I think it's cleaner to use l here to be clearer when communicating the formula.

This comment has been minimized.

@alvations

alvations Jul 3, 2018

Contributor

I think it's okay to use l in the code too instead of l_ since the function is locally scoped, no global variable.

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Jul 4, 2018

@53X Thank you for the updates! Now it LGTM.

IMO, you've contributed to the best documented metric in the nltk.metrics.distance.py and it comes with doctest too =)

Kudos, to the amount of work done to document and validate the implementation!

@53X

This comment has been minimized.

Copy link
Contributor Author

53X commented Jul 4, 2018

@alvations , @stevenbird can this PR be merged now and the corresponding issue be closed?

@stevenbird

This comment has been minimized.

Copy link
Member

stevenbird commented Jul 10, 2018

Thanks for an excellent contribution @53X.
Thanks for your careful reviewing work @alvations.

@stevenbird stevenbird merged commit ecced11 into nltk:develop Jul 10, 2018

1 check passed

NLTK tests Build finished.
Details

alvations added a commit that referenced this pull request Aug 15, 2018

Revert "Merge pull request #2044 from 53X/develop"
This reverts commit ecced11, reversing
changes made to ab07520.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.