Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Axe FreqDist ordering/comparisons? #1457

Closed
iliakur opened this issue Aug 27, 2016 · 6 comments · Fixed by #2502
Closed

Axe FreqDist ordering/comparisons? #1457

iliakur opened this issue Aug 27, 2016 · 6 comments · Fixed by #2502

Comments

@iliakur
Copy link
Contributor

iliakur commented Aug 27, 2016

This StackOverflow question made me aware that the FreqDist class implements ordering comparisons. As the question shows, the implementation is buggy so we should do something about it.
I propose we get rid of it altogether, considering that its parent class Counter doesn't support comparisons and that it's not entirely clear what it means for one frequency distribution to be "greater than" another.

Thoughts?

@stevenbird
Copy link
Member

@copper-head: the motivation is given in section 4.1 of chapter 2 of the NLTK book.

The operation corresponds to subset, for multisets.

Since the book uses it, I would rather fix the implementation than break the book example.

@bdevnani3
Copy link

bdevnani3 commented Dec 22, 2017

Hey! Could I give this a shot? :)

@alvations
Copy link
Contributor

@bdevnani3 Sure! But this might be a little challenging since it has to do with builtin Python class.

The code referred to in this PR is in the nltk.probability.FreqDist specifically https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L379

And the motivation of having __le__, __ge__, __lt__ and __gt__ functions in FreqDist is to check through all items in the FreqDist and make sure that all items are in the FreqDist has lesser/greater value than what it's compared to.

From http://www.nltk.org/book/ch02.html :

The FreqDist comparison method [3] permits us to check that the frequency of each letter in the candidate word is less than or equal to the frequency of the corresponding letter in the puzzle.

>>> puzzle_letters = nltk.FreqDist('egivrvonl') 
>>> obligatory = 'r' 
>>> wordlist = nltk.corpus.words.words() 
>>> [w for w in wordlist if len(w) >= 6 
...                      and obligatory in w  
...                      and nltk.FreqDist(w) <= puzzle_letters]  

['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole'] 

@alvations
Copy link
Contributor

alvations commented Dec 22, 2017

One possible solution is to remove these lines:

    def __le__(self, other):
        if not isinstance(other, FreqDist):
            raise_unorderable_types("<=", self, other)
        return set(self).issubset(other) and all(self[key] <= other[key] for key in self)

    # @total_ordering doesn't work here, since the class inherits from a builtin class
    __ge__ = lambda self, other: not self <= other or self == other
    __lt__ = lambda self, other: self <= other and not self == other
    __gt__ = lambda self, other: not self <= other

And write a generic inequalities function and make duck-types of the individual inequality:

    from operator import lt, gt, le, ge
    def inequalities(self, other, _operator):
        operator2string = {lt:"<", gt:">", le:">", ge:">",}
        if not isinstance(other, FreqDist):
            raise_unorderable_types(operator2string[_operator], self, other)
        return all(_operator(self[key], other[key]) for key in self)

    __lt__ = lambda self, other: self.inequalities(other, lt)
    __gt__ = lambda self, other: self.inequalities(other, gt)
    __le__ = lambda self, other: self.inequalities(other, le)
    __ge__ = lambda self, other: self.inequalities(other, ge)

[out]:

>>> import nltk
>>> nltk.FreqDist('abc') > nltk.FreqDist('abd')
False
>>> nltk.FreqDist('abd') < nltk.FreqDist('abc') 
False
>>> nltk.FreqDist('abcc') > nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abcc') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('aabbcc') > nltk.FreqDist('abc')
True
>>> nltk.FreqDist('aabbcc') < nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abc') < nltk.FreqDist('aabbcc')
True
>>> nltk.FreqDist('abc') <= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('abc') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('abc') >= 'abc'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/probability.py", line 402, in <lambda>
    __ge__ = lambda self, other: self.inequalities(other, ge)
  File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/probability.py", line 396, in inequalities
    raise_unorderable_types(operator2string[_operator], self, other)
  File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/internals.py", line 982, in raise_unorderable_types
    raise TypeError("unorderable types: %s() %s %s()" % (type(a).__name__, ordering, type(b).__name__))
TypeError: unorderable types: FreqDist() > str() 

But it'll lead to some cases where it's awkward but still logical if the aim of the inequality is restricted to the definition of comparing counts of each word given the "self" i.e. the first FreqDist as the "deictic" :

>>> nltk.FreqDist('xyz') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('xyz') <= nltk.FreqDist('abc')
False

>>> nltk.FreqDist('abc') <= nltk.FreqDist('xyz')
False
>>> nltk.FreqDist('abc') >= nltk.FreqDist('xyz')
True

>>> nltk.FreqDist('xyz') >= nltk.FreqDist('abcx')
True
>>> nltk.FreqDist('xyz') >= nltk.FreqDist('abcxx')
False
>>> nltk.FreqDist('abc') <= nltk.FreqDist('xyz')
False

@iliakur
Copy link
Contributor Author

iliakur commented Dec 26, 2017

I've given this some thought and actually I think we shouldn't change the implementation (it makes sense), but instead document it clearly.
This involves me updating my SO answer, I'm cool with that, so just let me know what the final verdict on this is!

@pyfisch
Copy link
Contributor

pyfisch commented Feb 13, 2020

Hi,

I just had a student ask me about how FreqDist less than and greater than works. After explaining a bit about the order relation for sets they showed me this example. The fdist1 is both bigger and smaller than fdist2. This does not make sense to me.

import nltk
liste1=["This","is","a","a","list"]
fdist1=nltk.FreqDist(liste1)
liste2=["This","This","is","a","list"]
fdist2=nltk.FreqDist(liste2)
print(nltk.__version__) # 3.4.5
print(fdist1>fdist2) # True
print(fdist2>fdist1) # True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants