Axe FreqDist ordering/comparisons? #1457

iliakur · 2016-08-27T09:05:41Z

This StackOverflow question made me aware that the FreqDist class implements ordering comparisons. As the question shows, the implementation is buggy so we should do something about it.
I propose we get rid of it altogether, considering that its parent class Counter doesn't support comparisons and that it's not entirely clear what it means for one frequency distribution to be "greater than" another.

Thoughts?

The text was updated successfully, but these errors were encountered:

stevenbird · 2016-08-27T10:24:35Z

@copper-head: the motivation is given in section 4.1 of chapter 2 of the NLTK book.

The operation corresponds to subset, for multisets.

Since the book uses it, I would rather fix the implementation than break the book example.

bdevnani3 · 2017-12-22T01:45:20Z

Hey! Could I give this a shot? :)

alvations · 2017-12-22T02:01:41Z

@bdevnani3 Sure! But this might be a little challenging since it has to do with builtin Python class.

The code referred to in this PR is in the nltk.probability.FreqDist specifically https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L379

And the motivation of having __le__, __ge__, __lt__ and __gt__ functions in FreqDist is to check through all items in the FreqDist and make sure that all items are in the FreqDist has lesser/greater value than what it's compared to.

From http://www.nltk.org/book/ch02.html :

The FreqDist comparison method [3] permits us to check that the frequency of each letter in the candidate word is less than or equal to the frequency of the corresponding letter in the puzzle.

>>> puzzle_letters = nltk.FreqDist('egivrvonl') 
>>> obligatory = 'r' 
>>> wordlist = nltk.corpus.words.words() 
>>> [w for w in wordlist if len(w) >= 6 
...                      and obligatory in w  
...                      and nltk.FreqDist(w) <= puzzle_letters]  

['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole']

alvations · 2017-12-22T02:19:00Z

One possible solution is to remove these lines:

    def __le__(self, other):
        if not isinstance(other, FreqDist):
            raise_unorderable_types("<=", self, other)
        return set(self).issubset(other) and all(self[key] <= other[key] for key in self)

    # @total_ordering doesn't work here, since the class inherits from a builtin class
    __ge__ = lambda self, other: not self <= other or self == other
    __lt__ = lambda self, other: self <= other and not self == other
    __gt__ = lambda self, other: not self <= other

And write a generic inequalities function and make duck-types of the individual inequality:

    from operator import lt, gt, le, ge
    def inequalities(self, other, _operator):
        operator2string = {lt:"<", gt:">", le:">", ge:">",}
        if not isinstance(other, FreqDist):
            raise_unorderable_types(operator2string[_operator], self, other)
        return all(_operator(self[key], other[key]) for key in self)

    __lt__ = lambda self, other: self.inequalities(other, lt)
    __gt__ = lambda self, other: self.inequalities(other, gt)
    __le__ = lambda self, other: self.inequalities(other, le)
    __ge__ = lambda self, other: self.inequalities(other, ge)

[out]:

>>> import nltk
>>> nltk.FreqDist('abc') > nltk.FreqDist('abd')
False
>>> nltk.FreqDist('abd') < nltk.FreqDist('abc') 
False
>>> nltk.FreqDist('abcc') > nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abcc') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('aabbcc') > nltk.FreqDist('abc')
True
>>> nltk.FreqDist('aabbcc') < nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abc') < nltk.FreqDist('aabbcc')
True
>>> nltk.FreqDist('abc') <= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('abc') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('abc') >= 'abc'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/probability.py", line 402, in <lambda>
    __ge__ = lambda self, other: self.inequalities(other, ge)
  File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/probability.py", line 396, in inequalities
    raise_unorderable_types(operator2string[_operator], self, other)
  File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/internals.py", line 982, in raise_unorderable_types
    raise TypeError("unorderable types: %s() %s %s()" % (type(a).__name__, ordering, type(b).__name__))
TypeError: unorderable types: FreqDist() > str()

But it'll lead to some cases where it's awkward but still logical if the aim of the inequality is restricted to the definition of comparing counts of each word given the "self" i.e. the first FreqDist as the "deictic" :

>>> nltk.FreqDist('xyz') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('xyz') <= nltk.FreqDist('abc')
False

>>> nltk.FreqDist('abc') <= nltk.FreqDist('xyz')
False
>>> nltk.FreqDist('abc') >= nltk.FreqDist('xyz')
True

>>> nltk.FreqDist('xyz') >= nltk.FreqDist('abcx')
True
>>> nltk.FreqDist('xyz') >= nltk.FreqDist('abcxx')
False
>>> nltk.FreqDist('abc') <= nltk.FreqDist('xyz')
False

iliakur · 2017-12-26T10:03:14Z

I've given this some thought and actually I think we shouldn't change the implementation (it makes sense), but instead document it clearly.
This involves me updating my SO answer, I'm cool with that, so just let me know what the final verdict on this is!

pyfisch · 2020-02-13T16:26:00Z

Hi,

I just had a student ask me about how FreqDist less than and greater than works. After explaining a bit about the order relation for sets they showed me this example. The fdist1 is both bigger and smaller than fdist2. This does not make sense to me.

import nltk
liste1=["This","is","a","a","list"]
fdist1=nltk.FreqDist(liste1)
liste2=["This","This","is","a","list"]
fdist2=nltk.FreqDist(liste2)
print(nltk.__version__) # 3.4.5
print(fdist1>fdist2) # True
print(fdist2>fdist1) # True

alvations added book corpus good first issue nice idea labels Oct 4, 2017

alvations removed the good first issue label Dec 22, 2017

alvations added the good first issue label Dec 22, 2017

pyfisch mentioned this issue Feb 23, 2020

FreqDist ordering based on partial order #2502

Merged

stevenbird closed this as completed in #2502 Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Axe FreqDist ordering/comparisons? #1457

Axe FreqDist ordering/comparisons? #1457

iliakur commented Aug 27, 2016

stevenbird commented Aug 27, 2016

bdevnani3 commented Dec 22, 2017 •

edited

alvations commented Dec 22, 2017

alvations commented Dec 22, 2017 •

edited

iliakur commented Dec 26, 2017

pyfisch commented Feb 13, 2020

Axe FreqDist ordering/comparisons? #1457

Axe FreqDist ordering/comparisons? #1457

Comments

iliakur commented Aug 27, 2016

stevenbird commented Aug 27, 2016

bdevnani3 commented Dec 22, 2017 • edited

alvations commented Dec 22, 2017

alvations commented Dec 22, 2017 • edited

iliakur commented Dec 26, 2017

pyfisch commented Feb 13, 2020

bdevnani3 commented Dec 22, 2017 •

edited

alvations commented Dec 22, 2017 •

edited