New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Axe FreqDist ordering/comparisons? #1457
Comments
@copper-head: the motivation is given in section 4.1 of chapter 2 of the NLTK book. The operation corresponds to subset, for multisets. Since the book uses it, I would rather fix the implementation than break the book example. |
Hey! Could I give this a shot? :) |
@bdevnani3 Sure! But this might be a little challenging since it has to do with builtin Python class. The code referred to in this PR is in the And the motivation of having From http://www.nltk.org/book/ch02.html :
|
One possible solution is to remove these lines:
And write a generic inequalities function and make duck-types of the individual inequality: from operator import lt, gt, le, ge
def inequalities(self, other, _operator):
operator2string = {lt:"<", gt:">", le:">", ge:">",}
if not isinstance(other, FreqDist):
raise_unorderable_types(operator2string[_operator], self, other)
return all(_operator(self[key], other[key]) for key in self)
__lt__ = lambda self, other: self.inequalities(other, lt)
__gt__ = lambda self, other: self.inequalities(other, gt)
__le__ = lambda self, other: self.inequalities(other, le)
__ge__ = lambda self, other: self.inequalities(other, ge) [out]: >>> import nltk
>>> nltk.FreqDist('abc') > nltk.FreqDist('abd')
False
>>> nltk.FreqDist('abd') < nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abcc') > nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abcc') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('aabbcc') > nltk.FreqDist('abc')
True
>>> nltk.FreqDist('aabbcc') < nltk.FreqDist('abc')
False
>>> nltk.FreqDist('abc') < nltk.FreqDist('aabbcc')
True
>>> nltk.FreqDist('abc') <= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('abc') >= nltk.FreqDist('abc')
True
>>> nltk.FreqDist('abc') >= 'abc'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/probability.py", line 402, in <lambda>
__ge__ = lambda self, other: self.inequalities(other, ge)
File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/probability.py", line 396, in inequalities
raise_unorderable_types(operator2string[_operator], self, other)
File "/Users/liling.tan/git-stuff/nltk-alvas/nltk/nltk/internals.py", line 982, in raise_unorderable_types
raise TypeError("unorderable types: %s() %s %s()" % (type(a).__name__, ordering, type(b).__name__))
TypeError: unorderable types: FreqDist() > str() But it'll lead to some cases where it's awkward but still logical if the aim of the inequality is restricted to the definition of comparing counts of each word given the "self" i.e. the first FreqDist as the "deictic" :
|
I've given this some thought and actually I think we shouldn't change the implementation (it makes sense), but instead document it clearly. |
Hi, I just had a student ask me about how import nltk
liste1=["This","is","a","a","list"]
fdist1=nltk.FreqDist(liste1)
liste2=["This","This","is","a","list"]
fdist2=nltk.FreqDist(liste2)
print(nltk.__version__) # 3.4.5
print(fdist1>fdist2) # True
print(fdist2>fdist1) # True |
This StackOverflow question made me aware that the
FreqDist
class implements ordering comparisons. As the question shows, the implementation is buggy so we should do something about it.I propose we get rid of it altogether, considering that its parent class
Counter
doesn't support comparisons and that it's not entirely clear what it means for one frequency distribution to be "greater than" another.Thoughts?
The text was updated successfully, but these errors were encountered: