Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in Wordnet closure and tree #2314

Open
basaldella opened this issue Jun 11, 2019 · 15 comments
Open

Infinite loop in Wordnet closure and tree #2314

basaldella opened this issue Jun 11, 2019 · 15 comments

Comments

@basaldella
Copy link

basaldella commented Jun 11, 2019

When I try to compute the hyponymic closure of restrain.v.01 I get stuck in an infinite loop. Example code:

>>> import nltk
>>> nltk.__version__
'3.4.3'
>>> from nltk.corpus import wordnet as wn
>>> ss = wn.synset('restrain.v.01')
>>> list(ss.closure(lambda s:s.hyponyms()))

And from here it just hangs. I guess that the reason is that the method is based on breadth_first, which explicitly states in the documentation that is not designed to handle loops.

The same happens when running

>>> list(ss.tree(lambda s:s.hyponyms()))

Which fails with a RecursionError.

This seems to be closely related to #52, which is now closed, but the bug seems to be still present nonetheless.

@basaldella basaldella changed the title Infinite loop in Wordnet closure Infinite loop in Wordnet closure and tree Jun 11, 2019
@basaldella
Copy link
Author

basaldella commented Jun 11, 2019

A tentative solution for tree may be:

def tree(synset, rel, traversed=set()):
    _tree = [synset]
    traversed.add(synset)
    for x in rel(synset):
        if x not in traversed:
            branch = traversed.copy()
            _tree += [tree(x, rel, branch)]

    return _tree

Which seems to work for me. I don't know if it's the most efficient/elegant solution but seems to be working fine for my personal test cases.

By the way, I absolutely don't get the sense of depth and cut_mark in the current implementation of tree - they do nothing and seem to be the remnant of some old implementation of the method.

Edit - ok, maybe I get the sense of depth and cut_mark, but still I think the documentation should say something meaningful about them.

@terryyz
Copy link

terryyz commented May 9, 2020

Just tested this issue. It still exists. Can I have a try to fix it then?

@dimazest
Copy link
Contributor

dimazest commented May 9, 2020 via email

@ekaf
Copy link
Contributor

ekaf commented Aug 2, 2020

In the original WordNet 3.0 database, there used to be endless cycles between 'restrain.v.01' and 'inhibit.v.04' with both hypernyms() and hyponyms(). Currently, nltk uses a patched version of WordNet 3.0, where the hypernym cycle (but not the hyponym cycle) was removed from the "nltk_data/corpora/wordnet/data.verb" database file, so while there is no longer any hypernym cycle to detect, both the following still loop forever:

>>> print(list(wn.synset('restrain.v.01').closure(lambda s:s.hyponyms())))
>>> print(list(wn.synset('inhibit.v.04').closure(lambda s:s.hyponyms())))

Addititonally, there is also a "topic_domain" cycle between "computer" and "computer_science", which is present in several of the latest Princeton WordNet versions (and also in English Wordnet 2020). So the following also hangs forever:

>>> print(list(wn.synset('computer.n.01').closure(lambda s:s.topic_domains())))

This contradicts the comment in util.py ("No need to check for cycles."). There is a need to check for cycles, in order to avoid eventual endless loops.

@ekaf
Copy link
Contributor

ekaf commented Aug 8, 2020

In "util.py", "breadth_first" does not call "tree" before looping forever, so changing the definition of "tree" cannot prevent endless cycles in "closure". One possibility is to change the definition of "breadth_first", but that is not my preferred solution, because it would also affect users of "breadth_first" outside of wordnet. A simpler solution is to change the definition of "closure" in "wordnet.py". Since "closure" already checks a list of traversed "synset_offsets", it is easy to just add an "else" statement to that check, and eventually break out of the for loop.

(Update October 3) NB: However, this fix is not recommendable! The problem is that it would truncate some closures like f.ex.
the hypernyms of 'dog.n.01' (please see next comment).

    def closure(self, rel, depth=-1):
        from nltk.util import breadth_first

        synset_offsets = []
        for synset in breadth_first(self, rel, depth):
            if synset._offset != self._offset:
                if synset._offset not in synset_offsets:
                    synset_offsets.append(synset._offset)
                    yield synset
                else:
                    print("Warning: cycle at %s" % (synset,))
                    break

With the fix above, "closure" detects the "topic_domains" cycle:

>>> import nltk
>>> print("nltk v. %s" % nltk.__version__)
nltk v. 3.4.5

>>> wn=nltk.corpus.wordnet

>>> print(list(wn.synset('computer.n.01').closure(lambda s:s.topic_domains())))
Warning: cycle at Synset('computer_science.n.01')
[Synset('computer_science.n.01')]

This cycle has been present in all WordNet versions since WN 2.0, where topic_domains were first introduced.

The situation is completely different with the cycle in the hypernyms of 'restrain.v.01', because this cycle was only present in the original Princeton WordNet 3.0 database, and no longer exists in the latest WordNet version (v. 3.1). Nltk uses a patched version of WordNet 3.0, where the hypernym cycle (but not the hyponym cycle) was removed from the "nltk_data/corpora/wordnet/data.verb" database file, so there just is no cycle to detect for hypernyms (although there still is one for hyponyms):

>>> print(list(wn.synset('restrain.v.01').closure(lambda s:s.hypernyms())))
[Synset('inhibit.v.04'), Synset('suppress.v.04'), Synset('forget.v.01')]

@ekaf
Copy link
Contributor

ekaf commented Oct 2, 2020

Please note however that although @ekaf's proposed solution above for "closure" solves the endless cycles (by checking if a synset was already traversed), it introduces a bug that is worse than the original problem, since it truncates trees that contain non-cyclic duplicate branches.

For example, please consider the hypernyms tree of 'dog.n.01' (listed in the comments of the original source code in wordnet.py). This tree starts with two distinct hypernym branches ('canine.n.02' and 'domestic_animal.n.01'), which merge again at 'animal.n.01'. Then, the hypernyms of 'animal.n.01' only need to be collected once, but would need to be copied to the second branch in order to display the full tree.

On the contrary, @basaldella solution for tree does not have that problem because, for each child, it recurses with a different copy of the "traversed" set.

Meanwhile, using only one global traversed set, "tree" does not display the full tree, but truncates the second branch:

[Synset('dog.n.01'), [Synset('canine.n.02'), [Synset('carnivore.n.01'), [Synset('placental.n.01'), [Synset('mammal.n.01'), [Synset('vertebrate.n.01'), [Synset('chordate.n.01'), [Synset('animal.n.01'), [Synset('organism.n.01'), [Synset('living_thing.n.01'), [Synset('whole.n.02'), [Synset('object.n.01'), [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]]]]], [Synset('domestic_animal.n.01'), [Synset('animal.n.01'), '(Already Traversed...)']]]

However, using only one global "traversed" set, is the only solution that can terminate when calling a tree that is too big to otherwise fit in memory, which happens when relations produce a big Strongly Connected Component, like f.ex the also_sees() relation.

@ekaf
Copy link
Contributor

ekaf commented Oct 7, 2020

Loops are only endless when they occur in the same path in the tree. Once these are prevented, redundant searches in different pathes are still possible.

When the tree is traversed top-down, each recursive call of "tree" stays in the same path so, at each recursive step, the current depth equals the number of nodes traversed in the current path.

def tree(node, children=iter, depth=-1, cut_mark=None, traversed=None):
    if traversed is None:
        traversed = {node}
#   Current depth equals the number of traversed nodes in the current path:
    cur_depth=len(traversed)
    out_tree = [node]
    if cur_depth != depth:
        children_depth = -cur_depth - 1
        for child in children(node):
            if child in traversed:
                warnings.warn('Discarded redundant search for {0} at depth {1}'.format(child, children_depth), stacklevel=3)
                if cut_mark:
                    out_tree += ['Cycle({0},{1},{2})'.format(child, children_depth, cut_mark)]
            else:
#               Recurse with a different "traversed" set for each child:
                out_tree += [tree(child, children, depth, cut_mark, traversed.union({child}))]
    elif cut_mark:
        out_tree += [cut_mark]
    return out_tree

This detects (and discards) endless loops:

>>> print("nltk v. %s" % nltk.__version__)
nltk v. 3.5

>>> pprint(tree(wn.synset('computer.n.01'), lambda s:s.topic_domains()))
[Synset('computer.n.01'), [Synset('computer_science.n.01')]]

UserWarning: Discarded redundant search for Synset('computer.n.01') at depth -3

Hower duplicate branches are preserved, yielding the same output as the current "tree" function:

>>> pprint(tree(wn.synset('dog.n.01'), lambda s:s.hypernyms()))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'),
  [Synset('animal.n.01'),
   [Synset('organism.n.01'),
    [Synset('living_thing.n.01'),
     [Synset('whole.n.02'),
      [Synset('object.n.01'),
       [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]

@ekaf
Copy link
Contributor

ekaf commented Oct 7, 2020

The issue with "closure" is different from "tree", because it repeatedly calls an external breadth-first generator, which cannot be interrupted without also loosing any subsequent synset in the relation, and this would not be acceptable. So we need to change the search function itself, rather than the wrapper around it. While "tree" should preserve duplicate branches, "closure" should discard them, since it only is expected to return each synset once, so there is no need to keep track of the different branches.

The following top-down "aclosure" (acyclic closure) discards the cycles, while still reaching any subsequent synset in the relation:

    def aclosure(self, rel, depth=-1, traversed=None):
        """Top-down acyclic transitive closure of source under the rel
        relationship, discarding eventual cycles:
            >>> from nltk.corpus import wordnet as wn
            >>> dog = wn.synset('dog.n.01')
            >>> hyp = lambda s:s.hypernyms()
            >>> list(dog.closure(hyp))
            Warning: discarding cycle at Synset('animal.n.01')
            [Synset('dog.n.01'), Synset('canine.n.02'), Synset('carnivore.n.01'),
            Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'),
            Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'),
            Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'),
            Synset('physical_entity.n.01'), Synset('entity.n.01'),
            Synset('domestic_animal.n.01')]
        """
        if not traversed:
            traversed=set()
        relset = []
        if depth != 0:
            if self in traversed:
                print('Warning: discarding cycle at %s' % self)
            else:
                relset.append(self)
                traversed.add(self)
                for x in rel(self):
                    relset.extend(x.aclosure(rel, depth - 1, traversed))
        elif cut_mark:
            relset.append(cut_mark)
        return relset

This discards endless loops:

>>> wn=nltk.corpus.wordnet
>>> print(wn.synset('computer.n.01').aclosure(lambda s:s.topic_domains()))
Warning: discarding cycle at Synset('computer.n.01')
[Synset('computer.n.01'), Synset('computer_science.n.01')]

But reduntant pathes do not prevent the search from reaching the top-most node. Please note that "aclosure" outputs the synsets in a slightly different order than "closure", because the second branch ('domestic_animal.n.01') is output at the end, after all the hypernyms of 'canine.n.02' have been exhausted.

>>> from pprint import pprint
>>> pprint(wn.synset('dog.n.01').aclosure(lambda s:s.hypernyms()))
Warning: discarding cycle at Synset('animal.n.01')
[Synset('dog.n.01'),
 Synset('canine.n.02'),
 Synset('carnivore.n.01'),
 Synset('placental.n.01'),
 Synset('mammal.n.01'),
 Synset('vertebrate.n.01'),
 Synset('chordate.n.01'),
 Synset('animal.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01'),
 Synset('domestic_animal.n.01')]

@ekaf
Copy link
Contributor

ekaf commented Oct 11, 2020

Great style, @cubiedev!

However, due to the potential severity of this bug, it is preferrable to raise a warning than just to ignore the cycles silently. So, although the fix can be coded compactly with list and set comprehensions, a more explicit coding style may be a better choice in most cases, in order to be able to report anomalies.

Also, since cyclicity is a general, severe problem that also occurs in other graphs than Wordnet, it is more helpful to provide a general-purpose solution outside of the Wordnet module, so that it is available to check other graphs as well.

So I would suggest implementing an "acyclic_breadth_first" function in nltk/util.py, that detects and discards cycles. It needs only be a simple clone of the already existing "breadth_first" function, with a few additional lines to handle the cycles.

Then wordnet.py needs only a simple wrapper to implement relation closure:

    def closure(self, rel, depth=-1):
        """
        Return the transitive closure of source under the rel
        relationship, breadth-first, discarding cycles:
        """

        from nltk.util import acyclic_breadth_first
        return acyclic_breadth_first(self, rel, depth)

Likewise, with an "acyclic_depth_first" search function that detects cycles, the Wordnet "tree" wrapper can simply be:

    def tree(self, rel, depth=-1, cut_mark=None):
        """
        Return the full relation tree, including self,
        discarding cycles:

        >>> from nltk.corpus import wordnet as wn
        >>> from pprint import pprint
        >>> computer = wn.synset('computer.n.01')
        >>> topic = lambda s:s.topic_domains()
        >>> pprint(computer.tree(topic))

        UserWarning: Discarded redundant search ([(Synset('computer.n.01'), -3)])
        [Synset('computer.n.01'), [Synset('computer_science.n.01')]]
        """

        from nltk.util import acyclic_depth_first
        return acyclic_depth_first(self, rel, depth)

@ekaf
Copy link
Contributor

ekaf commented Oct 15, 2020

In addition to the small number of endless cycles already found, the theory predicts that closure will also loop endlessly with any call of any symmetric relation, since all symmetric relation links are defined in both directions. It is possible to argue that the non-termination of symmetric closure is an inherent property rather than a bug. Nevertheless, it may still be an issue for unsuspecting users.

In WordNet, two relations between synsets are symmetric, and there are 1748 verb_groups() links and 1278 attributes() links in the current NLTK version. Additionally, the also_sees() relation is almost (but not fully) symmetric, So, in total, closure and tree will loop endlessly on any of over 5,000 symmetric synset relation links.

For example, several senses of the verb "break" have verb_groups, and any of these calls loops endlessly:

>>> print(list(wn.synset('break.v.02').closure(lambda s:s.verb_groups())))
>>> print(wn.synset('break.v.02').tree(lambda s:s.verb_groups()))

The same problem also occurs with senses 03, 04, 05, and 07 of break.

Similarly, with attributes(), any call of closure or tree also hangs forever:

>>> print(wn.synset('accurate.a.01').tree(lambda s:s.attributes()))

But the worst problem by far, is the also_sees() relation (2692 links between Synsets), which has both 2490 symmetric links, and 202 non-symmetric links that involve Synsets that also have at least a symmetric link. This combination is particularly toxic, because it produces a huge Strongly Connected Component (SCC), counting 758 synsets, where any member of the SCC is transitively connected to all other members. I am not sure how to estimate the size of such a tree, but only guessing that it could approximately be the average number of children per node, raised to the 758 power. Then the full also_sees() tree, starting at any member of this SCC, for ex. 'concrete.a.01', could not fit on a single computer, no matter how efficiently trees are represented or computed. The situation is better with the second-largest also_sees() SCC, which includes 19 synsets, and their full also_sees() tree is computable in short time, discarding only local cycles

But with unwieldy SCCs, at this time, pruning the tree globally (as for "closure") seems the most effective solution, although it has the unaceptable drawback with "tree", that it discards eventual legitimate duplicate paths, thus truncating the output tree.

However still, all implementations of "tree", including the current one in NLTK, can terminate when limiting the "depth" parameter. F. ex:

>>> from pprint import pprint
>>> pprint(wn.synset('concrete.a.01').tree(lambda s:s.also_sees(), cut_mark='...', depth=5))

stevenbird pushed a commit that referenced this issue Dec 12, 2020
* Fixes issue #2314 (Infinite loop in Wordnet closure and tree)

* Edited comments for closure and tree

* use pop(0) instead of pop() to output closure in correct order

* Include the caller node in acyclic_breadth_first() output, but discard it in Wordnet closure(). Improved comment in acyclic_depth_first().

* use deque instead of list

* Warn directly about each cycle instead of collecting them in a list

* Use acyclic_branch_depth_first for wordnet.tree, handle harder cycles with acyclic_depth_first

* fixed typo, added doctests

* doctests: adapted calls to match the documented function
@rmccampbell
Copy link

Why does closure() print a "Discarded redundant search for Synset" warning? This doesn't seem actionable or useful to the caller. I can see why it may be relevant to know if the tree is truncated for tree() but closure is just a set, so discarding duplicates is just expected behavior.

@ekaf
Copy link
Contributor

ekaf commented Dec 17, 2023

@rmccampbell, cyclic relations are not expected in WordNet (except with a few occurrences of symmetry), so the warnings are meant to be actionable for debugging spurious cyclicity in the relational databases. This is for ex. the case with the initial example which started this issue.

@rmccampbell
Copy link

rmccampbell commented Dec 17, 2023

It's not just cycles, it's any case where a node is duplicated in the tree. Which is true very often for relationships like hypernyms. Even the examples in the documentation show this warning in the docstring output. If you want to ensure there are no cycles in WordNet that's what tests are for, not warnings that developers have to silence with catch_warnings every time they call the function or expose them to end users who can do nothing about it.

@ekaf
Copy link
Contributor

ekaf commented Dec 18, 2023

Yes, these warnings are not relevant to end users. A solution is to add a verbose parameter to the 4 functions that call warnings.warn() in util.py, so that the warning is only output if verbose is True.
Of course, the default setting should be verbose=False.

@rmccampbell
Copy link

Yes that would work, but I don't really see the benefit of the warnings anyway for closure, which is perfectly well defined whether or not there are multiple paths to a node or even cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants