Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHUNK_TAG_PATTERN does not allow curly bracket quantifiers for tags #1597

Closed
chmeyer opened this issue Jan 19, 2017 · 0 comments

Comments

@chmeyer
Copy link
Contributor

commented Jan 19, 2017

Running the supplementary example from http://www.nltk.org/book/ch07.html#exploring-text-corpora ("your turn" section)

cp = nltk.RegexpParser('CHUNK: {<N.*>{4,}}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)

yields a ValueError:

  File "nltk/chunk/regexp.py", line 1130, in __init__
    self._read_grammar(grammar, root_label, trace)
  File "nltk/chunk/regexp.py", line 1166, in _read_grammar
    rules.append(RegexpChunkRule.fromstring(line))
  File "nltk/chunk/regexp.py", line 381, in fromstring
    raise ValueError('Illegal chunk pattern: %s' % rule)
ValueError: Illegal chunk pattern: {<N.*>{4,}}

This is because nltk.chunk.CHUNK_TAG_PATTERN does not permit curly brackets. Simple workaround is setting:

nltk.chunk.regexp.CHUNK_TAG_PATTERN = re.compile(r'^((%s|<%s>)*)$' %
                                ('([^\{\}<>]|\{\d+,?\d*\}|\{\d*,?\d+\})+',
                                 '[^\{\}<>]+'))

which might be a viable fix as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.