Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordAugmenter._tokenizer cant remove excessive space lead to nltk error #48

Closed
chiachong opened this issue Sep 26, 2019 · 1 comment
Closed
Labels
enhancement New feature or request

Comments

@chiachong
Copy link

Hi,

When there is excessive space in a sentence for example:
text = 'The quick brown fox jumps over the lazy dog . 1 2'
it would cause index error in nltk because there will be an empty token. The resulting tokens:
['The', '', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', '1', '', '2']

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/base_augmenter.py", line 61, in augment result = self.substitute(data) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/augmenter/word/synonym.py", line 83, in substitute pos = self.model.pos_tag(tokens) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/model/word_dict/wordnet.py", line 46, in pos_tag return nltk.pos_tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag return _pos_tag(tokens, tagset, tagger, lang) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag tagged_tokens = tagger.tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in tag context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in <listcomp> context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 261, in normalize elif word[0].isdigit(): IndexError: string index out of range

A quick fix could be as follows:
Original WordAugmenter._tokenizer in word_augmenter.py:
return text.split(' ')
fix:
return [t for t in text.split(' ') if len(t) > 0]

@makcedward makcedward added the enhancement New feature or request label Sep 28, 2019
@makcedward
Copy link
Owner

Thanks you for suggestion. Will use it as default tokenizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants