WordAugmenter._tokenizer cant remove excessive space lead to nltk error #48

chiachong · 2019-09-26T08:58:06Z

Hi,

When there is excessive space in a sentence for example:
text = 'The quick brown fox jumps over the lazy dog . 1 2'
it would cause index error in nltk because there will be an empty token. The resulting tokens:
['The', '', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', '1', '', '2']

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/base_augmenter.py", line 61, in augment result = self.substitute(data) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/augmenter/word/synonym.py", line 83, in substitute pos = self.model.pos_tag(tokens) File "/home/cenozai/mypy/tf_models/nlp/nlpaug/nlpaug/model/word_dict/wordnet.py", line 46, in pos_tag return nltk.pos_tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 162, in pos_tag return _pos_tag(tokens, tagset, tagger, lang) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/__init__.py", line 119, in _pos_tag tagged_tokens = tagger.tag(tokens) File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in tag context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 175, in <listcomp> context = self.START + [self.normalize(w) for w in tokens] + self.END File "/home/cenozai/.local/lib/python3.6/site-packages/nltk/tag/perceptron.py", line 261, in normalize elif word[0].isdigit(): IndexError: string index out of range

A quick fix could be as follows:
Original WordAugmenter._tokenizer in word_augmenter.py:
return text.split(' ')
fix:
return [t for t in text.split(' ') if len(t) > 0]

The text was updated successfully, but these errors were encountered:

makcedward · 2019-09-28T20:50:35Z

Thanks you for suggestion. Will use it as default tokenizer

makcedward added the enhancement New feature or request label Sep 28, 2019

makcedward closed this as completed in 16e4f97 Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordAugmenter._tokenizer cant remove excessive space lead to nltk error #48

WordAugmenter._tokenizer cant remove excessive space lead to nltk error #48

chiachong commented Sep 26, 2019

makcedward commented Sep 28, 2019

WordAugmenter._tokenizer cant remove excessive space lead to nltk error #48

WordAugmenter._tokenizer cant remove excessive space lead to nltk error #48

Comments

chiachong commented Sep 26, 2019

makcedward commented Sep 28, 2019