Text augmenters do not properly recombine items with punctuation. #143

wiseyoungbuck · 2020-08-15T00:06:26Z

input:

import nlpaug
print(nlpaug.__version__)
import nlpaug.augmenter.char as nac
raug = nac.RandomCharAug(action='swap')
naug = nac.OcrAug()
version_number = '0.0.16'
ip_address = '192.168.1.1'
version_sentence = "Just released version 0.0.0.16"
ip_sentence = "My IP address is 192.168.1.1. Ping it for me."
normal_sentence = "This is a sentence?"
print(raug.augment(normal_sentence))
print(raug.augment(version_number))
print(raug.augment(ip_address))
print(raug.augment(version_sentence))
print(raug.augment(ip_sentence))
print(naug.augment(normal_sentence))
print(naug.augment(version_number))
print(naug.augment(ip_address))
print(naug.augment(version_sentence))
 print(naug.augment(ip_sentence))

output:

This is a sentence ?
0 . 0 . 16
192 . 168 . 1 . 1
Ujst released version 0 . 0 . 0 . 16
My IP address is 192 . 168 . 1 . 1 . Ping it for me .
This i8 a sentence ?
0 . D . 16
192 . 768 . 1 . I
Jost ke1eased version 0 . 0 . 0 . 16
My IP address is 492 . 168 . 1 . l . Pin9 it for me .

Note that all punctuation is padded with spaces on both sides. I don't believe this is intended.

@classmethod
def _reverse_tokenizer(cls, tokens):
        return ' '.join(tokens)`

This does not properly recombine punctuation.

The text was updated successfully, but these errors were encountered:

wiseyoungbuck · 2020-08-15T01:51:31Z

Maybe replace the standard tokenizer/untokenizer with something like this?
https://www.nltk.org/_modules/nltk/tokenize/treebank.html

makcedward · 2020-08-15T04:31:16Z

For short term, you can provide a pre-defined/ customer tokenizer and reverse_tokenizer.

def your_tokenizer():
  pass
def your_reverse_tokenizer():
  pass

raug = nac.RandomCharAug(action='swap', tokenizer=your_tokenizer, reverse_tokenizer=your_reverse_tokenizer)

wiseyoungbuck · 2020-08-26T20:30:59Z

Unfortunately, this issue persists.

import nlpaug.augmenter.char as nac

nac.OcrAug().augment('192.168.1.1')

'192. 168. 1. 4'

nac.OcrAug().augment('john.doe@gmail.com')

'john. due @ 9mail. com'

wiseyoungbuck · 2020-08-26T20:32:04Z

it now adds a space after the tokens.

wiseyoungbuck · 2020-08-26T20:36:04Z

I used the nltk.tokenize.treebank.TreebankWordDeTokenizer it solves the case for the IP addresses, however it has the same issue with email addresses.

makcedward added the enhancement New feature or request label Aug 15, 2020

makcedward added a commit that referenced this issue Aug 16, 2020

Enhance default tokenizer and reverse tokenizer #143

b1ee1a8

makcedward closed this as completed Aug 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text augmenters do not properly recombine items with punctuation. #143

Text augmenters do not properly recombine items with punctuation. #143

wiseyoungbuck commented Aug 15, 2020 •

edited

wiseyoungbuck commented Aug 15, 2020

makcedward commented Aug 15, 2020

wiseyoungbuck commented Aug 26, 2020 •

edited

wiseyoungbuck commented Aug 26, 2020

wiseyoungbuck commented Aug 26, 2020 •

edited

Text augmenters do not properly recombine items with punctuation. #143

Text augmenters do not properly recombine items with punctuation. #143

Comments

wiseyoungbuck commented Aug 15, 2020 • edited

wiseyoungbuck commented Aug 15, 2020

makcedward commented Aug 15, 2020

wiseyoungbuck commented Aug 26, 2020 • edited

wiseyoungbuck commented Aug 26, 2020

wiseyoungbuck commented Aug 26, 2020 • edited

wiseyoungbuck commented Aug 15, 2020 •

edited

wiseyoungbuck commented Aug 26, 2020 •

edited

wiseyoungbuck commented Aug 26, 2020 •

edited