Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text augmenters do not properly recombine items with punctuation. #143

Closed
wiseyoungbuck opened this issue Aug 15, 2020 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@wiseyoungbuck
Copy link

wiseyoungbuck commented Aug 15, 2020

input:

import nlpaug
print(nlpaug.__version__)
import nlpaug.augmenter.char as nac
raug = nac.RandomCharAug(action='swap')
naug = nac.OcrAug()
version_number = '0.0.16'
ip_address = '192.168.1.1'
version_sentence = "Just released version 0.0.0.16"
ip_sentence = "My IP address is 192.168.1.1. Ping it for me."
normal_sentence = "This is a sentence?"
print(raug.augment(normal_sentence))
print(raug.augment(version_number))
print(raug.augment(ip_address))
print(raug.augment(version_sentence))
print(raug.augment(ip_sentence))
print(naug.augment(normal_sentence))
print(naug.augment(version_number))
print(naug.augment(ip_address))
print(naug.augment(version_sentence))
 print(naug.augment(ip_sentence))

output:

This is a sentence ?
0 . 0 . 16
192 . 168 . 1 . 1
Ujst released version 0 . 0 . 0 . 16
My IP address is 192 . 168 . 1 . 1 . Ping it for me .
This i8 a sentence ?
0 . D . 16
192 . 768 . 1 . I
Jost ke1eased version 0 . 0 . 0 . 16
My IP address is 492 . 168 . 1 . l . Pin9 it for me .

Note that all punctuation is padded with spaces on both sides. I don't believe this is intended.

@classmethod
def _reverse_tokenizer(cls, tokens):
        return ' '.join(tokens)`

This does not properly recombine punctuation.

@wiseyoungbuck
Copy link
Author

Maybe replace the standard tokenizer/untokenizer with something like this?
https://www.nltk.org/_modules/nltk/tokenize/treebank.html

@makcedward makcedward added the enhancement New feature or request label Aug 15, 2020
@makcedward
Copy link
Owner

For short term, you can provide a pre-defined/ customer tokenizer and reverse_tokenizer.

def your_tokenizer():
  pass
def your_reverse_tokenizer():
  pass

raug = nac.RandomCharAug(action='swap', tokenizer=your_tokenizer, reverse_tokenizer=your_reverse_tokenizer)

@wiseyoungbuck
Copy link
Author

wiseyoungbuck commented Aug 26, 2020

Unfortunately, this issue persists.

import nlpaug.augmenter.char as nac

nac.OcrAug().augment('192.168.1.1')

'192. 168. 1. 4'

nac.OcrAug().augment('john.doe@gmail.com')

'john. due @ 9mail. com'

@wiseyoungbuck
Copy link
Author

it now adds a space after the tokens.

@wiseyoungbuck
Copy link
Author

wiseyoungbuck commented Aug 26, 2020

I used the nltk.tokenize.treebank.TreebankWordDeTokenizer it solves the case for the IP addresses, however it has the same issue with email addresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants