-
Notifications
You must be signed in to change notification settings - Fork 2.9k
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
span_tokenize failed when sentence contains double quotation #1750
Comments
Thanks @albertauyeung for reporting the issue. Do you have an example where you met an error with the Do you mean something like this? >>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 230, in align_tokens
start = sentence.index(token, point)
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/treebank.py", line 167, in span_tokenize
return align_tokens(tokens, text)
File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 232, in align_tokens
raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "``" not found in "This is a sentence with "quotes inside" and alsom some 'single quotes', etc." Suboptimal solution: >>> s = '''This is a sentence with `` quotes inside '' and alsom some 'single quotes', etc.'''
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 26), (27, 33), (34, 40), (41, 43), (44, 47), (48, 53), (54, 58), (59, 66), (67, 73), (73, 74), (74, 75), (76, 79), (79, 80)] |
@alvations Yes. That's the exact error I got. Right now it seems we have to preprocess the sentence before submitting to span_tokenize. |
A simple solution would be to replace the quotes before calling the def span_tokenize(self, text):
tokens = self.tokenize(text)
tokens = ['"' if tok in ['``', "''"] else tok for tok in tokens]
return align_tokens(tokens, text) After the patch: >>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 25), (25, 31), (32, 38), (38, 39), (40, 43), (44, 49), (50, 54), (55, 62), (63, 69), (69, 70), (70, 71), (72, 75), (75, 76)] @albertauyeung do you want to take a stab at adding the patch and create a new pull-request? |
@alvations Yes, sure. Will do! |
Fixed on #1751 |
Please note that this fix will still throw an exception for a text with both types of quotes: |
Hi @alyaxey, what is the exception you see? I executed |
Sorry, I've provided a wrong test case. Please take a look at this one:
The expected output is
I'd like to suggest a different solution to 1) fix this and similar bugs, 2) provide more flexibility to users, 3) make code more clear. We can add a boolean parameter to |
I am running into an exception with the current version of span_tokenize for strings that contain brackets before quotation marks. I believe the regex is wrong since it also matches brackets and later replaces quotation marks in the "raw_tokens" with those brackets. Or am I missing something? Example: s = ' ( see 6) Biotin " " affinity'
w_spans = TreebankWordTokenizer().span_tokenize(s) Exception:
Suggested fix: |
Ok my bad, it has actually already been fixed in commit 4b21300 and is running like a charm in nltk-3.3 |
oh, it stiil has problem in nltk-3.3 like this:
|
@MeMeDa I confirm I can reproduce this bug. A solution is to add one more regex to match single quotes at the beginning of a string. Please see my branch on https://github.com/albertauyeung/nltk/tree/hotfix-span-tokenizer |
Confirmed:
|
Hi, I also experienced this bug, e. g. it happens on the following text:
The resulting error is as follows:
Are there any updates on this problem? |
Experienced this same crash today. Reproductible example:
yields |
@Toz3wm Thank you for the reproducible example, and for pointing me to this issue. I've submitted a PR which ought to solve this. Perhaps in the meantime you can use If this PR gets merged, then the tokenized output might become: |
If we feed in a sentence with double quotation into TreebankWordTokenizer's span_tokenize function, there will be errors. Probably this is because the function sends the raw string input along with the tokenized string to the align_tokens function, without considering that the tokenize function would replace double quotation marks with something else.
The text was updated successfully, but these errors were encountered: