-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word_tokenize keeps the opening single quotes and doesn't pad it with space #1995
Comments
If we make the following changes to import re
from nltk.tokenize.treebank import TreebankWordTokenizer
# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()
# See discussion on
# - https://github.com/nltk/nltk/pull/1437
# - https://github.com/nltk/nltk/issues/1995
# Adding to TreebankWordTokenizer, the splits on
# - chervon quotes u'\xab' and u'\xbb' .
# - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
# - opening single quotes if the token that follows isn't a clitic
improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))
def word_tokenize(text, language='english', preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
:type preserver_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [token for sent in sentences
for token in _treebank_word_tokenizer.tokenize(sent)] [out]:
The above regex hack will cover the following clitics:
Are there more clitics that should be added? |
What about the ending single quotes, which appear in the end of the objective form of plurals? Like "providers'". |
Do "readers' " need to be tokenized to "readers" and "'"? Also, what is the status of this bug so far? If the change mentioned above has not been implemented, I'd like to take up this issue |
This issue is on the opening quotes and the clitic fix for that can be easily done and that'll make the Feel free to contribute and open a pull-request on it =) But to handle possessive plural, it's hard to understand how to do it because we need to define the difference between sentences like:
There are too many instances where the possessive plurals that can be confused with closing single quotes. I would say for use of single quotes for the plural possessive, I don't think it's worth fixing. |
Adding improved regexes to handle clitics c.f. #1995
import nltk file=codecs.open('new2.txt','r','utf8') fh=file.readlines() #['సతతహరిత', 'సమశీతోష్ణ', ' అడవి-ఇల్లు*అడవి ']-the line is stored in new1.txt for line in fh: print(l) # ['[', "'సతతహరిత", "'", ',', "'సమశీతోష్ణ", "'", ',', "'", 'అడవి-ఇల్లు', '', 'అడవి', "'", ']'] |
i just again processed the result of nltk word tokenizer....It solved my problem. |
Was there a regression from #2018 or did that pr not fix the issue? nltk version: 3.8.1 from nltk.tokenize import word_tokenize
sentence = "I've said many times, 'We'll make it through!'"
word_tokenize(sentence) Expected: ['I', "'ve", 'said', 'many', 'times', ',', "'", "We", "'ll", 'make', 'it', 'through', '!', "'"] |
word_tokenize
keeps the opening single quotes and doesn't pad it with space, this is to make sure that the clitics get tokenized as'll
, `'ve', etc.The original treebank tokenizer has the same behavior but Stanford CoreNLP doesn't. It looks like some additional regex was put in to make sure that the opening single quotes get padded with spaces if it isn't followed by clitics.
There should be a non-capturing regex to catch the non-clitics and pad the space.
Details on https://stackoverflow.com/questions/49499770/nltk-word-tokenizer-treats-ending-single-quote-as-a-separate-word/49506436#49506436
The text was updated successfully, but these errors were encountered: