Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

Open
alvations opened this issue Mar 28, 2018 · 7 comments

Comments

@alvations
Copy link
Contributor

word_tokenize keeps the opening single quotes and doesn't pad it with space, this is to make sure that the clitics get tokenized as 'll, `'ve', etc.

The original treebank tokenizer has the same behavior but Stanford CoreNLP doesn't. It looks like some additional regex was put in to make sure that the opening single quotes get padded with spaces if it isn't followed by clitics.

There should be a non-capturing regex to catch the non-clitics and pad the space.

Details on https://stackoverflow.com/questions/49499770/nltk-word-tokenizer-treats-ending-single-quote-as-a-separate-word/49506436#49506436

@alvations
Copy link
Contributor Author

alvations commented Mar 28, 2018

If we make the following changes to word_tokenize at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/__init__.py, it would achieve similar behavior as of Stanford CoreNLP:

import re
from nltk.tokenize.treebank import TreebankWordTokenizer

# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

# See discussion on 
#     - https://github.com/nltk/nltk/pull/1437
#     - https://github.com/nltk/nltk/issues/1995
# Adding to TreebankWordTokenizer, the splits on
#     - chervon quotes u'\xab' and u'\xbb' .
#     - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
#     - opening single quotes if the token that follows isn't a clitic

improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))

def word_tokenize(text, language='english', preserve_line=False):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

[out]:

>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]

The above regex hack will cover the following clitics:

're
've
'll
'd
't
's
'm

Are there more clitics that should be added?

@Lingviston
Copy link

What about the ending single quotes, which appear in the end of the objective form of plurals? Like "providers'".

@djinn-anthrope
Copy link

Do "readers' " need to be tokenized to "readers" and "'"? Also, what is the status of this bug so far? If the change mentioned above has not been implemented, I'd like to take up this issue

@alvations
Copy link
Contributor Author

This issue is on the opening quotes and the clitic fix for that can be easily done and that'll make the word_tokenize behave like Stanford's. IMHO, I think it's a good feature to have.

Feel free to contribute and open a pull-request on it =)


But to handle possessive plural, it's hard to understand how to do it because we need to define the difference between sentences like:

The providers' CEO went on a holiday.
He said, 'The CEO has fired the providers'.
Breaking news: 'Down to all providers'
The 'internet providers' have went on a holiday.

There are too many instances where the possessive plurals that can be confused with closing single quotes. I would say for use of single quotes for the plural possessive, I don't think it's worth fixing.

djinn-anthrope pushed a commit to djinn-anthrope/nltk that referenced this issue May 14, 2018
stevenbird added a commit that referenced this issue Nov 11, 2018
Adding improved regexes to handle clitics c.f. #1995
@durgaprasad-palanati-AI
Copy link

import nltk
import codecs

file=codecs.open('new2.txt','r','utf8')

fh=file.readlines() #['సతతహరిత', 'సమశీతోష్ణ', ' అడవి-ఇల్లు*అడవి ']-the line is stored in new1.txt
#the king's cat is caught with kit's. -the line is stored in new2.txt

for line in fh:
l=nltk.tokenize.word_tokenize(line)

print(l) # ['[', "'సతతహరిత", "'", ',', "'సమశీతోష్ణ", "'", ',', "'", 'అడవి-ఇల్లు', '', 'అడవి', "'", ']']
#['\ufeffthe', 'king', "'s", 'cat', 'is', 'caught', 'with', 'kit', "'s", '.']
ll=[] #to store new updated token list
for i in l:
if i[0]==''':
ix=i.replace(''','')
ll.append(ix)
else:
ll.append(i)
#updated correct one's
print(ll) #['[', 'సతతహరిత', '', ',', 'సమశీతోష్ణ', '', ',', '', 'అడవి-ఇల్లు', '
', 'అడవి', '', ']']
#['\ufeffthe', 'king', 's', 'cat', 'is', 'caught', 'with', 'kit', 's', '.']

@durgaprasad-palanati-AI
Copy link

i just again processed the result of nltk word tokenizer....It solved my problem.
#But it may not be optimal solution need to update the libraries nltk

@th0rntwig
Copy link

Was there a regression from #2018 or did that pr not fix the issue?

nltk version: 3.8.1
python version: 3.10.12

from nltk.tokenize import word_tokenize

sentence = "I've said many times, 'We'll make it through!'"
word_tokenize(sentence)

Expected: ['I', "'ve", 'said', 'many', 'times', ',', "'", "We", "'ll", 'make', 'it', 'through', '!', "'"]
Actual: ['I', "'ve", 'said', 'many', 'times', ',', "'We", "'ll", 'make', 'it', 'through', '!', "'"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants