word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

alvations · 2018-03-28T02:18:01Z

word_tokenize keeps the opening single quotes and doesn't pad it with space, this is to make sure that the clitics get tokenized as 'll, `'ve', etc.

The original treebank tokenizer has the same behavior but Stanford CoreNLP doesn't. It looks like some additional regex was put in to make sure that the opening single quotes get padded with spaces if it isn't followed by clitics.

There should be a non-capturing regex to catch the non-clitics and pad the space.

Details on https://stackoverflow.com/questions/49499770/nltk-word-tokenizer-treats-ending-single-quote-as-a-separate-word/49506436#49506436

The text was updated successfully, but these errors were encountered:

alvations · 2018-03-28T02:40:59Z

If we make the following changes to word_tokenize at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/__init__.py, it would achieve similar behavior as of Stanford CoreNLP:

import re
from nltk.tokenize.treebank import TreebankWordTokenizer

# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

# See discussion on 
#     - https://github.com/nltk/nltk/pull/1437
#     - https://github.com/nltk/nltk/issues/1995
# Adding to TreebankWordTokenizer, the splits on
#     - chervon quotes u'\xab' and u'\xbb' .
#     - unicode quotes u'\u2018', u'\u2019', u'\u201c' and u'\u201d'
#     - opening single quotes if the token that follows isn't a clitic

improved_open_quote_regex = re.compile(u'([«“‘„]|[`]+)', re.U)
improved_open_single_quote_regex = re.compile(r"(?i)(\')(?!re|ve|ll|m|t|s|d)(\w)\b", re.U)
improved_close_quote_regex = re.compile(u'([»”’])', re.U)
improved_punct_regex = re.compile(r'([^\.])(\.)([\]\)}>"\'' u'»”’ ' r']*)\s*$', re.U)
_treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
_treebank_word_tokenizer.STARTING_QUOTES.append((improved_open_single_quote_regex, r'\1 \2'))
_treebank_word_tokenizer.ENDING_QUOTES.insert(0, (improved_close_quote_regex, r' \1 '))
_treebank_word_tokenizer.PUNCTUATION.insert(0, (improved_punct_regex, r'\1 \2 \3 '))

def word_tokenize(text, language='english', preserve_line=False):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into words
    :type text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

[out]:

>>> print(word_tokenize("The 'v', I've been fooled but I'll seek revenge."))
['The', "'", 'v', "'", ',', 'I', "'ve", 'been', 'fooled', 'but', 'I', "'ll", 'seek', 'revenge', '.']
>>> word_tokenize("'v' 're'")
["'", 'v', "'", "'re", "'"]

The above regex hack will cover the following clitics:

're
've
'll
'd
't
's
'm

Are there more clitics that should be added?

Lingviston · 2018-03-29T13:13:52Z

What about the ending single quotes, which appear in the end of the objective form of plurals? Like "providers'".

djinn-anthrope · 2018-04-29T15:12:28Z

Do "readers' " need to be tokenized to "readers" and "'"? Also, what is the status of this bug so far? If the change mentioned above has not been implemented, I'd like to take up this issue

alvations · 2018-04-30T06:55:41Z

This issue is on the opening quotes and the clitic fix for that can be easily done and that'll make the word_tokenize behave like Stanford's. IMHO, I think it's a good feature to have.

Feel free to contribute and open a pull-request on it =)

But to handle possessive plural, it's hard to understand how to do it because we need to define the difference between sentences like:

The providers' CEO went on a holiday.
He said, 'The CEO has fired the providers'.
Breaking news: 'Down to all providers'
The 'internet providers' have went on a holiday.

There are too many instances where the possessive plurals that can be confused with closing single quotes. I would say for use of single quotes for the plural possessive, I don't think it's worth fixing.

Adding improved regexes to handle clitics c.f. #1995

durgaprasad-palanati-AI · 2020-08-02T13:55:19Z

import nltk
import codecs

file=codecs.open('new2.txt','r','utf8')

fh=file.readlines() #['సతతహరిత', 'సమశీతోష్ణ', ' అడవి-ఇల్లు*అడవి ']-the line is stored in new1.txt
#the king's cat is caught with kit's. -the line is stored in new2.txt

for line in fh:
l=nltk.tokenize.word_tokenize(line)

print(l) # ['[', "'సతతహరిత", "'", ',', "'సమశీతోష్ణ", "'", ',', "'", 'అడవి-ఇల్లు', '', 'అడవి', "'", ']']
#['\ufeffthe', 'king', "'s", 'cat', 'is', 'caught', 'with', 'kit', "'s", '.']
ll=[] #to store new updated token list
for i in l:
if i[0]==''':
ix=i.replace(''','')
ll.append(ix)
else:
ll.append(i)
#updated correct one's
print(ll) #['[', 'సతతహరిత', '', ',', 'సమశీతోష్ణ', '', ',', '', 'అడవి-ఇల్లు', '', 'అడవి', '', ']']
#['\ufeffthe', 'king', 's', 'cat', 'is', 'caught', 'with', 'kit', 's', '.']

durgaprasad-palanati-AI · 2020-08-02T13:57:02Z

i just again processed the result of nltk word tokenizer....It solved my problem.
#But it may not be optimal solution need to update the libraries nltk

th0rntwig · 2024-04-27T10:28:59Z

Was there a regression from #2018 or did that pr not fix the issue?

nltk version: 3.8.1
python version: 3.10.12

from nltk.tokenize import word_tokenize

sentence = "I've said many times, 'We'll make it through!'"
word_tokenize(sentence)

Expected: ['I', "'ve", 'said', 'many', 'times', ',', "'", "We", "'ll", 'make', 'it', 'through', '!', "'"]
Actual: ['I', "'ve", 'said', 'many', 'times', ',', "'We", "'ll", 'make', 'it', 'through', '!', "'"]

alvations added the tokenizer label Mar 28, 2018

alvations added the good first issue label Mar 28, 2018

djinn-anthrope pushed a commit to djinn-anthrope/nltk that referenced this issue May 14, 2018

Attempted fix of issue nltk#1995

94bcd3e

djinn-anthrope mentioned this issue May 14, 2018

Adding improved regexes to handle clitics c.f. #1995 #2018

Merged

stevenbird added a commit that referenced this issue Nov 11, 2018

Merge pull request #2018 from AlokDebnath/develop

ece5531

Adding improved regexes to handle clitics c.f. #1995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

alvations commented Mar 28, 2018

alvations commented Mar 28, 2018 •

edited

Lingviston commented Mar 29, 2018

djinn-anthrope commented Apr 29, 2018

alvations commented Apr 30, 2018

durgaprasad-palanati-AI commented Aug 2, 2020

durgaprasad-palanati-AI commented Aug 2, 2020

th0rntwig commented Apr 27, 2024

word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

word_tokenize keeps the opening single quotes and doesn't pad it with space #1995

Comments

alvations commented Mar 28, 2018

alvations commented Mar 28, 2018 • edited

Lingviston commented Mar 29, 2018

djinn-anthrope commented Apr 29, 2018

alvations commented Apr 30, 2018

durgaprasad-palanati-AI commented Aug 2, 2020

durgaprasad-palanati-AI commented Aug 2, 2020

th0rntwig commented Apr 27, 2024

alvations commented Mar 28, 2018 •

edited