Nltk word tokenizer strange behaviour #1699

grafael · 2017-04-24T19:33:35Z

Given the sentence:

word = "B. Young c Moin Khan b Wasim 0"

if we call nltk.word_tokenize(word) We'll have the following tokens:
['B', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

But, if we change the sentence to be:
word = "C. Young c Moin Khan b Wasim 0"

The tokens will be:
['C.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

The char '.' is now joined to the letter 'C' .

What is more strange is that the 'B' splitted from '.' only occurs when the next word is 'Young' I tried several words and none appears to reproduce the error.

alvations · 2017-05-05T03:08:22Z

Rather strange bug:

>>> s = '{}. Young c Moin Khan b Wasim 0'
>>> for i in range(65, 65+26):
...     word_tokenize(s.format(chr(i)))
... 
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['B', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['C.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['D.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['E.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['F.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['G.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['H.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['I', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['J', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['K.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['L.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['M.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['N.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['O', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['P.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Q', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['R.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['S.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['T.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['U', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['V.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['W.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['X', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Y', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Z', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

Without caps after the fullstop, it sort of gives the expected behavior of the TreebankWordTokenizer, e.g. https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L39

>>> s = '{}. young c Moin Khan b Wasim 0'
>>> for i in range(65, 65+26):
...     word_tokenize(s.format(chr(i)))
... 
['A.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['B.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['C.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['D.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['E.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['F.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['G.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['H.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['I.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['J.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['K.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['L.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['M.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['N.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['O.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['P.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Q.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['R.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['S.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['T.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['U.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['V.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['W.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['X.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Y.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Z.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

alvations · 2017-05-05T03:37:09Z

It's the sent_tokenize part of the word_tokenize causing this issue:

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> from nltk import word_tokenize, sent_tokenize

>>> s = 'A. Young c Moin Khan b Wasim 0' # Causes unexpected splitting of "A ."

# Unexpected.
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

# Expected.
>>> tbt = TreebankWordTokenizer()
>>> tbt.tokenize(s)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

# Sent tokenize splits it into 2 sentences.
>>> sent_tokenize(s)
['A.', 'Young c Moin Khan b Wasim 0']

So when the sentence is split into 2 sentences, this regex will kick in in the TreebankWordTokenizer: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L61

It will pad the final fullstop with a preceding space.

To verify the explanation above, we will see that C. Young c Moin Khan b Wasim 0 remains as a single sentence after sent_tokenize():

>>> s = 'C. Young c Moin Khan b Wasim 0' 
>>> sent_tokenize(s)
['C. Young c Moin Khan b Wasim 0']
>>> 
>>> len(sent_tokenize(s))
1

This split is most probably because when training the punkt sentence tokenizer, it must have recognized that some single capital preceding a fullstop could be some sort of abbreviation while others are not hence the latter we be treated as the sentence boundary and be split.

alvations · 2017-05-05T03:49:32Z

Since this is idosyncratic to your data, I think this is not a bug but a quaint feature of Kiss and Strunk (2006) Punkt tokenizer =)

I've added a feature to preserve_line to be consistent with other tokenizers' user options, e.g. Stanford tokenizer's.

In the latest commit of #1710, it'll allow this:

>>> from nltk import word_tokenize
>>> s = 'A. Young c Moin Khan b Wasim 0'
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
>>> word_tokenize(s, preserve_line=True)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

This is done by skipping the sent_tokenize step in word_tokenize. But the default is kept as to perform sent_tokenize step, see here

@grafael Thanks for catching this!

alvations added the bug label May 4, 2017

alvations removed the bug label May 5, 2017

alvations mentioned this issue May 5, 2017

Better Treebank tokenizer #1710

Merged

stevenbird closed this as completed May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nltk word tokenizer strange behaviour #1699

Nltk word tokenizer strange behaviour #1699

grafael commented Apr 24, 2017

alvations commented May 5, 2017 •

edited

alvations commented May 5, 2017

alvations commented May 5, 2017 •

edited

Nltk word tokenizer strange behaviour #1699

Nltk word tokenizer strange behaviour #1699

Comments

grafael commented Apr 24, 2017

alvations commented May 5, 2017 • edited

alvations commented May 5, 2017

alvations commented May 5, 2017 • edited

alvations commented May 5, 2017 •

edited

alvations commented May 5, 2017 •

edited