Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nltk word tokenizer strange behaviour #1699

Closed
grafael opened this issue Apr 24, 2017 · 3 comments
Closed

Nltk word tokenizer strange behaviour #1699

grafael opened this issue Apr 24, 2017 · 3 comments

Comments

@grafael
Copy link

grafael commented Apr 24, 2017

Given the sentence:

word = "B. Young c Moin Khan b Wasim 0"

if we call nltk.word_tokenize(word) We'll have the following tokens:
['B', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

But, if we change the sentence to be:
word = "C. Young c Moin Khan b Wasim 0"

The tokens will be:
['C.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

The char '.' is now joined to the letter 'C' .

What is more strange is that the 'B' splitted from '.' only occurs when the next word is 'Young' I tried several words and none appears to reproduce the error.

@alvations alvations added the bug label May 4, 2017
@alvations
Copy link
Contributor

alvations commented May 5, 2017

Rather strange bug:

>>> s = '{}. Young c Moin Khan b Wasim 0'
>>> for i in range(65, 65+26):
...     word_tokenize(s.format(chr(i)))
... 
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['B', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['C.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['D.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['E.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['F.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['G.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['H.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['I', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['J', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['K.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['L.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['M.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['N.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['O', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['P.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Q', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['R.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['S.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['T.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['U', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['V.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['W.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['X', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Y', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Z', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

Without caps after the fullstop, it sort of gives the expected behavior of the TreebankWordTokenizer, e.g. https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L39

>>> s = '{}. young c Moin Khan b Wasim 0'
>>> for i in range(65, 65+26):
...     word_tokenize(s.format(chr(i)))
... 
['A.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['B.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['C.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['D.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['E.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['F.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['G.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['H.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['I.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['J.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['K.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['L.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['M.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['N.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['O.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['P.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Q.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['R.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['S.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['T.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['U.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['V.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['W.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['X.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Y.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Z.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

@alvations
Copy link
Contributor

It's the sent_tokenize part of the word_tokenize causing this issue:

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> from nltk import word_tokenize, sent_tokenize

>>> s = 'A. Young c Moin Khan b Wasim 0' # Causes unexpected splitting of "A ."

# Unexpected.
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

# Expected.
>>> tbt = TreebankWordTokenizer()
>>> tbt.tokenize(s)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

# Sent tokenize splits it into 2 sentences.
>>> sent_tokenize(s)
['A.', 'Young c Moin Khan b Wasim 0']

So when the sentence is split into 2 sentences, this regex will kick in in the TreebankWordTokenizer: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L61

It will pad the final fullstop with a preceding space.

To verify the explanation above, we will see that C. Young c Moin Khan b Wasim 0 remains as a single sentence after sent_tokenize():

>>> s = 'C. Young c Moin Khan b Wasim 0' 
>>> sent_tokenize(s)
['C. Young c Moin Khan b Wasim 0']
>>> 
>>> len(sent_tokenize(s))
1

This split is most probably because when training the punkt sentence tokenizer, it must have recognized that some single capital preceding a fullstop could be some sort of abbreviation while others are not hence the latter we be treated as the sentence boundary and be split.

@alvations alvations removed the bug label May 5, 2017
@alvations
Copy link
Contributor

alvations commented May 5, 2017

Since this is idosyncratic to your data, I think this is not a bug but a quaint feature of Kiss and Strunk (2006) Punkt tokenizer =)

I've added a feature to preserve_line to be consistent with other tokenizers' user options, e.g. Stanford tokenizer's.

In the latest commit of #1710, it'll allow this:

>>> from nltk import word_tokenize
>>> s = 'A. Young c Moin Khan b Wasim 0'
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
>>> word_tokenize(s, preserve_line=True)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

This is done by skipping the sent_tokenize step in word_tokenize. But the default is kept as to perform sent_tokenize step, see here

@grafael Thanks for catching this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants